Hi team, we have a run status sensors to track som...
# ask-community
a
Hi team, we have a run status sensors to track some metrics on successful jobs. It has been working fine fine, but recently seeing some issues where it is not getting triggered for some job runs (please note the screenshots below). Any thoughts on how I can debug it?
Job run on June 6
Run for the same job on June 7
d
Hey Arun - the big thing I'd want to check here are the daemon logs shortly after the run finished (since this is several days ago now though I'm not sure if those are still available
a
Thanks Daniel. Looked at the daemon logs for the next sensor tick. I don't see any errors, however it does not create requests for all the successful runs. Only a small subset of job runs were handled
Copy code
2022-06-16 19:07:17 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: job_run_success_sensor
2022-06-16 19:07:18 +0000 - dagster.daemon.SensorDaemon - INFO - Completed a reaction request for run 43674f8b-1c58-4310-ab6a-2c57cc8e4e80: Sensor "job_run_success_sensor" acted on run status SUCCESS of run 43674f8b-1c58-4310-ab6a-2c57cc8e4e80.
2022-06-16 19:07:18 +0000 - dagster.daemon.SensorDaemon - INFO - Completed a reaction request for run 007ee87b-8311-4e2e-8643-d7ceb5a9c8c6: Sensor "job_run_success_sensor" acted on run status SUCCESS of run 007ee87b-8311-4e2e-8643-d7ceb5a9c8c6.
2022-06-16 19:07:18 +0000 - dagster.daemon.SensorDaemon - INFO - Completed a reaction request for run 04e701cf-143d-48a9-bc90-bd30b92c8eab: Sensor "job_run_success_sensor" acted on run status SUCCESS of run 04e701cf-143d-48a9-bc90-bd30b92c8eab.
2022-06-16 19:07:18 +0000 - dagster.daemon.SensorDaemon - INFO - Completed a reaction request for run a319a3ff-9221-498a-bca3-ac2189e75460: Sensor "job_run_success_sensor" acted on run status SUCCESS of run a319a3ff-9221-498a-bca3-ac2189e75460.
2022-06-16 19:07:18 +0000 - dagster.daemon.SensorDaemon - INFO - Completed a reaction request for run f7ac1f38-712c-49bd-84a1-0af48a4e9956: Sensor "job_run_success_sensor" acted on run status SUCCESS of run f7ac1f38-712c-49bd-84a1-0af48a4e9956.
d
Got it - what version of dagster is this, and is this just a normal out of the box run status sensor?
a
v0.14.5. Yes, here is how the definition looks like
Copy code
@run_status_sensor(
    pipeline_run_status=PipelineRunStatus.SUCCESS,
    minimum_interval_seconds=constants.FAST_SENSOR_INTERVAL_SECONDS,
)
def job_run_success_sensor(context: RunStatusSensorContext):
    try:
        run_id = context.pipeline_run.run_id
        job_type = context.pipeline_run.tags.get("metrics-repo", None)

        if not job_type:
            logger.error(
                f"The job run with run_id {run_id} is missing the required 'metrics-repo' "
                f"tag for the job type. Skipping metrics tracking."
            )
        else:
            # update prometheus metric
            .....
    except Exception as e:
        ....
        raise e
d
and another thing to double-check: the runs that it's monitoring are all in the same repository as the sensor, right?
a
Yes, it should be able to monitor all the jobs in the repository where it exists right?
d
yeah, I was just making sure you weren't expecting it to handle other repositories as well
1
typically the sensor works in batches of 5 - you're sure that the missing runs weren't included in the next tick?
(that wouldn't apply to your original post where it was clearly missing)
is there anything that the missing runs have in common? like are they all from a particular job or anything like that?
a
I don't think so. Basically I am using this sensor to emit prometheus metrics for all types of job runs in our repository. Yesterday, I saw a clear drop in our metric, and came across this issue
d
Could the problem be that it's only emitting 5 runs each tick and then skipping the rest? (I looked through the code and it didn't seem like that should be happening, but curious if that matches what you are seeing)
a
Is there any chance, that the job status sensor processed a job run, but failed to emit an engine event (seen in run log) and a log (in daemon logs)? The prometheus metrics looks decent today, however I don't see an engine event in most of the run logs
I looked the log for the next tick, it returned empty result. However, I am sure there were successful runs in that interval
Copy code
2022-06-16 19:36:46 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: job_run_success_sensor
2022-06-16 19:36:46 +0000 - dagster.daemon.SensorDaemon - INFO - Sensor job_run_success_sensor skipped: Sensor function returned an empty result
d
would the theory that it's only ever running the first 5 runs when there are more than 5 match the data that you're seeing?
(in a given sensor tick interval)
a
It looks that might not be the case, because the last tick returned empty results as per the logs even though there are successful runs
Are these reaction requests stored in the DB tables? I can take a look at the source of truth directly
d
they're not, unfortunately. We don't have quite the same level of visibility with run reaction requests that we do with runrequests from standard sensors
😓 1
"It looks that might not be the case, because the last tick returned empty results as per the logs even though there are successful runs" - that would be consistent with the theory if we're only launching the last 5 runs and then skipping to the end for some reason
the theory would be that the bug is that if there are 15 runs, it only launches 5, then skips the rest
a
However, in between those two ticks there were successful runs. If it launches the first 5, then the second tick should have launched some reactions instead of empty results?
d
Yeah
There's definitely a bug - I'm trying to find some clues as to which runs are getting skipped and why
i verified that it's not just that it's taking 5 runs and then throwing out the rest with a local test, so there's something trickier going on here
this is with a postgres DB right?
a
Yes, its a postgres DB.
d
Quick question - you said "the next tick" there, but that's like 20 minutes later. Is the run status sensor only running once every 20 minutes?
The first one you posted was 2022-06-16 190717, then the next one was 2022-06-16 193646
a
Nope, its scheduled to run every 1 min. But we run a lot of sensors, and I think sensors ticks are being queued
Could that cause any issues?
d
I can't think how it would cause this particular issue
do you possibly have the value of the cursor for this run status sensor from before it did a tick where it missed runs? Or would it be possible to pull that? The cursor should be visible on the sensor page
Wait sorry, quick update - @prha informs me that i'm mistaken and we actually do store the run IDs that we believe were triggered in the database - it's just not currently displayed in dagit
So if there was a way for us to see the tick history for that sensor in your job_ticks table, we could take a look and if the expected runs are ther
a
Ah cool. Can I know the table where I can query the reaction requests from that sensor?
d
and just to double-check, there are no failed ticks for that sensor, right?
a
Thanks Daniel. Will query the job_ticks table and get back to you
Yes, there are no failures
d
It might not be trivial to find them though - they're in a serialized "tick_body" column
if you wanted to send us a dump of that table we might be able to help find it
and we'll look into showing them in dagit too, we really should be doing that
1
the history of cursors across a tick where there were some missing runs would be helpful too
and that part is visible in dagit
a
Another find, looks like the run ids from this trigger were from June 6
Currently the sensor is creating reaction requests for the runs that happen on June 7. The cursor for the last tick is
Copy code
{
  "__class__": "RunStatusSensorCursor",
  "record_id": 20549343,
  "update_timestamp": "2022-06-07T10:02:17.216280"
}
d
Hm interesting, so its just fallen way behind?
Ah OK, i'm starting to formulate a theory here
here's a question for you: approximately what percent of runs in your instance should be triggered by this sensor? (Ballpark is fine)
what i think is happening is that on each tick it's finding 5 run that succeeded that don't apply to that particular sensor (because they are from some other repository) and they're getting filtered out. and then because of this behavior where your sensor is only running every 30 minutes its just fallen extremely behind
we have some ideas of how we can fix this
a
The entire instance? May be around 2500 jobs per day
d
and how many runs are in this one repo?
(per day)
a
Probably around 1600 - 1800
d
so more than half of the runs each day are from this repo?
a
Yes
d
Hm, I'm not sure why most of the ticks aren't returning any runs then
that doesn't quite fit with my theory
(Is it true that most ticks are not returning runs? Or just that they are operating on runs, but the runs are many days old)
the big project that we have going on that will help with it falling behind is letting you run multiple sensors in parallel - since its only doing 5 runs every 30 minutes, its not too surprising that it's falling behind
starting to work through that in this PR here: https://github.com/dagster-io/dagster/pull/8265
❤️ 1
a
Nope, I think the only problem is somehow the sensor have fallen behind and looking for runs with older timestamps. Sensor returning empty results is just the side of the issue
👍 1
d
I think at one point we discussed the possibility of working with you all too to see if there was some way for you to have fewer sensors
which could also help and wouldn't require waiting for us to add new features
a
I think the sensors would not in 1000s, probably just in the 100s.
d
some quick back of the envelope math - if you have 2500 runs per day, you'll want this sensor running at least every couple of minutes in order for it to be able to keep up
a
What logic would have prevented the sensor from creating more than 5 reaction requests?
If it processed all the runs, then this issued would not have happened right?
d
The problems all come down to the sensors not being parallelized - ideally each one can do work in small chunks and be called again soon
Ie there wouldn't be a problem if it was actually running once a minute like it was supposed to
a
Sorry, I still did not understand why every tick is emitting only 5 reaction requests at most. Is this a limit specified in the implmentation?
d
There's a time limit for each sensor tick
It could probably be higher than 5, but the core problem is that your sensors are taking 30x longer to run than they are supposed to
If they are only running every 30 minutes, making it 10 instead of 5 isn't really going to help
a
It's 60 seconds right? Is emitting reaction requests an expensive operation? I was thinking that the sensor could emit may be even 100 reaction requests with in 60 seconds
d
It's possible that increasing the limit will serve as a bandaid
You are ok waiting up to 30 minutes for this sensor to run after a run finishes?
a
So is my understand right, currently every sensor tick would only emit at most 5 reaction requests?
d
That's the current implementation yeah
👍 1
a
Its not ideal, but should be better than falling behind
d
The assumption is that it will run frequently emitting a few runs each time
We can make the 5 tunable, that'll be a quick change
❤️ 1
Technically you can do it without us making changes by implementing your own sensor based on the built in run status sensor implementation, but that might be a pain
a
Thanks Daniel. Will look into it. For now, what value should I set to the cursor of this sensor in order bring back to the current time frame? Looks like I need to change both the record_id and update_timestamp?
d
I think if you make it empty it will reset to the end next time
But it will fall behind again pretty quickly no?
a
Got it. I am actually planning to work on another change to batch multiple job runs into one, so that should reduce the number of total runs which might help
Follow up unrelated question: Currently the job run tags take only str as a value. Is there any way we can add str[] as a tag value and search in the runs page by tag contains? When I merge multiple jobs, I might want add multiple entries to a particular tag key and being able to search by one if them
d
Would you mind making a new post for that?
So we can surface it to Ui folks
a
Sure, makes sense
Hey @daniel, In order to reset the cursor, could you please advice on what value should I set the record_it to? Looks like I can't remove the key
d
empty string should work
a
Let me try. Also, taking a step back, is run_status sensor still the right place to track metrics about the jobs? Not sure if there is a better place that the team can recommend. Related old thread
d
Once its able to keep up i think its a reasonable place to do it
1
thankyou 1
a
@daniel You were right that it would still fall behind even if we reduce the number of job runs. Looks like I might have to wait for your change to parallelize sensors. Looked into the run status sensor implementation, but don't feel comfortable re-implementing it again in our repo