My multi asset sensor is suddenly getting an error...
# ask-community
c
My multi asset sensor is suddenly getting an error like this:
Copy code
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "{"created":"@1672664343.554830497","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>
What do i do here?
okay so it seems the post-materialization sensor runs again and again even though i use the
advance_all_cursors
d
Hi Casper - there’s a 60 second timeout on sensor evaluation, is it possible that the body of your sensor function would be taking longer than 60 seconds to execute?
c
I don't imagine the body will, unless you are counting the actual run that results from the RunRequest too? All it does is check the metadata of the materialized asset records against literal values and create yield RunRequests for them. It's, however, doing it on about 38 different assets
d
Any chance you could share the code? The run doesn’t count for the time, no
(The code of just the sensor, that is)
it's also possible (maybe more likely) that just doing the DB queries to check the status of the 38 different asset streams is what's eating up the 60 seconds
c
https://gist.github.com/C0DK/deba711251478c99ab1de95e4db6e393 there you go. I've changed business-specific literals, but the code does exactly the same.
it's got to be said that it doesn't fail consistently. It hasn't failed for the last many hours, but it did for a bunch of times and then stopped. It might just be my VM (or postgres db?) that is a bit slow.
d
Do you have visibility into slow queries coming into your postgres db while the sensor is running? My first guess would be that it’s doing a query per asset key and one or more of them is taking a while to executes
c
I am not a super user regarding databases, but in theory i could probably see it, as i have full access to it. do you have any suggestions/directions to look in? I'd gladly supply the best possible bug report, I just dont know how 😄
c
Hi Casper, in 1.0.15 we added database indexes that should speed up the performance of multi-asset sensor queries. Have you run this migration? If you are able to determine which query is causing the timeout, that would be really helpful!
c
do you have any tips on how to figure out whic query? i'll try to migrate and see if that fixes it thanks
@claire i thought it was a thing of the past, but i am getting it again. how do i migrate the datbase etc?
I can see i am running dagster version 1.1.3
c
Hi Casper, in order to migrate the database you would need to run
dagster instance migrate
. If you have access to your postgres db, you can check which indexes are present. These two indexes were the ones added in 1.0.15 which should speed up this query: https://github.com/dagster-io/dagster/blob/61eeaa75909b1d580022b01d254c3b4f25555bf3/python_modules/dagster/dagster/_core/storage/event_log/schema.py#L101-L122
c
Copy code
Updating run storage...
Skipping already applied data migration: run_partitions
Skipping already applied data migration: run_repo_label_tags
Skipping already applied data migration: bulk_action_types
Updating event storage...
Skipping already applied data migration: asset_key_index_columns
Updating schedule storage...
Skipping already applied migration: schedule_jobs_selector_id
seems promising!
solved it! thanks for the great support once again dagster yay
🌈 1
Much development later, I've added an reconciliation_sensor which now has the same issue. Is there a way to cleanup the database? we've renamed a bunch of assets quite a lot of times, so that might make the whole thing slower. I imagine that there is a bunch of historical data that we might be able to do withou
Copy code
dagster._core.errors.DagsterUserCodeUnreachableError: The sensor tick timed out due to taking longer than 60 seconds to execute the sensor function. One way to avoid this error is to break up the sensor work into chunks, using cursors to let subsequent sensor calls pick up where the previous call left off.
  File "/usr/local/lib/python3.10/site-packages/dagster/_daemon/sensor.py", line 489, in _process_tick_generator
    yield from _evaluate_sensor(
  File "/usr/local/lib/python3.10/site-packages/dagster/_daemon/sensor.py", line 552, in _evaluate_sensor
    sensor_runtime_data = repo_location.get_external_sensor_execution_data(
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/host_representation/repository_location.py", line 823, in get_external_sensor_execution_data
    return sync_get_external_sensor_execution_data_grpc(
  File "/usr/local/lib/python3.10/site-packages/dagster/_api/snapshot_sensor.py", line 63, in sync_get_external_sensor_execution_data_grpc
    api_client.external_sensor_execution(
  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/client.py", line 398, in external_sensor_execution
    chunks = list(
  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/client.py", line 186, in _streaming_query
    self._raise_grpc_exception(
  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/client.py", line 137, in _raise_grpc_exception
    raise DagsterUserCodeUnreachableError(
The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "{"created":"@1675673483.967854180","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>
  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/client.py", line 182, in _streaming_query
    yield from self._get_streaming_response(
  File "/usr/local/lib/python3.10/site-packages/dagster/_grpc/client.py", line 171, in _get_streaming_response
    yield from getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 826, in _next
    raise self
@claire
I can see our database is running at 99% CPU usage
(there is 45 assets in the job btw. not sure if that is too much, but it seems like a small number in theory)
o
hi @Casper Weiss Bang! mind sharing what version of dagster you're running? there have been a fair number of performance improvements in the last few releases, so it's possible that these could help. I wouldn't expect historical data for asset keys that are no longer referenced in your repository to have much performance impact. The sensor should (in theory) be able to handle 45 assets with no problem -- we have people with 100s of assets monitored by the sensor, and I've tested it up to a few thousand (although that was running on my local machine). But it's definitely possible some specifics of your setup are causing different perf issues to arise. If you're on the newest version, mind letting me know a bit more about the shape of the asset graph that you're using the sensor for? (lots of sources? using partitions?)
c
The performance fixes owen mentioned above should likely be able to help you. But responding to your question about how to clean up the database, here's an example of how you might be able to delete old data: https://github.com/dagster-io/dagster/discussions/12047
c
1.1.15 (afaik. The newest one) Upgraded today for this reason and reran the migrations mentioned earlier in this thread, and ran the reindex command. I'll try to spend the next few hours figuring out which requests are slow. I need to figure out how to debug a postgres server
@owen partitions on the majority of ingress assets. Each asset is then duplicated in a cleaned version whereafter we have some "unions" of a few different assets.
Partitions are monthly back to 2017 as far as I remember. I can give you the hardware specs too in a while.
To update, I am running: 1.1.15 I have 45 assets, where maybe half has a partition
MonthlyPartitionsDefinition(start_date="2019-01-01", end_offset=1)
. I can also see i 'only' have 5790 runs records. (I deleted 2000~ to see if it helped.. it didn't seem to, now with only 3762 runs). I have three code locations (well repositories) and one of them occationally does succeed, the other two always times out. The database is a
GP_Gen5_2
from azure. It's troublesome to debug with you guys due to timezones, but i'll try to gather as much intel for you awesome people for tonight. I can add a github issue too, if that is better for you 🙂
o
great thanks for digging! feel free to leave the info on slack if that's easier for you. if you're able to connect to your production dagster instance from a local script, one thing that I've done to profile the reconciliation sensor is running the following:
Copy code
import os
from dagster import build_asset_reconciliation_sensor, build_sensor_context, DagsterInstance

from my_package import my_repo

if __name__ == "__main__":

    # you may need to change this line to get your prod dagster instance
    with DagsterInstance.get() as instance:
        sensor = build_asset_reconciliation_sensor(asset_selection=AssetSelection.all())
        cursor = sensor.evaluate_tick(
            build_sensor_context(
                instance=instance,
                repository_def=my_repo,
            )
        )
this cuts to the chase a bit, allowing you to directly run a tick of the sensor. From there, you can run
time sudo py-spy record -o profile.svg -- python that_file.py
, which will generate a flame graph (which would be super helpful in determining what's slowing things down)
c
I'm sadly going on vacation till Friday. Will give it a go then and give you some information
o
no problem!
c
Back from vacation. Had a bunch of things. Now i'm here. I got another related error running the same codebase on our newly configured production server:
dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 60 seconds to complete.
This is when running a backfill, and it's a clean server so it shouldn't have a bunch of historic values, which i assumed was the error on the dev server (the one i reported the other error on) I have the profile.svg but would prfer not to push this to a public slack channel. Do you have an email i can ping? :)
o
got it -- feel free to either dm me or send it to owen@elementl.com 🙂