https://dagster.io/ logo
Title
a

Adam Bloom

10/19/2022, 6:33 PM
any tips for debugging grpc deadline exceeded errors from dagit/dagster-daemon to user code deployments? Just rolled a new user code release out to prod with assets + sensors, and it's very unhappy. dagster-daemon doesn't seem to log much after startup, so I'm a bit lost as to where to look
d

daniel

10/19/2022, 6:34 PM
Hey Adam - do you have a full stack trace you can share? Curious which call it's happening in exactly
Sensors is the most common one - in which case the suggested way to address it is to do less work in each tick to stay under 60 seconds per tick limit
a

Adam Bloom

10/19/2022, 6:36 PM
sensors are definitely the most upset. This is running just fine in our QA environment, and I don't see how any of our sensors would be taking anywhere near that long. but here are a bunch of logs
huh, just got a run failure sensor output...30 minutes after it should have fired. so things are somehow making their way through....slowly
1.0.13 if not in those logs, btw
d

daniel

10/19/2022, 6:39 PM
Is the daemon still spewing new errors now? Like in the last minute or so?
a

Adam Bloom

10/19/2022, 6:40 PM
yup
and dagit is no longer displaying the overview timeline or sidebar (since the repositories list is failing?)
d

daniel

10/19/2022, 6:41 PM
How's the user code deployment doing? Seems like the daemon (and maybe dagit) is having some trouble connecting to it
a

Adam Bloom

10/19/2022, 6:43 PM
up and running just fine, supposedly. I've already restarted it once - things seemed to work for a bit but clearly failed again. its not logging anything at all, so hard to say if that's normal/abnormal (at least to stdout/stderr)
we're using s3 for logging, but I think that's only run log storage?
d

daniel

10/19/2022, 6:43 PM
do you know what version of grpcio it has installed? Probably not the issue, but just ruling it out - some newer versions were causing mysterious hangs that we pinned away
(the user code deployment, i mean)
a

Adam Bloom

10/19/2022, 6:44 PM
I can track that down!
d

daniel

10/19/2022, 6:44 PM
This spooks me a bit https://pypi.org/project/grpcio/#history - although I thought we had a pin up
a

Adam Bloom

10/19/2022, 6:45 PM
grpcio-1.47.2 looks like
d

daniel

10/19/2022, 6:45 PM
ok, that's not it then
a

Adam Bloom

10/19/2022, 6:46 PM
is it a scaling thing? this user code deployment has 22 repositories, probably ~50 sensors enabled right now
our QA environment is 2 repositories and only a handful of sensors, so that's the obvious difference between same image working and not working
d

daniel

10/19/2022, 6:47 PM
Do you have anything in this area activated in the helm chart? i.e. is useThreads true?
sensors:
    # Whether to evaluate sensors using an asynchronous thread pool.  Defaults to false
    useThreads: false
    # The max number of worker threads to use when asynchronously evaluating sensors. Will use the
    # default value used by Python, which depends on the number of cores available.
    numWorkers: ~
a

Adam Bloom

10/19/2022, 6:48 PM
probably not, since this is our first deployment that uses them. we're probably all defaults in that section
d

daniel

10/19/2022, 6:50 PM
ok, the sensors should be running one-by-one then. I wouldn't expect 22 repositories to be a problem.. is the main thing that changed from when it last worked to when it did not going to a bunch more repositories?
And do the repositories have a ton of assets? I wouldn't expect a scaling issue but it's something we could investigate
a

Adam Bloom

10/19/2022, 6:51 PM
repository count hasn't changed since last release. moving schedules -> sensors and adding software defined assets (most repos have ~85 or so)
d

daniel

10/19/2022, 6:51 PM
Got it - so 22 repos with ~85 assets each?
a

Adam Bloom

10/19/2022, 6:52 PM
yup
worth flipping that sensors threads flag?
d

daniel

10/19/2022, 6:52 PM
and we went from ~0 assets to ~1700 assets and it started failing?
ah no sensors thread is unlikely to improve this - I would dig into it as a potential cause if it were on 🙂
a

Adam Bloom

10/19/2022, 6:53 PM
a lot of these assets existed, but the keys changed (so temporarily doubled). I did try to clear out the assets that we're no longer using, although dagit froze for the first time during that, so not sure how far it got
d

daniel

10/19/2022, 6:53 PM
got it - what was the approx asset count before and after?
a

Adam Bloom

10/19/2022, 6:54 PM
by asset, do you mean software defined asset or asset ever materialized?
d

daniel

10/19/2022, 6:54 PM
just trying to get a sense of what changed when the problem started happening
software defined assets in this case
since those would affect the data that gets loaded in the calls that seem to be timing out now
a

Adam Bloom

10/19/2022, 6:55 PM
0 before, probably 2k now
the sensors that are failing are all multi asset sensors too
d

daniel

10/19/2022, 6:56 PM
is it possible to see if the problem goes away with the sensors turned off? Based on what you've said here my guess is it will not go away, but it would be a helpful data point
a

Adam Bloom

10/19/2022, 6:56 PM
yeah. I can turn more off for sure. I'll leave the run failure sensors but turn off the rest real quick
well, actually, can't do that without restarting things so that I can see things to click in dagit haha. i'll go ahead and do that though
yup, still getting context deadline exceeded with only the job failure sensors running
one thing worth noting: these sensors have repeated names between repositories. that wasn't an issue with schedules, but maybe it is an issue with sensors? there are technically 22 of these sensors (since they are currently repository specific):
[dagster-daemon-8959d56bb-lq8s5 dagster] 2022-10-19 19:00:59 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: slack_on_run_failure
[dagster-daemon-8959d56bb-lq8s5 dagster] 2022-10-19 19:01:00 +0000 - dagster.daemon.SensorDaemon - INFO - Sensor slack_on_run_failure skipped: Sensor function returned an empty result
only seeing logs in the daemon for one of them
d

daniel

10/19/2022, 7:03 PM
I'm going to surface that to the assets experts and see how outside of the norm 2000 assets in a single code location is. One option in the meantime would be to split it into multiple code locations? That would mean 22 pods which could have its own challenges, but there's a --attribute arg that you can add to the dagsterApiGrpcArgs that scopes it down to a single repository within the code that gets loaded
a

Adam Bloom

10/19/2022, 7:05 PM
it's definitely an option if we need to, but it's not a simple change since we'll lose partition history (I can migrate it again if needed)
(the code location is part of the repository name used for partition tracking)
we're not seeing any resource issues on the current deployment/pod - it seems to be running ok
to add some more observations to share - I was testing locally with just dagit - it loads the assets no problem. never ran dagster-daemon locally with the production config, but I can certainly try (just need to misconfigure something so any triggered jobs all fail and don't actually touch prod)
d

daniel

10/19/2022, 7:10 PM
running the daemon locally would be a good data point yeah - if it works but production k8s does not, the next thing i'd want to rule out was that the pod might be hitting a memory limit?
a

Adam Bloom

10/19/2022, 7:11 PM
no limits, we're all good there. I'm actually looking at the database right now. I'm suspicious something is going on with these event_logs queries. we don't have any settings to expire old event_logs, so that's an 18GB table
d

daniel

10/19/2022, 7:12 PM
that's also possible although i would be surprised if that presented as a timeout when the daemon tries to load your code which is what I see in those logs - I don't think that goes through the DB
Do the latest errors after its started up for a few minutes still have that
[dagster-daemon-8959d56bb-lq8s5 dagster]     res = self._query("ListRepositories", api_pb2.ListRepositoriesRequest)
line in the stack trace?
a

Adam Bloom

10/19/2022, 7:13 PM
not seeing that one right now, and dagit can actually load things
d

daniel

10/19/2022, 7:13 PM
ah what are the latest errors?
maybe that previous one was just a startup thing
a

Adam Bloom

10/19/2022, 7:14 PM
after restarting the user code pod, back to just this one:
[dagster-daemon-8959d56bb-lq8s5 dagster] 2022-10-19 19:12:49 +0000 - dagster.daemon.SensorDaemon - ERROR - Sensor daemon caught an error for sensor dbt_test_and_doc_vault : dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
[dagster-daemon-8959d56bb-lq8s5 dagster]
[dagster-daemon-8959d56bb-lq8s5 dagster] Stack Trace:
[dagster-daemon-8959d56bb-lq8s5 dagster]   File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/sensor.py", line 481, in _process_tick_generator
[dagster-daemon-8959d56bb-lq8s5 dagster]     sensor_debug_crash_flags,
[dagster-daemon-8959d56bb-lq8s5 dagster]   File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/sensor.py", line 546, in _evaluate_sensor
[dagster-daemon-8959d56bb-lq8s5 dagster]     instigator_data.cursor if instigator_data else None,
[dagster-daemon-8959d56bb-lq8s5 dagster]   File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/repository_location.py", line 817, in get_external_sensor_execution_data
[dagster-daemon-8959d56bb-lq8s5 dagster]     cursor,
[dagster-daemon-8959d56bb-lq8s5 dagster]   File "/usr/local/lib/python3.7/site-packages/dagster/_api/snapshot_sensor.py", line 61, in sync_get_external_sensor_execution_data_grpc
[dagster-daemon-8959d56bb-lq8s5 dagster]     cursor=cursor,
[dagster-daemon-8959d56bb-lq8s5 dagster]   File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 375, in external_sensor_execution
[dagster-daemon-8959d56bb-lq8s5 dagster]     sensor_execution_args
[dagster-daemon-8959d56bb-lq8s5 dagster]   File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 166, in _streaming_query
[dagster-daemon-8959d56bb-lq8s5 dagster]     raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
[dagster-daemon-8959d56bb-lq8s5 dagster]
[dagster-daemon-8959d56bb-lq8s5 dagster] The above exception was caused by the following exception:
[dagster-daemon-8959d56bb-lq8s5 dagster] grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
[dagster-daemon-8959d56bb-lq8s5 dagster] 	status = StatusCode.DEADLINE_EXCEEDED
[dagster-daemon-8959d56bb-lq8s5 dagster] 	details = "Deadline Exceeded"
[dagster-daemon-8959d56bb-lq8s5 dagster] 	debug_error_string = "{"created":"@1666206767.859389389","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
[dagster-daemon-8959d56bb-lq8s5 dagster] >
d

daniel

10/19/2022, 7:14 PM
ok, that one does look like the sensor is taking >60 seconds to run. Is dbt_test_and_doc_vault one of our sensors or something you wrote?
a

Adam Bloom

10/19/2022, 7:15 PM
one of mine. but I just disabled all of those and its still showing up
pretty trivial sensor:
@multi_asset_sensor(
        asset_keys=asset_keys,
        job=dbt_test_doc_job_factory(config, environment, schema_cfg),
        name="dbt_test_and_doc_vault" if config.is_vault else f"dbt_test_and_doc_{config.schema}",
        default_status=DefaultSensorStatus.RUNNING,
    )
    def dbt_test_docs_sensor(context):
        asset_events = context.latest_materialization_records_by_key()
        if all(asset_events.values()):
            context.advance_all_cursors()
            yield RunRequest()
        else:
            return SkipReason("Waiting for new dbt assets")
d

daniel

10/19/2022, 7:15 PM
it might linger briefly after turning off, but on the next iteration it should pick up that it's gone
Ah ok, it's possible that latest_materialization_records_by_key is timing out?
a

Adam Bloom

10/19/2022, 7:16 PM
aaaand the list repositories failure just returned (locking up dagit)
it stuck around much longer than a tick (unless I missed turning off one - which is totally possible). I'm restarting both the user code deployment and daemon pods
d

daniel

10/19/2022, 7:19 PM
what i'm imagining might be happening is that those sensor calls are hanging and locking up other requests
(like the ListRepositories call)
if you restart the daemon while all of those sensors are off i'd be curious if things are more stable
a

Adam Bloom

10/19/2022, 7:19 PM
yeah, it is
a

alex

10/19/2022, 7:20 PM
do you have resource requests/limits set for the user code servers? what size nodes do you have in the node pool?
a

Adam Bloom

10/19/2022, 7:21 PM
the other thing that could be happening: some of those assets haven't been materialized yet. I think in QA I added the jobs that materialized the assets before the sensors, but those changes went to prod together
I don't see any requests/limits on the user code pod, that node group is on m5.2xlarge
I'm going to turn a few sensors back on, just to see where the breaking point is (might need to restart in between)
user code deployment is currently using ~250MB RAM, don't think we have a resource issue at all
:ack: 1
first one is back on, completed in under a second:
[dagster-daemon-559fb67f7f-4rhv8 dagster] 2022-10-19 19:26:52 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: dbt_test_and_doc_vault
[dagster-daemon-559fb67f7f-4rhv8 dagster] 2022-10-19 19:26:53 +0000 - dagster.daemon.SensorDaemon - INFO - Sensor dbt_test_and_doc_vault skipped: Sensor function returned an empty result
I'll post in #dagster-feedback later, but it'd be great to add the repository names to these logs - pretty much impossible to isolate repeated names without it
d

daniel

10/19/2022, 7:28 PM
Yeah good call, that's a quick change
a

Adam Bloom

10/19/2022, 7:33 PM
definitely database side, some of these queries are just hanging. trying to get an explain analyze to see if its just missing the index or what
d

daniel

10/19/2022, 7:33 PM
an explain analyze would be amazing
but just knowing what the query is exactly that's hanging would also be very useful
a

Adam Bloom

10/19/2022, 7:34 PM
I left the
minimum_interval_seconds
at defaults - I could try setting those higher. but these queries should also not be taking this long
d

daniel

10/19/2022, 7:34 PM
yeah i think the slow query is the thing we need to squash
a

Adam Bloom

10/19/2022, 7:35 PM
Limit  (cost=0.44..7277.46 rows=1 width=595) (actual time=101759.561..101759.562 rows=0 loops=1)
  ->  Index Scan Backward using event_logs_pkey on event_logs  (cost=0.44..2910810.03 rows=400 width=595) (actual time=101759.560..101759.560 rows=0 loops=1)
        Filter: ((dagster_event_type = 'ASSET_MATERIALIZATION'::text) AND ((asset_key = '["<key_part_1>", "<key_part_2>"]'::text) OR (asset_key = '<key>'::text)))
        Rows Removed by Filter: 23635163
Planning Time: 0.102 ms
Execution Time: 101759.581 ms
d

daniel

10/19/2022, 7:35 PM
ah do you have the query too?
a

Adam Bloom

10/19/2022, 7:35 PM
its trying to use the pkey index, so that's a full table scan?
a

alex

10/19/2022, 7:35 PM
have you had this instance for many dagster versions? have you run
dagster instance migrate
/
dagster instance reindex
?
s

sandy

10/19/2022, 7:36 PM
are you using
build_asset_reconciliation_sensor
?
a

Adam Bloom

10/19/2022, 7:36 PM
SELECT event_logs.id, event_logs.event 
FROM event_logs 
WHERE event_logs.dagster_event_type = 'ASSET_MATERIALIZATION' AND (event_logs.asset_key = '["<key_part_1>", "<key_part_2>"]' OR event_logs.asset_key = '<key_part_1>.<key_part_2>') ORDER BY event_logs.id DESC 
 LIMIT 1;
d

daniel

10/19/2022, 7:36 PM
@sandy the sensor code is above (i think the answer is no)
a

Adam Bloom

10/19/2022, 7:36 PM
yeah, not using an asset reconciliation sensor - just using some simple multi asset sensors
I see indexes on asset_key and event_type, so I think we're ok on migrations?
We first deployed back in June I want to say - haven't seen release notes calling for migrations since then (although I might have missed something)
👍 1
idx_event_type is an index over both
dagster_event_type
and event
id
- curious. I bet we can fix this with a new index. kinda tempted to add one just to see what happens if I add an asset_key index that is restricted to ASSET_MATERIALIZATION event types
other thing that's interesting - the multi-asset sensor is issuing individual queries for each asset_key instead of a single query for all assets. that's likely less efficient too
a

alex

10/19/2022, 7:46 PM
ya you could almost certainly add an index and resolve the poor performance
create index concurrently test_events_by_asset_idx on event_logs (asset_key, id) where asset_key is not null;
s

sandy

10/19/2022, 7:46 PM
@claire - thoughts here?
a

alex

10/19/2022, 7:48 PM
i think well release a similar (if not identical) index migration,
id
should be in the asset key index to be able to do ordered cursored lookups
a

Adam Bloom

10/19/2022, 7:48 PM
ah, makes sense. I'm testing
CREATE INDEX CONCURRENTLY idx_asset_key_materializations ON event_logs (asset_key) WHERE dagster_event_type='ASSET_MATERIALIZATION' AND asset_key IS NOT NULL;
now, but that doesn't have the ID
ah yeah, it needs to be there to use the index since there's an order by in the query
hmmm....still using the pkey index (so, full table scan)
a

alex

10/19/2022, 8:04 PM
hmm, what does the
\d event_logs
look like currently (assuming psql)?
a

Adam Bloom

10/19/2022, 8:19 PM
ahh, my index failed to create, it shows as invalid. probably need to drop/recreate
dagster=> \d event_logs
                                            Table "public.event_logs"
       Column       |            Type             | Collation | Nullable |                Default
--------------------+-----------------------------+-----------+----------+----------------------------------------
 id                 | integer                     |           | not null | nextval('event_logs_id_seq'::regclass)
 run_id             | character varying(255)      |           |          |
 event              | text                        |           | not null |
 dagster_event_type | text                        |           |          |
 timestamp          | timestamp without time zone |           |          |
 step_key           | text                        |           |          |
 asset_key          | text                        |           |          |
 partition          | text                        |           |          |
Indexes:
    "event_logs_pkey" PRIMARY KEY, btree (id)
    "idx_asset_key" btree (asset_key)
    "idx_asset_key_materializations" btree (id, asset_key) WHERE dagster_event_type = 'ASSET_MATERIALIZATION'::text AND asset_key IS NOT NULL INVALID
    "idx_asset_partition" btree (asset_key, partition)
    "idx_event_type" btree (dagster_event_type, id)
    "idx_run_id" btree (run_id)
    "idx_step_key" btree (step_key)
a

alex

10/19/2022, 8:21 PM
you can just do a
reindex
to repair i think. Also I believe having
id
first vs second may also be a factor so may want to try
(asset_key, id)
if that you have doesn’t work.
thanks for all the follow up on this report by the way, much appreciated
âž• 1
a

Adam Bloom

10/19/2022, 8:25 PM
rebuild
locks, and dagster is still running, so I went with drop/re-create. trying that now. And back at all of you - super appreciate the support for an open source product!
:dagsir: 1
new index is in place, and queries seem to be using it. I'm going to turn on a few more sensors (get back past the tipping point).
:nice: 1
ok, they're all back on. they're still taking quite a while in some case (I saw ~1 minute), but I haven't seen any of the grpc timeouts
lol, I suppose that was tempting fate. One did just show up. I can probably work around this too by modifying the asset selection. it's still taking 1-2 seconds per asset to run that query, so having a sensor selector of all upstream assets isn't helping my case.
a

alex

10/19/2022, 8:44 PM
whats the updated
explain analyze
?
a

Adam Bloom

10/19/2022, 8:45 PM
Limit  (cost=0.43..2172.80 rows=1 width=595) (actual time=4019.545..4019.546 rows=0 loops=1)
  ->  Index Scan Backward using idx_asset_key_materializations on event_logs  (cost=0.43..864605.62 rows=398 width=595) (actual time=4019.544..4019.545 rows=0 loops=1)
        Filter: ((asset_key = '["<key_part_1>", "<key_part_2>"]'::text) OR (asset_key = '<key_part_1>.<key_part_2>'::text))
        Rows Removed by Filter: 3439164
Planning Time: 0.485 ms
Execution Time: 4019.567 ms
could try flipping those column order?
🙏 1
🤞 1
a

alex

10/19/2022, 8:47 PM
thats what i was about to suggest
a

Adam Bloom

10/19/2022, 8:48 PM
I'll add another index with it flipped, see if it switches to that
yeah, it very much wanted that order flipped
Limit  (cost=1612.63..1612.63 rows=1 width=595) (actual time=0.029..0.030 rows=0 loops=1)
  ->  Sort  (cost=1612.63..1613.62 rows=398 width=595) (actual time=0.029..0.029 rows=0 loops=1)
        Sort Key: id DESC
        Sort Method: quicksort  Memory: 25kB
        ->  Bitmap Heap Scan on event_logs  (cost=28.29..1610.64 rows=398 width=595) (actual time=0.025..0.025 rows=0 loops=1)
              Recheck Cond: (((asset_key = '["<key_part_1>", "<key_part_2>"]'::text) AND (dagster_event_type = 'ASSET_MATERIALIZATION'::text)) OR ((asset_key = '<key_part_1>.<key_part_2>'::text) AND (dagster_event_type = 'ASSET_MATERIALIZATION'::text)))
              ->  BitmapOr  (cost=28.29..28.29 rows=398 width=0) (actual time=0.023..0.023 rows=0 loops=1)
                    ->  Bitmap Index Scan on idx_asset_key_materializations_2  (cost=0.00..14.05 rows=199 width=0) (actual time=0.018..0.018 rows=0 loops=1)
                          Index Cond: (asset_key = '["<key_part_1>", "<key_part_2>"]'::text)
                    ->  Bitmap Index Scan on idx_asset_key_materializations_2  (cost=0.00..14.05 rows=199 width=0) (actual time=0.005..0.005 rows=0 loops=1)
                          Index Cond: (asset_key = '<key_part_1>.<key_part_2>'::text)
Planning Time: 0.163 ms
Execution Time: 0.053 ms
đź’° 1
a

alex

10/19/2022, 8:53 PM
awesome, well get a migration added that fixes up these indexes
a

Adam Bloom

10/19/2022, 8:56 PM
final index defintion:
CREATE INDEX CONCURRENTLY idx_asset_key_materializations_2 ON event_logs (asset_key, id) WHERE dagster_event_type='ASSET_MATERIALIZATION' AND asset_key IS NOT NULL;
Just following up to thank all of you for your help yesterday - everything looks to have run smoothly overnight. Really appreciate your suggestions...I might have gotten to that database query eventually, but would have taken me a bit longer. Only other snag I hit I covered in a separate thread (regarding sensors being mistakenly labelled as schedules in the database, and therefore not running) - lmk if I should just move that one to a github issue.