Hi team, we are currently in the process of migrat...
# ask-community
a
Hi team, we are currently in the process of migrating our Dagster prod k8s environment from 0.15.8 to 1.3.1. Our migration pod is struck for a long time and not sure if there is any way to monitor what's going on. Here are logs that I am seeing on the migrate pod. Its struck at
Querying run storage
for more than 30 mins
Copy code
time="2023-05-04T00:32:16Z" level=info msg="spawning process: [dagster instance migrate]" app=vault-env
$DAGSTER_HOME: /opt/dagster/dagster_home

Updating run storage...
Skipping already applied data migration: run_partitions
Starting data migration: run_repo_label_tags
Querying run storage.
@daniel Sorry for tagging. We are in the middle of the migration, but wanted to check if you have any thoughts before reverting the migration process
Looks like we are struck here https://github.com/dagster-io/dagster/blob/5ffc94927bac25a82463d85dd469aa3e8468a3f[…]/python_modules/dagster/dagster/_core/storage/runs/migration.py. Its trying to fetch runs in chunks of 100 for migrating tags and we have more than 1.3 million runs
d
I could definitely imagine that migration taking a while for that many runs, yeah
a
Hmm, we waited for more than 2.5hrs, it was not even able to finish 10% of the runs. However, the migration failed with the following error
Copy code
Updating run storage...
Skipping already applied data migration: run_partitions
Starting data migration: run_repo_label_tags
Querying run storage.
Traceback (most recent call last):
  File "/usr/local/bin/dagster", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/dagster/_cli/__init__.py", line 46, in main
    cli(auto_envvar_prefix=ENV_PREFIX)  # pylint:disable=E1123
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_cli/instance.py", line 49, in migrate_command
    instance.upgrade(click.echo)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/instance/__init__.py", line 830, in upgrade
    self._run_storage.migrate(print_fn)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/storage/runs/sql_run_storage.py", line 986, in migrate
    self._execute_data_migrations(REQUIRED_DATA_MIGRATIONS, print_fn, force_rebuild_all)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/storage/runs/sql_run_storage.py", line 980, in _execute_data_migrations
    migration_fn()(self, print_fn)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/storage/runs/migration.py", line 197, in migrate_run_repo_tags
    run = deserialize_value(row[0], DagsterRun)
  File "/usr/local/lib/python3.7/site-packages/dagster/_serdes/serdes.py", line 658, in deserialize_value
    packed_value = seven.json.loads(val)
  File "/usr/local/lib/python3.7/json/__init__.py", line 361, in loads
    return cls(**kw).decode(s)
  File "/usr/local/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 3631 (char 3630)
d
How important is it to keep all 1.3 million runs?
(it's fine if the answer is 'very important', just checking)
a
Now the db is in a corrupted state and we had a downtime. So, we tried to revert to the old state, by reverting to the db snapshot. However, the daemon starts hanging continuously every 5 mins
d
This is on the old version but before the migration?
er on the new version?
a
Yes, this is the old version which we snapshotted before starting the migration
d
I think you're going to need to complete the migration for things to run smoothly on the new version, especially with 1.3 million runs
i'll surface this with the team and check about that error
lmk the answer to my question above about whether or not clearing out some old runs might be an option
a
We actually want to get back to the old state to avoid more downtime, hence reverted the DB to the old snapshot which we took before starting the migration. However, that does not seem to work
d
The same code and database that were working before are no longer working?
👌 1
If you can send along logs from your daemon while its hanging we can take a look
a
Clearing some old runs might be fine. But at this point we are just trying to get back to the old state
d
I don't have an immediate explanation for why if nothing changed things would stop working
but logs may help
a
I don't see anything off in the daemon logs. It just stopped abruptly at a particular time.
Copy code
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Launching run for fabricator_cng_clip_item_embeddings_v2_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Completed launch of run 5c0a127d-8d11-4b22-9b14-462a329d15ac for fabricator_cng_clip_item_embeddings_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Creating new run for fabricator_cng_clip_query_embedding_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Completed launch of run a45eb160-ccc1-4fa2-9f45-0c3f27dbc190 for fabricator_cng_clip_item_embeddings_v2_sensor
2023-05-04 15:31:42 +0000 - dagster - DEBUG - fabricator_cng_clip_query_embedding - f6209ab3-5058-434c-96e1-b2d3bfccb1db - ASSET_MATERIALIZATION_PLANNED - fabricator_cng_clip_query_embedding intends to materialize asset ["datalake", "fact_clip_cng_query_embedding"]
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Launching run for fabricator_cng_clip_query_embedding_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: fabricator_cng_clip_query_embeddings_feature_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Sensor fabricator_cng_clip_query_embeddings_feature_sensor skipped: Skipping because following dependencies for cng_clip_query_embeddings_feature were not satisfied : 0 13 * * *
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Completed launch of run f6209ab3-5058-434c-96e1-b2d3bfccb1db for fabricator_cng_clip_query_embedding_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: fabricator_cng_consumer_item_embeddings_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Sensor fabricator_cng_consumer_item_embeddings_sensor skipped: Skipping because following dependencies for cng_consumer_item_embeddings were not satisfied : 0 12 * * *
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: fabricator_cng_consumer_query_embeddings_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Sensor fabricator_cng_consumer_query_embeddings_sensor skipped: Skipping because following dependencies for cng_consumer_query_embeddings were not satisfied : 0 12 * * *
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: fabricator_cng_daily_sku_count_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: fabricator_cng_eta_delivery_base_instance_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: fabricator_cng_infp_v3_predictions_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Creating new run for fabricator_cng_eta_delivery_base_instance_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: fabricator_cng_infp_v3_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Creating new run for fabricator_cng_daily_sku_count_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Creating new run for fabricator_cng_infp_v3_predictions_sensor
After this I don't see any logs
d
It's hanging? Or crashed?
What version of dagster is this?
a
The daemon pod looks health though
Its 0.15.8
d
Right
Tough situation - the old version is missing tons and tons of perf improvements and bugfixes, but is now running at scale - the new version has migration difficulties
a
The old version was running at the same scale even before, but it was working fine. I am trying to see if there is anything off anywhere
d
The same py-spy tool that I recommended on a different thread for perf investigation can also help identify why a python process is hanging, if that's what's happening here: https://github.com/benfred/py-spy
a
The cpu and memory seems to be fine fine for the daemon pod. When I restart the daemon pod, it works for sometime, but eventually hangs
d
got it - if it's hanging, running my-spy is what i'd recommend to get to the bottom of why
py-spy rather
but i have no explanation for why this would be happening now if the data and code haven't changed
a
Ok, I am am able to find this
Copy code
Traceback (most recent call last):
  File "/usr/local/bin/dagster-daemon", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/cli/__init__.py", line 127, in main
    cli(obj={})  # pylint:disable=E1123
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/cli/__init__.py", line 43, in run_command
    _daemon_run_command(instance, kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/telemetry.py", line 110, in wrap
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/cli/__init__.py", line 55, in _daemon_run_command
    controller.check_daemon_loop()
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/controller.py", line 273, in check_daemon_loop
    self.check_daemon_heartbeats()
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/controller.py", line 248, in check_daemon_heartbeats
    failed_daemons=failed_daemons
Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['SCHEDULER']
The daemon hangs only after this error. Trying to find if there is anything wrong with the scheduler daemon
d
what version of grpcio do you have installed?
at some point we introduced a <1.48.0 pin due to some hangs - but that's just a guess
question i'd have is if the thread died or is hanging, if its hanging, py-spy will explain why
a
Its 1.47.0
d
OK, likely not that then
a
How can I run py-spy in a running pod? Do I have to attach it before starting the deployment?
d
in k8s you have to add this securityContext: https://github.com/benfred/py-spy#how-do-i-run-py-spy-in-kubernetes But then you can just pip install py-spy and run it
(on the pod)
a
I see, Not sure how I can tweak the security context. Most of the k8s tooling are controlled by our internal team. At this point, I am not sure if that's the fastest way to get back to a healthy state
d
that's the main tool in my toolbox for debugging hanging pods - if that's not an option you could try selectively disabling schedules until things get back in a good state, take a closer look through the logs and try to find the last thing that happened and see if it's suspicious..
a
Is there any way we can disable scheduler daemon
d
you could set dagsterDaemon.heartbeatTolerance to a very high number in your helm chart and that would prevent it from crashing if the scheduler daemon is hanging
i don't think there's a way to specifically disable the scheduler daemon
a
We had to truncate al the past runs and complete the migration
d
Got it - are things going more smoothly now?
a
It looks to be running fine now. I assume that the sensor cursor and ticks should still be persistent though? So that we are not re-running all the jobs
d
Yeah, those should stay the same
a
Thanks Daniel for the support. However, this is surely not ideal as we have a long downtime and very tough migration (trying different things to make it work for a long time). Some warnings in the migration guide about scale considerations would have been helpful. Also, the current migration approach seems to be non scalable, having to update runs tags for 100 runs at a time might not be ideal. Not sure how it worked for other deployment migrations
Do you think if there is any way we can backfill the runs into our prod db? I don't think we could lose all the past runs information
d
The data migration should be idempotent - so if you added them back and re-ran it, that could work. I'm not sure that would be a zero-downtime operation though
I'll pass the feedback about making large-scale migrations easier and with less downtime on to the team
a
Thanks, I am not sure if we want to take a risk doing any manual DB operation and mess up things now. We might just fine losing the runs information
Does Dagster team have any recommended practices on periodic cleanup of runs and any other similar tables? Do you know how other teams handle it today?
d
We have this for ticks but not yet for runs: https://docs.dagster.io/deployment/dagster-instance#data-retention I'll check about the status of run retention
a
Yeah, we are already using it for sensor ticks. Would also be great to have a similar setting for runs
d
I think the feature request tracking that is here: https://github.com/dagster-io/dagster/issues/4100
I'm about to head out - can you make a new post for the new question?
👍 1
thankyou 1
Here’s a better example for cleaning up older runs https://github.com/dagster-io/dagster/discussions/12047
a
Thanks for sharing. Curious if we are expected to do this at a regular cadence? Are there any numbers on what scale can Dagster support today? I want to understand this to determine the how often I need to run this clean up jobs
d
I don't think there's a single number of jobs, because there are so many variables at play (the size of your DB, the size of the runs, the # of events, etc.)
a
In this case, this was solely dependent on the size of the runs table as the migrate finished quickly once we truncated the runs table. I am just wondering how it worked well for other users and want to take appropriate actions to avoid this in our next upgrade
d
The issue you ran into there was fairly specific to that migration I think - I don't expect similar problems going forward
thankyou 1
a
Also, previously I was told that I can assume the entire migration runs in a single transaction. It does not look like that's true anymore?
d
I think that applies to schema migrations - this was a data migration
a
I see. Sorry for the bunch of questions and thanks for answering 🙂 Just want to avoid the pain during our next upgrade and prepare better. From next time, we probably need to have a better idea on what kind of migration is going to happen or even test it before hand on a prod DB replica.
d
Testing it out first on a replica sounds like a great idea to me yeah
👍 1
a
In case, if you have any thoughts on this question, I would really appreciate your response. Thanks https://dagster.slack.com/archives/C01U954MEER/p1683239393374199