Hi team we are currently in the process of migrating our Dag dagster #ask-community

Hi team, we are currently in the process of migrat...

Arun Kumar

05/04/2023, 1:30 AM

Hi team, we are currently in the process of migrating our Dagster prod k8s environment from 0.15.8 to 1.3.1. Our migration pod is struck for a long time and not sure if there is any way to monitor what's going on. Here are logs that I am seeing on the migrate pod. Its struck at

Querying run storage

for more than 30 mins

Copy code

time="2023-05-04T00:32:16Z" level=info msg="spawning process: [dagster instance migrate]" app=vault-env
$DAGSTER_HOME: /opt/dagster/dagster_home

Updating run storage...
Skipping already applied data migration: run_partitions
Starting data migration: run_repo_label_tags
Querying run storage.

Arun Kumar

05/04/2023, 1:49 AM

@daniel Sorry for tagging. We are in the middle of the migration, but wanted to check if you have any thoughts before reverting the migration process

Arun Kumar

05/04/2023, 2:23 AM

Looks like we are struck here https://github.com/dagster-io/dagster/blob/5ffc94927bac25a82463d85dd469aa3e8468a3f[…]/python_modules/dagster/dagster/_core/storage/runs/migration.py. Its trying to fetch runs in chunks of 100 for migrating tags and we have more than 1.3 million runs

daniel

05/04/2023, 11:29 AM

I could definitely imagine that migration taking a while for that many runs, yeah

Arun Kumar

05/04/2023, 3:30 PM

Hmm, we waited for more than 2.5hrs, it was not even able to finish 10% of the runs. However, the migration failed with the following error

Copy code

Updating run storage...
Skipping already applied data migration: run_partitions
Starting data migration: run_repo_label_tags
Querying run storage.
Traceback (most recent call last):
  File "/usr/local/bin/dagster", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/dagster/_cli/__init__.py", line 46, in main
    cli(auto_envvar_prefix=ENV_PREFIX)  # pylint:disable=E1123
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_cli/instance.py", line 49, in migrate_command
    instance.upgrade(click.echo)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/instance/__init__.py", line 830, in upgrade
    self._run_storage.migrate(print_fn)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/storage/runs/sql_run_storage.py", line 986, in migrate
    self._execute_data_migrations(REQUIRED_DATA_MIGRATIONS, print_fn, force_rebuild_all)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/storage/runs/sql_run_storage.py", line 980, in _execute_data_migrations
    migration_fn()(self, print_fn)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/storage/runs/migration.py", line 197, in migrate_run_repo_tags
    run = deserialize_value(row[0], DagsterRun)
  File "/usr/local/lib/python3.7/site-packages/dagster/_serdes/serdes.py", line 658, in deserialize_value
    packed_value = seven.json.loads(val)
  File "/usr/local/lib/python3.7/json/__init__.py", line 361, in loads
    return cls(**kw).decode(s)
  File "/usr/local/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 3631 (char 3630)

daniel

05/04/2023, 3:30 PM

How important is it to keep all 1.3 million runs?

daniel

05/04/2023, 3:30 PM

(it's fine if the answer is 'very important', just checking)

Arun Kumar

05/04/2023, 3:37 PM

Now the db is in a corrupted state and we had a downtime. So, we tried to revert to the old state, by reverting to the db snapshot. However, the daemon starts hanging continuously every 5 mins

daniel

05/04/2023, 3:37 PM

This is on the old version but before the migration?

daniel

05/04/2023, 3:37 PM

er on the new version?

Arun Kumar

05/04/2023, 3:38 PM

Yes, this is the old version which we snapshotted before starting the migration

daniel

05/04/2023, 3:38 PM

I think you're going to need to complete the migration for things to run smoothly on the new version, especially with 1.3 million runs

daniel

05/04/2023, 3:38 PM

i'll surface this with the team and check about that error

daniel

05/04/2023, 3:39 PM

lmk the answer to my question above about whether or not clearing out some old runs might be an option

Arun Kumar

05/04/2023, 3:39 PM

We actually want to get back to the old state to avoid more downtime, hence reverted the DB to the old snapshot which we took before starting the migration. However, that does not seem to work

daniel

05/04/2023, 3:40 PM

The same code and database that were working before are no longer working?

👌 1

daniel

05/04/2023, 3:40 PM

If you can send along logs from your daemon while its hanging we can take a look

Arun Kumar

05/04/2023, 3:40 PM

Clearing some old runs might be fine. But at this point we are just trying to get back to the old state

daniel

05/04/2023, 3:41 PM

I don't have an immediate explanation for why if nothing changed things would stop working

daniel

05/04/2023, 3:41 PM

but logs may help

Arun Kumar

05/04/2023, 3:42 PM

I don't see anything off in the daemon logs. It just stopped abruptly at a particular time.

Arun Kumar

05/04/2023, 3:42 PM

Copy code

2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Launching run for fabricator_cng_clip_item_embeddings_v2_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Completed launch of run 5c0a127d-8d11-4b22-9b14-462a329d15ac for fabricator_cng_clip_item_embeddings_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Creating new run for fabricator_cng_clip_query_embedding_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Completed launch of run a45eb160-ccc1-4fa2-9f45-0c3f27dbc190 for fabricator_cng_clip_item_embeddings_v2_sensor
2023-05-04 15:31:42 +0000 - dagster - DEBUG - fabricator_cng_clip_query_embedding - f6209ab3-5058-434c-96e1-b2d3bfccb1db - ASSET_MATERIALIZATION_PLANNED - fabricator_cng_clip_query_embedding intends to materialize asset ["datalake", "fact_clip_cng_query_embedding"]
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Launching run for fabricator_cng_clip_query_embedding_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: fabricator_cng_clip_query_embeddings_feature_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Sensor fabricator_cng_clip_query_embeddings_feature_sensor skipped: Skipping because following dependencies for cng_clip_query_embeddings_feature were not satisfied : 0 13 * * *
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Completed launch of run f6209ab3-5058-434c-96e1-b2d3bfccb1db for fabricator_cng_clip_query_embedding_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: fabricator_cng_consumer_item_embeddings_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Sensor fabricator_cng_consumer_item_embeddings_sensor skipped: Skipping because following dependencies for cng_consumer_item_embeddings were not satisfied : 0 12 * * *
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: fabricator_cng_consumer_query_embeddings_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Sensor fabricator_cng_consumer_query_embeddings_sensor skipped: Skipping because following dependencies for cng_consumer_query_embeddings were not satisfied : 0 12 * * *
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: fabricator_cng_daily_sku_count_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: fabricator_cng_eta_delivery_base_instance_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: fabricator_cng_infp_v3_predictions_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Creating new run for fabricator_cng_eta_delivery_base_instance_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: fabricator_cng_infp_v3_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Creating new run for fabricator_cng_daily_sku_count_sensor
2023-05-04 15:31:42 +0000 - dagster.daemon.SensorDaemon - INFO - Creating new run for fabricator_cng_infp_v3_predictions_sensor

Arun Kumar

05/04/2023, 3:42 PM

After this I don't see any logs

daniel

05/04/2023, 3:42 PM

It's hanging? Or crashed?

daniel

05/04/2023, 3:42 PM

What version of dagster is this?

Arun Kumar

05/04/2023, 3:42 PM

The daemon pod looks health though

Arun Kumar

05/04/2023, 3:42 PM

Its 0.15.8

daniel

05/04/2023, 3:43 PM

Right

daniel

05/04/2023, 3:43 PM

Tough situation - the old version is missing tons and tons of perf improvements and bugfixes, but is now running at scale - the new version has migration difficulties

Arun Kumar

05/04/2023, 3:44 PM

The old version was running at the same scale even before, but it was working fine. I am trying to see if there is anything off anywhere

daniel

05/04/2023, 3:45 PM

The same py-spy tool that I recommended on a different thread for perf investigation can also help identify why a python process is hanging, if that's what's happening here: https://github.com/benfred/py-spy

Arun Kumar

05/04/2023, 3:59 PM

The cpu and memory seems to be fine fine for the daemon pod. When I restart the daemon pod, it works for sometime, but eventually hangs

daniel

05/04/2023, 3:59 PM

got it - if it's hanging, running my-spy is what i'd recommend to get to the bottom of why

daniel

05/04/2023, 3:59 PM

py-spy rather

daniel

05/04/2023, 3:59 PM

but i have no explanation for why this would be happening now if the data and code haven't changed

Arun Kumar

05/04/2023, 4:00 PM

Ok, I am am able to find this

Copy code

Traceback (most recent call last):
  File "/usr/local/bin/dagster-daemon", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/cli/__init__.py", line 127, in main
    cli(obj={})  # pylint:disable=E1123
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/cli/__init__.py", line 43, in run_command
    _daemon_run_command(instance, kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/telemetry.py", line 110, in wrap
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/cli/__init__.py", line 55, in _daemon_run_command
    controller.check_daemon_loop()
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/controller.py", line 273, in check_daemon_loop
    self.check_daemon_heartbeats()
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/controller.py", line 248, in check_daemon_heartbeats
    failed_daemons=failed_daemons
Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['SCHEDULER']

Arun Kumar

05/04/2023, 4:02 PM

The daemon hangs only after this error. Trying to find if there is anything wrong with the scheduler daemon

daniel

05/04/2023, 4:04 PM

what version of grpcio do you have installed?

daniel

05/04/2023, 4:06 PM

at some point we introduced a <1.48.0 pin due to some hangs - but that's just a guess

daniel

05/04/2023, 4:06 PM

question i'd have is if the thread died or is hanging, if its hanging, py-spy will explain why

Arun Kumar

05/04/2023, 4:07 PM

Its 1.47.0

daniel

05/04/2023, 4:07 PM

OK, likely not that then

Arun Kumar

05/04/2023, 4:22 PM

How can I run py-spy in a running pod? Do I have to attach it before starting the deployment?

daniel

05/04/2023, 4:23 PM

in k8s you have to add this securityContext: https://github.com/benfred/py-spy#how-do-i-run-py-spy-in-kubernetes But then you can just pip install py-spy and run it

daniel

05/04/2023, 4:23 PM

(on the pod)

Arun Kumar

05/04/2023, 4:37 PM

I see, Not sure how I can tweak the security context. Most of the k8s tooling are controlled by our internal team. At this point, I am not sure if that's the fastest way to get back to a healthy state

daniel

05/04/2023, 4:38 PM

that's the main tool in my toolbox for debugging hanging pods - if that's not an option you could try selectively disabling schedules until things get back in a good state, take a closer look through the logs and try to find the last thing that happened and see if it's suspicious..

Arun Kumar

05/04/2023, 4:45 PM

Is there any way we can disable scheduler daemon

daniel

05/04/2023, 4:47 PM

you could set dagsterDaemon.heartbeatTolerance to a very high number in your helm chart and that would prevent it from crashing if the scheduler daemon is hanging

daniel

05/04/2023, 4:47 PM

i don't think there's a way to specifically disable the scheduler daemon

Arun Kumar

05/04/2023, 6:10 PM

We had to truncate al the past runs and complete the migration

daniel

05/04/2023, 6:11 PM

Got it - are things going more smoothly now?

Arun Kumar

05/04/2023, 6:12 PM

It looks to be running fine now. I assume that the sensor cursor and ticks should still be persistent though? So that we are not re-running all the jobs

daniel

05/04/2023, 6:12 PM

Yeah, those should stay the same

Arun Kumar

05/04/2023, 6:16 PM

Thanks Daniel for the support. However, this is surely not ideal as we have a long downtime and very tough migration (trying different things to make it work for a long time). Some warnings in the migration guide about scale considerations would have been helpful. Also, the current migration approach seems to be non scalable, having to update runs tags for 100 runs at a time might not be ideal. Not sure how it worked for other deployment migrations

Arun Kumar

05/04/2023, 6:18 PM

Do you think if there is any way we can backfill the runs into our prod db? I don't think we could lose all the past runs information

daniel

05/04/2023, 6:20 PM

The data migration should be idempotent - so if you added them back and re-ran it, that could work. I'm not sure that would be a zero-downtime operation though

daniel

05/04/2023, 6:25 PM

I'll pass the feedback about making large-scale migrations easier and with less downtime on to the team

Arun Kumar

05/04/2023, 6:31 PM

Thanks, I am not sure if we want to take a risk doing any manual DB operation and mess up things now. We might just fine losing the runs information

Arun Kumar

05/04/2023, 6:33 PM

Does Dagster team have any recommended practices on periodic cleanup of runs and any other similar tables? Do you know how other teams handle it today?

daniel

05/04/2023, 6:56 PM

We have this for ticks but not yet for runs: https://docs.dagster.io/deployment/dagster-instance#data-retention I'll check about the status of run retention

Arun Kumar

05/04/2023, 10:14 PM

Yeah, we are already using it for sensor ticks. Would also be great to have a similar setting for runs

daniel

05/04/2023, 10:18 PM

I think the feature request tracking that is here: https://github.com/dagster-io/dagster/issues/4100

daniel

05/04/2023, 10:26 PM

I'm about to head out - can you make a new post for the new question?

👍 1

thankyou 1

daniel

05/05/2023, 12:29 PM

Here’s a better example for cleaning up older runs https://github.com/dagster-io/dagster/discussions/12047

Arun Kumar

05/05/2023, 10:17 PM

Thanks for sharing. Curious if we are expected to do this at a regular cadence? Are there any numbers on what scale can Dagster support today? I want to understand this to determine the how often I need to run this clean up jobs

daniel

05/05/2023, 10:17 PM

I don't think there's a single number of jobs, because there are so many variables at play (the size of your DB, the size of the runs, the # of events, etc.)

Arun Kumar

05/05/2023, 10:19 PM

In this case, this was solely dependent on the size of the runs table as the migrate finished quickly once we truncated the runs table. I am just wondering how it worked well for other users and want to take appropriate actions to avoid this in our next upgrade

daniel

05/05/2023, 10:20 PM

The issue you ran into there was fairly specific to that migration I think - I don't expect similar problems going forward

thankyou 1

Arun Kumar

05/05/2023, 10:23 PM

Also, previously I was told that I can assume the entire migration runs in a single transaction. It does not look like that's true anymore?

daniel

05/05/2023, 10:23 PM

I think that applies to schema migrations - this was a data migration

Arun Kumar

05/05/2023, 10:29 PM

I see. Sorry for the bunch of questions and thanks for answering 🙂 Just want to avoid the pain during our next upgrade and prepare better. From next time, we probably need to have a better idea on what kind of migration is going to happen or even test it before hand on a prod DB replica.

daniel

05/05/2023, 10:31 PM

Testing it out first on a replica sounds like a great idea to me yeah

👍 1

Arun Kumar

05/05/2023, 10:33 PM

In case, if you have any thoughts on this question, I would really appreciate your response. Thanks https://dagster.slack.com/archives/C01U954MEER/p1683239393374199

Open in Slack

Previous Next