Hello channel I ran into some major technical issues today a dagster #ask-community

Hello channel. I ran into some major technical iss...

Marco

06/14/2021, 4:23 PM

Hello channel. I ran into some major technical issues today after running Dagster for a while. My EC2 instance was spawning a high number of dagster-related processes (7k or so) to the point that it was not accessible via ssh anymore. This happened after I moved to a bigger machine: there is a suspicion that higher capacity meant higher number of spawn-able threads and eventually some hard limit was hit. For context: I have been experiencing UI performance issues that seem to get worse over time as the number of runs increases. Anyways, here’s a couple of questions for you: 1. Does it make sense that the high number of spawned threads and sluggish UI are the manifestation of the same cause (high number of runs accrued) 2. is there any new feature or plans around time-to-live in the sqlite DB? 3. I am planning to start to rely on Dagster for critical production processes soon. To avoid running into performance issues my plan is to have the live processes to run on a dagster instance and leave all backfills to another machine, running a separate Dagster instance. The idea here is that this latter instance is more easily purge-able (in the absence of a better TTL-related approach). What do you think about it? Is there any recommended way to run such a ‘secondary’ instance?

alex

06/14/2021, 4:27 PM

To summarize from debugging in DMs * each

load_from

in a

workspace

will lead to separate grpc server subprocesses spawned by

dagit

and the

daemon

if all being run on the same machine. Consolidating loading if separate python environments is not needed is recommended. * manually managing your grpc servers is another approach to more efficiently host dagit and daemon on the same machine https://docs.dagster.io/concepts/repositories-workspaces/workspaces#running-your-own-grpc-server

alex

06/14/2021, 4:30 PM

as for the specific grpc bit: do I need then one port per repo - how can I then link dagit to several ports?

if you are going to manually spawn multiple grpc servers instead of consolidating your loading, you will need to use a

workspace.yaml

to provide the multiple

load_from

entries with different ports

prha

06/14/2021, 4:52 PM

For #2, we don’t have time-to-live features in the roadmap right now, but you might be able to leverage existing APIs to implement your own on a schedule. For example, you could set up a weekly (or daily) schedule that queries the instance for run records that are older than a certain time and deletes them. Relevant pseudo-code would look something like this (untested):

Copy code

@solid
def prune_old_runs(context, prune_threshold_datetime):
    has_more = True
    while has_more:
        run_records = context.instance.get_run_records(order_by="create_timestamp", ascending=True, limit=CHUNK_SIZE)
        has_more = len(run_records) == CHUNK_SIZE
        for record in run_records:
            if record.create_timestamp > prune_threshold_datetime:
                break
            context.instance.delete_run(record.pipeline_run.run_id)

Disclaimer: this will wipe the event log records for runs older than a certain size, regardless of its run status. You may want to add extra filters / conditions to preserve certain types of runs (in-progress, most recent runs for partition, etc). This might also affect the asset history of assets materialized in the wiped run.

Marco

06/14/2021, 4:54 PM

That’s very useful, thanks!

alex

06/14/2021, 4:58 PM

note

get_run_records

is brand new and may be renamed with its formal support & docs, likely by this weeks release

Marco

06/15/2021, 6:09 PM

So the grpc server solution with a single file to load from generally did the trick (number of threads dropped to 10-20% of what it used to be). However stopping, disabling and then enabling/restarting (as part of my deployment process) the systemd service controlling it results in the following error:

Marco

06/15/2021, 6:11 PM

I suppose I need to stop the process graciously? Or it’s not that?

Marco

06/15/2021, 6:17 PM

Actually I think the issue is elsewhere

Open in Slack

Previous Next