Hello channel. I ran into some major technical iss...
# ask-community
m
Hello channel. I ran into some major technical issues today after running Dagster for a while. My EC2 instance was spawning a high number of dagster-related processes (7k or so) to the point that it was not accessible via ssh anymore. This happened after I moved to a bigger machine: there is a suspicion that higher capacity meant higher number of spawn-able threads and eventually some hard limit was hit. For context: I have been experiencing UI performance issues that seem to get worse over time as the number of runs increases. Anyways, here’s a couple of questions for you: 1. Does it make sense that the high number of spawned threads and sluggish UI  are the manifestation of the same cause (high number of runs accrued) 2. is there any new feature or plans around time-to-live in the sqlite DB? 3. I am planning to start to rely on Dagster for critical production processes soon. To avoid running into performance issues my plan is to have the live processes to run on a dagster instance and leave all backfills to another machine, running a separate Dagster instance. The idea here is that this latter instance is more easily purge-able (in the absence of a better TTL-related approach). What do you think about it? Is there any recommended way to run such a ‘secondary’ instance?
a
To summarize from debugging in DMs * each
load_from
in a
workspace
will lead to separate grpc server subprocesses spawned by
dagit
and the
daemon
if all being run on the same machine. Consolidating loading if separate python environments is not needed is recommended. * manually managing your grpc servers is another approach to more efficiently host dagit and daemon on the same machine https://docs.dagster.io/concepts/repositories-workspaces/workspaces#running-your-own-grpc-server
as for the specific grpc bit: do I need then one port per repo - how can I then link dagit to several ports?
if you are going to manually spawn multiple grpc servers instead of consolidating your loading, you will need to use a
workspace.yaml
to provide the multiple
load_from
entries with different ports
p
For #2, we don’t have time-to-live features in the roadmap right now, but you might be able to leverage existing APIs to implement your own on a schedule. For example, you could set up a weekly (or daily) schedule that queries the instance for run records that are older than a certain time and deletes them. Relevant pseudo-code would look something like this (untested):
Copy code
@solid
def prune_old_runs(context, prune_threshold_datetime):
    has_more = True
    while has_more:
        run_records = context.instance.get_run_records(order_by="create_timestamp", ascending=True, limit=CHUNK_SIZE)
        has_more = len(run_records) == CHUNK_SIZE
        for record in run_records:
            if record.create_timestamp > prune_threshold_datetime:
                break
            context.instance.delete_run(record.pipeline_run.run_id)
Disclaimer: this will wipe the event log records for runs older than a certain size, regardless of its run status. You may want to add extra filters / conditions to preserve certain types of runs (in-progress, most recent runs for partition, etc). This might also affect the asset history of assets materialized in the wiped run.
m
That’s very useful, thanks!
a
note
get_run_records
is brand new and may be renamed with its formal support & docs, likely by this weeks release
m
So the grpc server solution with a single file to load from generally did the trick (number of threads dropped to 10-20% of what it used to be). However stopping, disabling and then enabling/restarting (as part of my deployment process) the systemd service controlling it results in the following error:
I suppose I need to stop the process graciously? Or it’s not that?
Actually I think the issue is elsewhere