Hi I have dagster deployed and running on version `0 10 4` E dagster #announcements

Hi, I have dagster deployed and running on version...

Brian Abelson

02/10/2021, 12:24 AM

Hi, I have dagster deployed and running on version

0.10.4

. Everything runs fine, except the scheduler seems to continually shut down after about 2-3 hours with the following error (pasted below). It seems that I have to restart the daemon continually to address this. is this normal? is there a way to suppress these errors? I'm invoking

daagster-daemonn

via

supervisord

with the simple

run

commannd.

Copy code

dagster.serdes.ipc.DagsterIPCProtocolError: Timeout: read stream has not received any data in 15 seconds
  File "/usr/local/lib/python3.8/site-packages/dagster/scheduler/scheduler.py", line 86, in launch_scheduled_runs
    with RepositoryLocationHandle.create_from_repository_location_origin(
  File "/usr/local/lib/python3.8/site-packages/dagster/core/host_representation/handle.py", line 57, in create_from_repository_location_origin
    return ManagedGrpcPythonEnvRepositoryLocationHandle(repo_location_origin)
  File "/usr/local/lib/python3.8/site-packages/dagster/core/host_representation/handle.py", line 192, in __init__
    self.grpc_server_process = GrpcServerProcess(
  File "/usr/local/lib/python3.8/site-packages/dagster/grpc/server.py", line 1037, in __init__
    self.server_process = open_server_process(
  File "/usr/local/lib/python3.8/site-packages/dagster/grpc/server.py", line 942, in open_server_process
    wait_for_grpc_server(server_process, output_file)
  File "/usr/local/lib/python3.8/site-packages/dagster/grpc/server.py", line 878, in wait_for_grpc_server
    event = read_unary_response(ipc_output_file, timeout=timeout, ipc_process=server_process)
  File "/usr/local/lib/python3.8/site-packages/dagster/serdes/ipc.py", line 39, in read_unary_response
    messages = list(ipc_read_event_stream(output_file, timeout=timeout, ipc_process=ipc_process))
  File "/usr/local/lib/python3.8/site-packages/dagster/serdes/ipc.py", line 152, in ipc_read_event_stream
    raise DagsterIPCProtocolError(

Brian Abelson

02/10/2021, 12:26 AM

It's kind of odd, it actually seems like the scheduler is still running but the UI makes it seem like it has stopped:

Brian Abelson

02/10/2021, 12:38 AM

any insight here? this error seems to happen with every deploy. it almost seems as if theres if theres an intermittent connnection timeout, the process just fails and then that's it, your scheduler is toast.

daniel

02/10/2021, 12:38 AM

Hi, is it possible to post logs from a period of time just before this error as well? I think something might be causing a process to not be able to start up earlier.

Brian Abelson

02/10/2021, 12:39 AM

sure, i can attempt to do so. as I said, it seems to happen after a couple of hours of uptime

Brian Abelson

02/10/2021, 12:40 AM

and then the

Status

window is seemingly stuck like this

Brian Abelson

02/10/2021, 12:41 AM

Brian Abelson

02/10/2021, 12:41 AM

even though runs are still being triggered..

Brian Abelson

02/10/2021, 12:42 AM

here are the logs

Copy code

2021-02-10 00:41:32 - SchedulerDaemon - INFO - Checking for new runs for the following schedules: dbt_run_all, mysql_drupal_to_psql_warehouse_all_else, mysql_drupal_to_psql_warehouse_commerce_fields, mysql_drupal_to_psql_warehouse_commerce_core, mysql_drupal_to_psql_warehouse_ioby_sf, mysql_drupal_to_psql_warehouse_commerce_donations, mysql_drupal_to_psql_warehouse_match_programs, mysql_drupal_to_psql_warehouse_node, mysql_drupal_to_psql_warehouse_people, mysql_drupal_to_psql_warehouse_projects, mysql_drupal_to_psql_warehouse_revisions
ioby-data | 2021-02-09 19:41:38 2021-02-10 00:41:37 - dagster - INFO - system - 5ae2890f-3653-4763-bbf2-89f4f196936e - copy_mysql_drupal_tables_to_psql_warehouse - WRITING MYSQL ioby_sf_opportunities TO tmp_ioby_data_pipelines_etl_mysql_drupal_to_psql_warehouse_1326a.ioby_sf_opportunities IN WAREHOUSE
ioby-data | 2021-02-09 19:41:48 2021-02-10 00:41:48 - SchedulerDaemon - ERROR - Scheduler failed for dbt_run_all : dagster.serdes.ipc.DagsterIPCProtocolError: Timeout: read stream has not received any data in 15 seconds
ioby-data | 2021-02-09 19:41:48 
ioby-data | 2021-02-09 19:41:48 Stack Trace:
ioby-data | 2021-02-09 19:41:48   File "/usr/local/lib/python3.8/site-packages/dagster/scheduler/scheduler.py", line 86, in launch_scheduled_runs
ioby-data | 2021-02-09 19:41:48     with RepositoryLocationHandle.create_from_repository_location_origin(
ioby-data | 2021-02-09 19:41:48   File "/usr/local/lib/python3.8/site-packages/dagster/core/host_representation/handle.py", line 57, in create_from_repository_location_origin
ioby-data | 2021-02-09 19:41:48     return ManagedGrpcPythonEnvRepositoryLocationHandle(repo_location_origin)
ioby-data | 2021-02-09 19:41:48   File "/usr/local/lib/python3.8/site-packages/dagster/core/host_representation/handle.py", line 192, in __init__
ioby-data | 2021-02-09 19:41:48     self.grpc_server_process = GrpcServerProcess(
ioby-data | 2021-02-09 19:41:48   File "/usr/local/lib/python3.8/site-packages/dagster/grpc/server.py", line 1037, in __init__
ioby-data | 2021-02-09 19:41:48     self.server_process = open_server_process(
ioby-data | 2021-02-09 19:41:48   File "/usr/local/lib/python3.8/site-packages/dagster/grpc/server.py", line 942, in open_server_process
ioby-data | 2021-02-09 19:41:48     wait_for_grpc_server(server_process, output_file)
ioby-data | 2021-02-09 19:41:48   File "/usr/local/lib/python3.8/site-packages/dagster/grpc/server.py", line 878, in wait_for_grpc_server
ioby-data | 2021-02-09 19:41:48     event = read_unary_response(ipc_output_file, timeout=timeout, ipc_process=server_process)
ioby-data | 2021-02-09 19:41:48   File "/usr/local/lib/python3.8/site-packages/dagster/serdes/ipc.py", line 39, in read_unary_response
ioby-data | 2021-02-09 19:41:48     messages = list(ipc_read_event_stream(output_file, timeout=timeout, ipc_process=ipc_process))
ioby-data | 2021-02-09 19:41:48   File "/usr/local/lib/python3.8/site-packages/dagster/serdes/ipc.py", line 152, in ipc_read_event_stream
ioby-data | 2021-02-09 19:41:48     raise DagsterIPCProtocolError(
ioby-data | 2021-02-09 19:41:48

Brian Abelson

02/10/2021, 12:44 AM

i guess this is also the combined logs of

dagit

dagster-daemonn

, since im running both via

supervisord

daniel

02/10/2021, 12:47 AM

It’s also possible that the error is indicative of a larger problem (e.g. an out of memory error?) because I wouldn’t expect that error alone to shut down the whole scheduler process - there’s a catch around that codepath and it would normally try again a few seconds later

Brian Abelson

02/10/2021, 12:51 AM

it seems like it does try again, but that dagit gets stuck in an error state

Brian Abelson

02/10/2021, 12:53 AM

jobs are still actively running but the UI looks like this:

Brian Abelson

02/10/2021, 12:53 AM

refreshing the repo doesn't do anything

Brian Abelson

02/10/2021, 12:54 AM

it has the same error

daniel

02/10/2021, 12:56 AM

I’ll be able to take a closer look at this in an hour or two. In the meantime if it’s possible to check if you’re close to any memory limits when this is happening, that would help rule things out

Brian Abelson

02/10/2021, 1:04 AM

it doesn't look like i am (its deployed via digital ocean app platform), but maybe there are some secondary limits i'm not aware of.

Brian Abelson

02/10/2021, 1:05 AM

the CPU is high... i'm running everything on a single node without celery etc, so jobs are executed on the same instance that dagit is running on

Brian Abelson

02/10/2021, 1:06 AM

thanks for your help!

daniel

02/10/2021, 1:11 AM

Refreshing the repo giving the same error is a very useful clue - that means something about your current system state is making it impossible to launch the process that serves the repository information (in both dagit and the daemon) - that’s what leads me to believe it’s some kind of memory or other resource issue. Are there any useful errors or other logs in the command line output of your dagit process when you try to refresh the repo and it fails?

Brian Abelson

02/10/2021, 1:16 AM

just that timeout error.

Brian Abelson

02/10/2021, 1:18 AM

I also get this warning:

OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k

daniel

02/10/2021, 1:32 AM

I wonder if it could be so CPU capped that it’s taking a really long time to spin up a server and hits the timeout?

daniel

02/10/2021, 1:33 AM

If there’s any way to temporarily bring the CPU usage down a bit and see if the problem persists, that would help rule that out

Brian Abelson

02/10/2021, 2:14 PM

will try now.

Brian Abelson

02/10/2021, 2:31 PM

okay i scaled up the server and redeployed... from googling that

OpenBLAS

warning seems to be associated with OOM errors in other cases, but cant be sure since some things continue to run?

Brian Abelson

02/10/2021, 2:32 PM

i left it run overnight, oddly 1/2 of the jobs still got triggered by the scheduler even though it as in an "error" state according to dagit.

daniel

02/10/2021, 2:58 PM

Some jobs starting and some jobs failing is consistent with the node being overloaded - processes are probably sporadically getting shut down as they run out of resources

Brian Abelson

02/10/2021, 3:13 PM

okay, yeah. runs seem to be triggering normally now

Brian Abelson

02/10/2021, 3:13 PM

they're also going much quicker, another sign that the CPU being pinned at 100 was the root issue

Brian Abelson

02/10/2021, 3:14 PM

im assuming best practice is to isolate the scheduler in its own container for this reason.

Brian Abelson

02/10/2021, 3:17 PM

okay, to summarize: both dagit and dagster-daemon communicate with your dagster code via an RPC process which, when your node is overwhelmed, can take a long time to respond or startup and throws this error

dagster.serdes.ipc.DagsterIPCProtocolError: Timeout: read stream has not received any data in 15 seconds

daniel

02/10/2021, 3:23 PM

Right, they both spin up subprocesses to load your code (unless you have set up your own gRPC server to do it and specified that in your workspace.yaml). The error message here could definitely be clearer though. And yeah, putting the scheduler and Dagit in their own containers separately from the CPU-bound workers would likely help with this.

Brian Abelson

02/10/2021, 4:45 PM

I suppose I might also implement the

QueuedRunCoordinator

to ensure that there are never be too many jobs running at once?

Brian Abelson

02/10/2021, 4:45 PM

actually, it happened again. even with 2X the CPU.

daniel

02/10/2021, 4:46 PM

That would also likely help, yeah - if you're running into issues when launching a bunch of runs

Brian Abelson

02/10/2021, 5:04 PM

so basically if the CPU ever hits 100% on the node that the scheduler is running on youre toast?

Brian Abelson

02/10/2021, 5:08 PM

it would be preferable, i think, for the scheduler to exit with an error when it clearly cant do its job. at least that way i could dynamically restart it using something like supervisor. but right now i basically have to redeploy the entire image because it gets stuck in this error state.

daniel

02/10/2021, 5:13 PM

Hm, I'd need more information to make that conclusion (re: CPU ever hitting 100%). Certainly if your system is in a state where spinning up a new process takes more than 15 seconds, Dagster is going to run into trouble. I like the idea of adding some monitoring to try to identify that the node is getting overloaded though - and agree we probably shouldn't just keep retrying forever if something fundamental like being unable to spin up a subprocess keeps failing repeatedly.

Brian Abelson

02/10/2021, 10:23 PM

im still struggling with this... i was really hoping to avoid having to do a multi-container deploy but even when i implemeted the

QueuedRunCoordinator

, i still got CPU spikes and the scheduler failed as before. im now seeing whether maybe supervisor is the culprit. if not that, i may just try the CronScheduling option...

daniel

02/10/2021, 10:25 PM

Is there any way to see which process the spikes are coming from? I'm not sure the cron scheduler is going to be any better unless the spikes are specifically caused by the daemon somehow - the cron scheduler also spins up a subprocess.

Brian Abelson

02/10/2021, 10:33 PM

the error seems to be ocurring now even when there aren't cpu spikes

Copy code

Timeout: read stream has not received any data in 15 seconds
ioby-data | 2021-02-10 17:33:15 
ioby-data | 2021-02-10 17:33:15 Stack Trace:
ioby-data | 2021-02-10 17:33:15   File "/usr/local/lib/python3.8/site-packages/dagster/scheduler/scheduler.py", line 86, in launch_scheduled_runs
ioby-data | 2021-02-10 17:33:15     with RepositoryLocationHandle.create_from_repository_location_origin(
ioby-data | 2021-02-10 17:33:15   File "/usr/local/lib/python3.8/site-packages/dagster/core/host_representation/handle.py", line 57, in create_from_repository_location_origin
ioby-data | 2021-02-10 17:33:15     return ManagedGrpcPythonEnvRepositoryLocationHandle(repo_location_origin)
ioby-data | 2021-02-10 17:33:15   File "/usr/local/lib/python3.8/site-packages/dagster/core/host_representation/handle.py", line 192, in __init__
ioby-data | 2021-02-10 17:33:15     self.grpc_server_process = GrpcServerProcess(
ioby-data | 2021-02-10 17:33:15   File "/usr/local/lib/python3.8/site-packages/dagster/grpc/server.py", line 1037, in __init__
ioby-data | 2021-02-10 17:33:15     self.server_process = open_server_process(
ioby-data | 2021-02-10 17:33:15   File "/usr/local/lib/python3.8/site-packages/dagster/grpc/server.py", line 942, in open_server_process
ioby-data | 2021-02-10 17:33:15     wait_for_grpc_server(server_process, output_file)
ioby-data | 2021-02-10 17:33:15   File "/usr/local/lib/python3.8/site-packages/dagster/grpc/server.py", line 878, in wait_for_grpc_server
ioby-data | 2021-02-10 17:33:15     event = read_unary_response(ipc_output_file, timeout=timeout, ipc_process=server_process)
ioby-data | 2021-02-10 17:33:15   File "/usr/local/lib/python3.8/site-packages/dagster/serdes/ipc.py", line 39, in read_unary_response
ioby-data | 2021-02-10 17:33:15     messages = list(ipc_read_event_stream(output_file, timeout=timeout, ipc_process=ipc_process))
ioby-data | 2021-02-10 17:33:15   File "/usr/local/lib/python3.8/site-packages/dagster/serdes/ipc.py", line 152, in ipc_read_event_stream
ioby-data | 2021-02-10 17:33:15     raise DagsterIPCProtocolError(
ioby-data | 2021-02-10 17:33:15

daniel

02/10/2021, 10:35 PM

Got it. And no memory pressure? Just asking because you earlier mentioned the other non-Dagster log output that you saw was associated with OOM issues

Brian Abelson

02/10/2021, 10:36 PM

no, memory pressure no

Brian Abelson

02/10/2021, 10:36 PM

the queued run coordinator is throwing the same error:

Copy code

6 2021-02-10 22:35:16 - dagster-daemon - ERROR - Caught error in DaemonType.QUEUED_RUN_COORDINATOR:
ioby-data | 2021-02-10 17:35:16 SerializableErrorInfo(message='dagster.serdes.ipc.DagsterIPCProtocolError: Timeout: read stream has not received any data in 15 seconds\n', stack=['  File "/usr/local/lib/python3.8/site-packages/dagster/daemon/controller.py", line 117, in run_iteration\n    error_info = check.opt_inst(next(generator), SerializableErrorInfo)\n', '  File "/usr/local/lib/python3.8/site-packages/dagster/daemon/run_coordinator/queued_run_coordinator_daemon.py", line 172, in run_iteration\n    self._dequeue_run(run, location_manager)\n', '  File "/usr/local/lib/python3.8/site-packages/dagster/daemon/run_coordinator/queued_run_coordinator_daemon.py", line 206, in _dequeue_run\n    external_pipeline = location_manager.get_external_pipeline_from_run(run)\n', '  File "/usr/local/lib/python3.8/site-packages/dagster/daemon/run_coordinator/queued_run_coordinator_daemon.py", line 97, in get_external_pipeline_from_run\n    ] = RepositoryLocationHandle.create_from_repository_location_origin(\n', '  File "/usr/local/lib/python3.8/site-packages/dagster/core/host_representation/handle.py", line 57, in create_from_repository_location_origin\n    return ManagedGrpcPythonEnvRepositoryLocationHandle(repo_location_origin)\n', '  File "/usr/local/lib/python3.8/site-packages/dagster/core/host_representation/handle.py", line 192, in __init__\n    self.grpc_server_process = GrpcServerProcess(\n', '  File "/usr/local/lib/python3.8/site-packages/dagster/grpc/server.py", line 1037, in __init__\n    self.server_process = open_server_process(\n', '  File "/usr/local/lib/python3.8/site-packages/dagster/grpc/server.py", line 942, in open_server_process\n    wait_for_grpc_server(server_process, output_file)\n', '  File "/usr/local/lib/python3.8/site-packages/dagster/grpc/server.py", line 878, in wait_for_grpc_server\n    event = read_unary_response(ipc_output_file, timeout=timeout, ipc_process=server_process)\n', '  File "/usr/local/lib/python3.8/site-packages/dagster/serdes/ipc.py", line 39, in read_unary_response\n    messages = list(ipc_read_event_stream(output_file, timeout=timeout, ipc_process=ipc_process))\n', '  File "/usr/local/lib/python3.8/site-packages/dagster/serdes/ipc.py", line 152, in ipc_read_event_stream\n    raise DagsterIPCProtocolError(\n'], cls_name='DagsterIPCProtocolError', cause=None)

Brian Abelson

02/10/2021, 10:38 PM

dagit can still access the repo though, so im cconfused why the daemon cannot.

daniel

02/10/2021, 10:39 PM

If you refresh the repo in dagit, does it run into the same error?

Brian Abelson

02/10/2021, 10:40 PM

nope

daniel

02/10/2021, 10:41 PM

but yesterday it was right?

Brian Abelson

02/10/2021, 10:42 PM

yeah, actually now it is down. it seemed to take a while.

Brian Abelson

02/10/2021, 10:42 PM

interestingly this time there was no cpu spike ... no jobs were even running.

Brian Abelson

02/10/2021, 10:42 PM

the scheduler just ran for awhile and then shut downn and then eveything else shut down.

daniel

02/10/2021, 10:45 PM

Yeah, everything we're seeing so far is consistent with the system being overloaded enough that processes are randomly getting shut down and/or unable to start. But if it's not CPU and it's not memory (and presumably not disk space)... are other non-dagster processes struggling too when this is happening? The process that dagster is trying to spin up is a pretty lightweight gRPC server, and it's failing right away, before it loads any non-dagster code... so if that's failing i'd expect lots of other processes to be struggling as well.

daniel

02/10/2021, 10:47 PM

is there anything that has reliably fixed the issue so far?

daniel

02/10/2021, 10:48 PM

Or maybe if you run 'ps aux' is there anything surprising, like lots of hanging python processes, anything like that?

Brian Abelson

02/10/2021, 10:52 PM

no. the only reliable thing is that it fails. it doesn't happen when i run the docker container so it is probably some undocumented limit in digital ocean, maybe there is some limit on the amount of ram a given process can consume and its silently killing things.

daniel

02/10/2021, 10:55 PM

that would make sense - I'm not aware of any other reports of these symptoms, so could definitely be something unique about the execution environment. Sorry for the frustration :(

Brian Abelson

02/10/2021, 10:57 PM

do you know of any users that run single-node setups? is that actually advised?

Brian Abelson

02/11/2021, 3:16 PM

the output from

htop

on the node seems to indicate that all dagster-related

python

processes are running at 800% CPU

daniel

02/11/2021, 3:33 PM

hm, that's no good. Let me see if that's something we can reproduce on our side.

daniel

02/11/2021, 3:40 PM

hmm, are there any non-dagster processes running on the node, and are they running at more reasonable CPU levels? I noticed that the htop process is also at 800%, which I wouldn't expect to be the case

Brian Abelson

02/11/2021, 3:48 PM

the only other things that are running are

supervisor

and

nginx

Brian Abelson

02/11/2021, 3:50 PM

and those are running at 0% CPU

daniel

02/11/2021, 3:57 PM

got it - is getting profiling information from py-spy an option here?

Brian Abelson

02/11/2021, 3:58 PM

i can certainly try!

Brian Abelson

02/11/2021, 3:59 PM

so i'd point it at the pid of of the grpc proccesses? eg:

py-spy top --pid 12345

daniel

02/11/2021, 4:00 PM

honestly they're all pretty mysterious to me, but dagit and dagster-daemon are the most mysterious since those aren't even running user code

daniel

02/11/2021, 4:00 PM

but the grpc ones would also be interesting

Brian Abelson

02/11/2021, 4:02 PM

okay give me a minute. i'll try to get it setup

daniel

02/11/2021, 4:04 PM

thanks! Very curious to see what could be maxing out dagit CPU..

Brian Abelson

02/11/2021, 4:06 PM

maybe this is my lack of understandinng of htop but why are there so many entries for each process?

Brian Abelson

02/11/2021, 4:07 PM

daniel

02/11/2021, 4:09 PM

I'm not an htop expert either, but if you're using the default run launcher those could be subprocesses? The gRPC server spawns a subprocess to carry out each launched runs. That's the most likely explanation for the ones with identical arguments

daniel

02/11/2021, 4:11 PM

Dagit and the daemon will also have their own gRPC process for each of the repository locations (this is a place where we could do more to optimize for single-node deployments)

Brian Abelson

02/11/2021, 4:12 PM

heres the output from ps aux which is different

Copy code

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.3  13540  7032 ?        Ss   16:03   0:00 /bin/bash /opt/dagster/ops/entrypoint.sh
root      1088  6.4  9.5 1301460 199884 ?      Sl   16:06   0:15 /usr/local/bin/python -m dagster.grpc --socket /tmp/tmpasi8
root        11  0.4  1.2  34632 25512 ?        S    16:03   0:01 /usr/bin/python2 /usr/bin/supervisord -c /etc/supervisor/co
root      1162  0.1  0.6  21480 14076 ?        S    16:06   0:00 /usr/local/bin/python -c from multiprocessing.resource_trac
root      1163  107 25.8 1165684 542872 ?      Rl   16:06   4:04 /usr/local/bin/python -c from multiprocessing.spawn import 
root      1261  0.2  0.2  12132  4404 ?        S    16:07   0:00 tail -F -c +0 /opt/dagster/storage/4551b2d6-bc6f-4965-b587-
root      1262  0.2  0.5  19060 11648 ?        S    16:07   0:00 /usr/local/bin/python /usr/local/lib/python3.8/site-package
root      1263  0.2  0.2  12132  4528 ?        S    16:07   0:00 tail -F -c +0 /opt/dagster/storage/4551b2d6-bc6f-4965-b587-
root      1265  0.3  0.6  19060 12624 ?        S    16:07   0:00 /usr/local/bin/python /usr/local/lib/python3.8/site-package
root        14  0.2  3.1 118892 65436 ?        S    16:03   0:00 nginx: master process /usr/sbin/nginx -g daemon off;
root        15  0.0  0.2  13540  6040 ?        S    16:03   0:00 /bin/bash ops/start-dagit.sh
root      1503  0.2  0.2  13804  5852 ?        Ss   16:07   0:00 bash
root        16  0.0  0.3  13540  6912 ?        S    16:03   0:00 /bin/bash ops/start-dagster-daemon.sh
root        17 10.1  6.0 683080 126244 ?       Sl   16:03   0:41 /usr/local/bin/python /usr/local/bin/dagit -h 0.0.0.0 -p 30
root        18  9.4  4.9 624816 104088 ?       Sl   16:03   0:38 /usr/local/bin/python /usr/local/bin/dagster-daemon run
root        19  0.0  2.2 119236 47208 ?        S    16:03   0:00 nginx: worker process
root        20  0.0  2.2 119236 47208 ?        S    16:03   0:00 nginx: worker process
root        21  0.0  2.2 119236 47208 ?        S    16:03   0:00 nginx: worker process
root        22  0.0  2.2 119236 47208 ?        S    16:03   0:00 nginx: worker process
root        23  0.0  2.2 119236 47208 ?        S    16:03   0:00 nginx: worker process
root      2351  2.3  0.3  13804  7396 ?        Ss   16:10   0:00 bash
root        24  0.0  2.2 119236 47148 ?        S    16:03   0:00 nginx: worker process
root        25  0.1  2.2 119236 47208 ?        S    16:03   0:00 nginx: worker process
root        26  0.0  2.2 119236 47144 ?        S    16:03   0:00 nginx: worker process
root      2660  0.0  0.0      0     0 ?        Z    16:10   0:19 [python] <defunct>
root      2853  0.0  9.4 1312028 198292 ?      Sl   16:11   0:12 /usr/local/bin/python -m dagster.grpc --socket /tmp/tmpd6mm
root      2926  0.0  0.3  17444  7448 ?        R    16:11   0:00 ps aux
root        35  7.6 10.0 1407660 210488 ?      Sl   16:04   0:30 /usr/local/bin/python -m dagster.grpc --socket /tmp/tmph9_l
root        48  0.0  0.3  13804  6944 ?        Ss   16:04   0:00 bash
root       725 14.6  0.2  13196  4676 ?        R    16:05   0:42 htop

Brian Abelson

02/11/2021, 4:12 PM

that's showing the culprit to be:

/usr/local/bin/python -c from multiprocessing.spawn import

daniel

02/11/2021, 4:14 PM

ah that makes more sense than every process at 800

Brian Abelson

02/11/2021, 4:14 PM

for some reason

py-spy

can't find these PIDs, it returns

Error: No such file or directory (os error 2)

Brian Abelson

02/11/2021, 4:20 PM

it seems that the PIDs are continually changing which might be consistent with what you described: a process attempting to spawn, failing to allocate the necessary cpu/ram, and then continually retrying

daniel

02/11/2021, 4:21 PM

that would fit, yeah

Brian Abelson

02/11/2021, 4:24 PM

i can reproduce the CPU spikes in my local docker instance but i don't think it causes the outage like it does on app platform. this is the process

Copy code

root      8924 46.5 11.8 861680 241936 ?       Sl   16:16   2:32 /usr/local/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=14, pipe_handle=16) --multiprocessing-fork

Brian Abelson

02/11/2021, 4:24 PM

thats also continually changing its PID as well, though.

daniel

02/11/2021, 4:25 PM

A short spike when it's first starting up isn't totally unexpected

Brian Abelson

02/11/2021, 4:30 PM

yeah. as i said before, it seems to run fine on my local docker host.

Brian Abelson

02/11/2021, 4:57 PM

got this response from digital ocean support, do you think this could be related? my dags dont write anything to tmpfile, but i know dagster does. i dont really pass a lot of data betweenn dags either

Copy code

Could you please mention what would be size of temporary files that Dagster writes to disk. There is a 2GiB limit on ephemeral disk storage within the App container.  If it's filling up the temporary files the App might be restarted.

Brian Abelson

02/11/2021, 5:04 PM

could it just be that logs are clogging up the disk? i thought those were written to postgres

Brian Abelson

02/11/2021, 5:06 PM

that would be in line with the pattern of it taking 1-2 hours (basically a few jobs need to run) before it shuts down. i do have my log level set to

INFO

and print out a lot of information.

daniel

02/11/2021, 5:35 PM

Is it possible to share your instance config? This could potentially be inputs/outputs from solids if they're going to the filesystem, depending on how that is set up

Brian Abelson

02/11/2021, 5:37 PM

this is the pipeline code. i only pass a list of tables between each solid

Brian Abelson

02/11/2021, 5:37 PM

Copy code

from dagster import pipeline, solid, Output, OutputDefinition, InputDefinition

from ioby_data import modes, solids
from ioby_data.utils import sql

PIPELINE = __name__.replace(".", "_").lower()


@solid(
    required_resource_keys={"mysql_drupal"},
    config_schema={
        "from_schema": str,
        "exclude": list,
        "include": list,
    },
    output_defs=[
        OutputDefinition(
            dagster_type=dict,
            name="table_info",
            description="Information about the mysql tables selected",
        )
    ],
)
def get_mysql_drupal_tables_to_sync(context):
    """
    Fetch a list of table names inside a schema, optionally excluding some
    """

    # build exclude table clauses
    exclude = context.solid_config["exclude"]
    if len(exclude):
        exclude_table_clauses = "\n AND ".join(
            [f"table_name NOT LIKE '{exc}'" for exc in exclude if exc.strip() != ""]
        )
    else:
        exclude_table_clauses = "1=1"

    # build include table clauses
    include = context.solid_config["include"]
    if len(include):
        include_table_clauses = "\n OR ".join(
            [
                f"table_name LIKE '{inc}'"
                for inc in context.solid_config["include"]
                if inc.strip() != ""
            ]
        )
    else:
        include_table_clauses = "1=1"

    sql = f"""
        SELECT 
            table_name,
            table_rows
        FROM 
            information_schema.tables 
        WHERE
            table_rows > 0
            AND table_schema='{context.solid_config['from_schema']}'
            AND {exclude_table_clauses}
            AND (
                {include_table_clauses}
            )
                
        ORDER BY RAND()
    """
    <http://context.log.info|context.log.info>(f"Running query: {sql}")
    df = context.resources.mysql_drupal.df_from_query(sql)
    mysql_tables = {
        row.table_name: {"num_rows": row.table_rows, "part": (idx % 10) + 1}
        for idx, row in df.iterrows()
    }
    <http://context.log.info|context.log.info>(f"RETRIEVED {len(mysql_tables.keys())} MYSQL TABLES")
    table_info = {
        "from_schema": context.solid_config["from_schema"],
        "tables": mysql_tables,
    }
    yield Output(table_info, "table_info")


@solid(
    required_resource_keys={"psql_warehouse", "mysql_drupal"},
    config_schema={
        "limit": int,
    },
    input_defs=[
        InputDefinition(
            dagster_type=dict,
            name="table_info",
            description="Information about the mysql tables selected",
        )
    ],
    output_defs=[
        OutputDefinition(
            dagster_type=list,
            name="tables",
            description="A list of temp tables created in the warehouse",
        )
    ],
)
def copy_mysql_drupal_tables_to_psql_warehouse(context, table_info):

    # setup warehouse access
    wh = context.resources.psql_warehouse
    mysql = context.resources.mysql_drupal

    # setup schema to copy the table to
    from_schema = table_info["from_schema"]
    input_tables = table_info["tables"]
    tmp_schema = sql.gen_temp_schema(PIPELINE)
    <http://context.log.info|context.log.info>(f"CREATING TMP SCHEMA: {tmp_schema}")
    wh.create_schema(tmp_schema)

    # copy the tables
    output_tables = []

    for table, table_info in input_tables.items():
        <http://context.log.info|context.log.info>(f"FETCHING SCHEMA FOR {table}")
        column_schema = mysql.get_table_column_schema(table)
        create_table_stmt = wh.create_table(table, tmp_schema, column_schema)
        <http://context.log.info|context.log.info>(f"SUCCESSFULLY RAN IN WAREHOUSE:\n{create_table_stmt}")
        <http://context.log.info|context.log.info>(f"LOADING MYSQL TABLE {table} INTO WAREHOUSE")
        # export mysql table to csv
        limit_stmt = ""
        if context.solid_config["limit"] > 0:
            limit_stmt = f"LIMIT {context.solid_config['limit']}"
        rows = mysql.execute(
            f"""
        SELECT * FROM {table} {limit_stmt}
        """
        )
        <http://context.log.info|context.log.info>(f"WRITING MYSQL {table} TO {tmp_schema}.{table} IN WAREHOUSE")
        wh.insert_rows_to_table(rows, table, tmp_schema)
        <http://context.log.info|context.log.info>(f"FINISHED LOADING TABLE {table}")
        output_tables.append(f"{tmp_schema}.{table}")

    yield Output(output_tables, "tables")


@solid(
    required_resource_keys={"psql_warehouse"},
    config_schema={
        "to_schema": str,
    },
    input_defs=[
        InputDefinition(
            dagster_type=list,
            name="tables",
            description="A list of temp tables created in the warehouse",
        )
    ],
)
def replace_existing_tables_with_new_tables(context, tables):

    # setup warehouse access
    wh = context.resources.psql_warehouse
    dest_schema = context.solid_config["to_schema"]
    for src_table in tables:

        # format src/ dest table names and schema
        src_schema, table_name = src_table.split(".")
        dest_table = f"{dest_schema}.{table_name}"
        <http://context.log.info|context.log.info>(f"Replacing {dest_table} with {src_table}")
        # swap table in a single transaction
        wh.swap_table(src_table, dest_table)

    # drop the temp schema for this operation.
    wh.drop_schema(src_schema)


@pipeline(mode_defs=[modes.DEFAULT])
def mysql_drupal_to_psql_warehouse():
    table_info = get_mysql_drupal_tables_to_sync()
    tables = copy_mysql_drupal_tables_to_psql_warehouse(table_info)
    replace_existing_tables_with_new_tables(tables)

Brian Abelson

02/11/2021, 5:48 PM

and my

dagster.yaml

Copy code

# ==================================================================================================
# Run Storage
# ==================================================================================================
# Controls how the history of runs is persisted. Can be set to SqliteRunStorage (default) or
# PostgresRunStorage.
run_storage:
  module: dagster_postgres.run_storage
  class: PostgresRunStorage
  config:
    postgres_db:
      username:
        env: IOBY_DAGSTER_DB_USERNAME
      password:
        env: IOBY_DAGSTER_DB_PASSWORD
      hostname:
        env: IOBY_DAGSTER_DB_HOST
      db_name:
        env: IOBY_DAGSTER_DB_NAME
      port:
        env: IOBY_DAGSTER_DB_PORT

# ==================================================================================================
# Event Log Storage
# ==================================================================================================
# Controls how the structured event logs produced by each run are persisted. Can be set to
# SqliteEventLogStorage (default) or PostgresEventLogStorage.
event_log_storage:
  module: dagster_postgres.event_log
  class: PostgresEventLogStorage
  config:
    postgres_db:
      username:
        env: IOBY_DAGSTER_DB_USERNAME
      password:
        env: IOBY_DAGSTER_DB_PASSWORD
      hostname:
        env: IOBY_DAGSTER_DB_HOST
      db_name:
        env: IOBY_DAGSTER_DB_NAME
      port:
        env: IOBY_DAGSTER_DB_PORT

# ==================================================================================================
# Scheduler
# ==================================================================================================
# Provides an optional scheduler which controls execution of pipeline runs at regular intervals.
# We recommend using the default DagsterDaemonScheduler - SystemCronScheduler and K8sScheduler are
# also available but are deprecated.
scheduler:
  module: dagster.core.scheduler
  class: DagsterDaemonScheduler

# ==================================================================================================
# Schedule Storage
# ==================================================================================================
# Controls the backing storage used by the scheduler to manage the state of schedules and persist
# records of attempts.
schedule_storage:
  module: dagster_postgres.schedule_storage
  class: PostgresScheduleStorage
  config:
    postgres_db:
      username:
        env: IOBY_DAGSTER_DB_USERNAME
      password:
        env: IOBY_DAGSTER_DB_PASSWORD
      hostname:
        env: IOBY_DAGSTER_DB_HOST
      db_name:
        env: IOBY_DAGSTER_DB_NAME
      port:
        env: IOBY_DAGSTER_DB_PORT

# ==================================================================================================
# Run Launcher
# ==================================================================================================
# Component that determines where runs are executed.
run_launcher:
  module: dagster.core.launcher
  class: DefaultRunLauncher

# ==================================================================================================
# Run Coordinator
# ==================================================================================================
# Determines the policy used to determine the prioritization rules and concurrency limits for runs.
# Can be set to DefaultRunCoordinator (default) or QueuedRunCoordinator when you want to maintain
# limits on the number of runs that can be executing at once.
run_coordinator:
  module: dagster.core.run_coordinator
  class: QueuedRunCoordinator
  config:
    max_concurrent_runs: 2

telemetry:
  enabled: false

Brian Abelson

02/11/2021, 5:50 PM

to be completely honest. i can't tell if its really effecting the production env. basically the CPU spikes when thee job runs, dagit reports that the dameon and repository are inaccessible, the job finishes, and then eventually the status page shows everything being healthy again.

Brian Abelson

02/11/2021, 5:52 PM

setting limits on the number of concurrent jobs seemed to help but i don't feel particularly confident in its continued reliability.

Open in Slack

Previous Next