I ve been getting a bunch of these errors does anyone know w dagster #announcements

Join Slack

I've been getting a bunch of these errors. does an...

# announcements

Sam Rausser

06/04/2020, 2:23 AM

I've been getting a bunch of these errors. does anyone know what might be causing them?

max

06/04/2020, 2:28 AM

can you describe your system a little -- is this local, what OS, how many pipelines are you running concurrently, schedules, etc

Sam Rausser

06/04/2020, 2:46 AM

Centos 6 running 14 pipelines only about 8 running at any given time

Sam Rausser

06/04/2020, 2:47 AM

in production

max

06/04/2020, 2:48 AM

hmm, i know nothing about centos. i would be interested if you could see whether theres an abnormal number of open files or of processes

max

06/04/2020, 2:49 AM

also curious about disk space

Sam Rausser

06/04/2020, 2:49 AM

is there anything i can ask my systems team specifically that would be helpful?

max

06/04/2020, 2:50 AM

i'll note also that the LocalComputeLogManager is not really intended for production - are you running in a cloud?

max

06/04/2020, 2:51 AM

i assume you're not running in a container -- you're running dagster right on the metal/VM

Sam Rausser

06/04/2020, 2:51 AM

systems team manages all the nodes we use in 2 datacenters

Sam Rausser

06/04/2020, 2:51 AM

dont think we use vms

Sam Rausser

06/04/2020, 2:52 AM

everything is on prem

max

06/04/2020, 2:52 AM

gotcha

Sam Rausser

06/04/2020, 2:53 AM

what should i use instead of

local_compute_log_manager

Sam Rausser

06/04/2020, 2:53 AM

or is there a way to turn it off

max

06/04/2020, 2:54 AM

do you have an on-prem equivalent to an object store like S3? if not, i'd suggest configuring it to point at a shared filesystem if you have one. but i'm not certain that's the issue

max

06/04/2020, 2:55 AM

i think it'd be good if your systems folks could run

lsof

and

ps

or equivalent and see if they see either an abnormal number of open files or a large number of python processes

max

06/04/2020, 2:55 AM

if not, that'll at least rule some things out

max

06/04/2020, 2:56 AM

how frequently do these pipelines run; and do you have any sense of about how long the server has been up, about how many pipelines in total it's run

Sam Rausser

06/04/2020, 2:58 AM

we did have s3 set up for compute los and one of the systems guys yelled at me sayin there is no reason to store the logs there as they get sent to kafka anyway, so i had to turn it off.

Sam Rausser

06/04/2020, 2:58 AM

they run as soon as they get 15-25k messages from kafka

Sam Rausser

06/04/2020, 2:58 AM

so anywhere from 5 sec to the 5 min timeout

Sam Rausser

06/04/2020, 3:00 AM

791 ran since 5pm

Sam Rausser

06/04/2020, 3:00 AM

turned it off 10 min ago cause i didn't want to get paged throughout the night

max

06/04/2020, 3:03 AM

are you running off master, or 0.7.15, or another version?

Sam Rausser

06/04/2020, 3:05 AM

Copy code

0.7.13

max

06/04/2020, 3:05 AM

ok, as an interim step, i would turn the compute log manager off -- this is a totally fine way to run if you have some other facility that aggregates stdout/stderr

max

06/04/2020, 3:05 AM

you should be able to do the following in your dagster.yaml

Sam Rausser

06/04/2020, 3:05 AM

cool, so just remove

compute_logs

section from prod yaml file?

max

06/04/2020, 3:06 AM

Copy code

compute_logs:
    module: dagster.core.storage.local_compute_log_manager
    class: NoOpComputeLogManager

max

06/04/2020, 3:06 AM

i have a hunch what might be causing this - it'd be helpful if you could provide those diagnostics - and we can dig in tomorrow

Sam Rausser

06/04/2020, 3:10 AM

will do, i'll deploy this and see if i get less pager duties and we'll reconvene tomorrow and share the stats. thank you for your help!

Sam Rausser

06/04/2020, 3:15 AM

no one is online to accept my diff, guess it'll have to wait till the morning

max

06/04/2020, 3:26 AM

apologies for this

alex

06/04/2020, 1:34 PM

they run as soon as they get 15-25k messages from kafka

what exactly is the set up for kicking off pipeline runs? If the kafka listener is a long lived process its possible the issue is a memory / file descriptor leak from accidentally holding on to references

Sam Rausser

06/04/2020, 2:45 PM

once the min number of messages is hit, the messages are pickled to disk and the pipeline is called

3 Views

Open in Slack

Previous Next