I've been getting a bunch of these errors. does an...
# announcements
s
I've been getting a bunch of these errors. does anyone know what might be causing them?
m
can you describe your system a little -- is this local, what OS, how many pipelines are you running concurrently, schedules, etc
s
Centos 6 running 14 pipelines only about 8 running at any given time
in production
m
hmm, i know nothing about centos. i would be interested if you could see whether theres an abnormal number of open files or of processes
also curious about disk space
s
is there anything i can ask my systems team specifically that would be helpful?
m
i'll note also that the LocalComputeLogManager is not really intended for production - are you running in a cloud?
i assume you're not running in a container -- you're running dagster right on the metal/VM
s
systems team manages all the nodes we use in 2 datacenters
dont think we use vms
everything is on prem
m
gotcha
s
what should i use instead of
local_compute_log_manager
?
or is there a way to turn it off
m
do you have an on-prem equivalent to an object store like S3? if not, i'd suggest configuring it to point at a shared filesystem if you have one. but i'm not certain that's the issue
i think it'd be good if your systems folks could run
lsof
and
ps
or equivalent and see if they see either an abnormal number of open files or a large number of python processes
if not, that'll at least rule some things out
how frequently do these pipelines run; and do you have any sense of about how long the server has been up, about how many pipelines in total it's run
s
we did have s3 set up for compute los and one of the systems guys yelled at me sayin there is no reason to store the logs there as they get sent to kafka anyway, so i had to turn it off.
they run as soon as they get 15-25k messages from kafka
so anywhere from 5 sec to the 5 min timeout
791 ran since 5pm
turned it off 10 min ago cause i didn't want to get paged throughout the night
m
are you running off master, or 0.7.15, or another version?
s
Copy code
0.7.13
m
ok, as an interim step, i would turn the compute log manager off -- this is a totally fine way to run if you have some other facility that aggregates stdout/stderr
you should be able to do the following in your dagster.yaml
s
cool, so just remove
compute_logs
section from prod yaml file?
m
Copy code
compute_logs:
    module: dagster.core.storage.local_compute_log_manager
    class: NoOpComputeLogManager
i have a hunch what might be causing this - it'd be helpful if you could provide those diagnostics - and we can dig in tomorrow
s
will do, i'll deploy this and see if i get less pager duties and we'll reconvene tomorrow and share the stats. thank you for your help!
no one is online to accept my diff, guess it'll have to wait till the morning
m
apologies for this
a
they run as soon as they get 15-25k messages from kafka
what exactly is the set up for kicking off pipeline runs? If the kafka listener is a long lived process its possible the issue is a memory / file descriptor leak from accidentally holding on to references
s
once the min number of messages is hit, the messages are pickled to disk and the pipeline is called