Hey! First time user of Dagster here (been playing...
# ask-community
a
Hey! First time user of Dagster here (been playing around with it for a couple of weeks). Something weird is currently going on with my dagster setup. Currently I'm running the dagster cloud local agent running on a VPS on GCP. Everything runs fine when I've set it up, but after a while Dagster stops working with the Python IOerror 24: Too many files open. After a bit of debugging I find that the "dagster-cloud agent run" opens a ton of sockets when running without closing them (ls -alht /proc/{$dagster_PID}/fd/ . Eventually it opens up 1024 sockets and the VPS starts erroring out. Bit like a memory leak, just with open sockets.. I've set up a @run_failure_sensor to monitor for failed jobs. This is what seems to trigger the creation of new sockets. If I disable it new sockets stop being created. Here's the "ls -alht /proc/{PID}/fd" of the dagster-agent code:
Copy code
ls -alht /proc/8739/fd
total 0
lrwx------ 1 root root 64 Sep  8 13:12 13 -> 'socket:[23909]'
lrwx------ 1 root root 64 Sep  8 13:12 14 -> 'socket:[24052]'
lrwx------ 1 root root 64 Sep  8 13:12 15 -> 'socket:[23399]'
lrwx------ 1 root root 64 Sep  8 13:09 12 -> 'socket:[23156]'
lrwx------ 1 root root 64 Sep  8 13:08 11 -> 'socket:[23576]'
lrwx------ 1 root root 64 Sep  8 13:06 10 -> 'socket:[22520]'
lrwx------ 1 root root 64 Sep  8 13:03 9 -> 'socket:[21384]'
lrwx------ 1 root root 64 Sep  8 13:01 7 -> 'socket:[21333]'
lrwx------ 1 root root 64 Sep  8 13:01 8 -> 'socket:[23994]'
lrwx------ 1 root root 64 Sep  8 13:00 6 -> 'socket:[21644]'
dr-x------ 2 root root  0 Sep  8 13:00 .
lrwx------ 1 root root 64 Sep  8 13:00 0 -> /dev/pts/0
lrwx------ 1 root root 64 Sep  8 13:00 1 -> /dev/pts/0
lrwx------ 1 root root 64 Sep  8 13:00 2 -> /dev/pts/0
lrwx------ 1 root root 64 Sep  8 13:00 3 -> 'socket:[23896]'
lrwx------ 1 root root 64 Sep  8 13:00 4 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Sep  8 13:00 5 -> 'anon_inode:[eventfd]'
dr-xr-xr-x 9 root root  0 Sep  8 12:59 ..
The longer I let it run, the more sockets get created. I don't know what the next step of debugging is. I've already rebuilt the VPS once, and the same error occurred again.
d
Hey Alexander - we’ll take a look at this and see if we can reproduce. Whats happening within the body of the run failure sensor? Do the sockets only get opened whenever a run fails, or all the time even if no runs are happening?
a
@op def teams_op(context): webhook_url = "a url" msg = pymsteams.connectorcard(webhook_url) msg.title("Dagster Alert") msg.text("Task failed in Dagster.") msg.send() @job() def send_teams_msg_on_job_fail(): teams_op() @run_failure_sensor( request_job=send_teams_msg_on_job_fail, default_status=DefaultSensorStatus.RUNNING, ) def job_fail_sensor(context): run_config = {"ops": {"teams_op": {"config": {"job_name": "make_teams_alert_on_failure"}}}} return RunRequest(run_key=None, run_config=run_config)
Not always. Maybe every minute or 3? As you see the timings in the first post opened a new one every now and then. So it's not consistently every 30s
I didn't mention this earlier, but I have an earlier instance that I was testing with where this was not an issue. I reproduced this problem on both dagster-cloud version 1.0.3 and 1.0.7 as that was my first thought. I also moved my affected code over to that instance to see what would happen and that one had a normal amount of sockets and has been running fine for a couple of weeks
When I check the sockets being used by the agent process on the old test instance it looks like this:
Copy code
REDACTED@instance-1:~$ sudo ls -alht /proc/2969627/fd
total 0
dr-x------ 2 REDACTED REDACTED  0 Sep  8 13:21 .
lr-x------ 1 REDACTED REDACTED 64 Sep  8 13:21 0 -> /dev/null
lrwx------ 1 REDACTED REDACTED 64 Sep  8 13:21 1 -> 'socket:[8022942]'
lrwx------ 1 REDACTED REDACTED 64 Sep  8 13:21 2 -> 'socket:[8022942]'
lrwx------ 1 REDACTED REDACTED 64 Sep  8 13:21 3 -> 'socket:[8060216]'
lrwx------ 1 REDACTED REDACTED 64 Sep  8 13:21 4 -> 'socket:[8061390]'
I don't know whats up but I notice that anon_inode |eventpoll] and [evendtfd] isn't there.
Done at work now. If you want any more information or have any tips lmk and I'll be back tomorrow. Thanks!
b
Just to add more color here: we run dagster on AWS ECS containers and we are hitting:
E0907 21:26:00.771067497      20 <http://tcp_server_posix.cc:216]|tcp_server_posix.cc:216]>    Failed accept4: Too many open files
on the
workflows_user_code
service. We also have a bunch of sensors, and also for tasks failures. The container itself seems to get stuck after 2h of a deployment, for some random reason.
d
Alexander do each of those times correspond to a run? Curious if the problem is that each launched run by the sensor is leaving a socket open, even after the run finishes - I'll see if we can reproduce
a
No, it's not a run it's a check (which is then skipped). It's from the polling, not the run itself
d
got it, thanks. And those "socket:[22520]" lines - is the number in there a PID? (Might be displaying some linux ignorance here)
or just a random identifer
a
Don't know enough, socket FD id or something. The pid is in the ls call /proc/(pid)/fd
👍 1
d
The other thing i'd like to check is if pressing "Redeploy" in Dagit on the Workspace tab brings the socket usage back down (that will stop and restart a subprocess on the agent)
a
Will test in 5
Yeah, this removes the sockets and restarts the process
d
OK, that's helpful to know. (It shouldn't restart the agent process, but it will restart one of its subprocesses, which seems to be the one leaving those sockets open_)
One other question - do you happen to know what version of grpcio you have installed?
We had some issues with the latest version (0.48.1) and just added a pin
a
I'm running 1.48.1 atm yeah. I'll try to downgrade and see what happens
d
Ah right sorry 1.48.1 not 0.48.1 - that might help explain why it started recently but downgrading dagster didn't fix it
a
I see that both of my servers are running 1.48.1 though. Any way of getting more verbose logging of these sensor checks or something?
d
I don't totally follow - 1.48.1 is the version that (may) be causing the problem
So downgrading to 1.47.0 and seeing if the problem goes away seems like the right next step to me
a
Ayy nice, that solved it actually
Downgrading grpcio to 1.47.0
d
Ah amazing
@Bianca Rosa curious if that would apply to your situation too
a
Much appreciated Daniel 🙂
b
I’ve been following and will try this later today! We have rn poetry locking
grpcio = ">=1.43.0"
, so we could be grabbing
1.47.1
- this started happening last week or so.
d
1.48.1 is the bad one - that was released on 9/1/2022
b
Alright - we are also not using Dagster 1.x.x yet, so that upgrade could be useful for us too.