Hello please let me know if this should go to < C03SMUY0SFQ| dagster #ask-community

Hello — please let me know if this should go to <#...

Jonathan Ming

06/06/2023, 9:26 PM

Hello — please let me know if this should go to #dagster-serverless instead since we’re on Serverless. We’ve got an S3 sensor on a bucket with many files, with a cursor and run_keys set to the S3 keys, and it has started failing increasingly often with the attached error. (See thread for more…)

Untitled

Jonathan Ming

06/06/2023, 9:27 PM

I first assumed we were submitting too many runs in a single tick, so I kept limiting the number of RunRequests per tick more and more, down now to even just 25, and it is still failing. So now I’m fairly sure the number of RunRequests is not the main problem. Perhaps our run_keys are too long? (ours now can get as long as ~80 chars) I didn’t see any requirements or guidelines about run_key values in the docs, so I wasn’t sure what we should do. Do you think it would help if I uuidv5'd our run keys to shorten them to 36 chars?

daniel

06/06/2023, 10:16 PM

Hi Jonathan - sorry for the trouble here, we’re looking into making improvements here now to avoid this timeout. In the short term as a workaround, is doing fewer runrequests per tick with a shorter interval between ticks an option?

daniel

06/06/2023, 10:17 PM

Ie batches of 5 instead of 25, but 5x as often? May or may not be viable depending on your current interval

Jonathan Ming

06/06/2023, 10:19 PM

Hi, thanks for the response! It’s set to 1min right now, so I could try 5 requests and dropping the interval to 15s? There’s gonna be a lot of keys on our backlog anyway since this is kind of the initial uptake of this new sensor, in the long term it won’t get new entries very often. But in the short term, I’m just after as much throughput as I can get without it failing, so we can get the initial load done sooner rather than later. But it’s not a huge deal if it takes longer while we tune stuff 👍

Jonathan Ming

06/06/2023, 10:29 PM

So far batches of 5 are working again, thanks for the tip. I put down 15s minimum interval, and the real world intervals are closer to 1min, but that’s fine — I’ll take whatever I can get while you guys are working on it 👍

Jonathan Ming

06/06/2023, 10:30 PM

Let me know if/when you think I could raise that batch size!

daniel

06/06/2023, 10:38 PM

Will do - likely sometime later this week, but will keep you posted

Jonathan Ming

06/06/2023, 10:38 PM

Sounds good, thank you!

Jonathan Ming

06/07/2023, 4:48 PM

Since yesterday I’ve had to drop all the way down to 1 RunRequest per tick to get it back running again. It’s a little late now, but I’m curious, do you think it would have made a difference if we’d used shorter run_keys?

daniel

06/07/2023, 4:49 PM

I don't think it's the size of the keys - we believe the postgres query planner is just having some trouble doing the search for existing runs with your run keys efficiently, so we're going to rework to give it more clues about how to do it performantly

Jonathan Ming

06/07/2023, 4:50 PM

Okay, thanks for the info. I was just wondering if there was anything more I could do to help from our end! Thanks for working on it though, really appreciate it

daniel

06/14/2023, 8:13 PM

We think that our query planner is playing more nicely with the run key queries - you could try bumping back up to a higher number of run keys and see if you get better results now

Jonathan Ming

06/14/2023, 9:12 PM

Awesome, thanks! We actually are up to date now (we manually advanced the cursor to the end of its workload and ran a script outside of Dagster to catch it up). I expect there might be a second task like this one coming down our pipeline though, so when that happens I’ll test the sensor again and let you know 👍

Jonathan Ming

06/28/2023, 6:48 PM

Hi again @daniel, coming back with an update here! It’s definitely much better now, thanks 🎉 We have increased the number of files per tick to 10 and the schedule to ~30s, and it works well enough. It now fails due to timeout 2-6 times per day, and it’s always able to eventually get that tick done after a try or two and move on. It only fails while trying to submit Runs, never during a tick that would be Skipped. This throughput is more than enough for our ongoing needs 👍 Only icing on the cake now would be if we could no longer have the last sporadic failures, since they make some noise in our alerts channel that is not really actionable for us. Thanks for your improvements on the query planner since last we spoke; they’re already a big help for us!

Open in Slack

Previous Next