Hi everyone, My pipelines are not starting on the ...
# ask-community
n
Hi everyone, My pipelines are not starting on the scheduled time. Instead they are starting 1-2 hours late. Tick history shows that they are starting at the scheduled time, but the actual starting time is different. Could it be resource related, is dagster putting the pipeline in a queue?
c
There's a difference between the time a tick occurs and the time that a job kicks off. The tick just kicks off a request for a run, which is then handled by the run coordinator into actually making that run happen.
n
Thanks for responding @chris. Shouldn't it be starting the time it was requested(ideally)?
c
ideally, the lag time should not be so high. A few questions: • Are you using the queued run coordinator? • How many ticks are being fired off? Is there potentially a memory bottleneck?
n
• We're using defaultRunCoordinator • A single tick for each run is fired. About the bottleneck, I am not sure. I don't see any OOM error in the logs
@chris, should we be using queued run coordinator?
c
Queued run coordinator is definitely recommended for production workloads. Still surprised you're seeing such a large latency - how many schedules / how often are they ticking? Pinging @johann for any ideas as to what might be causing that
j
If it was consistently an hour late I might say investigate if dagit and the daemon are on different timezones… The runs just don’t exist for 1-2 hours after the tick? Or are they in STARTING, and take that long to reach STARTED?
n
They are not reaching is STARTING state and from what i observed even the tick is not available on scheduled time. It was showing unable to load tick history.
There are about 100 schedules. Some of them are ticking every hour.
@johann Delay is not consistent
Hi @johann, @chris Sorry to disturb you again, but can you please suggest some solution because I am having this issue in production pipelines.
j
do you have metrics on cpu/mem usage of the daemon process and the DB? Beefing those up could help
n
Ok, that means the problem is in resources. Could queued run coordinator solve this issue without beefing the cpu/mem ?
j
the only difference with the queued run coordinator is that it will limit the maximum number runs. It may help if you have too many runs that are overloading your system. If you don’t have a bunch of in progress runs then I wouldn’t expect it to help.
n
In one of the cases there are only 7 runs that occur simultaneously in a pod. Most of the time it works fine, but once every five runs , there is a delay of 1-2 hrs
j
very strange
if it’s really once every 5 I wonder if there’s something strange with the cron string?
n
Ohh my bad, I meant an estimation as 20% of the time a delay is observed.
Hi, I this issue has come up again. It got deprioritized at the time. Now, we are working on it again. I just wanted to know can we find the reason of this behaviour in dagster logs?
j
It’d be good to get a full description of the issue: • Are the ticks occurring on time? Could you include a screenshot of the ticks graph • Are the runs launching on time? Could you include a debug file of a delayed run? (
Download Debug File
at the top right dropdown on the run page)
d
We've made some pretty substantial performance improvements since 0.12.x across the board - it's possible that upgrading will make this go away