https://dagster.io/ logo
#deployment-kubernetes
Title
# deployment-kubernetes
r

Rubén Lopez Lozoya

04/22/2022, 1:53 PM
Hey team, we are experiencing an issue with our dagster user deployments where it is constantly running an SQL query that is definesd in a partition set (aka it is constantly running the partition set fn), this is causing our dagit deployment to die and our DB to become overloaded, any ideas? As you can see in the picture, its constantly calling a function that reads a partition set from BQ (hence the download statements) and this is being triggered constantly
d

daniel

04/22/2022, 2:08 PM
Hi Ruben - in the short term is this something that you could cache / memoize within your python code? So that repeated calls only make the SQL query once?
r

Rubén Lopez Lozoya

04/22/2022, 2:09 PM
would need to get back to the team, but is this issue I am experiencing something you guys are aware of? just to make sure that this behavior is something inherent to dagster and not something we may be doing wrong
d

daniel

04/22/2022, 2:11 PM
This is the function that returns the list of partitions?
r

Rubén Lopez Lozoya

04/22/2022, 2:11 PM
yep
d

daniel

04/22/2022, 2:11 PM
and what version of dagster is this?
r

Rubén Lopez Lozoya

04/22/2022, 2:12 PM
we have other partition sets defined, but this one was recently added and only 3 days after deploying it it has started this never ending loop of constantly running, and now no matter how back we rollback it wont stop unless we stop inserting the set in the repository
0.14.8
d

daniel

04/22/2022, 2:15 PM
and the intention is that on each schedule tick it will run on every partition? (hence the partition selector?
r

Rubén Lopez Lozoya

04/22/2022, 2:15 PM
so what I am trying to achieve is to have the whole list of partitions be run once per day at 14:00
so Id expect this schedule to trigger the partition_fn once per day at the desired time, and then run all items returned by that fn
d

daniel

04/22/2022, 2:16 PM
is there any way to get a stack trace when its running the partition function repeatedly? If you add something like
Copy code
import traceback
traceback.print_stack()
inside get_company_partitions, that would tell us exactly where its being called
r

Rubén Lopez Lozoya

04/22/2022, 2:17 PM
this is something I can try yes
does the use case I depicted make sense with my current implementation?
d

daniel

04/22/2022, 2:20 PM
it seems reasonable to me
I think the stack trace will give us some useful clues, I was under the impression that it should only be called if you're running a backfill (or once per day like you said)
r

Rubén Lopez Lozoya

04/22/2022, 2:39 PM
Copy code
Error
2022-04-22T14:36:38.925203914Z File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/grpc/server.py", line 521, in ExternalScheduleExecution
Error
2022-04-22T14:36:38.925294492Z get_external_schedule_execution(
Error
2022-04-22T14:36:38.925378256Z File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/grpc/impl.py", line 252, in get_external_schedule_execution
Error
2022-04-22T14:36:38.925474883Z return schedule_def.evaluate_tick(schedule_context)
Error
2022-04-22T14:36:38.925559545Z File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/definitions/schedule_definition.py", line 442, in evaluate_tick
Error
2022-04-22T14:36:38.925648461Z result = list(ensure_gen(execution_fn(context)))
Error
2022-04-22T14:36:38.925754251Z File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/definitions/partition.py", line 594, in _execution_fn
Error
2022-04-22T14:36:38.925851548Z missing_partition_names = [
Error
2022-04-22T14:36:38.925938358Z File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/definitions/partition.py", line 597, in <listcomp>
Error
2022-04-22T14:36:38.926027145Z if partition.name not in self.get_partition_names(context.scheduled_execution_time)
Error
2022-04-22T14:36:38.926113111Z File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/definitions/partition.py", line 520, in get_partition_names
Error
2022-04-22T14:36:38.926201454Z return [part.name for part in self.get_partitions(current_time)]
Error
2022-04-22T14:36:38.926298093Z File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/definitions/partition.py", line 510, in get_partitions
Error
2022-04-22T14:36:38.926391442Z return self._partitions_def.get_partitions(current_time)
Error
2022-04-22T14:36:38.926482648Z File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/definitions/partition.py", line 356, in get_partitions
Error
2022-04-22T14:36:38.926576250Z partitions = self.partition_fn(current_time)
Error
2022-04-22T14:36:38.926666711Z File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/definitions/partition.py", line 440, in _wrap_partition_fn
Error
2022-04-22T14:36:38.926789120Z obj_list = partition_fn() # type: ignore
Error
2022-04-22T14:36:38.926887857Z File "/opt/dagster/app/maquinillo/dagster_resources/underwriting/pipeline_defs/post_company_metrics_to_client_api_pipeline.py", line 409, in get_company_partitions
Error
2022-04-22T14:36:38.926987185Z traceback.print_stack()
this is the stack trace (sorry if its a bit messy)
d

daniel

04/22/2022, 2:41 PM
That's no problem - i'm a bit confused though, because that evaluate_tick call should only be called when the cron schedule triggers (once per day in this case). Are there any logs from your daemon that might help explain why it's repeatedly making that ExternalScheduleExecution call?
r

Rubén Lopez Lozoya

04/22/2022, 2:43 PM
Copy code
Información
2022-04-22T14:40:26.941372516Z external_schedule_execution_args
Información
2022-04-22T14:40:26.941375803Z File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 119, in _streaming_query
Información
2022-04-22T14:40:26.941379619Z raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
Información
2022-04-22T14:40:26.941384093Z
Información
2022-04-22T14:40:26.941387633ZThe above exception was caused by the following exception:
Información
2022-04-22T14:40:26.941391342Zgrpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
Información
2022-04-22T14:40:26.941397975Z status = StatusCode.DEADLINE_EXCEEDED
Información
2022-04-22T14:40:26.941402632Z details = "Deadline Exceeded"
Información
2022-04-22T14:40:26.941407134Z debug_error_string = "{"created":"@1650638426.930071599","description":"Error received from peer ipv4:10.88.0.193:3030","file":"src/core/lib/surface/call.cc","file_line":903,"grpc_message":"Deadline Exceeded","grpc_status":4}"
Información
2022-04-22T14:40:26.941412324Z>
Información
2022-04-22T14:40:26.941416031Z
Información
2022-04-22T14:40:26.941419939ZStack Trace:
Información
2022-04-22T14:40:26.941423891Z File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 117, in _streaming_query
Información
2022-04-22T14:40:26.941433701Z yield from response_stream
Información
2022-04-22T14:40:26.941437169Z File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
Información
2022-04-22T14:40:26.941441036Z return self._next()
Información
2022-04-22T14:40:26.941444936Z File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
Información
2022-04-22T14:40:26.941449299Z raise self
it says it cannot reach user code but in the user code I am seeing that it constantly runs the partition_fn
d

daniel

04/22/2022, 2:44 PM
Ah, is that firing in a loop?
that timeout?
r

Rubén Lopez Lozoya

04/22/2022, 2:44 PM
so maybe there is a weird loop
yes
d

daniel

04/22/2022, 2:44 PM
Is it possible that this get_company_partitions call is taking more than 60 seconds to execute?
r

Rubén Lopez Lozoya

04/22/2022, 2:44 PM
no it shouldnt, at least when I execute it it doesnt
I mean, if I go to the pipeline page and I click on the partitions section
and then I load the partitions, it takes maybe 3 seconds tops
but after some point my whole deployment crashes bc of this ongoing loop of errors
my db is overloaded
but until that, the partition works fine
d

daniel

04/22/2022, 2:46 PM
Ah, so maybe the timeout is what happens after the deployment crashes
r

Rubén Lopez Lozoya

04/22/2022, 2:46 PM
but the daemon and user deployments are not crashed right now, its dagit
d

daniel

04/22/2022, 2:47 PM
Can you give more output from the scheduler daemon, maybe as a text file? I'm trying to figure out if its actually calling ExternalScheduleExecution frequently or if something else is going on
This should be much more performant in the 0.14.11 release that just went out