https://dagster.io/ logo
#ask-community
Title
# ask-community
g

Gustavo Carvalho

02/19/2023, 8:31 PM
Hi all! I have a self-hosted deploy of dagster in K8s. One of my code-locations is frequently being marked as "not available" in Dagit's Deployment tab. This only happens when a specific schedule is active. That schedule tick also always fails due to code-server timeout, even though I can run the associated job without any problem (even some large backfills of this job work properly). To give a little more context, this job is built with
define_asset_job
with only one asset. However, this asset is using a Multipartition, The first dimension is a daily partiton, and the second dimension is a static partition with ~100 elements. Any idea on how to make this schedule work (and hopefully make the code-location not crash, too)? Feel free to ask for the code defining the described entities.
Hi, the problem was a timeout of the gRPC client, as the scheduling function took too long (more than 1 minute) on the gRPC server/code-location deployment. Solved it by increasing the timeout through an environment variable:
Copy code
# values.yaml

dasgterDaemon:
  ...
  env:
    DAGSTER_GRPC_TIMEOUT_SECONDS: '180'
As a feedback, it was rather cryptic to find this configuration option, I had to dig inside the code of
dagster._grpc.client
. The docs directed me to https://docs.dagster.io/deployment/dagster-instance#grpc-servers , but it does not work (or at least I could not make it work) on Dagster K8s deployment.
d

daniel

02/21/2023, 3:18 PM
Hey gustavo - any chance you could share the code of the schedule that was causing issues?
g

Gustavo Carvalho

02/21/2023, 3:18 PM
surely
give me a minute
d

daniel

02/21/2023, 3:19 PM
I can't come up with a good reason why the schedule timing out would cause the whole code location to be marked as failed on the deployment tab - it might indicate that the whole server was just getting overloaded? I'd be a little bit worried that bumping up the timeout is more of a bandaid than actually fixing the root cause of the problem
g

Gustavo Carvalho

02/21/2023, 3:20 PM
yeah, I also dont like to just bump the timeout... though i needed it for a quasi-production setup
but if we can make the code faster, it would be better, for sure
about the code location deployment, I must also note that I had a very large backfill running and related to that code location
sorry for the delay, here's the code...
Copy code
ge_influxdb_daily_schedule = build_schedule_from_multipartitioned_job(
    ge_influxdb_daily_job,
    hour_of_day=1,
    minute_of_hour=5,
    tags={
        "dagster/priority": "10",
    },
)
Copy code
def build_schedule_from_multipartitioned_job(
    job: JobDefinition,
    description=None,
    name=None,
    minute_of_hour=None,
    hour_of_day=None,
    day_of_week=None,
    day_of_month=None,
    default_status=DefaultScheduleStatus.STOPPED,
    tags=None,
):
    multipartitions_def: MultiPartitionsDefinition = job.partitions_def
    time_window_partitions_def: TimeWindowPartitionsDefinition = (
        multipartitions_def.primary_dimension.partitions_def
    )
    static_partitions_def: StaticPartitionsDefinition = (
        multipartitions_def.secondary_dimension.partitions_def
    )

    cron_schedule = time_window_partitions_def.get_cron_schedule(
        minute_of_hour, hour_of_day, day_of_week, day_of_month
    )

    @schedule(
        cron_schedule=cron_schedule,
        job=job,
        default_status=default_status,
        execution_timezone=time_window_partitions_def.timezone,
        name=check.opt_str_param(name, "name", f"{job.name}_schedule"),
        tags=tags,
        description=check.opt_str_param(description, "description"),
    )
    def _schedule(context: ScheduleEvaluationContext):

        # Run for the latest partition.
        # Prior partitions will have been handled by prior ticks.
        time_key = time_window_partitions_def.get_last_partition_key(
            context.scheduled_execution_time
        )

        for static_key in static_partitions_def.get_partition_keys():
            multipartition_key = MultiPartitionKey(
                {
                    multipartitions_def.primary_dimension.name: time_key,
                    multipartitions_def.secondary_dimension.name: static_key,
                }
            )
            yield job.run_request_for_partition(
                partition_key=multipartition_key,
                run_key=multipartition_key,
                tags=tags,
            )

    return _schedule
d

daniel

02/21/2023, 4:25 PM
How many run requests do you expect it to produce on each tick?
g

Gustavo Carvalho

02/21/2023, 4:26 PM
I use it for two schedules
one of them will produce ~50
the other, ~110
54 and 106 to be exact
the one that was having the bugs is the one generating 106
d

daniel

02/21/2023, 4:27 PM
hmm I wouldn't expect that call to take 60 seconds to run (unless something else was generally overloading the server)
you mentioned other potentially expensive things (large backfills etc.) were happening at the same time?
g

Gustavo Carvalho

02/21/2023, 4:27 PM
yes
I had a backfill with more than 10k partitions queued
d

daniel

02/21/2023, 4:28 PM
Got it - my bet would be on that causing the slowdown rather than the schedule, I think I saw a similar issue get filed by somebody else recently..
g

Gustavo Carvalho

02/21/2023, 4:28 PM
although only ~14 jobs were running simultaneously
(due to concurrency limitations)
@daniel I will try to set the timeout back to 1min when the code location workload is back to normal. For now, the increased timestamp will do 🙂 Thanks for taking the time to look into it
d

daniel

02/21/2023, 5:04 PM
no problem - the other thing I was going to suggest is possibly allocating more resources to the user code deployment pod
g

Gustavo Carvalho

02/21/2023, 5:04 PM
yeah, we also need to check its resources, it was one of our guesses, too
also, is the code deployment able to horizontally scale?
d

daniel

02/21/2023, 5:06 PM
yeah, we'd like to add replica support there, which would also help here
I do see this PR which could possibly be relevant? https://github.com/dagster-io/dagster/pull/12431
👍 1
g

Gustavo Carvalho

02/21/2023, 5:17 PM
definetly
In fact, I guess I have even more partitions
3 years * 365 days/year * ~100 devices
d

daniel

02/21/2023, 5:44 PM
cc @claire in case we need another test case / partner for that PR
👍 1
g

Gustavo Carvalho

02/21/2023, 5:45 PM
@claire hmu if you need any info on this !
c

claire

02/21/2023, 6:15 PM
Yep, that PR should fix the schedule evaluation timing out. The source of the problem is that
run_request_for_partition
will iterate through each multipartition for each run request which gets unwieldy.
🌈 2
d

daniel

02/21/2023, 6:17 PM
ah OK it was the schedule after all then - my mistake
g

Gustavo Carvalho

03/03/2023, 12:19 PM
@daniel @claire I checked the PR and it is merged, but I can't tell if it is already available on 1.1.21 (I read the changelog, but it was not mentioned there). Is it on 1.1.21?
d

daniel

03/03/2023, 1:51 PM
Yeah, that’s live in 1.1.21
👍 1
g

Gustavo Carvalho

03/03/2023, 1:51 PM
Already upgraded, thanks!
2 Views