Hi all! I have a self-hosted deploy of dagster in...
# ask-community
g
Hi all! I have a self-hosted deploy of dagster in K8s. One of my code-locations is frequently being marked as "not available" in Dagit's Deployment tab. This only happens when a specific schedule is active. That schedule tick also always fails due to code-server timeout, even though I can run the associated job without any problem (even some large backfills of this job work properly). To give a little more context, this job is built with
define_asset_job
with only one asset. However, this asset is using a Multipartition, The first dimension is a daily partiton, and the second dimension is a static partition with ~100 elements. Any idea on how to make this schedule work (and hopefully make the code-location not crash, too)? Feel free to ask for the code defining the described entities.
Hi, the problem was a timeout of the gRPC client, as the scheduling function took too long (more than 1 minute) on the gRPC server/code-location deployment. Solved it by increasing the timeout through an environment variable:
Copy code
# values.yaml

dasgterDaemon:
  ...
  env:
    DAGSTER_GRPC_TIMEOUT_SECONDS: '180'
As a feedback, it was rather cryptic to find this configuration option, I had to dig inside the code of
dagster._grpc.client
. The docs directed me to https://docs.dagster.io/deployment/dagster-instance#grpc-servers , but it does not work (or at least I could not make it work) on Dagster K8s deployment.
d
Hey gustavo - any chance you could share the code of the schedule that was causing issues?
g
surely
give me a minute
d
I can't come up with a good reason why the schedule timing out would cause the whole code location to be marked as failed on the deployment tab - it might indicate that the whole server was just getting overloaded? I'd be a little bit worried that bumping up the timeout is more of a bandaid than actually fixing the root cause of the problem
g
yeah, I also dont like to just bump the timeout... though i needed it for a quasi-production setup
but if we can make the code faster, it would be better, for sure
about the code location deployment, I must also note that I had a very large backfill running and related to that code location
sorry for the delay, here's the code...
Copy code
ge_influxdb_daily_schedule = build_schedule_from_multipartitioned_job(
    ge_influxdb_daily_job,
    hour_of_day=1,
    minute_of_hour=5,
    tags={
        "dagster/priority": "10",
    },
)
Copy code
def build_schedule_from_multipartitioned_job(
    job: JobDefinition,
    description=None,
    name=None,
    minute_of_hour=None,
    hour_of_day=None,
    day_of_week=None,
    day_of_month=None,
    default_status=DefaultScheduleStatus.STOPPED,
    tags=None,
):
    multipartitions_def: MultiPartitionsDefinition = job.partitions_def
    time_window_partitions_def: TimeWindowPartitionsDefinition = (
        multipartitions_def.primary_dimension.partitions_def
    )
    static_partitions_def: StaticPartitionsDefinition = (
        multipartitions_def.secondary_dimension.partitions_def
    )

    cron_schedule = time_window_partitions_def.get_cron_schedule(
        minute_of_hour, hour_of_day, day_of_week, day_of_month
    )

    @schedule(
        cron_schedule=cron_schedule,
        job=job,
        default_status=default_status,
        execution_timezone=time_window_partitions_def.timezone,
        name=check.opt_str_param(name, "name", f"{job.name}_schedule"),
        tags=tags,
        description=check.opt_str_param(description, "description"),
    )
    def _schedule(context: ScheduleEvaluationContext):

        # Run for the latest partition.
        # Prior partitions will have been handled by prior ticks.
        time_key = time_window_partitions_def.get_last_partition_key(
            context.scheduled_execution_time
        )

        for static_key in static_partitions_def.get_partition_keys():
            multipartition_key = MultiPartitionKey(
                {
                    multipartitions_def.primary_dimension.name: time_key,
                    multipartitions_def.secondary_dimension.name: static_key,
                }
            )
            yield job.run_request_for_partition(
                partition_key=multipartition_key,
                run_key=multipartition_key,
                tags=tags,
            )

    return _schedule
d
How many run requests do you expect it to produce on each tick?
g
I use it for two schedules
one of them will produce ~50
the other, ~110
54 and 106 to be exact
the one that was having the bugs is the one generating 106
d
hmm I wouldn't expect that call to take 60 seconds to run (unless something else was generally overloading the server)
you mentioned other potentially expensive things (large backfills etc.) were happening at the same time?
g
yes
I had a backfill with more than 10k partitions queued
d
Got it - my bet would be on that causing the slowdown rather than the schedule, I think I saw a similar issue get filed by somebody else recently..
g
although only ~14 jobs were running simultaneously
(due to concurrency limitations)
@daniel I will try to set the timeout back to 1min when the code location workload is back to normal. For now, the increased timestamp will do ๐Ÿ™‚ Thanks for taking the time to look into it
d
no problem - the other thing I was going to suggest is possibly allocating more resources to the user code deployment pod
g
yeah, we also need to check its resources, it was one of our guesses, too
also, is the code deployment able to horizontally scale?
d
yeah, we'd like to add replica support there, which would also help here
I do see this PR which could possibly be relevant? https://github.com/dagster-io/dagster/pull/12431
๐Ÿ‘ 1
g
definetly
In fact, I guess I have even more partitions
3 years * 365 days/year * ~100 devices
d
cc @claire in case we need another test case / partner for that PR
๐Ÿ‘ 1
g
@claire hmu if you need any info on this !
c
Yep, that PR should fix the schedule evaluation timing out. The source of the problem is that
run_request_for_partition
will iterate through each multipartition for each run request which gets unwieldy.
๐ŸŒˆ 2
d
ah OK it was the schedule after all then - my mistake
g
@daniel @claire I checked the PR and it is merged, but I can't tell if it is already available on 1.1.21 (I read the changelog, but it was not mentioned there). Is it on 1.1.21?
d
Yeah, thatโ€™s live in 1.1.21
๐Ÿ‘ 1
g
Already upgraded, thanks!