Hi all I have a self hosted deploy of dagster in K8s One of dagster #ask-community

Hi all! I have a self-hosted deploy of dagster in...

Gustavo Carvalho

02/19/2023, 8:31 PM

Hi all! I have a self-hosted deploy of dagster in K8s. One of my code-locations is frequently being marked as "not available" in Dagit's Deployment tab. This only happens when a specific schedule is active. That schedule tick also always fails due to code-server timeout, even though I can run the associated job without any problem (even some large backfills of this job work properly). To give a little more context, this job is built with

define_asset_job

with only one asset. However, this asset is using a Multipartition, The first dimension is a daily partiton, and the second dimension is a static partition with ~100 elements. Any idea on how to make this schedule work (and hopefully make the code-location not crash, too)? Feel free to ask for the code defining the described entities.

Gustavo Carvalho

02/21/2023, 12:32 PM

Hi, the problem was a timeout of the gRPC client, as the scheduling function took too long (more than 1 minute) on the gRPC server/code-location deployment. Solved it by increasing the timeout through an environment variable:

Copy code

# values.yaml

dasgterDaemon:
  ...
  env:
    DAGSTER_GRPC_TIMEOUT_SECONDS: '180'

As a feedback, it was rather cryptic to find this configuration option, I had to dig inside the code of

dagster._grpc.client

. The docs directed me to https://docs.dagster.io/deployment/dagster-instance#grpc-servers , but it does not work (or at least I could not make it work) on Dagster K8s deployment.

daniel

02/21/2023, 3:18 PM

Hey gustavo - any chance you could share the code of the schedule that was causing issues?

Gustavo Carvalho

02/21/2023, 3:18 PM

surely

Gustavo Carvalho

02/21/2023, 3:19 PM

give me a minute

daniel

02/21/2023, 3:19 PM

I can't come up with a good reason why the schedule timing out would cause the whole code location to be marked as failed on the deployment tab - it might indicate that the whole server was just getting overloaded? I'd be a little bit worried that bumping up the timeout is more of a bandaid than actually fixing the root cause of the problem

Gustavo Carvalho

02/21/2023, 3:20 PM

yeah, I also dont like to just bump the timeout... though i needed it for a quasi-production setup

Gustavo Carvalho

02/21/2023, 3:20 PM

but if we can make the code faster, it would be better, for sure

Gustavo Carvalho

02/21/2023, 3:21 PM

about the code location deployment, I must also note that I had a very large backfill running and related to that code location

Gustavo Carvalho

02/21/2023, 4:21 PM

sorry for the delay, here's the code...

Gustavo Carvalho

02/21/2023, 4:21 PM

Copy code

ge_influxdb_daily_schedule = build_schedule_from_multipartitioned_job(
    ge_influxdb_daily_job,
    hour_of_day=1,
    minute_of_hour=5,
    tags={
        "dagster/priority": "10",
    },
)

Gustavo Carvalho

02/21/2023, 4:22 PM

Copy code

def build_schedule_from_multipartitioned_job(
    job: JobDefinition,
    description=None,
    name=None,
    minute_of_hour=None,
    hour_of_day=None,
    day_of_week=None,
    day_of_month=None,
    default_status=DefaultScheduleStatus.STOPPED,
    tags=None,
):
    multipartitions_def: MultiPartitionsDefinition = job.partitions_def
    time_window_partitions_def: TimeWindowPartitionsDefinition = (
        multipartitions_def.primary_dimension.partitions_def
    )
    static_partitions_def: StaticPartitionsDefinition = (
        multipartitions_def.secondary_dimension.partitions_def
    )

    cron_schedule = time_window_partitions_def.get_cron_schedule(
        minute_of_hour, hour_of_day, day_of_week, day_of_month
    )

    @schedule(
        cron_schedule=cron_schedule,
        job=job,
        default_status=default_status,
        execution_timezone=time_window_partitions_def.timezone,
        name=check.opt_str_param(name, "name", f"{job.name}_schedule"),
        tags=tags,
        description=check.opt_str_param(description, "description"),
    )
    def _schedule(context: ScheduleEvaluationContext):

        # Run for the latest partition.
        # Prior partitions will have been handled by prior ticks.
        time_key = time_window_partitions_def.get_last_partition_key(
            context.scheduled_execution_time
        )

        for static_key in static_partitions_def.get_partition_keys():
            multipartition_key = MultiPartitionKey(
                {
                    multipartitions_def.primary_dimension.name: time_key,
                    multipartitions_def.secondary_dimension.name: static_key,
                }
            )
            yield job.run_request_for_partition(
                partition_key=multipartition_key,
                run_key=multipartition_key,
                tags=tags,
            )

    return _schedule

daniel

02/21/2023, 4:25 PM

How many run requests do you expect it to produce on each tick?

Gustavo Carvalho

02/21/2023, 4:26 PM

I use it for two schedules

Gustavo Carvalho

02/21/2023, 4:26 PM

one of them will produce ~50

Gustavo Carvalho

02/21/2023, 4:26 PM

the other, ~110

Gustavo Carvalho

02/21/2023, 4:26 PM

54 and 106 to be exact

Gustavo Carvalho

02/21/2023, 4:26 PM

the one that was having the bugs is the one generating 106

daniel

02/21/2023, 4:27 PM

hmm I wouldn't expect that call to take 60 seconds to run (unless something else was generally overloading the server)

daniel

02/21/2023, 4:27 PM

you mentioned other potentially expensive things (large backfills etc.) were happening at the same time?

Gustavo Carvalho

02/21/2023, 4:27 PM

yes

Gustavo Carvalho

02/21/2023, 4:27 PM

I had a backfill with more than 10k partitions queued

daniel

02/21/2023, 4:28 PM

Got it - my bet would be on that causing the slowdown rather than the schedule, I think I saw a similar issue get filed by somebody else recently..

Gustavo Carvalho

02/21/2023, 4:28 PM

although only ~14 jobs were running simultaneously

Gustavo Carvalho

02/21/2023, 4:28 PM

(due to concurrency limitations)

Gustavo Carvalho

02/21/2023, 5:03 PM

@daniel I will try to set the timeout back to 1min when the code location workload is back to normal. For now, the increased timestamp will do 🙂 Thanks for taking the time to look into it

daniel

02/21/2023, 5:04 PM

no problem - the other thing I was going to suggest is possibly allocating more resources to the user code deployment pod

Gustavo Carvalho

02/21/2023, 5:04 PM

yeah, we also need to check its resources, it was one of our guesses, too

Gustavo Carvalho

02/21/2023, 5:05 PM

also, is the code deployment able to horizontally scale?

daniel

02/21/2023, 5:06 PM

yeah, we'd like to add replica support there, which would also help here

daniel

02/21/2023, 5:17 PM

I do see this PR which could possibly be relevant? https://github.com/dagster-io/dagster/pull/12431

👍 1

Gustavo Carvalho

02/21/2023, 5:17 PM

definetly

Gustavo Carvalho

02/21/2023, 5:32 PM

In fact, I guess I have even more partitions

Gustavo Carvalho

02/21/2023, 5:32 PM

3 years * 365 days/year * ~100 devices

daniel

02/21/2023, 5:44 PM

cc @claire in case we need another test case / partner for that PR

👍 1

Gustavo Carvalho

02/21/2023, 5:45 PM

@claire hmu if you need any info on this !

claire

02/21/2023, 6:15 PM

Yep, that PR should fix the schedule evaluation timing out. The source of the problem is that

run_request_for_partition

will iterate through each multipartition for each run request which gets unwieldy.

🌈 2

daniel

02/21/2023, 6:17 PM

ah OK it was the schedule after all then - my mistake

Gustavo Carvalho

03/03/2023, 12:19 PM

@daniel @claire I checked the PR and it is merged, but I can't tell if it is already available on 1.1.21 (I read the changelog, but it was not mentioned there). Is it on 1.1.21?

daniel

03/03/2023, 1:51 PM

Yeah, that’s live in 1.1.21

👍 1

Gustavo Carvalho

03/03/2023, 1:51 PM

Already upgraded, thanks!

3 Views

Open in Slack

Previous Next