I have a job which should conduct a backfill for partitioned dagster #dagster-plus

I have a job which should conduct a backfill for p...

geoHeil

12/16/2022, 11:20 PM

I have a job which should conduct a backfill for partitioned assets. The assets and the job have the daily partitions definition attached. However, only the normal-non backfill jobs shows the assets as green after materialization. The backfill job successfully materializes the assets. However, the partition does not show up as green - and furthermore the backfill is also not possible (no backfills for <<<job_name>>> is written on the page of this job

geoHeil

12/17/2022, 10:22 AM

The launch_backfill button is also not visible https://docs.dagster.io/concepts/partitions-schedules-sensors/backfills

sandy

12/19/2022, 5:30 PM

I'm not entirely following what you mean by "However, only the normal-non backfill jobs shows the assets as green after materialization". Mind expanding? Is there a screenshot you could share? What version are you on? cc @Ben Gotow - I know we had some bugs related to the Launch Backfill dialog recently

geoHeil

12/19/2022, 9:17 PM

the latest 1.1.7

geoHeil

12/19/2022, 9:17 PM

I cannot run any backfill - this is the TLDR

geoHeil

12/19/2022, 9:18 PM

not sure what screenshot to share here - I see the normal partitions tab in dagit - but have no chance to launch the backfill. button - or select any partitions during click of the materialize button

geoHeil

12/19/2022, 9:18 PM

for a job based off of assets

daniel

12/19/2022, 9:19 PM

If this is in cloud, do you have a link to the partitions page in question?

geoHeil

12/19/2022, 9:19 PM

sure - I even sent you this link on Friday or Saturday to share it in the dagster private slack ... 😉

daniel

12/19/2022, 9:19 PM

ah sorry, yes you did

sandy

12/19/2022, 11:15 PM

The Launch Backfill button is no longer there, but you should be able to replicate the same functionality by selecting partitions when you click the Materialize button on the tab that has the asset graph Daniel sent me a link to your job. I noticed that the assets inside it don't appear to be partitioned - is that expected? In general, partitioned asset jobs are only meant to be used with partitioned assets

geoHeil

12/20/2022, 7:56 AM

Well it is statefully partitioned SCD2 assets. And these need a backfill

geoHeil

12/20/2022, 7:56 AM

so I is not idempotent (i.e. delete 2022-01-01 and overwrite) rather delete the whole state and backfill from the raw data

geoHeil

12/20/2022, 7:57 AM

Only the raw data assets are partitioned in the traditional sense

sandy

12/20/2022, 4:53 PM

in Dagster, backfills by definition work over partitioned assets - the point of a Dagster backfill is to be able to launch separate work for each partition. do you want to have separate runs for different subcomponents of your SCD2 assets? or just a single run to backfill the entire asset? if the latter, I'd recommend adding boolean configuration options to those assets that you can toggle to recompute them from scratch, instead of using Dagster backfills

geoHeil

12/20/2022, 8:54 PM

so you mean a boolean config plus a tqdm for loop ... hm this might perhaps be the better solution

geoHeil

12/20/2022, 8:57 PM

Would you suggest to create a config value (which must be then passed to all the assets) or use a tag instead?

sandy

12/21/2022, 12:20 AM

Our normal recommendation for this kind of thing would be a config value that’s passed to all the assets, but I could see how a tag would be more convenient if you don’t care about being able to validate the config before launching the run

geoHeil

12/21/2022, 7:27 AM

One question still remains open for me: where to put the for loop. I would have loved to use the partitions/dagster itself to run the loop - potentially parallelized, neatly isolated like normal jobs and nicely logged as well. Are you saying with this backfill=true config passed should I have a manual for loop inside the asset which a) discovers available raw partitions b) deletes the existing state c) backfills (loops the loop) or could/should parts of this be in a sensor which is launching jobs against dagster? What are best practices for such a case (I guess SCD2 stateful tables are rather common)

sandy

12/21/2022, 5:55 PM

I think I'm struggling a bit to understand how you can have executions for different raw partitions happen independently but they're not idempotent could you potentially model the SCD2 asset as a partitioned asset where each partition depends on the prior partition?

geoHeil

12/21/2022, 5:56 PM

Is this a possibility? How could/would this work?

geoHeil

12/21/2022, 5:57 PM

Indeed, they are certainly not independent. But it would be nice to use the normal backfill UI (with all its nice observability for the individual runs) compared to coding up a 2nd code path (including the for loop) inside each asset or perhaps the IO manager

sandy

12/21/2022, 5:58 PM

partition self-dependencies are something that we just added support for. self-dependencies aren't yet respected in the backfill code, but I'm working on adding that right now. here's an example:

Copy code

@asset(
        partitions_def=DailyPartitionsDefinition(start_date="2020-01-01"),
        ins={
            "a": AssetIn(
                partition_mapping=TimeWindowPartitionMapping(start_offset=-1, end_offset=-1)
            )
        },
    )
    def a(a):
        ...

geoHeil

12/21/2022, 5:59 PM

This is super interesting! I was not yet aware of this functionality. Do you already have a timeline when you plan to add this in for backfill?

sandy

12/21/2022, 6:00 PM

my dream is to get it done this week, but realistically it will be more like early january

D 1

sandy

12/21/2022, 6:00 PM

the asset reconciliation sensor already supports it. i.e. if partition N fails, it won't try to run partition N + 1 until partition N is filled

geoHeil

12/21/2022, 6:01 PM

Really awesome! Please keep me updated on the progress of this feature.

geoHeil

12/21/2022, 6:01 PM

I think this would be the perfect fit for this case

geoHeil

12/23/2022, 11:10 AM

Given a partitions definition like:

Copy code

DailyPartitionsDefinition(start_date='2022-01-01', end_offset=1)

would the corresponding

TimeWindowPartitionMapping

for each asset:

Copy code

partition_mapping=TimeWindowPartitionMapping(start_offset=-1, end_offset=-1)

still look like this? In particular, I want to double check the

end_offset

sandy

12/24/2022, 1:57 AM

yes - that should still work

geoHeil

12/27/2022, 7:14 AM

_TimeWindowPartitionMapping.__new__() got an unexpected keyword argument 'start_offset'

geoHeil

12/27/2022, 7:15 AM

something works differently - or you have the described functionality still only on your feature branch

sandy

12/27/2022, 3:50 PM

it was included in the most recent release - are you possibly on an earlier version?

geoHeil

12/27/2022, 8:39 PM

it was 1.1.6 - now with 1.1.7 this works. However, it is still unclear to me how to pass the TimeWindowPartitionMapping to the define_asset_job - I can only pass the partition definition as an argument.

sandy

12/27/2022, 9:10 PM

You don’t need to pass it to the job - just on the AssetIn like in my example above

sandy

12/27/2022, 9:11 PM

The self dependencies are respected by the asset reconciliation sensor when scheduling materializations. They don’t have an effect on scheduled jobs

geoHeil

12/28/2022, 7:30 AM

I have a little bit different situation: you have asset a with self-reference to a. I have a ---> b where a is delivering full copies of the data daily and b is the scd2 table. b gets a daily update of a and runs the merge into operation given the fresh data and its state. What I would understand is you are saying here is that I need to explicitly get hold of the existing state as the self reference ... this seems quite complicated (but perhaps the only sane way)

geoHeil

12/28/2022, 7:31 AM

If you want I can show and explain you my stateful scd2 IO manager I am using currently - it is implicitly deriving the reference to the current state. Perhaps this might be a nice way how to deal with the scd2 (once dagit allows to surface this self reference for backfills)

sandy

12/28/2022, 5:47 PM

my understanding of your original goal was that you wanted to be able to launch a backfill that filled in your scd2 table in a set of sequential steps, and you had to fill in day N before day N + 1. is that still right? if so, I don't think you should need to modify your I/O manager. you can specify the self-partition-dep without actually needing to have your I/O manager load that data: you can use a Nothing type on the

AssetIn

that corresponds to the self-dep. in our next release (next Thursday), asset backfills will respect the ordering of self-partition-deps

geoHeil

12/28/2022, 6:05 PM

Yes this would be ideal

geoHeil

12/28/2022, 6:06 PM

But so far - I am using delta lake there are different code paths I.e write for the initial partition and then merge into - including manual deletions for the subsequent ones.

geoHeil

12/28/2022, 6:08 PM

Still, I need to combine the state with the fresh partition. And am confused how your example is handling this already. I can share an extended example with you on a video call if you want

sandy

12/28/2022, 6:21 PM

I don't think I'll be able to video call this week, but if you have a github gist I could take a look?

geoHeil

12/28/2022, 8:06 PM

no worries - we could also do next. This one is also bad for me. Here you go: https://gist.github.com/geoHeil/12ce1e1403e474b44a84fd267323acb4 perhaps this alleviates the need for a call

geoHeil

12/28/2022, 8:09 PM

I hope this (`

Copy code

def a_scd2(context, a: pyspark.sql.DataFrame)

`) makes my line of thought clearer to you - but perhaps you can tell me now how I am mis-thinking the handling of the self-references.

sandy

12/29/2022, 8:56 PM

I just took a look at your example. I don't think you should need to change anything except to add a Nothing self-dependency. It would look something like this:

Copy code

from dagster import asset, DailyPartitionsDefinition, TimeWindowPartitionMapping, AssetIn, Nothing


@asset(partitions_def=DailyPartitionsDefinition(start_date="2020-01-01"))
def a():
    ...


@asset(
    partitions_def=DailyPartitionsDefinition(start_date="2020-01-01"),
    ins={
        "a_scd2": AssetIn(
            dagster_type=Nothing,
            partition_mapping=TimeWindowPartitionMapping(start_offset=-1, end_offset=-1),
        )
    },
)
def a_scd2(a):
    ...

(there's actually a bug that causes the above to error currently, but it will get fixed in the next release)

geoHeil

01/03/2023, 9:24 AM

But is this best practice? Or would you recommend to refactor and retrieve the self reference from dagster (without the nothing dependency)?

geoHeil

01/03/2023, 9:32 AM

Is:

@op 'a_scd2' decorated function has parameter 'a_scd2' that is one of the input_defs of type 'Nothing' which should not be included since no data will be passed for it.

this the error you are talking about?

geoHeil

01/10/2023, 9:37 AM

@sandy is the issue fixed now with the latest release?

sandy

01/10/2023, 4:45 PM

yes - that issue with Nothing arguments to AssetIn is now fixed

5 Views

Open in Slack

Previous Next