https://dagster.io/ logo
#announcements
Title
# announcements
d

Dave W

12/27/2019, 7:06 PM
i notice there's something involving backfill and partitions in the latest git checkins, but not in the latest pypi package, or in the docs
for now i'm just creating a resource that takes in YYYYMMDD from config and generates appropriate file paths from that. but that means each pipeline run has to be for a separate date, so i've got to do separate runs per day... really feel like i'm missing something obvious here
a

abhi

12/27/2019, 9:46 PM
Hi Dave. So I would love to better understand what you are trying to do. My mental model is that you are trying to take a start date and an end date and process the corresponding dated csv's. Is that right? If so you can just do something simple like the following:
Copy code
@solid
def process_dated_csvs(context, start_date, end_date):
	# for each date going from start_date to end_date, do something to the csv
Resources are better geared towards pipeline wide facilities. For example, an s3 bucket, a database, your file system. These are your external environment that you want to interact with. In your case, dates are an integral input that drives the computation you want to perform so they would not be a resource. However, let's say that multiple solids were using a specific directory or a temporary directory to do stuff, then that directory could be a resource.
This is also a useful guide for better exploring our resource system: https://dagster.readthedocs.io/en/0.6.6/sections/learn/tutorial/resources.html
d

Dave W

12/27/2019, 10:31 PM
Hey, thanks for the quick reply!
So the idea is that my solids are parametrized by date
In general, each solid in my pipeline takes in a CSV with a dated path and outputs a CSV with a dated path (id love to use intermediates but need more control over naming)
So I can’t do a for loop within a single solid
But I don’t want to have to put the date in the solid config of every single solid either
The other hack I thought of would be a solid that takes date as a config and then outputs that date. Then that solid could be an input to every other solid. But that seems ridiculous.
Okay, so after thinking about this for a while, I think what I actually should do is have input path and output path in the input for each solid and just construct the environment dict programmatically for each date
Because then stuff like automatic rehydration will still work
It just seems a little annoying because I’ve got to understand the structure of the pipeline to construct the config, so it violates DRY
OK, this is what i ended up doing
Copy code
def run_multi_day_pipeline(md_root, start_date, end_date):
    solid_configs = {}
    @dagster.pipeline
    def multi_day_pipeline():
        for date in pd.bdate_range(start_date, end_date):
            yyyymmdd = date.strftime('%Y%m%d')
            solid_name = f'day_{yyyymmdd}'
            # Create and run alias
            day_solid.alias(solid_name)()

            # Store config for this alias
            solid_configs.update(
                get_single_day_solid_config(
                    solid_name,
                    date,
                    md_root
                )
            )

    config = {'solids': solid_configs}

    dagster.execute_pipeline(multi_day_pipeline, config, instance=)
it seems sort of crazy
and because i have to generate my pipeline dynamically, i don't think i can add it to a repository (i mean, I can get away with it due to lazy loading, but that seems dangerous)
so then i think i can't use dagit to execute it?
@abhi very curious to hear your thoughts
p

prha

01/02/2020, 4:42 PM
@Dave W We’re working on supporting backfill / partitioning utilities to make it easier to generate config with the partition date name, but it still requires generating config for the solid
i’m curious as to whether @abhi’s suggestion of having a resource that supplies the date would work for you…
d

Dave W

01/02/2020, 5:20 PM
@prha thanks for the reply! Actually the idea of having a resource that supplies the date was actually mine, and I implemented it that way originally, but it was pretty clunky. The problem with that is that it means you can’t run multiple days as part of the same pipeline, so it’s harder to use dagit and so on
Config gen is also kind of a burden, as it violates DRY and means you can’t use dagit
And if you want to use auto rehydration or materialization, you have to know the paths outside of the pipeline at the time of config gen, which is a pain as well
p

prha

01/02/2020, 5:25 PM
I hear you… I’m still trying to understand the structure of your pipeline, and how we might best structure the partition primitives to support what you want
d

Dave W

01/02/2020, 5:25 PM
Oh, and resources are also kind of painful because to be safe you have to add verbose “required resource” headers to a bunch of stuff
Dope
It’s pretty simple right now
p

prha

01/02/2020, 5:26 PM
in my head, a partitioned pipeline would generate a run over a single slice of the partition (e.g. single day)
and you would generate multiple runs over a date range (one per day)
the backfill tools we’re working on support scheduling a block of pipeline runs over a date range
d

Dave W

01/02/2020, 5:27 PM
That’s cool. I’m using gnu parallel for that right now
The question I have is, how do you account for changing file names
Because all the config right now needs hard coded names
p

prha

01/02/2020, 5:28 PM
Yeah, the way I’ve been thinking about it is to have the date as an input to the solid, and the solid would construct the paths from those inputs
d

Dave W

01/02/2020, 5:29 PM
Yep, that’s kind of what I’m converging on as well
But it breaks the rehydration and materialzation features for custom types
Making dagster-pandas for example way less useful
p

prha

01/02/2020, 5:33 PM
I’m not sure I follow how it would break the hydration/materialization features… because it would replace the dataframe input with a date string?
d

Dave W

01/02/2020, 5:35 PM
This could be my own ignorance of the feature set, but yes
Right now if I want to use that stuff, I think the only way is to put paths and file types in the solid config
If I’m getting a date string instead, I’ve got to load everything myself and materialize it myself
It’s not the end of the world and I can write helper scripts
That’s been my plan actually
But it’s just work
p

prha

01/02/2020, 6:08 PM
Just spitballing right now, but could you use multiple inputs, to get the best of both worlds? Or use composite solids to either do A) config mapping or B) abstract away the date => path => dataframe loading
d

Dave W

01/02/2020, 6:20 PM
Sounds promising
What’s config mapping?
p

prha

01/02/2020, 6:23 PM
This is functionality to generically help support this sort of parameterization
we might sunset it if the partitioning work we do is powerful enough, but it’s currently supported
I wonder if composite solids might achieve what you want, even without using any config mapping
d

Dave W

01/02/2020, 6:31 PM
Config mapping would work for sure
Composite solid with no config mapping I can vaguely see but I’m wondering if you have something specific in mind
p

prha

01/02/2020, 6:39 PM
I was more thinking that it could be an abstraction that just takes in the date as input, and under the hood is constructing the paths in a solid and loading the dataframes as inputs to subsolids, so that we still can make use of the intermediate store for rehydration/materialization
d

Dave W

01/02/2020, 7:03 PM
1. Can I make such a thing reusable? (Short of creating a closure like I did in my example above, which I think breaks multiprocess running)
2. If you’re referring to the “intermediates” functionality, that doesn’t work for me as well
Because I don’t have control over the paths
By the way this level of support is amazing, really appreciating it
p

prha

01/02/2020, 7:16 PM
just sent you a direct message to follow up for more specifics…