Proposal environment aware assets When delivering ETL code dagster #dagster-feedback

Proposal: environment-aware assets When deliverin...

Daniel Gafni

12/11/2022, 10:19 AM

Proposal: environment-aware assets When delivering ETL code to production it often has to be tested on real production data (this happens a lot in ML pipelines) before merging the code into

master

. To solve this Dagster Cloud's Branch Deployments can be used, or one can setup a custom CI/CD job manually. In both cases, the job spinns up a new temporary Dagster deployment I will refer to this deployment as Feature Stage. The Problem Usually we don't want to have write access to Production from the FS as it would be unsafe. Thus, Dagster jobs must write data into a separate FS environment, like a temporary S3 bucket. However, this means that we also can't read the production assets. One can of course write a custom IOManager that would somehow pick the read and write environment based on the asset metadata, but it would be extremely complicated to support it. In fact, what I'm trying to do here is to generalize this logic and let Dagster run it instead of the custom user code. So when testing an asset, the user has to materialize it's upstream dependencies the FS environment first. This becomes a huge pain when working with heavy assets like trained machine learning models or a lot of data preprocessing for ML. It may take hours to materialize the upstream assets before the developer can finally test the code he was working on. This has an enormous impact on development speed and productivity. Now imagine adding backfills… Solving this problem would bring a lot of value to developers being affected by it. Proposed Solution 1. Introduce a

env

tag to assets and op outputs. Internally it can be just a special metadata value like

__environment__

. The default would be:

__environment__: default

. This tag can be displayed in Dagit. It has to be saved in Dagser's database when the asset is materialized (included in the materialization event). 2. Allow providing a dictionary of

resource_configs

instead of a single config for every resource. Maybe also a dictionary of resource definitions instead of a single one. If the user hasn't provided a dictionary, wrap the single config into

{"default": config}

. 3. When initializing a resource for the asset, pick the resource config (and possible the resource definition) using the asset's

env

tag. If loading an asset, use the tag recorded in Dagster's database. If writing an asset, use the tag provided by the Dagster deployment. Lets see some examples: 1. "Hey let's not overwrite prod files in CI"

Copy code

with_resources(
    assets,
    resource_defs={"io_manager": my_io_manager},
    resource_config_by_key={
        "io_manager": {"prod": {"base_dir": "prod"}, "feature_stage": {"base_dir": f"fs-{FS_ID}"}}
    },
)

2. "Hey let's not write into prod from CI"

Copy code

with_resources(
    assets,
    resource_defs={"io_manager": my_io_manager, "aws_credentials": aws_credentials},
    resource_config_by_key={
        "io_manager": {"prod": {"bucket": "prod"}, "feature_stage": {"bucket": "stage", "base_dir": f"fs-{FS_ID}"}},
        "aws_credentials": {
            "prod": {"AWS_ACCESS_KEY_ID": READONLY_PROD_AWS_ACCESS_KEY_ID},
            "feature_stage": {"AWS_ACCESS_KEY_ID": FS_AWS_ACCESS_KEY_ID},
        },
    },
)

What happens inside the feature stage (FS) Dagster deployment: 1. The production Dagster database is being cloned. The FS Dagster thus has access to all the assets materialized in production, as well as runs history, assets metadata, etc. 2. When materializing a new asset, Dagster will load upstream assets from the production environment. The

env=prod

tag will tell Dagster to use the IOManager and resources that have read access to production. 3. When writing the asset, the

env=feature_stage

tag will tell Dagster to use the FS IOManager and resources, thus materializing the asset in the FS environment. As a result, all the assets produced before the FS deployment are going to be loaded from Production. All the assets produced after the deployment will be loaded from and written to the FS environment. The proposed changes changes are: • non-breaking • very general, users can do a lot of stuff with custom environments • would work with partitions • very small codebase edits, we just have to add a few dictionaries here and there • can be immediately used by Branch Deployments in Dagster Cloud Would love to hear what everybody thinks! It's still not very clear to me where does the

env

tag has to be defined - perhaps in the

repository

decorator? Tagging @sandy and @schrockn for future discussions since I've mentioned The Problem to you guys previously.

👍 1

Nicolas Parot Alvarez

12/12/2022, 2:01 PM

It reminds me of the "defer" feature in DBT: https://docs.getdbt.com/reference/node-selection/defer

daniel

12/12/2022, 3:18 PM

Hey Daniel - this is a really interesting proposal that is very in line with some discussions we've been having about ways to improve branch deployments. One thing I want to call out though is that this description isn't actually correct today: "Branch Deployment CI also clones Dagster's Production database, so the FS has access to Production runs history, asset metadata, etc." Each branch deployment has a brand new/empty run history and asset history currently, although it would be very nice if it 'branched' the dagster DB too in the way you're describing.

Daniel Gafni

12/12/2022, 3:19 PM

re: cloning the DB Oh I see, I've never used it myself, but for some reason that's what I thought it did re: proposal do you think it sounds realistic enough to implement?

Daniel Gafni

12/13/2022, 8:39 PM

Is this something I could work on and make a PR?

sandy

12/14/2022, 12:22 AM

Hey Daniel - I have desired something exactly like this in the past as well. I agree that we need to find some way to support this pattern you're bringing up: i.e. where someone wants to read from production storage, but write to staging or branched storage. In your proposal, where are you imagining that the environment tags would be specified? As an argument to

@asset

Daniel Gafni

12/14/2022, 8:09 AM

repository / Definitions for a global flag and asset for more fine-grained control

12 Views

Open in Slack

Previous Next