Proposal: environment-aware assets When deliverin...
# dagster-feedback
d
Proposal: environment-aware assets When delivering ETL code to production it often has to be tested on real production data (this happens a lot in ML pipelines) before merging the code into
master
. To solve this Dagster Cloud's Branch Deployments can be used, or one can setup a custom CI/CD job manually. In both cases, the job spinns up a new temporary Dagster deployment I will refer to this deployment as Feature Stage. The Problem Usually we don't want to have write access to Production from the FS as it would be unsafe. Thus, Dagster jobs must write data into a separate FS environment, like a temporary S3 bucket. However, this means that we also can't read the production assets. One can of course write a custom IOManager that would somehow pick the read and write environment based on the asset metadata, but it would be extremely complicated to support it. In fact, what I'm trying to do here is to generalize this logic and let Dagster run it instead of the custom user code. So when testing an asset, the user has to materialize it's upstream dependencies the FS environment first. This becomes a huge pain when working with heavy assets like trained machine learning models or a lot of data preprocessing for ML. It may take hours to materialize the upstream assets before the developer can finally test the code he was working on. This has an enormous impact on development speed and productivity. Now imagine adding backfills… Solving this problem would bring a lot of value to developers being affected by it. Proposed Solution 1. Introduce a
env
tag to assets and op outputs. Internally it can be just a special metadata value like
__environment__
. The default would be:
__environment__: default
. This tag can be displayed in Dagit. It has to be saved in Dagser's database when the asset is materialized (included in the materialization event). 2. Allow providing a dictionary of
resource_configs
instead of a single config for every resource. Maybe also a dictionary of resource definitions instead of a single one. If the user hasn't provided a dictionary, wrap the single config into
{"default": config}
. 3. When initializing a resource for the asset, pick the resource config (and possible the resource definition) using the asset's
env
tag. If loading an asset, use the tag recorded in Dagster's database. If writing an asset, use the tag provided by the Dagster deployment. Lets see some examples: 1. "Hey let's not overwrite prod files in CI"
Copy code
with_resources(
    assets,
    resource_defs={"io_manager": my_io_manager},
    resource_config_by_key={
        "io_manager": {"prod": {"base_dir": "prod"}, "feature_stage": {"base_dir": f"fs-{FS_ID}"}}
    },
)
2. "Hey let's not write into prod from CI"
Copy code
with_resources(
    assets,
    resource_defs={"io_manager": my_io_manager, "aws_credentials": aws_credentials},
    resource_config_by_key={
        "io_manager": {"prod": {"bucket": "prod"}, "feature_stage": {"bucket": "stage", "base_dir": f"fs-{FS_ID}"}},
        "aws_credentials": {
            "prod": {"AWS_ACCESS_KEY_ID": READONLY_PROD_AWS_ACCESS_KEY_ID},
            "feature_stage": {"AWS_ACCESS_KEY_ID": FS_AWS_ACCESS_KEY_ID},
        },
    },
)
What happens inside the feature stage (FS) Dagster deployment: 1. The production Dagster database is being cloned. The FS Dagster thus has access to all the assets materialized in production, as well as runs history, assets metadata, etc. 2. When materializing a new asset, Dagster will load upstream assets from the production environment. The
env=prod
tag will tell Dagster to use the IOManager and resources that have read access to production. 3. When writing the asset, the
env=feature_stage
tag will tell Dagster to use the FS IOManager and resources, thus materializing the asset in the FS environment. As a result, all the assets produced before the FS deployment are going to be loaded from Production. All the assets produced after the deployment will be loaded from and written to the FS environment. The proposed changes changes are: • non-breaking • very general, users can do a lot of stuff with custom environments • would work with partitions • very small codebase edits, we just have to add a few dictionaries here and there • can be immediately used by Branch Deployments in Dagster Cloud Would love to hear what everybody thinks! It's still not very clear to me where does the
env
tag has to be defined - perhaps in the
repository
decorator? Tagging @sandy and @schrockn for future discussions since I've mentioned The Problem to you guys previously.
👍 1
n
It reminds me of the "defer" feature in DBT: https://docs.getdbt.com/reference/node-selection/defer
d
Hey Daniel - this is a really interesting proposal that is very in line with some discussions we've been having about ways to improve branch deployments. One thing I want to call out though is that this description isn't actually correct today: "Branch Deployment CI also clones Dagster's Production database, so the FS has access to Production runs history, asset metadata, etc." Each branch deployment has a brand new/empty run history and asset history currently, although it would be very nice if it 'branched' the dagster DB too in the way you're describing.
d
re: cloning the DB Oh I see, I've never used it myself, but for some reason that's what I thought it did re: proposal do you think it sounds realistic enough to implement?
Is this something I could work on and make a PR?
s
Hey Daniel - I have desired something exactly like this in the past as well. I agree that we need to find some way to support this pattern you're bringing up: i.e. where someone wants to read from production storage, but write to staging or branched storage. In your proposal, where are you imagining that the environment tags would be specified? As an argument to
@asset
?
d
repository / Definitions for a global flag and asset for more fine-grained control