Daniel Gafni
12/11/2022, 10:19 AMmaster
.
To solve this Dagster Cloud's Branch Deployments can be used, or one can setup a custom CI/CD job manually. In both cases, the job spinns up a new temporary Dagster deployment I will refer to this deployment as Feature Stage.
The Problem
Usually we don't want to have write access to Production from the FS as it would be unsafe. Thus, Dagster jobs must write data into a separate FS environment, like a temporary S3 bucket.
However, this means that we also can't read the production assets. One can of course write a custom IOManager that would somehow pick the read and write environment based on the asset metadata, but it would be extremely complicated to support it. In fact, what I'm trying to do here is to generalize this logic and let Dagster run it instead of the custom user code.
So when testing an asset, the user has to materialize it's upstream dependencies the FS environment first. This becomes a huge pain when working with heavy assets like trained machine learning models or a lot of data preprocessing for ML.
It may take hours to materialize the upstream assets before the developer can finally test the code he was working on. This has an enormous impact on development speed and productivity.
Now imagine adding backfills…
Solving this problem would bring a lot of value to developers being affected by it.
Proposed Solution
1. Introduce a env
tag to assets and op outputs. Internally it can be just a special metadata value like __environment__
. The default would be:
__environment__: default
. This tag can be displayed in Dagit. It has to be saved in Dagser's database when the asset is materialized (included in the materialization event).
2. Allow providing a dictionary of resource_configs
instead of a single config for every resource. Maybe also a dictionary of resource definitions instead of a single one. If the user hasn't provided a dictionary, wrap the single config into {"default": config}
.
3. When initializing a resource for the asset, pick the resource config (and possible the resource definition) using the asset's env
tag. If loading an asset, use the tag recorded in Dagster's database. If writing an asset, use the tag provided by the Dagster deployment.
Lets see some examples:
1. "Hey let's not overwrite prod files in CI"
with_resources(
assets,
resource_defs={"io_manager": my_io_manager},
resource_config_by_key={
"io_manager": {"prod": {"base_dir": "prod"}, "feature_stage": {"base_dir": f"fs-{FS_ID}"}}
},
)
2. "Hey let's not write into prod from CI"
with_resources(
assets,
resource_defs={"io_manager": my_io_manager, "aws_credentials": aws_credentials},
resource_config_by_key={
"io_manager": {"prod": {"bucket": "prod"}, "feature_stage": {"bucket": "stage", "base_dir": f"fs-{FS_ID}"}},
"aws_credentials": {
"prod": {"AWS_ACCESS_KEY_ID": READONLY_PROD_AWS_ACCESS_KEY_ID},
"feature_stage": {"AWS_ACCESS_KEY_ID": FS_AWS_ACCESS_KEY_ID},
},
},
)
What happens inside the feature stage (FS) Dagster deployment:
1. The production Dagster database is being cloned. The FS Dagster thus has access to all the assets materialized in production, as well as runs history, assets metadata, etc.
2. When materializing a new asset, Dagster will load upstream assets from the production environment. The env=prod
tag will tell Dagster to use the IOManager and resources that have read access to production.
3. When writing the asset, the env=feature_stage
tag will tell Dagster to use the FS IOManager and resources, thus materializing the asset in the FS environment.
As a result, all the assets produced before the FS deployment are going to be loaded from Production. All the assets produced after the deployment will be loaded from and written to the FS environment.
The proposed changes changes are:
• non-breaking
• very general, users can do a lot of stuff with custom environments
• would work with partitions
• very small codebase edits, we just have to add a few dictionaries here and there
• can be immediately used by Branch Deployments in Dagster Cloud
Would love to hear what everybody thinks! It's still not very clear to me where does the env
tag has to be defined - perhaps in the repository
decorator?
Tagging @sandy and @schrockn for future discussions since I've mentioned The Problem to you guys previously.Nicolas Parot Alvarez
12/12/2022, 2:01 PMdaniel
12/12/2022, 3:18 PMDaniel Gafni
12/12/2022, 3:19 PMsandy
12/14/2022, 12:22 AM@asset
?Daniel Gafni
12/14/2022, 8:09 AM