Are two runs totally independent Is there a way to specify t dagster #ask-ai

Are two runs totally independent? Is there a way t...

Linton Ye

02/06/2024, 7:19 PM

Are two runs totally independent? Is there a way to specify that, ok, in this run the assets depend on assets produced in another specific run?

Scout

02/06/2024, 7:19 PM

In Dagster, runs are generally independent in terms of execution, and each run will compute the assets based on the dependencies defined in the asset graph. However, the assets themselves can be designed to depend on data produced by previous runs, such as when using partitioned assets where an asset in one partition may depend on the data from a previous partition. The documentation does not explicitly describe a mechanism to make one run depend on the assets produced by another specific run in a direct manner. Instead, dependencies are typically defined in terms of the asset graph and partitioning scheme, where the computation of an asset can depend on the materialization of its upstream assets, which may have been produced in previous runs. For example, if you have a partitioned asset that depends on the previous partition of itself, you would use a partition mapping to define this relationship. This allows the asset to access the data from the previous partition, which would have been produced in a previous run. If you need to ensure that a run only starts after another specific run has completed, you might need to implement custom logic outside of Dagster's standard execution model. This could involve checking the status of a previous run and conditionally launching a new run based on that status. For more complex scenarios where you need to orchestrate runs based on specific conditions or dependencies beyond the standard asset graph, you might consider using sensors or custom orchestration logic to trigger runs in response to certain events or conditions being met. In summary, while runs are independent, the assets within those runs can be designed to depend on the outputs of previous runs through partitioning and asset dependencies. Direct dependencies between runs are not a standard feature described in the documentation, and achieving this would likely require custom implementation.

Linton Ye

02/06/2024, 7:23 PM

@Scout My particular use case is being able to re-run a subset of assets if some of them fail, without having re-run all the dependencies

Scout

02/06/2024, 7:23 PM

In Dagster, you have several options for re-executing assets when some of them fail, without needing to re-run all dependencies. Here are a few approaches based on the documentation: 1. Retry Policy: You can define a

RetryPolicy

on your assets to automatically retry them a certain number of times with a specified delay if they fail. This is useful for transient failures that might succeed upon a subsequent attempt. 2. Re-execution: Dagster provides several re-execution options that allow you to re-run parts of your job: - All Ops: Re-execute the entire job from scratch. - Selected Op(s): Re-execute only the selected ops, regardless of their status. - From Selected: Re-execute the job downstream from the selected ops. - From Failure: Re-execute the job, skipping ops that completed successfully. This option is enabled when the run has failed, and you want to re-run the failed op and all downstream dependencies. 3. Asset Observability: Dagster allows you to attach code versions to your assets, and the UI can indicate when an asset has a different code version than the code version used for its most recent materialization. This can help you identify which assets may need to be re-run due to changes. 4. Subset Execution: Within the UI, you can select a subset of assets to re-run by specifying the desired asset names. This allows you to target only the assets that need to be re-executed. 5. Custom Logic: If the built-in re-execution strategies do not fit your needs, you can implement custom logic to handle re-execution. For example, you could use sensors to monitor asset materializations and trigger re-execution based on custom criteria. 6. Partitioned Assets: If your assets are partitioned, you can use backfills to re-materialize assets across specific partitions. This can be useful if you want to re-run assets for certain time windows or datasets. 7. Run-Level Retries: For ensuring that your ops and assets execute even after a crash, you can set up run-level retries, which will retry the run if it fails for any reason. It's important to note that the exact approach will depend on your specific use case and the nature of the failures you're encountering. For instance, if you have a set of assets that are independent and you want to stop the entire run if any asset fails, you might need to implement custom logic to handle this scenario, as the standard behavior in Dagster is to continue executing other assets in parallel even if one fails. For more detailed guidance on implementing these strategies, you should refer to the Dagster documentation on re-execution, retry policies, and asset materialization.

Linton Ye

02/06/2024, 7:27 PM

@Scout ok let’s say we use No.2 above, i.e. re-execute a failed assets, but it will run in a separate container, how does it get the information from previous runs?

Scout

02/06/2024, 7:27 PM

When you re-execute a failed asset in Dagster, especially in a distributed environment where different parts of your pipeline might run in separate containers, Dagster's execution model ensures that the necessary information from previous runs is available for the re-execution. This is typically achieved through the use of Dagster's storage and IO managers, which handle the inputs and outputs of assets across runs. Here's how it works: 1. IO Managers: Dagster uses IO managers to handle the storage and retrieval of data. When an asset is executed, its outputs are managed by an IO manager, which can write the outputs to a persistent storage system like a database, object store (e.g., S3), or a filesystem. 2. Persistent Storage: By using persistent storage that is accessible across different containers, the outputs of assets from previous runs are stored in a way that they can be retrieved by subsequent runs. For example, if you're using an S3 bucket as your storage backend, the output of an asset from a previous run would be written to a specific location in the bucket. 3. Asset Materialization: When an asset is materialized, Dagster records metadata about the materialization, including the storage location of the asset's outputs. This metadata is stored in Dagster's event log, which is accessible to subsequent runs. 4. Re-execution Context: During re-execution, Dagster uses the metadata from the previous run to locate the outputs of the assets that were successfully executed. The IO manager then retrieves the necessary data from the storage backend, making it available to the re-executed asset as inputs. 5. Configurable IO Managers: You can configure different IO managers for different assets or environments. This means that you can set up your IO managers to work with the storage solutions that are most appropriate for your deployment setup, ensuring that data is accessible across container boundaries. 6. Dagster Instance: The Dagster instance orchestrates the execution and stores the history of runs. When you trigger a re-execution, the instance provides the context needed to access previous run data, ensuring that the re-executed asset can access the outputs of its dependencies from prior runs. To ensure that a re-executed asset can access information from previous runs, you should ensure that your IO managers are configured correctly to read from and write to a storage system that is accessible from all the containers where your assets might run. This way, even if the re-execution occurs in a separate container, it will still have access to the necessary data from previous runs.

Linton Ye

02/06/2024, 7:54 PM

@Scout is there a way to make the config of an asset optional?

Scout

02/06/2024, 7:54 PM

Yes, in Dagster, you can make the config of an asset optional. By default, fields in a config schema that are typed as

Optional

are not required to be specified in the config, and they have an implicit default value of

None

. If you want to make a config field optional, you can define it using the

Optional

type hint from the

typing

module. Here's an example of how you can define an asset with an optional config:

Copy code

python from typing import Optional from dagster import asset  @asset(config_schema={"optional_param": Optional[str]}) def my_asset(context):     optional_param = context.op_config.get("optional_param")     # Your asset logic here     return "Result with optional param: {}".format(optional_param)

In this example,

optional_param

is an optional config parameter. If it is not provided in the run config, it will default to

None

. When you materialize this asset, you can choose to provide a value for

optional_param

or omit it entirely. If you omit it, the asset will still execute, using the default value of

None

. Remember that if you want to provide a default value other than

None

, you can do so by setting a default in the config schema:

Copy code

python from typing import Optional from dagster import asset, Field  @asset(config_schema={"optional_param": Field(Optional[str], default_value="default_value")}) def my_asset_with_default(context):     optional_param = context.op_config["optional_param"]     # Your asset logic here     return "Result with optional param: {}".format(optional_param)

In this case, if

optional_param

is not provided in the run config, it will default to the string

"default_value"

. Using these patterns, you can define assets with optional configs that provide flexibility in how they are executed and materialized.

Open in Slack

Previous Next