Can I define assets that update each other in a circular way dagster #ask-ai

Join Slack

Can I define assets that update each other in a ci...

# ask-ai

Terry Lines

02/22/2024, 7:19 PM

Can I define assets that update each other in a circular way?

Scout

02/22/2024, 7:19 PM

No, you cannot define assets that update each other in a circular way. The documentation and issues from the Dagster repository indicate that circular dependencies within the asset graph are not supported. Specifically, there is an open issue discussing the possibility of allowing cycles in the asset graph as long as asset partitions only depend on earlier partitions, but currently, this results in a

CircularDependencyError

. This suggests that Dagster does not currently support circular dependencies between assets, even if they are partitioned in a way that could theoretically allow for it. For reference, the issue discussing this feature request is titled "Allow cycles in the asset graph as long as asset partitions only depend on earlier partitions" and can be found in the Dagster GitHub repository. Allow cycles in the asset graph as long ... button Updating materialized assets button dbt and cyclical dependencies in Dagster button Is it possible for assets to load inputs... button Thank you for your feedback!

Terry Lines

02/22/2024, 7:22 PM

@Scout Can I use metadata to record when a partition was last materialized and only update if new data found

Scout

02/22/2024, 7:22 PM

Yes, you can use metadata in Dagster to record when a partition was last materialized and conditionally update the partition only if new data is found. This is a common pattern for incrementally updating assets based on changes in the source data. To implement this, you would typically: 1. Store metadata each time you materialize your asset, such as the timestamp of the last record you observed or some sort of cursor into an update stream. 2. When your asset is next due to be materialized, retrieve the stored metadata to determine the last time the asset was updated. 3. Use this information to check the source data for any new or changed records since the last materialization. 4. If new or changed records are found, update the asset accordingly; otherwise, skip the update. Here's a simplified example based on the documentation:

Copy code

python from dagster import asset, MetadataValue  @asset def my_asset(context):     # Retrieve the last materialization timestamp from metadata     last_materialization_timestamp = context.get_metadata_value('last_materialization_timestamp')          # Logic to check for new data since last_materialization_timestamp     new_data = check_for_new_data_since(last_materialization_timestamp)          if new_data:         # Logic to update the asset with new data         update_asset_with_new_data(new_data)                  # Record the current timestamp as metadata         yield <http://MetadataValue.int|MetadataValue.int>(time.time(), label='last_materialization_timestamp')     else:         # No new data found, so skip the update         <http://context.log.info|context.log.info>("No new data found. Skipping update.")

In this example,

check_for_new_data_since

is a placeholder for your logic to check the source data, and

update_asset_with_new_data

is a placeholder for your logic to update the asset with the new data. Please note that the actual implementation will depend on the specifics of your data sources, how you're tracking changes, and how you're storing and retrieving metadata. The example above is a high-level illustration and not a complete solution.

Terry Lines

02/22/2024, 7:39 PM

@Scout how many partitions can i create for an asset?

Scout

02/22/2024, 7:39 PM

The number of partitions you can create for an asset in Dagster is not explicitly limited by the system itself, but there are practical considerations to keep in mind. The documentation does not provide a specific maximum number of partitions, but it does suggest that you should consider performance and manageability when creating a large number of partitions. When you have a large number of partitions, it can impact the performance of the Dagster instance, especially in terms of how quickly the Dagit UI loads and operates, as well as the scheduling and execution of runs. Each partition represents a slice of the data that the asset operates on, and having too many partitions can lead to a cluttered UI and potentially slow down metadata queries. In practice, you should aim to create a number of partitions that makes sense for your use case and allows for efficient querying and management. If you find that you are considering creating thousands or tens of thousands of partitions, you may want to reconsider your partitioning strategy or consult the Dagster community for best practices on handling such cases. For very large numbers of partitions, you might explore dynamic partitioning, which allows you to define partitions based on the data itself, potentially at runtime. This can help manage scenarios where the number of partitions is very large or not known ahead of time. Remember that the right number of partitions will depend on your specific use case, data volume, and how you intend to use the partitions within your Dagster pipelines.

Terry Lines

02/22/2024, 7:44 PM

@Scout can i have an asset which updates itself?

Scout

02/22/2024, 7:44 PM

In Dagster, an asset updating itself, meaning an asset that depends on its own previous materialization, is not a typical pattern because it can lead to circular dependencies, which are not supported. However, there are scenarios where you might want an asset to be influenced by its previous state, such as incrementally updating a machine learning model or appending records to a dataset. To handle such cases, you can use the following approaches: 1. Self-Dependency with Partitions: If your asset is partitioned, you can define dependencies between partitions such that a new partition of an asset depends on the previous partition. This is often done with time-based partitions, where, for example, today's partition might depend on yesterday's partition. 2. Loading Previous Materialization: You can write an asset that, as part of its computation, loads the value from its previous materialization. This allows the asset to effectively "update itself" by incorporating its previous state into the new computation. Here's an example from the documentation that illustrates how you might load the current value of an asset when generating a new version:

Copy code

python from dagster import asset, AssetKey, AssetMaterialization, EventMetadataEntry  @asset def my_asset(context):     # Attempt to load the current value of the asset     try:         current_value = context.instance.get_latest_materialization(AssetKey("my_asset"))         <http://context.log.info|context.log.info>(f"Loaded current value: {current_value}")     except Exception as e:         # Handle the case where this is the first execution or the asset is not found         current_value = None         <http://context.log.info|context.log.info>("No current value found, starting from scratch")      # ... Perform computation to update the asset using current_value if available ...      # Yield an AssetMaterialization to record the new state of the asset     yield AssetMaterialization(         asset_key="my_asset",         metadata_entries=[EventMetadataEntry.text(str(new_value), "New Value")]     )      # Return the new value for the asset     return new_value

In this example,

get_latest_materialization

is used to retrieve the most recent materialization of the asset. If it's the first time the asset is being materialized, or if the previous materialization cannot be found, you handle this case accordingly. Please note that the actual implementation details will depend on your specific use case and how you manage state and dependencies in your Dagster assets.

Terry Lines

02/22/2024, 7:56 PM

@Scout give an example of a graph-multi-asset definition

Scout

02/22/2024, 7:56 PM

Certainly! A

graph_multi_asset

is a way to define multiple assets that are computed using the same graph of ops and the same upstream assets. Here's an example from the documentation that demonstrates how to define a `graph_multi_asset`:

Copy code

python from dagster import AssetOut, graph_multi_asset  @graph_multi_asset(     outs={         "first_asset": AssetOut(),         "second_asset": AssetOut(),     } ) def two_assets(upstream_asset):     one, two = two_outputs(upstream_asset)     return {"first_asset": one, "second_asset": two}

In this example,

two_assets

is a

graph_multi_asset

that takes an

upstream_asset

as input and produces two assets,

first_asset

and

second_asset

. The function

two_outputs

is assumed to be a function that takes the

upstream_asset

and returns two values, which are then mapped to the respective assets. Please note that this is a high-level example, and the actual implementation of

two_outputs

and how

upstream_asset

is defined and used would depend on your specific use case. The

graph_multi_asset

decorator allows you to create a combined definition of multiple assets that are computed using the same graph logic.

Terry Lines

02/22/2024, 7:58 PM

@Scout give an example of a graph-multi-asset which updates the assets by using their prior value

Scout

02/22/2024, 7:58 PM

The documentation does not provide a direct example of a

graph_multi_asset

that updates assets by using their prior values. However, I can describe a conceptual approach based on the capabilities of Dagster's asset system. To create a

graph_multi_asset

that updates assets using their prior values, you would need to: 1. Define a way to retrieve the previous materialization of each asset. 2. Use the retrieved values as inputs to your computation. 3. Output the new values as the updated assets. Here's a conceptual example based on the capabilities described in the documentation:

Copy code

python
from dagster import AssetOut, graph_multi_asset, asset, OpExecutionContext

# Assume these are your asset definitions
@asset
def asset_one(context: OpExecutionContext):
    # Logic to retrieve and update the asset's value
    ...

@asset
def asset_two(context: OpExecutionContext):
    # Logic to retrieve and update the asset's value
    ...

# Define a graph that updates both assets
@graph_multi_asset(
    outs={
        "updated_asset_one": AssetOut(),
        "updated_asset_two": AssetOut(),
    }
)
def update_assets(context: OpExecutionContext):
    # Retrieve the previous materialization of each asset
    prev_asset_one = context.instance.get_latest_materialization(AssetKey("asset_one"))
    prev_asset_two = context.instance.get_latest_materialization(AssetKey("asset_two"))

    # Logic to update the assets using their previous values
    new_value_for_asset_one = update_logic_for_asset_one(prev_asset_one)
    new_value_for_asset_two = update_logic_for_asset_two(prev_asset_two)

    # Return the new values as a dictionary mapping to the asset keys
    return {
        "updated_asset_one": new_value_for_asset_one,
        "updated_asset_two": new_value_for_asset_two,
    }

# Utility functions to perform the update logic
def update_logic_for_asset_one(prev_value):
    # Implement your logic here
    ...

def update_logic_for_asset_two(prev_value):
    # Implement your logic here
    ...

In this conceptual example,

update_assets

is a

graph_multi_asset

that updates two assets,

updated_asset_one

and

updated_asset_two

, by using their previous materializations. The

update_logic_for_asset_one

and

update_logic_for_asset_two

functions contain the logic for updating the assets based on their previous values. Please note that this is a high-level conceptual example and not a complete solution. The actual implementation would depend on how you store and retrieve asset materializations and the specific logic for updating the assets. Additionally, the

get_latest_materialization

method and how you handle the absence of a previous materialization (in case it's the first run) would need to be implemented according to your specific requirements and the ca

Scout

02/22/2024, 7:59 PM

pabilities of the Dagster instance you are using.

4 Views

Open in Slack

Previous Next