Hi < Scout> can i use multiple Definitions in one init py Le dagster #ask-ai

Hi <@U0667DNC02Y>, can i use multiple Definitions ...

Paul Herzog

02/20/2024, 6:43 PM

Hi @Scout, can i use multiple Definitions in one init.py? Lets say i want to schedule each asset differently, is this the way to go?

Scout

02/20/2024, 6:43 PM

Yes, you can use multiple

Definitions

in one

__init__.py

file. The

Definitions

object in Dagster is a utility that allows you to combine various definitions such as assets, schedules, jobs, and sensors, and have them be aware of each other. This is particularly useful as your project grows and you have a variety of definitions to manage. If you want to schedule each asset differently, you can define multiple schedules and include them in your

Definitions

object. Here's an example from the documentation that shows how to add a schedule to the `Definitions`:

Copy code

python from dagster import (     Definitions,     ScheduleDefinition,     define_asset_job,     load_assets_from_modules, )  from . import assets  all_assets = load_assets_from_modules([assets])  # Define a job that will materialize the assets hackernews_job = define_asset_job("hackernews_job", selection=AssetSelection.all())  # Addition: a ScheduleDefinition the job it should run and a cron schedule of how frequently to run it hackernews_schedule = ScheduleDefinition(     job=hackernews_job,     cron_schedule="0 * * * *",  # every hour )  defs = Definitions(     assets=all_assets,     schedules=[hackernews_schedule], )

In this example, a

ScheduleDefinition

is created for the

hackernews_job

and added to the

Definitions

object. If you have multiple assets that you want to schedule differently, you can create a

ScheduleDefinition

for each job associated with those assets and include them all in your

Definitions

object. Remember that each asset can be associated with its own job, and each job can have its own schedule. This way, you can have fine-grained control over when and how each asset is materialized.

Scout

02/20/2024, 6:43 PM

Yes, you can use multiple

Definitions

in one

__init__.py

file. The

Definitions

Definitions

object. Here's an example from the documentation that shows how to add a schedule to the

Definitions

object:

Copy code

python from dagster import (     Definitions,     ScheduleDefinition,     define_asset_job,     load_assets_from_modules, )  from . import assets  all_assets = load_assets_from_modules([assets])  # Define a job that will materialize the assets hackernews_job = define_asset_job("hackernews_job", selection=AssetSelection.all())  # Addition: a ScheduleDefinition the job it should run and a cron schedule of how frequently to run it hackernews_schedule = ScheduleDefinition(     job=hackernews_job,     cron_schedule="0 * * * *",  # every hour )  defs = Definitions(     assets=all_assets,     schedules=[hackernews_schedule], )

In this example, a

ScheduleDefinition

is created for the

hackernews_job

and added to the

Definitions

object. You can create multiple schedules for different jobs or assets and include them all in the

Definitions

object. This way, you can manage the scheduling of each asset individually while keeping all definitions organized within a single

Definitions

object in your

__init__.py

file.

Paul Herzog

02/20/2024, 6:45 PM

@Scout but this didnt work for me: defs = Definitions( assets=accuranker_assets, resources=resources_by_env[os.getenv("ENV", "local")] ) defs = Definitions( assets=gsc_assets, resources=resources_by_env[os.getenv("ENV", "local")] ) defs = Definitions( assets=teamtailor_assets, resources=resources_by_env[os.getenv("ENV", "local")] )

Scout

02/20/2024, 6:45 PM

It looks like you're trying to create multiple

Definitions

objects with different sets of assets, but you're using the same variable name

defs

for each one. This will result in the previous

Definitions

object being overwritten each time you assign a new one to

defs

. In Python, variable names are references to objects, and if you assign a new object to the same name, the old reference is lost. If you want to maintain separate

Definitions

for each set of assets, you should use different variable names for each one. Here's an example of how you could adjust your code:

Copy code

python from dagster import Definitions  # Assuming accuranker_assets, gsc_assets, and teamtailor_assets are defined elsewhere # and resources_by_env is a dictionary with environment-specific resources  accuranker_defs = Definitions(     assets=accuranker_assets,     resources=resources_by_env[os.getenv("ENV", "local")] )  gsc_defs = Definitions(     assets=gsc_assets,     resources=resources_by_env[os.getenv("ENV", "local")] )  teamtailor_defs = Definitions(     assets=teamtailor_assets,     resources=resources_by_env[os.getenv("ENV", "local")] )

Now, each set of assets has its own

Definitions

object, and you can refer to them separately as

accuranker_defs

gsc_defs

, and

teamtailor_defs

. However, if your intention is to have a single

Definitions

object that includes all the assets and resources together, you should combine them into a single

Definitions

call:

Copy code

python from dagster import Definitions  # Combine all assets into a single list all_assets = accuranker_assets + gsc_assets + teamtailor_assets  # Create a single Definitions object with all assets all_defs = Definitions(     assets=all_assets,     resources=resources_by_env[os.getenv("ENV", "local")] )

This way, you have one

Definitions

object named

all_defs

that includes all the assets and the resources. Remember that the

Definitions

object is meant to be a container for all related definitions in your Dagster project, so typically you would have one

Definitions

object per repository or project scope.

Paul Herzog

02/20/2024, 6:47 PM

My goal is to schedule each Definition & asset pipeline differently @Scout

Scout

02/20/2024, 6:47 PM

If you want to schedule each asset pipeline differently, you should create separate schedule definitions for each asset or asset group and then include all those schedules in a single

Definitions

object. Here's an example of how you might do that:

Copy code

python from dagster import Definitions, ScheduleDefinition, define_asset_job, load_assets_from_modules import os  # Assuming you have different asset groups defined in separate modules from . import accuranker_assets, gsc_assets, teamtailor_assets  # Load assets from each module accuranker_all_assets = load_assets_from_modules([accuranker_assets]) gsc_all_assets = load_assets_from_modules([gsc_assets]) teamtailor_all_assets = load_assets_from_modules([teamtailor_assets])  # Define jobs for each asset group accuranker_job = define_asset_job("accuranker_job", selection=AssetSelection.all()) gsc_job = define_asset_job("gsc_job", selection=AssetSelection.all()) teamtailor_job = define_asset_job("teamtailor_job", selection=AssetSelection.all())  # Define schedules for each job accuranker_schedule = ScheduleDefinition(     job=accuranker_job,     cron_schedule="0 8 * * *",  # every day at 8 AM )  gsc_schedule = ScheduleDefinition(     job=gsc_job,     cron_schedule="0 12 * * *",  # every day at 12 PM )  teamtailor_schedule = ScheduleDefinition(     job=teamtailor_job,     cron_schedule="0 16 * * *",  # every day at 4 PM )  # Combine everything into a single Definitions object defs = Definitions(     assets=accuranker_all_assets + gsc_all_assets + teamtailor_all_assets,     schedules=[accuranker_schedule, gsc_schedule, teamtailor_schedule],     resources=resources_by_env[os.getenv("ENV", "local")] )

In this example, we're creating separate jobs for each asset group and then defining a schedule for each job with different cron schedules. Finally, we combine all the assets and schedules into a single

Definitions

object. This way, each asset group will be scheduled to run at different times, as specified by their respective schedules. The

Definitions

object

defs

will contain all the assets and schedules, and you can use this object to manage your Dagster repository.

Paul Herzog

02/20/2024, 6:48 PM

@Scout this seems like it gets out of hand very quickly, when things scale bigger. whats best practise?

Scout

02/20/2024, 6:48 PM

When scaling to a larger number of assets and schedules, it's important to maintain organization and avoid complexity in your codebase. Here are some best practices to consider: 1. Modularization: Keep your code organized by grouping related assets and schedules into separate modules or files. This makes it easier to manage and understand the different parts of your pipeline. 2. Dynamic Generation: If your schedules follow a pattern or can be generated programmatically, consider writing a function that generates the schedules dynamically based on your assets. This can reduce boilerplate and make it easier to adjust schedules globally. 3. Repository Structure: Use a repository structure that logically groups assets and schedules. For example, you might have a directory for each domain or team within your organization, with each directory containing the assets and schedules relevant to that domain or team. 4. Configuration Management: Externalize configuration where possible. Use environment variables, configuration files, or a configuration management system to manage the settings for different environments (development, staging, production). 5. Asset Grouping: Group related assets together when possible, and schedule them as a single job if they are meant to run together. This can simplify scheduling and reduce the number of schedules you need to manage. 6. Documentation: Keep thorough documentation of your assets and schedules. This is especially important as the number of assets grows, as it helps new team members understand the system and aids in maintenance. 7. Code Reviews and Standards: Implement code review processes and coding standards to ensure that as new assets and schedules are added, they adhere to the established patterns and practices. 8. Automated Testing: Implement automated tests for your assets and schedules to ensure that changes do not introduce regressions. 9. Monitoring and Alerting: Set up monitoring and alerting for your schedules to quickly identify and respond to failures or performance issues. 10. Use of Partitions and Backfills: If your assets are partitioned, use Dagster's partitioning and backfill capabilities to manage and schedule runs for specific subsets of your data. Remember that as your project grows, it's important to regularly refactor and reevaluate your setup to ensure that it remains maintainable. The practices that worked well for a small project may need to be adapted as the project scales. The documentation does not provide a one-size-fits-all solution for scaling, as the best approach can vary depending on the specific needs and context of your project. However, these best practices can serve as guidelines to help you structure your Dagster project in

Scout

02/20/2024, 6:49 PM

a scalable and maintainable way.

Paul Herzog

02/20/2024, 6:56 PM

i get the following error when i use your example above @Scout The above exception was caused by the following exception: TypeError: Can't instantiate abstract class AssetSelection with abstract method resolve_inner

Scout

02/20/2024, 6:56 PM

The error message you're encountering, "TypeError: Can't instantiate abstract class AssetSelection with abstract method resolve_inner," suggests that there's an attempt to instantiate the

AssetSelection

class directly, which is an abstract class in Dagster. The

AssetSelection

class is used to select a subset of assets for a job, and it provides several class methods for creating instances, such as

AssetSelection.all()

AssetSelection.keys()

AssetSelection.groups()

, and

AssetSelection.key_prefixes()

. You should use these class methods rather than trying to instantiate

AssetSelection

directly. Here's an example of how you might use

AssetSelection

correctly:

Copy code

python from dagster import define_asset_job, AssetSelection  # Define a job that will materialize all assets all_assets_job = define_asset_job("all_assets_job", selection=AssetSelection.all())  # Define a job that will materialize assets with specific keys specific_assets_job = define_asset_job(     "specific_assets_job",     selection=AssetSelection.keys(["asset_key1", "asset_key2"]) )

In this example,

AssetSelection.all()

and

AssetSelection.keys(...)

are used to create selections of assets for jobs. If you're encountering this error in a different context, please ensure that you're using the

AssetSelection

class methods correctly and not trying to create an instance of

AssetSelection

directly. If you continue to experience issues, please provide more context or code snippets, and I'll do my best to assist you further.

Paul Herzog

02/20/2024, 7:00 PM

thats my code @Scout import os from dagster import Definitions, FilesystemIOManager, load_assets_from_modules, AssetSelection, AssetsDefinition, ScheduleDefinition, define_asset_job from dagster_gcp.gcs import GCSPickleIOManager, GCSResource from .jt_assets.gsc import ( gsc_api ) from .jt_assets.accuranker import (accuranker_etl, df_to_gcs, gcs_to_bq, json_to_df ) from .jt_assets.teamtailor import ( tt_api ) ########################################################### project_id = os.getenv("BIGQUERY_PROJECT_ID") accuranker_assets = [ accuranker_etl, df_to_gcs, gcs_to_bq, json_to_df ] gsc_assets = [ gsc_api ] tt_assets = [ tt_api ] ########################################################### accuranker_assets = load_assets_from_modules(accuranker_assets, group_name="Accuranker") gsc_assets = load_assets_from_modules(gsc_assets, group_name="GSC") teamtailor_assets = load_assets_from_modules(tt_assets, group_name="TeamTailor_API") all_assets = [ *accuranker_assets, *gsc_assets, *tt_assets ] #define jobs for each asset group accuranker_job = define_asset_job("accuranker_job", selection=AssetSelection.all()) gsc_job = define_asset_job("gsc_job", selection=AssetSelection.all()) teamtailor_job = define_asset_job("teamtailor_job", selection=AssetSelection.all()) accuranker_schedule = ScheduleDefinition( job=accuranker_job, cron_schedule="0 6 * * *" ) gsc_schedule = ScheduleDefinition( job = gsc_job, cron_schedule="0 6 * * *" ) teamtailor_schedule = ScheduleDefinition( job = teamtailor_job, cron_schedule="0 6 * * *" ) ########################################################### resources_by_env = { "prod": { "io_manager": GCSPickleIOManager( gcs_bucket=f"{project_id}-accuranker-orchestration-test", gcs_prefix="prefix" ), "gcs": GCSResource(project=f"{project_id}") }, "local": { "io_manager": FilesystemIOManager( base_dir="data" ) } } ########################################################### defs = Definitions( assets=all_assets, schedules=[accuranker_schedule, gsc_schedule, teamtailor_schedule], resources=resources_by_env[os.getenv("ENV", "local")] ) ###########################################################

Scout

02/20/2024, 7:00 PM

The code you've provided is attempting to define jobs for different asset groups and schedule them with different cron schedules. However, there are a few issues that need to be addressed: 1. Asset Selection: The

AssetSelection.all()

method selects all assets within the repository. If you want to schedule each asset group differently, you need to use a more specific selection for each job. For example, you can use

AssetSelection.keys(...)

AssetSelection.groups(...)

to select specific assets or groups of assets. 2. Asset Loading: The

load_assets_from_modules

function expects a list of modules, not a list of asset functions. You should pass the module that contains your assets, not the assets themselves. 3. Asset Grouping: You are using the

group_name

parameter in

load_assets_from_modules

, which is not a valid parameter for this function. If you want to group assets, you should define them within a module and then load that module using

load_assets_from_modules

. Here's how you might adjust your code:

Copy code

python
# Assuming you have modules named `accuranker`, `gsc`, and `teamtailor` that contain your asset definitions
from .jt_assets import accuranker, gsc, teamtailor

# Load assets from each module
accuranker_assets = load_assets_from_modules([accuranker])
gsc_assets = load_assets_from_modules([gsc])
teamtailor_assets = load_assets_from_modules([teamtailor])

# Define jobs for each asset group using specific selections
accuranker_job = define_asset_job("accuranker_job", selection=AssetSelection.groups("Accuranker"))
gsc_job = define_asset_job("gsc_job", selection=AssetSelection.groups("GSC"))
teamtailor_job = define_asset_job("teamtailor_job", selection=AssetSelection.groups("TeamTailor_API"))

# Define schedules for each job with specific cron schedules
accuranker_schedule = ScheduleDefinition(job=accuranker_job, cron_schedule="0 6 * * *")
gsc_schedule = ScheduleDefinition(job=gsc_job, cron_schedule="0 7 * * *")
teamtailor_schedule = ScheduleDefinition(job=teamtailor_job, cron_schedule="0 8 * * *")

# Combine everything into a single Definitions object
defs = Definitions(
    assets=accuranker_assets + gsc_assets + teamtailor_assets,
    schedules=[accuranker_schedule, gsc_schedule, teamtailor_schedule],
    resources=resources_by_env[os.getenv("ENV", "local")]
)

In this adjusted code, we're assuming that

accuranker

gsc

, and

teamtailor

are modules that contain your asset definitions. We load assets from these modules and then define jobs with specific asset selections based on groups. Each job is then scheduled with a different cron schedule. Please ensure that your asset definitions are organized into modules and that you're using the correct selection methods to target the assets you want to include in e

Scout

02/20/2024, 7:01 PM

ach job. If you continue to face issues, please provide more details about the structure of your asset definitions and how they are organized within your modules.

78 Views

Open in Slack

Previous Next