how to run test on dbt sources in dagster dagster #ask-ai

Join Slack

how to run test on dbt sources in dagster ?

# ask-ai

Jérôme MONTIGNY

02/20/2024, 3:38 PM

how to run test on dbt sources in dagster ?

Scout

02/20/2024, 3:38 PM

To run tests on dbt sources in Dagster, you can leverage the

dagster-dbt

integration which allows you to execute dbt commands within a Dagster job. According to the documentation, you can use the

dbt_test_op

to execute a

dbt test

command. This operation requires the use of a dbt resource, which can be configured to execute the command through the CLI using the

dbt_cli_resource

. Here is an example from the documentation on how to set up a Dagster job that runs dbt tests:

Copy code

python from dagster import job from dagster_dbt import dbt_test_op, dbt_cli_resource  @job(resource_defs={"dbt": dbt_cli_resource}) def my_dbt_cli_job():     dbt_test_op()

In this example,

dbt_test_op

is an operation provided by the

dagster-dbt

library that runs the

dbt test

command. The

dbt_cli_resource

is a resource definition that specifies how to run dbt commands, such as the path to the dbt project and profiles. If you want to run tests specifically on dbt sources, you would need to ensure that your dbt project is set up with source tests and that the dbt tests are configured correctly in your dbt project files. Once you have your dbt source tests defined, running the

dbt_test_op

within a Dagster job should execute those tests. Additionally, there has been a discussion on integrating dbt tests with Dagster asset checks, which suggests that you can model dbt tests as Dagster asset checks. This is an experimental feature, and you would need to opt-in your dbt tests to be modeled as Dagster asset checks by adding specific configuration to your

dbt_project.yml

file and then invoking a dbt command that runs your tests, such as

dbt test

dbt build

. For more detailed guidance on integrating dbt tests with Dagster asset checks, you can refer to the related discussions and documentation on the

dagster-dbt

integration and asset checks.

Jérôme MONTIGNY

02/20/2024, 3:40 PM

@Scout would it be possible to do this using a define_asset_job ?

Scout

02/20/2024, 3:40 PM

According to the documentation,

define_asset_job

is a function for defining a job from a selection of assets. It is typically used to create jobs that materialize assets based on the dependencies between them. However,

define_asset_job

is not directly used for running dbt tests. To run dbt tests within the context of a Dagster job that is defined using

define_asset_job

, you would need to create a separate operation that invokes the dbt tests and include that operation in your job definition. However, this is not the typical use case for

define_asset_job

, as it is designed to work with assets rather than operations. If you want to run dbt tests as part of a job that also materializes assets, you would typically define a separate operation for the dbt tests and include it in a standard Dagster job definition, not an asset job. Here's an example of how you might define a job that includes both asset materialization and dbt tests:

Copy code

python from dagster import job, op from dagster_dbt import dbt_cli_resource  @op(required_resource_keys={"dbt"}) def run_dbt_tests(context):     # This op would invoke dbt tests using the dbt CLI resource     context.resources.dbt.run_cli(["test"])  @job(resource_defs={"dbt": dbt_cli_resource}) def my_dbt_job():     run_dbt_tests()     # Include other ops or assets that you want to materialize as part of this job

In this example,

run_dbt_tests

is a separate operation that uses the dbt CLI resource to run dbt tests. It is included in a job definition along with other operations or assets that you want to materialize. If you are looking to integrate dbt tests more closely with Dagster's asset system, you might consider using asset checks or other mechanisms provided by Dagster for running checks or tests on assets. However, the specifics of how to do this would depend on the features available in the version of Dagster you are using and might require custom implementation.

Jérôme MONTIGNY

02/20/2024, 3:45 PM

@Scout Basically, I load data using airbyte, then I want to run tests on the sources in dbt and after that run my dbt models and the test on these models. If everything worked the way it is supposed to, I want to run a job named publish using a sensor. How do I do that ?

Scout

02/20/2024, 3:45 PM

To achieve the workflow you described—loading data with Airbyte, running dbt source tests, running dbt models and tests on those models, and finally triggering a

publish

job if everything succeeds—you can follow these steps: 1. Load Data with Airbyte: Define a job that uses Airbyte to load data into your data warehouse. 2. Run dbt Source Tests: After loading the data, define a dbt job that runs the source tests. You can use the

dbt_cli_resource

and

dbt_test_op

from the

dagster-dbt

integration to run these tests. 3. Run dbt Models: If the source tests pass, run your dbt models. You can use the

dbt_run_op

operation provided by

dagster-dbt

to execute your dbt models. 4. Run dbt Model Tests: After running the dbt models, run tests on these models using

dbt_test_op

again. 5. Define a Sensor for the Publish Job: Create a sensor that triggers the

publish

job when all the previous steps are successful. You can use the

run_status_sensor

decorator to monitor the success of the dbt model tests job and trigger the

publish

job. Here's an example of how you might set up the jobs and sensor:

Copy code

python
from dagster import job, sensor, RunRequest, DagsterRunStatus
from dagster_dbt import dbt_cli_resource, dbt_run_op, dbt_test_op

# Define your Airbyte job (not shown here)

# Define your dbt job
@job(resource_defs={"dbt": dbt_cli_resource})
def dbt_job():
    dbt_run_op()  # Run dbt models
    dbt_test_op()  # Run dbt tests on models

# Define your publish job (not shown here)

# Define a sensor that triggers the publish job after dbt_job succeeds
@sensor(job=publish_job)
def dbt_job_success_sensor(context):
    runs = context.instance.get_runs(
        filters=RunsFilter(
            tags={"dagster/job": "dbt_job"},
            statuses=[DagsterRunStatus.SUCCESS],
        ),
        limit=1,
    )
    if runs:
        last_run = runs[0]
        if context.cursor != last_run.run_id:
            context.update_cursor(last_run.run_id)
            yield RunRequest(run_key=last_run.run_id)

In this example, the

dbt_job

is a Dagster job that runs your dbt models and tests. The

dbt_job_success_sensor

is a sensor that triggers the

publish

job when

dbt_job

completes successfully. The sensor uses the

context.cursor

to ensure that it only triggers the

publish

job once for each successful run of

dbt_job

. Please note that this is a simplified example. In practice, you would need to configure your dbt operations with the appropriate resource configurations, and you might need to handle dependencies between operations explicitly. Additionally, the exact implementation details may vary depending on the version of Dagster and the

dagster-dbt

integration you are using, as well as the specifics of your dbt project and

Scout

02/20/2024, 3:45 PM

Airbyte configuration.

Jérôme MONTIGNY

02/21/2024, 8:15 AM

@Scout How do I link the Airbyte and the dbt jobs. What about the lineage ? Will it show the real lineage between Airbyte tables and dbt models ?

Scout

02/21/2024, 8:15 AM

To link Airbyte and dbt jobs in Dagster and ensure that the lineage between Airbyte tables and dbt models is accurately represented, you need to follow these steps: 1. Define Airbyte Assets: Use the

build_airbyte_assets

function provided by the

dagster-airbyte

integration to define assets that represent the tables created by an Airbyte sync operation. These assets will be part of your Dagster asset graph. 2. Define dbt Assets: Use the

dbt_cli_resource

along with the

dbt_run_op

and

dbt_test_op

operations to define dbt models and tests as assets in Dagster. The

dagster-dbt

integration allows you to represent each dbt model as its own asset. 3. Set Up Dependencies: Ensure that the asset keys of your Airbyte assets match up with the asset keys of your dbt sources. Dagster determines upstream dependencies for dbt-based assets by looking inside the dbt

sources.yaml

. Each table in

sources.yaml

is converted into a Dagster asset key. If a dbt model references a table named "raw_orders" inside a dbt source named "prod_db", then the Dagster asset corresponding to that dbt model will have an asset dependency with

AssetKey(["prod_db", "raw_orders"])

. 4. Materialize Assets: When you materialize assets in Dagster, it will respect the dependencies you have set up. If you have an asset with the key

AssetKey(["prod_db", "raw_orders"])

in your Dagster deployment, then the UI will show the connection between those assets. Dagster will wait for the

raw_orders

asset to complete before running the downstream dbt model. Here's an example of how you might define the Airbyte and dbt assets and set up the dependencies: ```python from dagster import job, op, AssetKey from dagster_airbyte import build_airbyte_assets, airbyte_sync_op from dagster_dbt import dbt_cli_resource, dbt_run_op, dbt_test_op # Define Airbyte assets airbyte_assets = build_airbyte_assets( connection_id="your_airbyte_connection_id", destination_tables=["your_destination_table"], # ... other parameters ) # Define dbt assets @op(required_resource_keys={"dbt"}, out={"dbt_result": Out(asset_key=AssetKey(["prod_db", "raw_orders"]))}) def run_dbt_models(context): # This op would invoke dbt run using the dbt CLI resource context.resources.dbt.run_cli(["run"]) @op(required_resource_keys={"dbt"}, ins={"dbt_result": In(asset_key=AssetKey(["prod_db", "raw_orders"]))}) def test_dbt_models(context): # This op would invoke dbt test using the dbt CLI resource context.resources.dbt.run_cli(["test"]) # Define a job that materializes the Airbyte and dbt assets @job(resource_defs={"dbt": dbt_cli_resource}) def etl_job(): airbyte_sync_op() dbt_result = run_dbt_models() test_dbt_models(dbt_result) # Define your publish job (not shown here) ``

Scout

02/21/2024, 8:16 AM

` In this example,

airbyte_sync_op

is used to sync data from Airbyte, and

run_dbt_models

and

test_dbt_models

are ops that run dbt models and tests, respectively. The

dbt_result

output from

run_dbt_models

is passed as an input to

test_dbt_models

, establishing a dependency between the two operations. The lineage between Airbyte tables and dbt models will be visible in the Dagster UI, showing how data flows from the source (Airbyte) to the transformation (dbt models) and finally to the tests. This assumes that the asset keys in your dbt

sources.yaml

match the asset keys generated by the

build_airbyte_assets

function. Please note that the actual implementation details may vary depending on the specifics of your Airbyte and dbt configurations. You may need to customize the asset keys and dependencies to match your setup.

Jérôme MONTIGNY

02/21/2024, 9:56 AM

@Scout are you sure that the context.resources.dbt.run_cli(["test"]) function exists ? it seems to be named only cli actually

Scout

02/21/2024, 9:56 AM

You are correct; my apologies for the oversight. The function to invoke dbt commands using the

DbtCliResource

is named

cli

, not

run_cli

. The

cli

method is used to execute dbt commands programmatically within a Dagster op or asset. Here's the corrected example using the

cli

method:

Copy code

python from dagster import job, op, Out, In, AssetKey from dagster_airbyte import build_airbyte_assets, airbyte_sync_op from dagster_dbt import dbt_cli_resource  # Define Airbyte assets airbyte_assets = build_airbyte_assets(     connection_id="your_airbyte_connection_id",     destination_tables=["your_destination_table"],     # ... other parameters )  # Define dbt assets @op(required_resource_keys={"dbt"}, out={"dbt_result": Out(asset_key=AssetKey(["prod_db", "raw_orders"]))}) def run_dbt_models(context):     # This op would invoke dbt run using the dbt CLI resource     context.resources.dbt.cli(["run"])  @op(required_resource_keys={"dbt"}, ins={"dbt_result": In(asset_key=AssetKey(["prod_db", "raw_orders"]))}) def test_dbt_models(context):     # This op would invoke dbt test using the dbt CLI resource     context.resources.dbt.cli(["test"])  # Define a job that materializes the Airbyte and dbt assets @job(resource_defs={"dbt": dbt_cli_resource}) def etl_job():     airbyte_sync_op()     dbt_result = run_dbt_models()     test_dbt_models(dbt_result)  # Define your publish job (not shown here)

In this corrected example,

context.resources.dbt.cli(["run"])

and

context.resources.dbt.cli(["test"])

are used to run dbt commands within the

run_dbt_models

and

test_dbt_models

ops, respectively. The

cli

method is part of the

DbtCliResource

which is used to interact with dbt via the command line interface from within Dagster. Again, I apologize for the confusion, and thank you for pointing that out.

Jérôme MONTIGNY

02/21/2024, 9:59 AM

@Scout I want to run tests on the Airbyte tables using dbt sources, how do I run them ?

Scout

02/21/2024, 9:59 AM

To run tests on Airbyte tables using dbt sources, you'll need to follow these steps: 1. Set Up Airbyte: Ensure that Airbyte is configured to sync data from your sources to your destination tables. 2. Configure dbt: Set up your dbt project to include source definitions that correspond to the tables created by Airbyte. Define tests in your

schema.yml

files within your dbt project. 3. Install Dagster Integrations: Make sure you have the

dagster-dbt

and

dagster-airbyte

integrations installed in your environment. 4. Define Airbyte Assets: Use the

build_airbyte_assets

function from the

dagster-airbyte

integration to define assets that represent the tables created by an Airbyte sync operation. 5. Create a Dagster Job: Define a Dagster job that includes the Airbyte sync operation and dbt test operation. You will use the

dbt_cli_resource

to run dbt commands. 6. Run dbt Source Tests: Use the

dbt_cli_resource

to run the

dbt test

command, which will execute the tests defined in your dbt project against the Airbyte tables. Here's an example of how you might set up a Dagster job to run dbt tests on Airbyte tables:

Copy code

python
from dagster import job, op, AssetKey
from dagster_airbyte import build_airbyte_assets, airbyte_sync_op
from dagster_dbt import dbt_cli_resource

# Define Airbyte assets
airbyte_assets = build_airbyte_assets(
    connection_id="your_airbyte_connection_id",
    destination_tables=["your_destination_table"],
    # ... other parameters
)

# Define an op to run dbt source tests
@op(required_resource_keys={"dbt"})
def run_dbt_source_tests(context):
    # This op would invoke dbt test using the dbt CLI resource
    # The --select argument is used to run tests only on the specified sources
    context.resources.dbt.cli(["test", "--select", "source:your_source_name"])

# Define a job that materializes the Airbyte assets and runs dbt source tests
@job(resource_defs={"dbt": dbt_cli_resource})
def airbyte_dbt_test_job():
    airbyte_sync_op()
    run_dbt_source_tests()

# Execute the job
if __name__ == "__main__":
    result = airbyte_dbt_test_job.execute_in_process()

In this example,

airbyte_sync_op

is used to sync data from Airbyte, and

run_dbt_source_tests

is an op that runs dbt tests on the sources that correspond to the Airbyte tables. The

dbt_cli_resource

is used to execute dbt commands within the

run_dbt_source_tests

op. Make sure to replace

"your_airbyte_connection_id"

"your_destination_table"

, and

"source:your_source_name"

with the actual values from your Airbyte and dbt configurations. The

--select

argument in the dbt CLI command is used to specify which sources to test. You can use dbt's selection syntax to run tests on specific sources or models. Please note that the actua

Scout

02/21/2024, 10:00 AM

l implementation details may vary depending on the specifics of your Airbyte and dbt configurations. You may need to customize the asset keys, dbt project paths, and other parameters to match your setup.

Jérôme MONTIGNY

02/21/2024, 10:06 AM

@Scout can I do this using load_assets_from_airbyte_instance ?

Scout

02/21/2024, 10:06 AM

The

load_assets_from_airbyte_instance

function is used to load Airbyte assets into Dagster from a live Airbyte instance. This function will automatically generate asset definitions from your Airbyte project, creating software-defined assets for each data stream synced by your connections. However, the

load_assets_from_airbyte_instance

function does not directly provide a way to run dbt tests on the Airbyte tables. To run dbt tests on the Airbyte tables, you would still need to use the

dagster-dbt

integration to define a separate operation or job that runs the dbt tests after the Airbyte sync operation has completed. Here's how you might set up a Dagster job using

load_assets_from_airbyte_instance

to load Airbyte assets and then run dbt source tests:

Copy code

python from dagster import job, op from dagster_airbyte import load_assets_from_airbyte_instance, airbyte_resource from dagster_dbt import dbt_cli_resource  # Configure the Airbyte resource airbyte_instance = airbyte_resource.configured({     "host": "localhost",     "port": "8000",     # ... other configuration options })  # Load assets from the Airbyte instance airbyte_assets = load_assets_from_airbyte_instance(airbyte_instance)  # Define an op to run dbt source tests @op(required_resource_keys={"dbt"}) def run_dbt_source_tests(context):     # This op would invoke dbt test using the dbt CLI resource     context.resources.dbt.cli(["test", "--select", "source:your_source_name"])  # Define a job that materializes the Airbyte assets and runs dbt source tests @job(resource_defs={"dbt": dbt_cli_resource, "airbyte": airbyte_instance}) def airbyte_dbt_test_job():     # The airbyte_assets will be materialized as part of the job execution     run_dbt_source_tests()  # Execute the job if __name__ == "__main__":     result = airbyte_dbt_test_job.execute_in_process()

In this example,

load_assets_from_airbyte_instance

is used to load assets from a live Airbyte instance, and

run_dbt_source_tests

is an op that runs dbt tests on the sources that correspond to the Airbyte tables. The

dbt_cli_resource

is used to execute dbt commands within the

run_dbt_source_tests

op. Make sure to replace

"localhost"

"8000"

"your_source_name"

, and any other placeholders with the actual values from your Airbyte and dbt configurations. The

--select

argument in the dbt CLI command is used to specify which sources to test. Please note that the actual implementation details may vary depending on the specifics of your Airbyte and dbt configurations. You may need to customize the resource definitions, asset keys, dbt project paths, and other parameters to match your setup.

Jérôme MONTIGNY

02/21/2024, 10:09 AM

@Scout I selected a subpart of the airbyte tables, how does it knows which subpart to pick ?

Scout

02/21/2024, 10:09 AM

When you select a subpart of the Airbyte tables in Dagster, you typically specify which tables you want to include as assets using the

destination_tables

parameter in the

build_airbyte_assets

function or by defining a filter function when using

load_assets_from_airbyte_instance

. These mechanisms allow you to control which tables are represented as assets in the Dagster asset graph. For

build_airbyte_assets

, you explicitly list the tables you want to include:

Copy code

python airbyte_assets = build_airbyte_assets(     connection_id="your_airbyte_connection_id",     destination_tables=["table1", "table2", "table3"],     # ... other parameters )

For

load_assets_from_airbyte_instance

, you can provide a filter function that returns

True

for the connections you want to include:

Copy code

python from dagster_airbyte import load_assets_from_airbyte_instance, AirbyteResource  airbyte_instance = AirbyteResource.configured({     "host": "localhost",     "port": "8000",     # ... other configuration options })  def connection_filter_fn(connection):     # Implement logic to determine if a connection should be included     return connection.name in ["connection1", "connection2"]  airbyte_assets = load_assets_from_airbyte_instance(     airbyte_instance,     connection_filter=connection_filter_fn,     # ... other parameters )

In both cases, Dagster uses the provided information to determine which Airbyte tables to represent as assets. When you run a job that includes these assets, only the specified tables will be materialized. If you are using

load_assets_from_airbyte_instance

and want to select a subset of tables within a specific Airbyte connection, you would need to implement additional logic to filter or map the connections to the desired tables. This could involve inspecting the metadata provided by Airbyte or implementing custom logic based on your knowledge of the Airbyte configuration. It's important to ensure that the subset of tables you select aligns with the dbt source definitions you plan to test. The dbt source tests should be defined in your dbt project's

schema.yml

files, and the source names used in the dbt

test

command should match the source names defined in your dbt project. This ensures that when you run dbt tests, they are executed against the correct subset of tables that have been materialized as assets in Dagster.

Jérôme MONTIGNY

02/21/2024, 10:19 AM

@Scout ok, and after that's done using

Copy code

airbyte_assets = load_assets_from_airbyte_instance(
    airbyte_instance,
    connection_filter=connection_filter_fn,
    # ... other parameters
)

How do I run the tests on the dbt sources loaded by these table with the proper lineage ?

Scout

02/21/2024, 10:19 AM

To run tests on dbt sources that correspond to the tables loaded by Airbyte and maintain the proper lineage, you need to ensure that your dbt project is set up correctly with source definitions and tests, and then you need to integrate these dbt tests within your Dagster asset graph. Here's a step-by-step guide on how to do this: 1. Set Up dbt Sources and Tests: In your dbt project, define sources that correspond to the tables loaded by Airbyte. Then, define tests for these sources in your

schema.yml

files. 2. Load Airbyte Assets: Use

load_assets_from_airbyte_instance

with a

connection_filter

to load only the relevant Airbyte tables as assets in Dagster. 3. Define dbt Assets: Use the

dagster-dbt

integration to define dbt models and tests as assets in Dagster. You can use the

dbt_cli_resource

to run dbt commands. 4. Create Asset Dependencies: Ensure that the asset keys for the Airbyte assets match the source definitions in your dbt project. This will create the correct lineage between the Airbyte tables and the dbt models. 5. Run dbt Source Tests: Define a Dagster job that includes an operation to run dbt source tests after the Airbyte sync operation has completed. Here's an example of how you might set up a Dagster job to run dbt tests on Airbyte tables with proper lineage: ```python from dagster import job, op, AssetKey from dagster_airbyte import load_assets_from_airbyte_instance, airbyte_resource from dagster_dbt import dbt_cli_resource # Configure the Airbyte resource airbyte_instance = airbyte_resource.configured({ "host": "localhost", "port": "8000", # ... other configuration options }) # Load assets from the Airbyte instance def connection_filter_fn(connection): # Implement logic to determine if a connection should be included return True # or your custom logic airbyte_assets = load_assets_from_airbyte_instance( airbyte_instance, connection_filter=connection_filter_fn, # ... other parameters ) # Define an op to run dbt source tests @op(required_resource_keys={"dbt"}) def run_dbt_source_tests(context): # This op would invoke dbt test using the dbt CLI resource # The --select argument is used to run tests only on the specified sources context.resources.dbt.cli(["test", "--select", "source:your_source_name"]) # Define a job that materializes the Airbyte assets and runs dbt source tests @job(resource_defs={"dbt": dbt_cli_resource, "airbyte": airbyte_instance}) def airbyte_dbt_test_job(): for asset in airbyte_assets: # Materialize each Airbyte asset asset() # Run dbt source tests after Airbyte assets have been materialized run_dbt_source_tests() # Execute the job if name == "__main__": result = airbyte_dbt_test_job.execute_in

Scout

02/21/2024, 10:20 AM

_process() ``` In this example,

airbyte_assets

is a list of assets generated from the Airbyte instance, and

run_dbt_source_tests

is an op that runs dbt tests on the sources that correspond to the Airbyte tables. The

dbt_cli_resource

is used to execute dbt commands within the

run_dbt_source_tests

op. The lineage between the Airbyte tables and dbt models will be visible in the Dagster UI, showing how data flows from the source (Airbyte) to the transformation (dbt models) and finally to the tests. This assumes that the asset keys in your dbt

sources.yaml

match the asset keys generated by the

load_assets_from_airbyte_instance

function. Please note that the actual implementation details may vary depending on the specifics of your Airbyte and dbt configurations. You may need to customize the resource definitions, asset keys, dbt project paths, and other parameters to match your setup.

Jérôme MONTIGNY

02/21/2024, 10:27 AM

@Scout how do I chain it with a dbt run of a subpart of my dbt models (the subpart is in a specific folder)?

Scout

02/21/2024, 10:27 AM

To chain a dbt run of a subpart of your dbt models located in a specific folder after running dbt source tests, you can use the

dbt_cli_resource

to execute a dbt run command with a selection criteria that targets the specific folder or models you want to include. You can use the

--models

--select

flag in the dbt CLI to specify the subpart of models to run. Here's how you can set up a Dagster job to run dbt source tests followed by a dbt run for a specific subpart of your dbt models: 1. Define an Op for dbt Source Tests: Create an op that uses the

dbt_cli_resource

to run dbt source tests. 2. Define an Op for dbt Run: Create another op that uses the

dbt_cli_resource

to run a subset of dbt models located in a specific folder. 3. Chain the Ops in a Job: Define a Dagster job that first runs the dbt source tests and then, if successful, runs the dbt models. Here's an example of how you might set up the ops and job:

Copy code

python from dagster import job, op, AssetKey from dagster_dbt import dbt_cli_resource  # Define an op to run dbt source tests @op(required_resource_keys={"dbt"}) def run_dbt_source_tests(context):     context.resources.dbt.cli(["test", "--select", "source:your_source_name"])  # Define an op to run a subpart of dbt models in a specific folder @op(required_resource_keys={"dbt"}) def run_dbt_models_subpart(context):     # Replace 'your_folder_name' with the actual folder name or model selection criteria     context.resources.dbt.cli(["run", "--models", "your_folder_name.*"])  # Define a job that runs dbt source tests and then a subpart of dbt models @job(resource_defs={"dbt": dbt_cli_resource}) def dbt_test_and_run_job():     run_dbt_source_tests()     run_dbt_models_subpart()  # Execute the job if __name__ == "__main__":     result = dbt_test_and_run_job.execute_in_process()

In this example,

run_dbt_source_tests

is an op that runs dbt tests on the sources, and

run_dbt_models_subpart

is an op that runs a subset of dbt models located in a specific folder. The

dbt_test_and_run_job

job chains these two ops together, ensuring that the dbt models are run only after the source tests have completed successfully. Make sure to replace

"your_source_name"

and

"your_folder_name.*"

with the actual source name and folder or model selection criteria from your dbt project. The

--models

flag in the dbt CLI command is used to specify which models to run, and you can use dbt's selection syntax to run models based on different criteria, such as folder, tag, or model name. Please note that the actual implementation details may vary depending on the specifics of your dbt configuration. You may need to customize the dbt project paths, selection criteria, and other parameters to match your setup.

Jérôme MONTIGNY

02/21/2024, 10:42 AM

@Scout about your for loop to materialize the Airbyte assets, I am getting this :

Copy code

TypeError: 'AirbyteInstanceCacheableAssetsDefinition' object is not iterable

Any idea why ?

Scout

02/21/2024, 10:42 AM

The error message

TypeError: 'AirbyteInstanceCacheableAssetsDefinition' object is not iterable

suggests that the object returned by

load_assets_from_airbyte_instance

is not a list or iterable of assets, but rather a single asset definition object that represents all the assets for the Airbyte connection. In Dagster, when you use

load_assets_from_airbyte_instance

, you don't need to iterate over the returned object to materialize assets. Instead, the function creates a set of assets that are automatically added to your repository and materialized when you run the job that includes them. Here's how you can define a job that includes Airbyte assets and runs dbt source tests without iterating over the assets:

Copy code

python from dagster import job from dagster_airbyte import load_assets_from_airbyte_instance, airbyte_resource from dagster_dbt import dbt_cli_resource  # Configure the Airbyte resource airbyte_instance = airbyte_resource.configured({     "host": "localhost",     "port": "8000",     # ... other configuration options })  # Load assets from the Airbyte instance airbyte_assets = load_assets_from_airbyte_instance(     airbyte_instance,     # ... other parameters )  # Define an op to run dbt source tests @op(required_resource_keys={"dbt"}) def run_dbt_source_tests(context):     # This op would invoke dbt test using the dbt CLI resource     context.resources.dbt.cli(["test", "--select", "source:your_source_name"])  # Define a job that includes the Airbyte assets and runs dbt source tests @job(resource_defs={"dbt": dbt_cli_resource, "airbyte": airbyte_instance}) def airbyte_dbt_test_job():     run_dbt_source_tests()  # Execute the job if __name__ == "__main__":     result = airbyte_dbt_test_job.execute_in_process()

In this example,

airbyte_assets

is created by

load_assets_from_airbyte_instance

, and it represents all the assets for the Airbyte connection. You don't need to manually materialize these assets in the job; they are part of the asset graph and will be materialized when the job is executed. The

run_dbt_source_tests

op is defined to run dbt source tests, and it's included in the

airbyte_dbt_test_job

. When you execute this job, Dagster will automatically handle the materialization of the Airbyte assets before running the dbt source tests. Make sure to replace

"localhost"

"8000"

"your_source_name"

, and any other placeholders with the actual values from your Airbyte and dbt configurations. The

--select

argument in the dbt CLI command is used to specify which sources to test. Please note that the actual implementation details may vary depending on the specifics of your Airbyte and dbt configurations. You may need to customize the resource definitions, dbt project paths, and other parameters to match your setup.

Jérôme MONTIGNY

02/21/2024, 1:53 PM

@Scout how do I create a job that loads a subpart of the tables from an airbyte instance and triggers a job on the success of the load ?

Scout

02/21/2024, 1:53 PM

To create a job that loads a subpart of the tables from an Airbyte instance and triggers another job upon successful load, you can follow these steps: 1. Load Airbyte Assets: Use

load_assets_from_airbyte_instance

to load assets from an Airbyte instance. You can use a

connection_filter

to specify which connections (and thus which tables) you want to include. 2. Define a Job for Loading Assets: Define a job that includes the Airbyte assets. This job will be responsible for triggering the Airbyte sync and loading the data. 3. Define a Sensor: Create a sensor that monitors the success of the asset-loading job. When the job succeeds, the sensor will trigger another job. Here's an example of how you might set up the job and sensor:

Copy code

python
from dagster import job, op, sensor, RunRequest, repository
from dagster_airbyte import load_assets_from_airbyte_instance, airbyte_resource
from dagster_dbt import dbt_cli_resource

# Configure the Airbyte resource
airbyte_instance = airbyte_resource.configured({
    "host": "localhost",
    "port": "8000",
    # ... other configuration options
})

# Load assets from the Airbyte instance
def connection_filter_fn(connection):
    # Implement logic to determine if a connection should be included
    return True  # or your custom logic

airbyte_assets = load_assets_from_airbyte_instance(
    airbyte_instance,
    connection_filter=connection_filter_fn,
    # ... other parameters
)

# Define a job that includes the Airbyte assets
@job(resource_defs={"airbyte": airbyte_instance})
def airbyte_load_job():
    # The airbyte_assets will be materialized as part of the job execution
    pass

# Define another job that should be triggered after successful load
@job(resource_defs={"dbt": dbt_cli_resource})
def downstream_job():
    # Define the operations that should run after the Airbyte load
    pass

# Define a sensor that triggers the downstream job upon successful load
@sensor(job=downstream_job)
def airbyte_load_success_sensor(context):
    runs = context.instance.get_runs(
        filters=RunsFilter(
            tags={"dagster/job": "airbyte_load_job"},
            statuses=[DagsterRunStatus.SUCCESS],
        ),
        limit=1,
    )
    if runs:
        last_run = runs[0]
        if context.cursor != last_run.run_id:
            context.update_cursor(last_run.run_id)
            yield RunRequest(run_key=last_run.run_id)

# Define a repository to hold the jobs and sensor
@repository
def my_repository():
    return [airbyte_load_job, downstream_job, airbyte_load_success_sensor]

In this example,

airbyte_load_job

is a job that includes the Airbyte assets, which will be materialized when the job is executed. The

downstream_job

is another job that should be triggered after the Airbyte assets have be

Scout

02/21/2024, 1:53 PM

en successfully loaded. The

airbyte_load_success_sensor

is a sensor that monitors the success of the

airbyte_load_job

. When it detects a successful run, it triggers the

downstream_job

. The sensor uses the

context.cursor

to ensure that it only triggers the

downstream_job

once for each successful run of

airbyte_load_job

. Make sure to replace

"localhost"

"8000"

, and any other placeholders with the actual values from your Airbyte configuration. The

connection_filter_fn

should contain the logic to filter the connections you want to load as assets. Please note that the actual implementation details may vary depending on the specifics of your Airbyte and dbt configurations. You may need to customize the resource definitions, dbt project paths, and other parameters to match your setup.

Jérôme MONTIGNY

02/21/2024, 2:09 PM

@Scout Would it be possible to do so with operators instead ?

Scout

02/21/2024, 2:09 PM

Yes, you can achieve the same workflow using ops instead of assets in Dagster. To do this, you would define ops for the Airbyte sync and dbt run, and then chain these ops together in a job. You can still use a sensor to trigger another job upon the successful completion of the Airbyte sync. Here's a high-level outline of how you might set up the job using ops: 1. Define an Op for Airbyte Sync: Create an op that triggers the Airbyte sync. This op can use the

airbyte_resource

to interact with the Airbyte API. 2. Define an Op for dbt Run: Create an op that runs the dbt models. This op can use the

dbt_cli_resource

to run dbt commands. 3. Chain the Ops in a Job: Define a job that first runs the Airbyte sync op and then, if successful, runs the dbt run op. 4. Define a Sensor: Create a sensor that monitors the success of the job and triggers another job if the Airbyte sync and dbt run are successful. Here's an example of how you might set up the ops, job, and sensor:

Copy code

python
from dagster import job, op, sensor, RunRequest, repository, DagsterRunStatus, RunsFilter
from dagster_airbyte import airbyte_resource
from dagster_dbt import dbt_cli_resource

# Define an op for Airbyte sync
@op(required_resource_keys={"airbyte"})
def airbyte_sync_op(context):
    # Use the airbyte_resource to trigger the Airbyte sync
    # Replace 'your_airbyte_connection_id' with the actual connection ID
    context.resources.airbyte.sync_connection('your_airbyte_connection_id')

# Define an op for dbt run
@op(required_resource_keys={"dbt"})
def dbt_run_op(context):
    # Use the dbt_cli_resource to run dbt models
    context.resources.dbt.cli(["run"])

# Define a job that chains the Airbyte sync and dbt run ops
@job(resource_defs={"airbyte": airbyte_resource, "dbt": dbt_cli_resource})
def airbyte_to_dbt_job():
    airbyte_sync_op()
    dbt_run_op()

# Define another job that should be triggered after successful load
@job
def downstream_job():
    # Define the operations that should run after the Airbyte load
    pass

# Define a sensor that triggers the downstream job upon successful load
@sensor(job=downstream_job)
def airbyte_to_dbt_success_sensor(context):
    runs = context.instance.get_runs(
        filters=RunsFilter(
            tags={"dagster/job": "airbyte_to_dbt_job"},
            statuses=[DagsterRunStatus.SUCCESS],
        ),
        limit=1,
    )
    if runs:
        last_run = runs[0]
        if context.cursor != last_run.run_id:
            context.update_cursor(last_run.run_id)
            yield RunRequest(run_key=last_run.run_id)

# Define a repository to hold the jobs and sensor
@repository
def my_repository():
    return [airbyte_to_dbt_job, downstream_job, airbyte_to_dbt_success_sensor]

In this example,

airbyte_sync_op

is an o

Scout

02/21/2024, 2:11 PM

p that triggers an Airbyte sync, and

dbt_run_op

is an op that runs dbt models. The

airbyte_to_dbt_job

job chains these two ops together. The

airbyte_to_dbt_success_sensor

is a sensor that triggers the

downstream_job

when the

airbyte_to_dbt_job

completes successfully. Make sure to replace

"your_airbyte_connection_id"

and any other placeholders with the actual values from your Airbyte and dbt configurations. The sensor uses the

context.cursor

to ensure that it only triggers the

downstream_job

once for each successful run of

airbyte_to_dbt_job

. Please note that the actual implementation details may vary depending on the specifics of your Airbyte and dbt configurations. You may need to customize the resource definitions, dbt project paths, and other parameters to match your setup.

Jérôme MONTIGNY

02/21/2024, 2:19 PM

@Scout how does airbyte gets to know which assets to load based on this op :

Copy code

@op(required_resource_keys={"airbyte"})
def airbyte_sync_op(context):
    # Use the airbyte_resource to trigger the Airbyte sync
    # Replace 'your_airbyte_connection_id' with the actual connection ID
    context.resources.airbyte.sync_connection('your_airbyte_connection_id')

Scout

02/21/2024, 2:19 PM

The

airbyte_sync_op

you've defined is an operation that uses the

airbyte_resource

to trigger a sync in Airbyte. When you call

context.resources.airbyte.sync_connection('your_airbyte_connection_id')

, you are instructing Airbyte to start a sync job for the connection with the specified ID. Airbyte itself determines which tables to load based on the configuration of the connection you've specified. Each connection in Airbyte is configured to sync data from a source to a destination, and this configuration includes details about which tables or streams to sync. When you trigger a sync using the connection ID, Airbyte looks up the connection configuration and performs the sync accordingly. The

airbyte_resource

in Dagster is a representation of your Airbyte instance, and it provides methods to interact with the Airbyte API. The

sync_connection

method specifically triggers the sync process in Airbyte for the given connection ID. In Dagster, if you want to represent the tables loaded by Airbyte as assets, you would typically use the

load_assets_from_airbyte_instance

function, which automatically generates asset definitions from your Airbyte project. However, in the code snippet you've provided, you're not using this function; you're simply triggering an Airbyte sync within an op. If you want to create explicit asset definitions in Dagster for the tables that Airbyte loads, you would need to use either

load_assets_from_airbyte_instance

build_airbyte_assets

to define these assets. These functions will create software-defined assets in Dagster that correspond to the data streams synced by your Airbyte connections. Here's an example of how you might use

build_airbyte_assets

to create assets for specific tables:

Copy code

python from dagster_airbyte import build_airbyte_assets, airbyte_resource  # Configure the Airbyte resource airbyte_instance = airbyte_resource.configured({     "host": "localhost",     "port": "8000",     # ... other configuration options })  # Build Airbyte assets for a specific connection airbyte_assets = build_airbyte_assets(     connection_id="your_airbyte_connection_id",     destination_tables=["table1", "table2", "table3"],     # ... other parameters )  # Add the Airbyte resource to the assets airbyte_assets_with_resource = with_resources(     airbyte_assets,     {"airbyte": airbyte_instance}, )

In this example,

build_airbyte_assets

is used to create assets for the specified tables from a particular Airbyte connection. The

with_resources

function is then used to associate the

airbyte_instance

resource with these assets. This way, the lineage and dependencies between Airbyte syncs and downstream dbt models can be explicitly defined and visualized in Dagster.

Jérôme MONTIGNY

02/21/2024, 2:37 PM

@Scout how do I get the connection id then ?

Scout

02/21/2024, 2:37 PM

To obtain the connection ID for an Airbyte connection, you typically need to access the Airbyte UI and find the connection ID in the URL of the connection page. The connection ID is a unique identifier for each connection in Airbyte. Here's how you can find the connection ID according to the documentation: 1. Access the Airbyte UI: Log in to your Airbyte instance through the web interface. 2. Navigate to the Connections Page: Go to the connections section where you can see a list of all your configured connections. 3. Open the Connection Details: Click on the connection for which you want to get the connection ID. 4. Find the Connection ID in the URL: Look at the URL in your web browser's address bar. The connection ID is the part of the URL that comes after

/connections/

and before the next slash. For example, if the URL is

<https://airbyte.example.com/connections/12345678-90ab-cdef-1234-567890abcdef/status>

, then the connection ID is

12345678-90ab-cdef-1234-567890abcdef

. Here's a visual representation from the documentation: ![Screenshot of the Airbyte UI in a browser, showing the connection ID in the URL.](/_next/image?url=%2Fimages%2Fintegrations%2Fairbyte%2Fairbyte_cloud_connection_ui.png&w=3840&q=75) Once you have the connection ID, you can use it in your Dagster code to define Airbyte assets or to trigger syncs using the

airbyte_resource

. If you are using Airbyte Cloud, the process is the same; you would access the Airbyte Cloud UI to find the connection ID. If you are using a local or self-hosted instance of Airbyte, you would access the UI through the appropriate URL for your instance. Remember to handle the connection ID securely, as it is a sensitive piece of information that grants access to the configuration and data of that connection.

Jérôme MONTIGNY

02/21/2024, 2:39 PM

@Scout can't I just create an op that loads the airbyte data based on a load_assets_from_airbyte_instance function ?

Scout

02/21/2024, 2:40 PM

Yes, you can create an op in Dagster that loads data from Airbyte using the

load_assets_from_airbyte_instance

function. This function retrieves all of the connections you have defined in the Airbyte interface, creating software-defined assets for each data stream. Each connection has an associated op which triggers a sync of that connection. Here's a simplified example of how you might define such an op:

Copy code

python from dagster import op from dagster_airbyte import airbyte_resource, load_assets_from_airbyte_instance  # Configure the Airbyte resource airbyte_instance = airbyte_resource.configured({     "host": "localhost",     "port": "8000",     # ... other configuration options })  # Load assets from the Airbyte instance airbyte_assets = load_assets_from_airbyte_instance(     airbyte_instance,     # Optionally, you can use a connection_filter to load a subset of connections     connection_filter=lambda meta: "desired_connection_name" in meta.name, )  # Define an op to load data from Airbyte @op(required_resource_keys={"airbyte"}) def airbyte_sync_op(context):     # This op would trigger the sync for all connections loaded as assets     for asset in airbyte_assets:         asset.materialize(context)  # Define a job that includes the Airbyte sync op @job(resource_defs={"airbyte": airbyte_instance}) def airbyte_load_job():     airbyte_sync_op()

In this example,

airbyte_sync_op

is an op that triggers the sync for all connections loaded as assets. The

airbyte_load_job

job includes this op and is configured with the

airbyte_instance

resource. When you execute

airbyte_load_job

, it will run the

airbyte_sync_op

, which in turn will materialize each Airbyte asset, effectively triggering the Airbyte syncs for the connections that have been loaded as assets. Please note that the actual implementation details may vary depending on the specifics of your Airbyte configuration. You may need to customize the resource definitions, connection filters, and other parameters to match your setup. Additionally, the

load_assets_from_airbyte_instance

function may have other parameters that you can use to further customize the behavior, such as specifying a

key_prefix

or a

connection_to_asset_key_fn

to map connections to asset keys.

Jérôme MONTIGNY

02/21/2024, 2:44 PM

@Scout you write @op(required_resource_keys={"airbyte"}), but what is the airbyte resource in your last message ?

Scout

02/21/2024, 2:44 PM

The

airbyte

resource in the

@op(required_resource_keys={"airbyte"})

decorator refers to an instance of

AirbyteResource

that is configured to interact with an Airbyte server. This resource provides the necessary functionality to communicate with the Airbyte API, such as triggering syncs or checking the status of connections. In the example provided, the

airbyte_instance

is created using the

airbyte_resource.configured

method, which sets up the resource with the appropriate connection details to the Airbyte server. Here's the relevant part of the example that defines the `airbyte_instance`:

Copy code

python from dagster_airbyte import airbyte_resource  # Configure the Airbyte resource airbyte_instance = airbyte_resource.configured({     "host": "localhost",     "port": "8000",     # ... other configuration options })

This

airbyte_instance

is then passed to the

resource_defs

argument when defining the job, associating the resource with the key `"airbyte"`:

Copy code

python from dagster import job  # Define a job that includes the Airbyte sync op @job(resource_defs={"airbyte": airbyte_instance}) def airbyte_load_job():     airbyte_sync_op()

When the

airbyte_sync_op

is executed within the

airbyte_load_job

, it has access to the

airbyte

resource, which it can use to perform operations against the Airbyte API, such as triggering a sync. The

required_resource_keys

parameter in the

@op

decorator specifies that the op requires a resource with the key

"airbyte"

to be provided in the job's resource definitions. Please ensure that the host and port values match the actual configuration of your Airbyte instance. If you're using Airbyte Cloud or a different host, you'll need to adjust these values accordingly.

Jérôme MONTIGNY

02/21/2024, 3:28 PM

@Scout I have issues running dbt commands with the dbt.cli function : how can I make it so that I don't have to give it the dbt project location as argument to the dbt command ?

Scout

02/21/2024, 3:28 PM

If you're encountering issues running dbt commands with the

dbt.cli

function and you want to avoid having to specify the dbt project location as an argument each time, you can configure the

DbtCliResource

with the project directory and other relevant settings. This way, the resource will use the configured settings by default when you invoke dbt commands. Here's how you can configure the `DbtCliResource`: 1. Configure the
DbtCliResource
: When you define the

DbtCliResource

, you can provide the

project_dir

(and other dbt settings if necessary) in the configuration. This tells the resource where to find the dbt project. 2. Use the
DbtCliResource
in Your Op: In your op, you can then use

context.resources.dbt.cli

to run dbt commands without needing to specify the project location each time. Here's an example of how to set this up:

Copy code

python from dagster import job, op, resource from dagster_dbt import dbt_cli_resource  # Configure the DbtCliResource with the dbt project directory @resource(config_schema={"project_dir": str}) def my_dbt_cli_resource(context):     return dbt_cli_resource.configured({         "project_dir": context.resource_config["project_dir"],         # You can include other dbt settings here as needed     })  # Define an op that uses the dbt_cli_resource @op(required_resource_keys={"dbt"}) def run_dbt_command(context):     # Run a dbt command without specifying the project location     context.resources.dbt.cli(["run"])  # Define a job that includes the op and the configured dbt_cli_resource @job(resource_defs={"dbt": my_dbt_cli_resource}) def my_dbt_job():     run_dbt_command()  # Example run configuration for the job run_config = {     "resources": {         "dbt": {             "config": {                 "project_dir": "/path/to/your/dbt/project",             }         }     } }  # Execute the job with the run configuration if __name__ == "__main__":     result = my_dbt_job.execute_in_process(run_config=run_config)

In this example, the

my_dbt_cli_resource

is a resource definition that wraps the

dbt_cli_resource

and is configured with the dbt project directory. The

run_dbt_command

op uses this resource to run dbt commands, and it no longer needs to specify the project location because it's already configured in the resource. When you execute the job, you provide the run configuration with the path to your dbt project. This configuration is used by the

my_dbt_cli_resource

to set up the dbt CLI environment correctly. Please replace

"/path/to/your/dbt/project"

with the actual path to your dbt project directory. This setup should resolve the issue of having to specify the dbt project location as an argument to the dbt command.

Jérôme MONTIGNY

02/21/2024, 3:44 PM

@Scout I am still getting this : TypeError: 'AirbyteInstanceCacheableAssetsDefinition' object is not iterable How do I materialize assets from a load_assets_from_airbyte_instance function ?

Scout

02/21/2024, 3:45 PM

The

load_assets_from_airbyte_instance

function is designed to create a set of Dagster assets from a configured Airbyte instance. These assets represent the data streams synced by your Airbyte connections. When you call this function, it returns an

AirbyteInstanceCacheableAssetsDefinition

object, which is not directly iterable because it represents the collection of assets as a whole. To materialize these assets, you don't need to iterate over them manually. Instead, you should include the

AirbyteInstanceCacheableAssetsDefinition

in your repository, and Dagster will handle the materialization of these assets when you run your job. Here's how you can include the

AirbyteInstanceCacheableAssetsDefinition

in your repository and define a job that materializes the Airbyte assets:

Copy code

python from dagster import repository, job from dagster_airbyte import airbyte_resource, load_assets_from_airbyte_instance  # Configure the Airbyte resource airbyte_instance = airbyte_resource.configured({     "host": "localhost",     "port": "8000",     # ... other configuration options })  # Load assets from the Airbyte instance airbyte_assets = load_assets_from_airbyte_instance(     airbyte_instance,     # Optionally, you can use a connection_filter to load a subset of connections     connection_filter=lambda meta: "desired_connection_name" in meta.name, )  # Define a job that materializes the Airbyte assets @job(resource_defs={"airbyte": airbyte_instance}) def airbyte_sync_job():     # The job body can be empty; the assets will be materialized by Dagster     pass  # Include the Airbyte assets in your repository @repository def my_repository():     return [airbyte_assets, airbyte_sync_job]

In this example,

airbyte_sync_job

is a job that, when executed, will materialize the Airbyte assets defined by

airbyte_assets

. The

my_repository

function returns both the

airbyte_assets

and the

airbyte_sync_job

, making them available for execution. When you run

airbyte_sync_job

, Dagster will automatically materialize all the assets defined by

airbyte_assets

. There's no need to manually trigger materialization for each asset because Dagster's execution engine understands the asset dependencies and will materialize them as needed. Please ensure that the host and port values match the actual configuration of your Airbyte instance. If you're using Airbyte Cloud or a different host, you'll need to adjust these values accordingly. The

connection_filter

is optional and allows you to specify which connections you want to include as assets based on the connection metadata.

38 Views

Open in Slack

Previous Next