https://dagster.io/ logo
#ask-ai
Title
# ask-ai
p

Philip Shurpik

02/14/2024, 6:15 AM
Hi! I'm trying to build pipeline to orchestrate web search, scrapping and calls to LLMs using Dagster 🙂 If simplify this process - it should look like that: 1. We receive new request with client data and list of attributes 2. For each attribute we use google search, scrap data, then execute llm calls for each web page 3. Combine all relevant (based on llm) search outputs for each attribute 4. Run final request to llm using llm output from attributes from previous step 5. Write report as file to storage Rerun this pipeline daily, and if there some new relevant search results were found - run final request to llm and generate new report So this pipeline looks like DAG 🙂 And dagster looks like good framework to implement it Looks natural, that each new client - it's new partition in assets. For simplicity I implemented sensor that is looking for new client files and
RunRequest(partition_key=client_name)
processing when new files are found. But in assets it adds new partition, but in missing state - not materialize it. My code is fully based on: https://github.com/dagster-io/dagster/blob/master/examples/docs_snippets/docs_snip[…]cepts/partitions_schedules_sensors/dynamic_partitioned_asset.py So have two questions: 1) Is dagster good fit for this type of pipelines? Or maybe I missed smth and should look for another solutions? :) 2) How to setup pipeline that new partition will not be added in missing state, but automatically will be processed / materialized ?
s

Scout

02/14/2024, 6:15 AM
Hello! It's great to hear that you're considering Dagster for orchestrating your data pipeline. Let's address your questions one by one: 1) Is Dagster a good fit for this type of pipeline? Dagster is designed to be a flexible and powerful framework for building and orchestrating data pipelines, and it can handle complex workflows like the one you've described. The pipeline you've outlined involves a series of dependent steps, which is a common use case for Dagster. The framework allows you to define assets, which can represent the outputs of different stages in your pipeline, and it supports dynamic orchestration, where the work to be orchestrated is determined at runtime, dependent on data that's observed as part of pipeline execution. This dynamic orchestration is particularly useful for cases where you don't know in advance how many inputs you're going to have to process, which seems to align with your use case where each new client represents a new partition. 2) How to set up a pipeline so that new partitions are not added in a missing state but are automatically processed/materialized? Dagster provides the concept of partitions for assets, which can represent slices of your data, such as a set of transactions for each day. In your case, each client could be a partition. To automatically process and materialize these partitions, you can use sensors to detect new data and trigger pipeline runs for specific partitions. Additionally, Dagster has experimental features like
DynamicPartitionsDefinition
for cases where you don't know the set of partitions ahead of time, and
DynamicOutput
for dynamic orchestration within your pipeline. To ensure that new partitions are automatically processed and materialized, you can use a combination of sensors and partitioned assets. The sensor can be configured to trigger a pipeline run when new client data is detected, and within your pipeline, you can define assets with partitioning logic that processes the data for that specific client. Regarding the issue of partitions being added in a missing state, you might want to look into the
DynamicPartitionsDefinition
feature, which allows for the dynamic addition of partitions based on runtime data. This feature is experimental, so you should be aware that some APIs may change in the future. If you're following an example similar to the one in the Dagster repository and encountering issues with partitions being in a missing state, it could be related to how the sensor and partitioned assets are set up. It's important to ensure that the sensor correctly triggers the pipeline run for the new partition and that the pipeline logic is set up to materialize the asset for that partition. In summary, Dagster is well-suited for the type of pipeline you're building,
and with the right configuration, you should be able to automatically process and materialize new partitions as they are detected. If you're still having trouble with partitions being in a missing state, you may want to review your sensor and asset definitions to ensure they are correctly configured to trigger and process new partitions.
3 Views