hey friends, just writing a message to pin here to capture any potential interest. ongoing work is happening with our dagster<>airflow operators and we’d love to get feedback from anyone that is running Airflow while moving to Dagster. please DM me if you’re interested in learning more
12/08/2022, 8:41 AM
Curious if there is any plan to make it easier to track airflow tasks as upstream dependencies for our Dagster asset jobs.
We have our central data infra team running Airflow to orchestrate all the ETL workloads for the upstream tables. However, our use-case specific teams (like ML, Experimentation) are running dynamic assets on Dagster. Most of these assets have dependencies on those upstream tables managed by Airflow and we are running a sensor for each asset to check the status of the upstream Airflow task by checking the Airflow metadata DB. This approach has been quite error prone and adds a huge burden for the individual teams
12/08/2022, 10:09 PM
@Arun Kumar how would you want it to work? and what kinds of errors are happening bc of the sensor based approach? cc: @sandy
12/15/2022, 11:50 AM
Hey @Dagster Jarred, I don't have a clear solution. But my initial thoughts were to build observable source assets for different airflow upstream tasks (or for datasets derived from upstream airflow tasks) and keep them up to date by tracking any updates on those tasks. This would be made possible by having an observation function that fetches the tasks metadata from the Airflow metadata DB (or API) and may be using the end time of the task as a Logical Version. I see some advantages with this approach
• All the airflow upstream dependencies are automatically tacked in the asset lineage graph with their corresponding source assets
• Different teams building Dagster jobs with airflow dependencies does not have to come up with their own sensor logic and the initial setup required for metadata DB connection. They can directly build downstream assets based on this source asset which saves a lot of efforts
• Reduced load on Airflow metadata DB as the individual sensors would not poll them. Only a central routine polls it and keeps the status up to date with in the asset ecosystem.
Let me know if this makes sense