I have a question about a specific use case and what the bes dagster #ask-community

I have a question about a specific use-case, and w...

Travis DePriest

04/14/2023, 3:52 PM

I have a question about a specific use-case, and what the best way to approach this would be. I currently have two assets in my pipeline. 1. An HTML page that is requested from a site that contains links. 2. A CSV that contains the links scraped from the HTML page. These two assets will change, i.e. the HTML page will be updated with new links and therefore the csv will change as well . Now for the next part. Each link the in csv file points to a link to a PDF/webpage. I want to read this csv and request each PDF. However, once that PDF/webpage is requested, I don't need to ever request it again. My intuition is that I would use an Op here instead of an asset because I don't need to every refresh this piece of data. Or is there a way to define an asset that doesn't need to be refreshed? Could I just bake a file check in the asset definition so that when it materializes nothing happens? I will need to aggregate these pages into one big table, which will definitely need to be an asset.

Qwame

04/14/2023, 4:43 PM

You can use a sensor that evaluates the successful completion of the csv asset and then triggers a job that handles the pdfs. You can pass the list of links as a config to the job through the run request sent by the sensor

🙏 1

claire

04/14/2023, 5:45 PM

Hi Travis. Yes, I think Qwame's approach above would work well here. With Qwame's approach, you could have a

pdfs_asset

downstream of the csv asset that is partitioned with dynamic partitions. Then, in your sensor that evaluates the successful completion of the csv asset, you could: 1. Evaluate all of the pdfs, creating dynamic partitions for the pdfs that don't exist 2. Yield a partitioned run request for each new pdf

🙏 1

claire

04/14/2023, 5:50 PM

This would ensure that you won't automatically kick off duplicate requests per PDF. But if you wanted to be sure that you don't re-request PDFs e.g. via the manual "materialization" button, you could add a file check in your

pdfs_asset

to early exit if the PDF already exists. In terms of using ops versus assets, I think with both it's possible to re-request PDFs accidentally, i.e. by kicking off additional runs. But with assets you'll get additional observability like knowing if you've already requested one PDF, whereas with ops you can't see when/if it's been run for a given PDF. So I'd advocate for using assets instead, and baking in the file check if the sensor approach doesn't offer you a full guarantee.

👍 1

💡 1

Travis DePriest

04/14/2023, 6:28 PM

Awesome! Thank you for the response. This looks exactly what I was looking for. I will give it a shot.

🌈 1

Travis DePriest

04/14/2023, 10:43 PM

Can you help me understand how to access the asset that the sensor depends on? For instance, I want to access the csv file in the sensor so that I can pass the links to the RunRequest job

claire

04/18/2023, 9:00 PM

Hi Travis. You could load the asset value using the

Definitions.load_asset_value

function: https://docs.dagster.io/concepts/assets/software-defined-assets#loading-asset-values-outside-of-dagster-runs If loading the asset ends up being an expensive operation, you could also add metadata (e.g. the new file paths) to the csv asset's materialization, then within your sensor load the attached metadata.

2 Views

Open in Slack

Previous Next