I have a question about a specific use-case, and w...
# ask-community
t
I have a question about a specific use-case, and what the best way to approach this would be. I currently have two assets in my pipeline. 1. An HTML page that is requested from a site that contains links. 2. A CSV that contains the links scraped from the HTML page. These two assets will change, i.e. the HTML page will be updated with new links and therefore the csv will change as well . Now for the next part. Each link the in csv file points to a link to a PDF/webpage. I want to read this csv and request each PDF. However, once that PDF/webpage is requested, I don't need to ever request it again. My intuition is that I would use an Op here instead of an asset because I don't need to every refresh this piece of data. Or is there a way to define an asset that doesn't need to be refreshed? Could I just bake a file check in the asset definition so that when it materializes nothing happens? I will need to aggregate these pages into one big table, which will definitely need to be an asset.
q
You can use a sensor that evaluates the successful completion of the csv asset and then triggers a job that handles the pdfs. You can pass the list of links as a config to the job through the run request sent by the sensor
🙏 1
c
Hi Travis. Yes, I think Qwame's approach above would work well here. With Qwame's approach, you could have a
pdfs_asset
downstream of the csv asset that is partitioned with dynamic partitions. Then, in your sensor that evaluates the successful completion of the csv asset, you could: 1. Evaluate all of the pdfs, creating dynamic partitions for the pdfs that don't exist 2. Yield a partitioned run request for each new pdf
🙏 1
This would ensure that you won't automatically kick off duplicate requests per PDF. But if you wanted to be sure that you don't re-request PDFs e.g. via the manual "materialization" button, you could add a file check in your
pdfs_asset
to early exit if the PDF already exists. In terms of using ops versus assets, I think with both it's possible to re-request PDFs accidentally, i.e. by kicking off additional runs. But with assets you'll get additional observability like knowing if you've already requested one PDF, whereas with ops you can't see when/if it's been run for a given PDF. So I'd advocate for using assets instead, and baking in the file check if the sensor approach doesn't offer you a full guarantee.
👍 1
💡 1
t
Awesome! Thank you for the response. This looks exactly what I was looking for. I will give it a shot.
🌈 1
Can you help me understand how to access the asset that the sensor depends on? For instance, I want to access the csv file in the sensor so that I can pass the links to the RunRequest job
c
Hi Travis. You could load the asset value using the
Definitions.load_asset_value
function: https://docs.dagster.io/concepts/assets/software-defined-assets#loading-asset-values-outside-of-dagster-runs If loading the asset ends up being an expensive operation, you could also add metadata (e.g. the new file paths) to the csv asset's materialization, then within your sensor load the attached metadata.