Hey all. Been ramping up on Dagster this past week...
# ask-community
Hey all. Been ramping up on Dagster this past week for a brand new project. Hoping someone here can help me map my requirements to the Dagster conceptual model a bit. This project requires collecting a bunch of source data on regular-ish intervals, and preserving each observation and its collection timestamp for later analysis. For sake of example, I might want to occasionally scrape a webpage, and later be able to report on changes to that page from one year to the next. My needs are sort of analogous to BuiltWith, for those who are familiar with it. I'm trying to figure out the best way to model this in terms of software defined assets. The naive approach of one asset per source seems not to work, because I need to preserve historical observations rather than overwrite them. I'm also unsure how to manage the observation timestamp info. Storing this in metadata seems insufficient, since the data needs to be available downstream for batch processing. So ideally it should be in S3, rather than have batch jobs query directly from Dagster's run database. Making things even more fun, I have some rate limits to work around. My current thinking is that I'll need some external resource for managing parts of my pipeline state, and probably a custom IO manager to translate back and forth between observation timestamps and S3 file paths. I would then have sensors that inspect the external state tracking resource and submit run requests based on what they see there. That said, this approach makes doesn't feel very functional, and seems like it would make very light use of the "asset" construct. So I'm wondering if there is a more Dagster-ish way of accomplishing the same goal here. Any thoughts?
👍 1
Just as a general thought: I think there are two ways to approach "using assets". ( The first is a "strong assets" approach (i am makign up this terminology), where you are heavily using io_manager to store and retrieve data assets. The second is a "weak assets" approach, where you use the assets construct to track metadata about the thing, but handle operations within the op logic itself. In this case you can use
to pass in lineage information, or you can simply save dicts to s3 (or something) and pass in like below.
Copy code
def my_dowstream_asset(context, last_run_info: dict):
    previous_run_at = last_run_info['created_at']
i haven't implemented any custom io managers currently, but we are moving towards a heavily assetified world, because the benefits of having that lineage map are high, as are using asset selection for job. I bring this up because of your comment on
I need to preserve historical observations rather than overwrite them
. If you thought of your asset output as a table of metadata about previous runs, perhaps you could keep a running state and load/unload that efficiently.
Thanks @Stephen Bailey! I appreciate the intuition behind "weak vs strong assets". Good to know it's not all or nothing. I imagine there is a healthy middle ground for my use case.
👍 1