Hi. My use case is to use Dagster to generate a list of assets for each dataset in the platform we develop at Hugging Face (https://huggingface.co/datasets): list of ML “splits”, list of the first 100 rows, list of parquet files converted from the original dataset, list of column types, etc. The datasets are versioned with git, and we want to generate these assets for each commit of each dataset.
I’m wondering if I should use the “software-based assets” paradigm for that, or rely on ops/graphs/jobs.
this actually sounds more like an ETL problem, where you are trying to build a pipeline that fetches data from some HuggingFace API and writes it into a table, but it only fetches the latest updates each time? Is that correct?
01/04/2023, 5:05 PM
Hmmm yes, I think so: for each webhook sent by HuggingFace, I compute and store all the assets for the new version of the dataset, than serve them through my API.
As it can take a lot of time to compute, I want to run the operations asynchronously, ideally in an isolated environment, and limit the dedicated resources (RAM, CPU) for every dataset.
Do you think Dagster is the wrong tool for that problem?
01/04/2023, 5:13 PM
Not at all! It's just a matter of mapping the Dagster language onto the problem. For example, you might frame this problem as a single "asset" that:
• every hour, fetches a list of all webhooks sent
• for each webhook, launch a subprocess to handle your different task pieces (fetch metadata, sample rows, etc.)
• write the results of those processes into some table / storage
• log metadata around the work done
That is one way of conceptualizing the job to be done. Whether you put that in multiple "ops" or a single asset, or multiple assets is a conceptual decision
01/04/2023, 5:23 PM
OK, thanks. I was not thinking of doing this, but instead of launching the jobs as soon as possible when a webhook is received (it’s currently working this way, with a queue, and a fixed amount of workers that process the queued jobs). We want to show the dataset viewer as soon as possible, and for small datasets it only takes minutes.
But maybe switching to scheduled runs, with a small interval (one hour, or less if possible?), would make everything simpler? I’ll read more on partitioned assets.
In my mind, every generated file, for every dataset (and commit in that dataset) is an “asset”. But I’m not sure if it fits in the Dagster model. I was able to materialize assets through ops, I think it works, but I’m wondering if software-defined assets would be feasible too.
01/04/2023, 6:01 PM
this table has a nice overview of how to think about it: https://docs.dagster.io/guides/dagster/enriching-with-software-defined-assets#when-should-i-use-software-defined-assets
in practice, i tend to think of 1 asset as a table or a discrete "dataset". while you can dynamically generate asset metadata through ops, the recommended path is to define all your permutations in code.
If you're looking for a streaming/continuous use case, Dagster may not be the best fit, although you certainly can make it work. You could do small partitions (say 5 minutes), and that lets you take advantage of backfill capabilities.
In my mind, the value of assets scales with the amount of processes you're managing through dagster, so it really pays off when you have multiple separate workfloads that are dependent on each other. (an ingestion pipeline, and a transform layer, and a serving layer.) for point solutions, using ops / jobs might make a ton of sense!
01/06/2023, 10:44 AM
I have been able to understand a bit more how to model my data with dagster using software-based assets, thanks to @sandy.
I’ll write updates on how it goes in #dagster-showcase