Hello! Very new to dagster, and I'm wondering if t...
# ask-community
Hello! Very new to dagster, and I'm wondering if this is something possible/supported or if I need to rethink my process. We have a dataframe of ids (currently instantiated as an asset) that I would like to run a process on to generate a full dataset we would then like to score against a trained model. Is it possible to dynamically generate the assets to do this, or should I be looking at using some form of ops, jobs and graphs to accomplish this? Also is there any sort of best practices to score large data sets like this (dask, spark cluster, some other tool)? Thank you!
Hi @Danny Steffy using assets for this seems reasonable to me. From what i understand, you would have one asset that is the dataset of ids. Then you would have a second asset that takes the first asset as input and generates the full dataset. For the scoring portion, you could compute the score in a third asset and have that score be the data asset A couple follow up questions: I assume you’d want to repeat this process a lot to score multiple datasets, correct? Do you want to keep all of the generated datasets and scores for all of the different datasets?
Right, that's how I currently have it with the second asset taking only a single row from the first asset. Part of the problem is - how do I duplicate that? The problem is generating the full dataset, pulling that down is about 13-20k rows per id. We have 80k ids that we want to score, so that means about 1 billion rows materialized and persisted. We want to keep the scores(need to load them back to our main application database), but we don't care about the generated datasets
ok, this is definitely a complicated use case. one option is to write a “generic” asset and wrap it in a function that can be called a bunch of times to create all of the downstream assets. kinda like this op factory pattern here but for assets
I'm fairly new to python, so I'm not sure I follow this. So I would have a my_asset_factory method that would generate the assets I would need for each id to get the data to score and then score that data? what exactly is the ins and **kwargs?
no worries! factories are definitely one of the more python-y topics. basically you write a function that returns an asset and if there is anything in the asset decorator or the body of the asset function that needs to change for each asset, you pass that as a parameter to the factory function. In the example in the docs, the
parameter allows you to change the names of the upstream assets. If all of the assets you create via the factory function will have the same upstream dependencies, then you don’t need to include
is a special python syntax that allows you to pass arbitrary keyword arguments and treat them like a dictionary within the function body. In this case we pass all of the
to the
Thank you for the help!
just to come back to this... would it make more sense to have this be an op rather than an asset if we don't want to materialize every data set we want to score, we just want to score the data and get back the scores as an asset?
we're also looking at distributing this work in large batched chunks instead of having workers do this a single id at a time... but I'm not sure what would be the best way to do that, some sort of partitioned op?
yeah that’s definitely an option too! however, right now you can’t have an asset that depends on an op so your scoring would have to be in an op as well. you can manually tell dagster to consider the output of an op as if it’s an asset, but you don’t get all of the features around assets (like the global asset graph)
Ah I see. Would this still work in a set of assets then, with the scoring data only being materialized on a worker as needed to generate the scoring, or would we need to lose those asset features in order for our use-case to work?
Is there some other industry standard of using dagster to score millions to billions of rows that you could potentially point me to?
I'd generally recommend Dask or Spark for that
this also might be relevant to your use case? https://github.com/dagster-io/dagster/discussions/12061
One idea is to have the table of ID's as a configured resource which is passed to the rest of the assets. i.e. use a config schema on the resource to say which row of the id table to use, then the resource provides the id to all of the assets or ops you want. Then you could have a job which does the same calculation with multiple configurations.