i have been learning Dagster for past few weeks an...
# ask-community
k
i have been learning Dagster for past few weeks and trying to figure out the following. i have created some one off assets and they use resources like mongo, redis, feather, parquet, etc and this all works great. Next, I am trying to figure out how to create 1000+ assets by … 1. get list of 1000 from mongo query on a schedule (like once per day). is this a job or asset or ? 2. loop/queue them (in parallel) to make 1000 api calls (at a rate limit of X/minute); limited by both ratelimit and/or worker limit 3. create an asset for each of the 1000 items. in my case a parquet file is updated. then more jobs triggered after that. questions are: 1. how do i define this? in a job that loops into the queue or some other type of asset creation method? 2. do all 1000+ show up as assets in the UI? how to organize that so its not a mess? can the UI handle this? 3. where or how do i set a rate limit on an api that i connect to globally so other calls all are included in this rate limit? 4. when each new asset is materialized how do i trigger the other jobs i need? like each of the 1000 items all have to trigger a series of like 5 more jobs/assets. 5. how do i run a job/asset to be created once ALL the 1000 are done? what happens if 1 out 1000 fail?
t
Are the 1000 results from MongoDB static or do they change over time? Assuming they change over time: 1. I'd start with building a graph of ops. This can grow dynamically based on the list you retrieve. a. Eventually, you can wrap them up as a multi-asset so you can make the most out dagster's observability, but I'd avoid that when just evaluating dagster 2. Is there a way that you can partition them? For example, if you're building an ETL pipeline per customer, you can have each ingestion, transofmration, and report be an asset, and then have 1000 partitions, 1 per customer. This way you wouldn't have 1000+ assets 3. You can specify limits with op-level concurrency. let me know if that wouldn't satisfy your needs though! 4. This ease of implementing an answer depends on whether you can partition the data or not. But in general, I'd recommend auto-materialize policies to kick off downstream materializations. 5. Depending on your assets and needs, automaterialization can manage this. I'm going to acknowledge asset sensors as a more granular alternative to 4 and 5 though.
k
0: mostly static. its just stock tickers. like AAPL, AMZN, etc. Some might get added or dropped over time but not often. 1/2: there is no customer. it just updates a series of assets on each stock (like few images, some analysis files) on every daily update of the initial API call that materializes the main data file daily. i dont think i need partitions. i have a parquet file for each stock and for each set of data. none of them get more than few thousand lines of data in each file so that acts as the “partition” or seperation. 3: that concurrency doc is helpful. 4/5: i need to be able to auto-materialize and backfill new “tickers” but i havent got that far yet. i need to figure out how to loop these 1000 assets first and do it in parallel. it cant do it one by one. thats too slow. need to do like 10 or 20 at a time depending on the concurrency rate limits i set. +: i created a test sensor. that might be useful to trigger the series of jobs after the initial update from the api call. i not doing anything real time. its mostly just update once per day. but trying to run it as fast as possible so i can scale out workers.
t
Thanks for the details! Partitioning looks like it'd help you out even more then! If that's the case then you only need a couple assets, and partition them per ticker. The partitioning isn't for splitting the data in storage, but to simplify the mental model and increase the flexibility. If you partition, then you can re-run all assets for just one ticker rather than having to re-run everything
k
i re-read https://docs.dagster.io/concepts/partitions-schedules-sensors/partitions#partitioned-assets-and-jobs and see its not just for dividing time but dividing by “any dimension along which you want to be able to materialize and monitor independently” and it makes sense that in my case this would be the ticker symbol. i figured out how to connect all my resources and that seems to work. now trying to figure out how to do this multiple(MANY) asset thing. seems partitions is a start. thanks.