Could someone help me understand if a dagster graph best supports my use case?
I’m building something similar to Plaid. I’m planning on using a dagster graph to orchestrate my data pipeline. Here’s a simplified version of what it looks like:
1. Extract the user’s bank transactions. This involves logging into their bank with their submitted credentials, scraping their bank transactions, and saving this raw data to bucket storage.
2. Transforming the bank transactions. This involves downloading the raw data from bucket storage, transforming the data into our standardized format, and saving this data to a postgres table.
This graph must be run on a per-user basis. There are two instances in which this graph is ran:
1. When a user is first created
2. During a daily job to refresh all user’s data
Thus, if I have 1M users, then I would need something that supports 1M concurrent graph runs. Is this achievable with dagster?
12/09/2022, 7:52 PM
Is there a reason that the graph needs to be run on a per-user basis? The work you're doing for each user seems pretty low-cost, wondering if they can be batched
In any case, that is definitely a high run volume, but dagster should be able to handle it. You can have a sensor that kicks off a run when a user is first created, and then a schedule that kicks off runs to refresh user data each day
12/12/2022, 5:42 PM
Thanks for the answer! I don’t want to batch these jobs because I don’t want the success of updating one user’s data to depend on the success of updating another’s. My understanding is if I were to batch five users together in one graph run, then step #1 would need to succeed for all five of these users before the graph moves on to step #2.
12/13/2022, 9:41 PM
couldn't you have the graph use dynamic outputs to just concurrently fan out across all users (possibly batched) that need their data refreshed? there wouldn't be any dependencies between users introduced with this. seems like 1M run requests would blow out much of the usefulness of the UI, I can't imagine reviewing 1M daily runs in the UI would work very well. there's also relatively significant overhead with launching a run (depending on the run launcher you're using), which might add a lot of time to the refresh