Hi, I’m new to Dagster, and have a couple of simpl...
# random
Hi, I’m new to Dagster, and have a couple of simple questions that I can’t seem to figure out by reading the documentation. Let’s say, there is a task (C) dependent on task (A) and (B), where A and B are independent of each other. I could write a Python script where Solid A, B will be run from C, and they all run successfully. 1. How can I instruct Dagster to run A and B concurrently? 2. If I need to re-run A and C, but not B, how do I do that without changing the script?
Hey! I used Luigi some years ago. In your item #1, by default dagster will run applicable ops concurrently. In your item #2, there are a few different ways, but in dagit web UI, this is easily accomplished. You literally just select/highlight task A and in the dropdown option, there is option to just re-run task A or run Task A and any downstream task(s). That's it. Done.
Cool. I will take your word for it and keep digging. 🙂 I once tried Prefect and got bitten pretty bad before I decided not to pursue it.
Thank you!!
By design Dagster is a data orchestrator not a task runner. Dagster is trying to keep track of the state of the data assets being orchestrated. Whilst what @Daniel Kim says is true and will work; I think doing this goes against the way Dagster is designed / intended to be used.
Taking a step back; what is the higher level problem you are trying to solve? Maybe there is a more “Dagster-y” way to since it?
So following the example I gave at the top,
Task A is client grouping (ie. client id -> client group)
more specifically, loading client group mapping is task A.
Task B is loading transaction data which includes client id.
Task C which depends on A & B, calculates commissions, etc and produce a summary report showing results by client group.
Let’s say client mapping changed after Task C successfully completed. We would want to reload the mapping, (A) and rerun task C, but not B.
Hope it makes sense.
1. Concurrency is determined by your Executor settings. You'll probably want to start by using the MultiprocessExecutor - https://docs.dagster.io/deployment/executors - and telling it how many things it is allowed to run in parallel (when the DAG allows)
I think you'll just get 2 out of the box by using memoization- take a look at https://docs.dagster.io/guides/dagster/memoization
Thank you!! I will take a look.