Hi, I’m new to Dagster, and have a couple of simpl...
# random
j
Hi, I’m new to Dagster, and have a couple of simple questions that I can’t seem to figure out by reading the documentation. Let’s say, there is a task (C) dependent on task (A) and (B), where A and B are independent of each other. I could write a Python script where Solid A, B will be run from C, and they all run successfully. 1. How can I instruct Dagster to run A and B concurrently? 2. If I need to re-run A and C, but not B, how do I do that without changing the script?
d
Hey! I used Luigi some years ago. In your item #1, by default dagster will run applicable ops concurrently. In your item #2, there are a few different ways, but in dagit web UI, this is easily accomplished. You literally just select/highlight task A and in the dropdown option, there is option to just re-run task A or run Task A and any downstream task(s). That's it. Done.
j
Cool. I will take your word for it and keep digging. 🙂 I once tried Prefect and got bitten pretty bad before I decided not to pursue it.
Thank you!!
m
By design Dagster is a data orchestrator not a task runner. Dagster is trying to keep track of the state of the data assets being orchestrated. Whilst what @Daniel Kim says is true and will work; I think doing this goes against the way Dagster is designed / intended to be used.
Taking a step back; what is the higher level problem you are trying to solve? Maybe there is a more “Dagster-y” way to since it?
j
So following the example I gave at the top,
Task A is client grouping (ie. client id -> client group)
more specifically, loading client group mapping is task A.
Task B is loading transaction data which includes client id.
Task C which depends on A & B, calculates commissions, etc and produce a summary report showing results by client group.
Let’s say client mapping changed after Task C successfully completed. We would want to reload the mapping, (A) and rerun task C, but not B.
Hope it makes sense.
m
1. Concurrency is determined by your Executor settings. You'll probably want to start by using the MultiprocessExecutor - https://docs.dagster.io/deployment/executors - and telling it how many things it is allowed to run in parallel (when the DAG allows)
I think you'll just get 2 out of the box by using memoization- take a look at https://docs.dagster.io/guides/dagster/memoization
j
Thank you!! I will take a look.