Hey everyone! I started looking at Dagster recent...
# announcements
d
Hey everyone! I started looking at Dagster recently and fell in love with its amazing UI. Part of that feeling comes from the fact that we're currently using Luigi at my place of work and its UI is too minimalistic. However, I didn't find any comparisons of Dagster vs. Luigi over the internet so asking here. My question is: In Luigi, Targets -- data artifacts produced by Tasks (pipelines), are first-class citizens and each time you execute a pipeline, Luigi unfolds all its dependencies and for each of them decides whether it should be executed or not based on presence of output targets (a task is considered as completed if all its output targets exist). How does Dagster fit into this picture? If I have a parametrized dagster pipeline and some of its solids have already been executed and produced an output (in my case, it's hive partition in s3 written by pyspark), will dagster skip corresponding subgraphs of these finished solids and execute only what's needed? Thanks in advance, will be happy to hear any inputs on this!
a
Hey! Dagster has all the building blocks to support skipping downstream graphs if a “target output” already exists - but doesn’t do it in an out-of-the-box way at this time. For example you could do this manually today by using optional outputs in your solids https://dagster.readthedocs.io/en/0.6.4/sections/learn/tutorial/multiple_outputs.html#multiple-and-conditional-outputs
d
Thanks for your answer! How does dagster decide which donwstream graphs it should skip and which not? Luigi does it by checking if a task is complete, and task is considered complete iff its output(s) exist in the filesystem. And assumption is that task is idempotent, deterministic and it's output is fully determined by task name and parameters' values passed in. Does dagster do something similar under the hood?
a
How does dagster decide which donwstream graphs it should skip and which not?
So we do not have any system wide mode (yet) like the one you described for Luigi - but you can accomplish this manually by not yielding an optional output.
Copy code
@solid
def example():
    expected_path = calculate()
    if exists(expected_path):
        return # exit early - downstream work skipped
    output = do_work()
    write(output, expected_path)
    yield Output(...)
A system wide application of this behavior as Luigi has is something we’re interested in adding in the future but do not have at this point in time.
d
Interesting. Just to reiterate and make sure I'm on the same page: am I right that if I have two different
@pipeline
-s which turn out to execute the same
@solid
with the same parameters, then I shouldn't expect that solid's output will be reused for a pipeline that runs last. So basically that solid will be executed twice despite it does exact same calculations (since input is the same, as well as rest of parameters). Is this correct? Thanks!
a
if I have two different
@pipeline
-s which turn out to execute the same
@solid
with the same parameters, then I shouldn’t expect that solid’s output will be reused for a pipeline that runs last
Correct - we don’t make global assumptions about idempotentcy so we don’t do this. In the future we will add an “on switch” to enable this type of behavior - but in the meantime you can implement it in user space.
d
Very cool, thanks for your detailed answers!
i
@alex hi, is it still the best way to perform skipping?
I have logically separated solids: 1. loads raw data from big query 2. builds a column in the raw dataset 3. persists final dataset to disk I want to reuse this chain of solids in many pipelines. So I wrapped those steps in composite_solid. Yet I don't want to re-execute the whole chain everytime. I want to first check that there's no saved file on disk.
Copy code
if os.path.exists(...):
    # read the df from disk
else:
    # run composite solid to get the df
I can't find a way to implement such logic at the moment. I would love to encapsulate that logic in one solid for convenience 🤔 Any suggestions?
Basically I want any pipeline to be unaware where the df comes from.
after playing around a bit I realized that I don't need such granularity in my solids and I've combined all the steps in only one solid. after all the operations I update solid's config with a filename for later branching.
👍 1