https://dagster.io/ logo
#announcements
Title
# announcements
j

Jarryd

11/07/2019, 3:07 AM
I'm new, so forgive me. Scattered thoughts below. Are pipelines orchestrated entirely within a compute? And thus able to share the magic
context
resource ? Contrasting to airflow where you have a worker that runs a single task with zero context between other tasks (besides the shoe-horned xcom) If two tasks need to operate on different platforms: pyspark in one, general redshift sql in another and finishing with a python function. how does dagster handle this?
n

nate

11/07/2019, 2:45 PM
Hi Jarryd - good questions! yes by default pipelines are executed in a single process, but we have built machinery to support distributed execution. In a multi-node context, Dagster will rehydrate the context and associated resources via something we call an “instance”. For the data passed between solids themselves, we serialize the outputs to the filesystem or an object store like S3 or GCS. So it’s something to consider when building pipelines, since you’ll have serialization overhead between solids
Re: pyspark vs. redshift, we’re currently iterating on pyspark, and I’ll have more to share soon! However, generally when handling large scale data in Dagster, you’re going to be writing solids that invoke API calls to other systems, and those systems are where you’ll perform the physical execution vs. performing the execution in Dagster itself
another way to state this is that we’re not attempting to replace pyspark, just build orchestration on top of it
2 Views