I ve got a colleague who is evaluating Dagster for some data dagster #announcements

I've got a colleague who is evaluating Dagster for...

Tobias Macey

09/06/2019, 2:09 PM

I've got a colleague who is evaluating Dagster for some data manipulation in one of our applications, but he needs to be able to generate nested subtasks based on inputs to a given solid. I feel like that's a conversation that has happened on here previously, but a cursory search didn't turn it up. Is that a capability that Dagster currently possesses?

schrockn

09/06/2019, 2:34 PM

we currently don’t support dynamic step creation based on data flowing through the dag

schrockn

09/06/2019, 2:34 PM

what is the problem they are trying to solve using that proposed feature?

Tobias Macey

09/06/2019, 2:36 PM

So, my understanding of the situation is that we have an incoming data structure that needs to get mapped to a different model, and depending on what is inbound there will be different nested attributes/objects that need to get handled via nested tasks to perform that mapping which is where the dynamic creation would be needed.

Tobias Macey

09/06/2019, 2:37 PM

They looked at the

map

operator in Prefect (https://docs.prefect.io/guide/core_concepts/mapping.html) and that wasn't quite what they needed either, instead requiring what is described in this ticket on the Prefect project https://github.com/PrefectHQ/prefect/issues/1311

schrockn

09/06/2019, 2:37 PM

if there are bounded number of branches

schrockn

09/06/2019, 2:37 PM

this can be handled my multiple outputs quite nicely

schrockn

09/06/2019, 2:37 PM

*by

schrockn

09/06/2019, 2:37 PM

if it is totally unbounded and dynamic, we don’t support that atm

Tobias Macey

09/06/2019, 2:39 PM

So, multiple outputs meaning that for each component/subobject that needs to get processed, that would be generated as an output which would then get picked up as an input to the necessary step based on the type of the data?

schrockn

09/06/2019, 2:39 PM

yeah

schrockn

09/06/2019, 2:39 PM

if i am understanding you correctly

schrockn

09/06/2019, 2:39 PM

meaning a solid can output A, B, or C

schrockn

09/06/2019, 2:40 PM

and you can optionally fire those outputs

schrockn

09/06/2019, 2:40 PM

so if you only fire B only those solids downstream from it get executed

Tobias Macey

09/06/2019, 2:41 PM

Roger that.

schrockn

09/06/2019, 2:41 PM

👍🏻

schrockn

09/06/2019, 2:41 PM

would love to hear more context on the use case at some point!

Tobias Macey

09/06/2019, 2:41 PM

I just asked for a sample data structure and pseudo code for what we are doing so I can better understand and articulate the actual situation in the event that it serves as useful input for your work.

schrockn

09/06/2019, 2:42 PM

we are generally being conservative with adding dynamic features like that since it is complicated

schrockn

09/06/2019, 2:42 PM

and has deep effects throughout the system

schrockn

09/06/2019, 2:42 PM

seeing what we can do without it

schrockn

09/06/2019, 2:43 PM

but i suspect at some point we will support dynamic behavior like that in some form

Tobias Macey

09/06/2019, 2:54 PM

So, this is the imperative code that my colleague has written for this particular case https://github.com/mitodl/open-discussions/compare/nl/integrate_micromasters_catalog

Tobias Macey

09/06/2019, 2:54 PM

His description of the overall logic flow is: api call to edx to get course-> course run data [10:51 AM] then api call to MM for program-> course data, which merges over the edx data [10:51 AM] I basically need the transformed data to inform the structure of the load portion of the ETL pipeline [10:53 AM] but if I'm doing ET portion in a pipeline already, there's no way i could see under dagster to append new solids based on outputs of solids that have run so far [10:53 AM] in order to generate the L portion new messages [10:53 AM] the answer might be that ET is one pipeline and L is another

alex

09/06/2019, 3:13 PM

Ya so you could model the pipeline as the full set of possible solids to run (assuming can be known ahead of time) and use optional outputs to only "activate" the subset of them that should be run for that given invocation. Another option you sort of alluded to would be to split the pipeline. You can't dynamically change a pipeline mid run, but you can dynamically generate pipelines ahead of time. Using the

@pipeline

decorator makes it pretty easy to do this. How you keep tabs on these dynamic pipelines over time is its own unique challenge.

Open in Slack

Previous Next