I've got a colleague who is evaluating Dagster for...
# announcements
I've got a colleague who is evaluating Dagster for some data manipulation in one of our applications, but he needs to be able to generate nested subtasks based on inputs to a given solid. I feel like that's a conversation that has happened on here previously, but a cursory search didn't turn it up. Is that a capability that Dagster currently possesses?
we currently don’t support dynamic step creation based on data flowing through the dag
what is the problem they are trying to solve using that proposed feature?
So, my understanding of the situation is that we have an incoming data structure that needs to get mapped to a different model, and depending on what is inbound there will be different nested attributes/objects that need to get handled via nested tasks to perform that mapping which is where the dynamic creation would be needed.
They looked at the
operator in Prefect (https://docs.prefect.io/guide/core_concepts/mapping.html) and that wasn't quite what they needed either, instead requiring what is described in this ticket on the Prefect project https://github.com/PrefectHQ/prefect/issues/1311
if there are bounded number of branches
this can be handled my multiple outputs quite nicely
if it is totally unbounded and dynamic, we don’t support that atm
So, multiple outputs meaning that for each component/subobject that needs to get processed, that would be generated as an output which would then get picked up as an input to the necessary step based on the type of the data?
if i am understanding you correctly
meaning a solid can output A, B, or C
and you can optionally fire those outputs
so if you only fire B only those solids downstream from it get executed
Roger that.
would love to hear more context on the use case at some point!
I just asked for a sample data structure and pseudo code for what we are doing so I can better understand and articulate the actual situation in the event that it serves as useful input for your work.
we are generally being conservative with adding dynamic features like that since it is complicated
and has deep effects throughout the system
seeing what we can do without it
but i suspect at some point we will support dynamic behavior like that in some form
So, this is the imperative code that my colleague has written for this particular case https://github.com/mitodl/open-discussions/compare/nl/integrate_micromasters_catalog
His description of the overall logic flow is: api call to edx to get course-> course run data [10:51 AM] then api call to MM for program-> course data, which merges over the edx data [10:51 AM] I basically need the transformed data to inform the structure of the load portion of the ETL pipeline [10:53 AM] but if I'm doing ET portion in a pipeline already, there's no way i could see under dagster to append new solids based on outputs of solids that have run so far [10:53 AM] in order to generate the L portion new messages [10:53 AM] the answer might be that ET is one pipeline and L is another
Ya so you could model the pipeline as the full set of possible solids to run (assuming can be known ahead of time) and use optional outputs to only "activate" the subset of them that should be run for that given invocation. Another option you sort of alluded to would be to split the pipeline. You can't dynamically change a pipeline mid run, but you can dynamically generate pipelines ahead of time. Using the
decorator makes it pretty easy to do this. How you keep tabs on these dynamic pipelines over time is its own unique challenge.