https://dagster.io/ logo
s

sean

12/08/2020, 6:45 PM
Hello, I have a question about defining inputs/outputs for
dagstermill
solids. In the docs, we see input defs defined like this:
Copy code
k_means_iris = dm.define_dagstermill_solid(
    "k_means_iris",
    script_relative_path("iris-kmeans_2.ipynb"),
    input_defs=[InputDefinition("path", str, description="Local path to the Iris dataset")],
)
This way of doing things requires duplicating input definitions/descriptions between the notebook itself and the solid definition call. Ideally the input definitions could be parsed from the special
parameters
-tagged cell that you need to define anyway (some special comment formatting could maybe be used for the descriptions). A similar cell could be used for outputs. Is anything like this possible now or planned? I would be willing to work on this if devs think it is a good idea but no one is working on it.
m

max

12/08/2020, 7:25 PM
hm, i'm not averse to inferring these things if there's a sensible way to do it, but it feels hard in principle - since, e.g., you could yield an output from arbitrary code, making it impossible to reliably determine from inspection of notebook code which outputs the notebook declared
s

sean

12/08/2020, 7:32 PM
I'm new to dagster, so maybe I'm misunderstanding, but I don't see how the situation differs from a solid written in a text file. The yielding of outputs is always separate from output declarations, right? Like this, from the docs:
@solid(
input_defs=[
InputDefinition(name="a", dagster_type=int),
InputDefinition(name="b", dagster_type=int),
],
output_defs=[
OutputDefinition(name="sum", dagster_type=int),
OutputDefinition(name="difference", dagster_type=int),
],
)
def my_input_output_example_solid(context, a, b):
yield Output(a + b, output_name="sum")
yield Output(a - b, output_name="difference")
What I'm suggesting is two-fold: • support a special cell corresponding to output definitions (there already is one for input definitions, the
parameters
cell). • optionally parse the contents of these special cells for the input/output definitions to be used in the solid declaration
m

max

12/08/2020, 8:53 PM
i think there might be a bit of a misunderstanding about the parameters cell, but im very curious what the ergonomic issue that you're trying to solve is -- the inputs aren't inferred from the contents of the parameters cell by dagstermill; at runtime, dagstermill injects the inputs defined by the call to
define_dagstermill_solid
into that cell
s

sean

12/08/2020, 9:40 PM
Right, but I'm saying the inputs (and outputs) should be inferrable. Ideally, a notebook should be as independently comprehensible as possible-- you shouldn't need to refer to a definition in another file to read the descriptions of the inputs and outputs. This actually negates one of the main advantages of notebooks, which is the clean interleaving of docs and code. Of course you can write the input/output descriptions in both the notebook itself and the
dagstermill.define_dagstermill_solid
call, but then you are unnecessarily duplicating information. And since dagstermill already constrains notebook structure (requirement of parameters cell if using inputs), why not provide a mechanism to infer input/output params based on further constraints (e.g. a special tagged output cell)? Or, taking this idea further, why not provide facility to fully define the solid within a notebook, as in providing a specially tagged cell or set of cells where one somehow specifies all the info that goes into the @solid(...) decorator call?
m

max

12/08/2020, 9:56 PM
yep, it certainly would be nice to have less stuff split across the dagster/notebook boundary
it'd be nice to see an example of the kind of notebook you'd like to be able to write
s

sean

12/08/2020, 9:57 PM
Sure, I can provide that shortly.
That's an example-- I'm experimenting with dagster in a scientific data analysis context. I have fairly complex pipelines that are currently in monolithic notebooks that I am trying to break up into smaller notebooks as dagster solids. The notebook I posted has title and specification of inputs/outputs at the top. This is just a custom format I'm using that I've written my own parser code for to extract input/output descriptions so that I don't have to respecify in my pipeline defs. (I'm not suggesting this format specifically be incorporated into dagstermill, but rather just some kind of parseable format). Or perhaps a parameter (
extract_metadata
?) could be added to
dm.define_dagstermill_solid
that takes a function reference, which gets passed the notebook object and should return a dictionary of
solid
params. Then the user could use whatever documentation style they want and just provide this adapter function to dagstermill. This also has the advantage of adding very little complexity on the
dagstermill
end.
m

max

12/08/2020, 10:38 PM
interesting, i like that last idea quite a bit
can you open an issue with this on github and assign it to me?
s

sean

12/08/2020, 11:33 PM
sure
I don't know how (or even if I can, since I'm not a dagster maintainer) to assign this issue to you, so here is the link: https://github.com/dagster-io/dagster/issues/3373
🙏 1