https://dagster.io/ logo
Title
j

José Zavala González

11/15/2022, 3:38 PM
Hi everyone! I am wondering if anyone has used a notebook asset as an input to another SDA? A lot of our (no Dagster yet) pipeline is currently interspread across Jupyter notebooks that export a dataset to some file. Partly the reason for the notebooks is the "literate programming" angle since it enables the team to follow along how the analyses are conducted step-by-step. In spite of that, is using a notebook SDA to create another SDA discouraged?
👍 1
j

jamie

11/15/2022, 3:42 PM
Hi @José Zavala González just to confirm, the use case you’re trying to support is something like: notebook 1 creates a dataset that is stored in a file, then notebook 2 uses the dataset in that file and does some more analysis? I don’t know with complete certainty if that currently works, i’ll write a quick example case and see
❤️ 2
j

José Zavala González

11/15/2022, 3:53 PM
Yeah! Basically that. Thank you for your time ! I'm anticipating I'll have to just copy-paste the notebook code into Python modules that Dagster can work with, and I'm ok with that since it's not too much code. Just wanting to check before I start moving code around.
Alternatively, maybe there's some Dagster features I'm unfamiliar with yet that assist with the "data literacy" side that a notebook is great for
j

jamie

11/15/2022, 3:56 PM
yeah, ideally you won’t have to do any copy pasting! i’ll let you know what i find. For some more context, the first use case we’re supporting with notebook assets is where the executed notebook file itself is the asset. However,
define_dagstermill_asset
uses a lot of the same code as
define_dagstermill_op
and
define_dagstermill_op
supports returning arbitrary outputs from the notebook (but those outputs won’t be assets). So
define_dagstermill_asset
may already be able to return arbitrary outputs, but i haven’t explored the implications of that deeply enough to be able to confidently recommend it
❤️ 1
:rainbow-daggy: 1
j

José Zavala González

11/15/2022, 3:59 PM
Got it, thank you! I still have to go through a lot of the intro tutorials, so no rush to find anything
j

jamie

11/15/2022, 4:00 PM
ok! let me know if i can help out with anything else. happy to explain any concepts if you run into issues
❤️ 2
n

nickvazz

11/15/2022, 4:25 PM
Hi @José Zavala González I have been using notebooks that way as a starting point for using dagster myself i.e.
notebook_0
->
data.parq
->
notebook_1
all defined as SDAs. I am currently trying to get it situated to all be partitioned assets since each notebook is essentially a separate experiment
🔥 1
to chain them together I have been using
non_argument_deps
to chain the notebooks together:
notebook_asset = dm.define_dagstermill_asset(
    "some_notebook_asset",
    "template_notebook.ipynb",
    non_argument_deps={'some_other_notebook_asset'},
)
😲 1
j

José Zavala González

11/15/2022, 5:00 PM
The
non_argument_deps
definitely help alleviate the lineage. What partitioning variable do you want to use for your notebooks?
n

nickvazz

11/15/2022, 5:08 PM
I am trying to use a directory that refers to an experiment:
{EXP_NAME}_{VERSION}
since the same experiment could have multiple iterations
👍 1
the tricky part so far has been getting the partitioning to work the way I expect. Currently I have the multiple notebooks in a
@job
(via
define_asset_job
) and that job is using an
@config_mapping
to specify the shared parameters within each notebook in the pipeline
notebook_asset_0 = dm.define_dagstermill_asset(
    'notebook_asset_0',
    'notebook_0.ipynb',
    group_name='group_0',
    config_schema={'required_shared_var': dagster.Field(str)}

notebook_asset_1 = dm.define_dagstermill_asset(
    'notebook_asset_1',
    'notebook_1.ipynb',
    group_name='group_0',
    config_schema={'required_shared_var': dagster.Field(str)},
    non_argument_deps={'notebook_asset_0'},
))


@config_mapping(config_schema={'shared_var': dagster.Field(str)})
def simplified_config(values)
    return {'ops:
        {'notebook_asset_0': 'config': {'required_shared_var': values['shared_var']},
        {'notebook_asset_1': 'config': {'required_shared_var': values['shared_var']}, 
    }

notebook_pipeline_job = define_asset_job(
    name='full_notebook_pipeline_job', 
    config=simplified_config,
    selection=AssetSelection.groups('group_0'),
)
that way I can only have to specify
required_shared_var
once for the entire job and not to every
@op
/
@asset
but I cant figure out how to combine partitioning with the
config_mapping
https://dagster.slack.com/archives/C01U954MEER/p1668200854863459