I see in the docs that the built in `fs io manager` allows u dagster #ask-community

I see in the docs that the built-in `fs_io_manager...

Martim Passos

05/03/2021, 4:45 PM

I see in the docs that the built-in

fs_io_manager

“allows users to specify a base directory where all the step outputs will be stored”, but the provided example does not explicitly tell me how to define such a directory. Would anyone shed some light into this please? Or is the base directory the same as

DAGSTER_HOME

? Also, if I want to store .csv instead of .pickle, should I declare my own IOManager?

owen

05/03/2021, 5:07 PM

Hi Martim! Good question, looks like the api docs should include an example for that behavior. base_dir is an optional configuration option of the

fs_io_manager

, which means it can be set either through the configured api:

Copy code

fs_io_manager.configured({"base_dir": "path/to/basedir"})

, or by passing in configuration through the run config. You can see an example of the first option in the docs for custom_path_fs_io_manager (where this configuration is required, instead of optional). As for storing information as a csv, you will have to write your own IOManager, although its implementation can be quite similar to the fs_io_manager implementation.

Martim Passos

05/03/2021, 8:56 PM

Thanks @owen! I have successfully declared a

pandas_csv_iomanager

. Now, because of the way I’m structuring my project, I happen to have many small, atomic solids inside a composite solid, of which I only intend to persist the last output. Do I need to set

OutputDefinition

for all of them (using a default one on the upstream solids) or is there a way to say “just use my custom IOManager in the last solid’s output”?

owen

05/03/2021, 9:00 PM

No problem! For that, you can use a per-output IOManager . So you only need to set the OutputDefinition on the final output (with an io_manager_key='my_pandas_csv_io_manager' or whatever you want to call it). All outputs, by default, use the io_manager_key='io_manager', so as long as you don't map the 'io_manager' resource to something different, they will continue to use the default (in-memory) IOManager.

Martim Passos

05/03/2021, 9:26 PM

Copy code

@dg.pipeline(mode_defs =[dg.ModeDefinition(resource_defs={"pandas_csv":df_csv_io_manager})])
def main():
    catalog_df= catalog_main()

Copy code

@dg.composite_solid(output_defs=[dg.OutputDefinition(io_manager_key="pandas_csv")])
def catalog_main():
    root = read_xml()   
    outDict = find_uids(root)
    formated_table = fill_records(root,outDict)
    catalog_df = load(formated_table)
    catalog_df = rename_columns(catalog_df)
    catalog_df = select_columns(catalog_df)
    catalog_df = remove_extension(catalog_df)
    catalog_df = remove_duplicates(catalog_df)
    catalog_df = reverse_creators_name(catalog_df)
    catalog_df = dates_accuracy(catalog_df)
    catalog_df = extract_dimensions(catalog_df)
    

    return catalog_df

Martim Passos

05/03/2021, 9:29 PM

this still seems to use `io-manager`for all steps. What am I doing wrong?

Martim Passos

05/03/2021, 9:30 PM

does the `io-manager`operate directly on the `extract_dimensions`output? It seems logical to me that I need to specify the manager for the composite solid output, in case I ever change the order or add new solids inside of it…

owen

05/03/2021, 9:42 PM

hm interesting -- I was able to reproduce this behavior. It seems like a bug, but perhaps @sandy has more context on this? Short term solution would be to declare the OutputDefinition with the io_manager_key on the

extract_dimensions

solid instead of the composite solid, but I see why this can be cumbersome for your use case

owen

05/03/2021, 9:43 PM

Copy code

@solid
def solid_a(_):
    return 1

@solid
def solid_b(_, x):
    return x

@composite_solid(output_defs=[OutputDefinition(io_manager_key="my_io_manager")])
def my_composite_solid():
    x = solid_a()
    x = solid_b(x)

    # this output will still be processed with the default io_manager
    return x

@pipeline(mode_defs=[ModeDefinition(resource_defs={"my_io_manager": fs_io_manager})])
def my_pipeline():
    my_composite_solid()

owen

05/03/2021, 9:43 PM

^ code to reproduce

Martim Passos

05/03/2021, 9:49 PM

phew, I’m always happy when it’s a bug and not something I’m doing wrong. Should I open an issue?

owen

05/03/2021, 9:50 PM

hahaha I understand the feeling -- and that would be great 🙂

Martim Passos

05/03/2021, 10:37 PM

issue opened. I was wondering if it is possible to chat with anyone at Elementl/Dagster about my project to understand the best way to structure it, as I see you’re very supportive here on Slack. Mainly I want to understand where to draw the line between composite solids, pipelines and repositories, based on what I need to be sensor-triggered and the assets I materialize… It’s a really simple digital humanities project, so something a bit different from your usual ML/Business applications. I understand Slack is a great channel but a zoom session would really allow me to point our needs and doubts

sandy

05/03/2021, 10:53 PM

Hi @Martim Passos - I'd be happy to chat with you

sandy

05/03/2021, 10:54 PM

How's tomorrow at 10 am pacific time?

Martim Passos

05/03/2021, 11:06 PM

That works great, @sandy. Thanks for the quick reply! Send me a link on the dms?

sandy

05/04/2021, 6:07 AM

Great. Will dm you a zoom link before

Laura Moraes

05/11/2021, 2:17 PM

Hi @Martim Passos . we're trying to create our own pandas to csv IOManager as well, but we have not been succesful yet. Would you share your IOManager implementation?

Martim Passos

05/11/2021, 2:19 PM

Hi @Laura Moraes, yes by all means. Are you in Brazil by any chance? It would be really good to start a local community of Dagster users!

sandy

05/11/2021, 3:47 PM

@Laura Moraes - let me know if it would be helpful to chat on zoom about this

Laura Moraes

05/11/2021, 3:49 PM

Hi @sandy that would be great! I posted right now on this channel what we're trying to accomplish.

Open in Slack

Previous Next