I see in the docs that the built-in `fs_io_manager...
# ask-community
m
I see in the docs that the built-in
fs_io_manager
“allows users to specify a base directory where all the step outputs will be stored”, but the provided example does not explicitly tell me how to define such a directory. Would anyone shed some light into this please? Or is the base directory the same as
DAGSTER_HOME
? Also, if I want to store .csv instead of .pickle, should I declare my own IOManager?
o
Hi Martim! Good question, looks like the api docs should include an example for that behavior. base_dir is an optional configuration option of the
fs_io_manager
, which means it can be set either through the configured api:
Copy code
fs_io_manager.configured({"base_dir": "path/to/basedir"})
, or by passing in configuration through the run config. You can see an example of the first option in the docs for custom_path_fs_io_manager (where this configuration is required, instead of optional). As for storing information as a csv, you will have to write your own IOManager, although its implementation can be quite similar to the fs_io_manager implementation.
m
Thanks @owen! I have successfully declared a
pandas_csv_iomanager
. Now, because of the way I’m structuring my project, I happen to have many small, atomic solids inside a composite solid, of which I only intend to persist the last output. Do I need to set
OutputDefinition
for all of them (using a default one on the upstream solids) or is there a way to say “just use my custom IOManager in the last solid’s output”?
o
No problem! For that, you can use a per-output IOManager . So you only need to set the OutputDefinition on the final output (with an io_manager_key='my_pandas_csv_io_manager' or whatever you want to call it). All outputs, by default, use the io_manager_key='io_manager', so as long as you don't map the 'io_manager' resource to something different, they will continue to use the default (in-memory) IOManager.
m
Copy code
@dg.pipeline(mode_defs =[dg.ModeDefinition(resource_defs={"pandas_csv":df_csv_io_manager})])
def main():
    catalog_df= catalog_main()
Copy code
@dg.composite_solid(output_defs=[dg.OutputDefinition(io_manager_key="pandas_csv")])
def catalog_main():
    root = read_xml()   
    outDict = find_uids(root)
    formated_table = fill_records(root,outDict)
    catalog_df = load(formated_table)
    catalog_df = rename_columns(catalog_df)
    catalog_df = select_columns(catalog_df)
    catalog_df = remove_extension(catalog_df)
    catalog_df = remove_duplicates(catalog_df)
    catalog_df = reverse_creators_name(catalog_df)
    catalog_df = dates_accuracy(catalog_df)
    catalog_df = extract_dimensions(catalog_df)
    

    return catalog_df
this still seems to use `io-manager`for all steps. What am I doing wrong?
does the `io-manager`operate directly on the `extract_dimensions`output? It seems logical to me that I need to specify the manager for the composite solid output, in case I ever change the order or add new solids inside of it…
o
hm interesting -- I was able to reproduce this behavior. It seems like a bug, but perhaps @sandy has more context on this? Short term solution would be to declare the OutputDefinition with the io_manager_key on the
extract_dimensions
solid instead of the composite solid, but I see why this can be cumbersome for your use case
Copy code
@solid
def solid_a(_):
    return 1

@solid
def solid_b(_, x):
    return x

@composite_solid(output_defs=[OutputDefinition(io_manager_key="my_io_manager")])
def my_composite_solid():
    x = solid_a()
    x = solid_b(x)

    # this output will still be processed with the default io_manager
    return x

@pipeline(mode_defs=[ModeDefinition(resource_defs={"my_io_manager": fs_io_manager})])
def my_pipeline():
    my_composite_solid()
^ code to reproduce
m
phew, I’m always happy when it’s a bug and not something I’m doing wrong. Should I open an issue?
o
hahaha I understand the feeling -- and that would be great 🙂
m
issue opened. I was wondering if it is possible to chat with anyone at Elementl/Dagster about my project to understand the best way to structure it, as I see you’re very supportive here on Slack. Mainly I want to understand where to draw the line between composite solids, pipelines and repositories, based on what I need to be sensor-triggered and the assets I materialize… It’s a really simple digital humanities project, so something a bit different from your usual ML/Business applications. I understand Slack is a great channel but a zoom session would really allow me to point our needs and doubts
s
Hi @Martim Passos - I'd be happy to chat with you
How's tomorrow at 10 am pacific time?
m
That works great, @sandy. Thanks for the quick reply! Send me a link on the dms?
s
Great. Will dm you a zoom link before
l
Hi @Martim Passos . we're trying to create our own pandas to csv IOManager as well, but we have not been succesful yet. Would you share your IOManager implementation?
m
Hi @Laura Moraes, yes by all means. Are you in Brazil by any chance? It would be really good to start a local community of Dagster users!
s
@Laura Moraes - let me know if it would be helpful to chat on zoom about this
l
Hi @sandy that would be great! I posted right now on this channel what we're trying to accomplish.