https://dagster.io/ logo
#announcements
Title
# announcements
d

dwall

02/18/2020, 11:54 PM
I see that the dagster pandas dataframes types have an output materialization config built in: https://github.com/dagster-io/dagster/blob/c3ebf8cbe773778412da47196361814e686bfe0a/python_modules/libraries/dagster-pandas/dagster_pandas/data_frame.py#L185 I'm assuming this means as long as we are creating a dagster pandas dataframe type, we can also provide solid output config that would cause the dataframe to materialize itself as well as store as an intermediate?
s

schrockn

02/18/2020, 11:55 PM
yeah those allow specify materializations via config
in addition to the intermediates store
d

dwall

02/18/2020, 11:56 PM
dope
is it frowned upon to have a downstream solid depend on the materialized artifact from an upstream solid?
a

abhi

02/18/2020, 11:57 PM
Couldn't you just pass that dataframe as the input to that downstram solid?
d

dwall

02/18/2020, 11:59 PM
the downstream solid is going to load to snowflake. Ideally, I'd want to do a bulk
copy into
->
merge into
flow loading data from a csv/parquet file instead of doing single row inserts from the dataframe
s

schrockn

02/19/2020, 12:05 AM
That context is super helpful. This is a bit subtle but I think the issue here is that snowflake solid is depending on a persisted serialization format rather than a dataframe
d

dwall

02/19/2020, 12:05 AM
yep ^
s

schrockn

02/19/2020, 12:05 AM
e.g. the fact that you are materializing something to parquet in a special format is business logic in this case
rather than an incidental operational concern
d

dwall

02/19/2020, 12:06 AM
for sure
like, I can see specifying the file path as an
output
config for the dataframe solid and then an
input
config for a snowflake load solid, but it doesn't feel great to me
s

schrockn

02/19/2020, 12:08 AM
Yeah I don’t think should depend upon config at the solid-level
because in order for things to work you have to duplicate config in two places
How do you want the files persisted and how do you want to manage them?
d

dwall

02/19/2020, 12:09 AM
could be temporary
s

schrockn

02/19/2020, 12:09 AM
(like imagine if dagster wasn’t in the mix at all here)
on disk, object store, both?
d

dwall

02/19/2020, 12:10 AM
I like the idea of treating them as persistent artifacts for auditing. So materializing them to cloud storage is probably the optimal route
👍 1
s

schrockn

02/19/2020, 12:11 AM
so a resource seems like a good idea here
you can encapsulate all the logic and behavior you want there
configure it once
d

dwall

02/19/2020, 12:12 AM
how would the information about the location of the artifact within the resource flow from solid to solid?
s

schrockn

02/19/2020, 12:12 AM
the solids always have access to the resource
context.resources.your_file_store
d

dwall

02/19/2020, 12:13 AM
yeah totally, but Im saying how would a downstream solid know where to look
unless the upstream solid passes the file location as an output
s

schrockn

02/19/2020, 12:15 AM
pass a string between them that can be used as an argument into a method on the resource
👍 1
we have an abstraction which does something quite similar to this
the FileManager
d

dwall

02/19/2020, 12:15 AM
you have a link to an example?
s

schrockn

02/19/2020, 12:15 AM
and it deals with “Handle” objects
so you can swap in filesystem or object store implementations
depending on “mode”
it’s used in the airline demo. no super tight example.
let me see (it’s been a few months :-))
don’t have a super simple example, but unzip_file_handle in airline_demo.unzip_file_handle is the best we have right now
the solid works with s3 or the local file system
the basic idea is when you call file_manager.write
the system takes care of writing a file to some well-known spot (we currently support file system and s3 but is straightforward to extend)
and then when you write to get back a file handle which can be passed between solids
here’s a simpler test of this:
Copy code
bar_bytes = 'bar'.encode()

    @solid(output_defs=[OutputDefinition(S3FileHandle)])
    def emit_file(context):
        return context.file_manager.write_data(bar_bytes)

    @solid(input_defs=[InputDefinition('file_handle', S3FileHandle)])
    def accept_file(context, file_handle):
        local_path = context.file_manager.copy_handle_to_local_temp(file_handle)
        assert isinstance(local_path, str)
        assert open(local_path, 'rb').read() == bar_byte
here is a simpler example I just whipped up
but this is all pretty heavy and needs some work to be more approachable
d

dwall

02/19/2020, 1:03 AM
cool - this is still super helpful though. thanks @schrockn
e

Eric

02/19/2020, 6:57 PM
I have a question related to the new
create_dagster_pandas_dataframe_type
the examples show it's use as
Copy code
TripDataFrame = create_dagster_pandas_dataframe_type(
    name='TripDataFrame',
    columns=[
        PandasColumn.integer_column('bike_id', min_value=0),
        PandasColumn.categorical_column('color', categories={'red', 'green', 'blue'}),
        PandasColumn.datetime_column(
            'start_time', min_datetime=datetime(year=2020, month=2, day=10)
        ),
        PandasColumn.datetime_column('end_time', min_datetime=datetime(year=2020, month=2, day=10)),
        PandasColumn.string_column('station'),
        PandasColumn.exists('amount_paid'),
        PandasColumn.boolean_column('was_member'),
    ],
)
How would you define an
input_hydration_config
for the new data type created with
create_dagster_pandas_dataframe_type
?
a

abhi

02/19/2020, 7:42 PM
Hi Eric. Great catch. That was an oversight on my part. I have a revision out for it now. I will notify you when it goes through!
👍 2
It’s out! See 0.7.1
d

dwall

02/20/2020, 11:35 PM
celebrate
3 Views