or should i just go with <https docs dagster io concepts io dagster #ask-community

Join Slack

or should i just go with <https://docs.dagster.io/...

# ask-community

Bojan

05/08/2023, 9:46 AM

or should i just go with https://docs.dagster.io/concepts/io-management/io-managers#custom-filesystem-based-io-manager

Bojan

05/08/2023, 2:34 PM

update i’ll go with the second approach the gcs file handler seems way too experimental and not easy to work with

sean

05/08/2023, 4:37 PM

file managers are used to write data inside ops/assets-- like chris said above, they’re an old pattern. IO managers handling passing data between ops/assets. There is a gcs IO manager

PickledObjectGCSIOManager

which is probably what you want: https://github.com/dagster-io/dagster/blame/master/python_modules/libraries/dagster-gcp/dagster_gcp/gcs/io_manager.py

Bojan

05/08/2023, 6:31 PM

So i’m actually reading the files in that someone else is posting, and the pickeled object gcs manager doesn’t cut it

Bojan

05/08/2023, 6:31 PM

correct me if i’m wrong

sean

05/08/2023, 6:37 PM

ah yes that’s right-- if you want to read arbitrary files I think you should use a custom IO manager-- you can use the file manager here if you like, or you can just use any other client API for GCS

Bojan

05/08/2023, 9:10 PM

Given that the format is also weird i think i just might write a client that loads from the gcs since it doesn’t really fit the bill in the io manager workflow either

Bojan

05/08/2023, 9:12 PM

Would it be considered “bad practice” if i just used the client from gcs resource and then did the loading of the files from the bucket within an asset ?

sean

05/08/2023, 9:24 PM

Plenty of people do I/O inside their assets, though we’re striving to develop our IO management layer to the point that that’s not necessary. What is it about your case that doesn’t fit IO managers? I would think you could model the inputs you want to load as source assets and have use the GCS client inside the

load_input

of a custom IO manager.

Bojan

05/08/2023, 9:30 PM

so the catch is that the files that are being posted to the bucket don’t have the name format that i could be using upfront. the use case is following. Someone at any point in the day posts a file to the bucket that has a timestamp from yesterday but at any point in time of the day. I need to ingest that file, since it’s custom format i also need to parse things out of it.

Bojan

05/08/2023, 9:31 PM

I could technically from scratch write out a custom io manager, that works with partitions and figures out which file it needs to pull

Bojan

05/08/2023, 9:32 PM

the processed file needs to be posted into snowflake

sean

05/09/2023, 12:26 PM

I don’t see why you couldn’t do this with an IO manager. This patterns makes me think you should use a dynamically partitioned asset to represent the incoming files, with each file corresponding to an asset partition. You can use a sensor to detect and generate new partitions for each file. There is an example here: https://docs.dagster.io/concepts/partitions-schedules-sensors/partitions#dynamically-partitioned-assets Your IO manager would just load the file (with filename given by partition key), then you can do whatever you want with it and write to differently partitioned snowflake-based assets downstream.

Bojan

05/11/2023, 3:42 PM

Gotten sidetracked with other stuff, i’ll start back on this tomorrow - it might just be that it’s my lack of knowledge that’s stopping me - i’ll give it a try

Bojan

05/11/2023, 3:42 PM

Thanks again Sean, Chris as well - you’re super helpful

2 Views

Open in Slack

Previous Next