I have a doubt how do you guys create ingestion pipelines fo dagster #ask-community

I have a doubt, how do you guys create ingestion p...

Ismael Rodrigues

05/20/2023, 8:03 PM

I have a doubt, how do you guys create ingestion pipelines for data that can't be loaded on memory using Assets? Using OPs we have the dynamic op, but how can you do it using assets?

Yamamoto Hironori

05/21/2023, 8:53 AM

I feel similarities with my question. One possible way, I think, is to make an asset to store the id of the asset result, e.g. BigQuery table id. But it does not seem to follow concepts of Dagster https://dagster.slack.com/archives/C01U954MEER/p1684657832055289

jamie

05/22/2023, 1:55 PM

Hey @Ismael Rodrigues is the issue that the asset is too large to be stored in memory (for example a table with a trillion rows) but you still want to apply the concepts of assets and treat it like it is an asset in dagster?

Ismael Rodrigues

05/22/2023, 5:43 PM

@jamie My question is about dealing with GB of data, for example: If I need to create a job to do ingestion of a table that has about 90mi rows, in this case, using OP, this can be easily achieved by DynamicOutput, because I can read the table in chunks and then do something with that data after, but using assets, I can't read this data in chunks, because there's no such a thing as the DynamicOut for assets, so if my machine has 8GB RAM, that's the limit of the dataset that I can ingest using assets. My question was about understanding how people deal with tons of data using only assets. Even if you load by partitions, it may occur that a particular partition has lots and lots of data that won't be loaded in memory.

Ismael Rodrigues

05/22/2023, 5:46 PM

All my jobs today use OPs and JOBs and sometimes I feel kinda sad because the core concept of Dagster is the Assets, so I was hoping to develop Assets from now on instead of OPs and JOBs, but I found myself in this dilemma, where I couldn't load a CSV file with 500k rows because as a Dataframe, consumes a lot of RAM.

jamie

05/22/2023, 5:54 PM

yep that makes a ton of sense! this is a use case we are actively thinking about right now and is one of the higher priority projects on our roadmap. One of the options you can do right now is to take advantage of the

non_argument_deps

parameter on

asset

to set up the dependency structure between your assets without pulling the data corresponding to the asset into memory. Then the function body of the asset could just execute some SQL (or whatever) against the externally stored asset data https://docs.dagster.io/concepts/assets/software-defined-assets#non-argument-dependencies

❤️ 2

D 1

🌈 1

daggy love 1

Ismael Rodrigues

05/22/2023, 6:00 PM

Yea, but suppose I need to upload this data to S3, using this alternative I can't. But I do understand that this is already on your roadmap, so it's just about waiting a little. Until then, I'll use Assets for low volumes of data and OPs and JOBs for high volumes.

Open in Slack

Previous Next