still making a proof of concept for myself and since im clea dagster #ask-community

still making a proof of concept for myself, and si...

Jake Kagan

03/09/2023, 6:02 PM

still making a proof of concept for myself, and since im clear about not being able to run parallel ops while using in process io manager, I'm hoping to get some advice on how to approach a problem like this: is the idea to consolidate 2B1 + 2B2? and is there some kind of way to chain io managers that im not aware of?? thank you!!

chris

03/09/2023, 11:39 PM

I think the ideal thing might be to take a step back and think about what the core outputs are that you care about - then can structure accordingly around those things. You can parallalelize io managers into the same op / asset by having multiple assets that have different IO manager keys

Jake Kagan

03/10/2023, 12:04 AM

hmm, the main thing im trying to do is add incrementally to a table. but that increment might include half day data, so it's easier to just remove that the next day and add the full day's worth.. so the logic was, if the two things don't depend on each other i can run them in parallel.

Jake Kagan

03/10/2023, 12:06 AM

but that move from 2b1 dataframe into 2b2 is done in memory...and i thought the idea was to break things up

Jake Kagan

03/10/2023, 12:07 AM

conceptually kind of difficult to see how i can make the final table an asset when it involves this many moves before hand

chris

03/10/2023, 12:34 AM

I don’t think I understand the full picture of what you’re trying to do - so it’s a bit hard to provide structure recommendations. Can you lay out the full scope of the job?

Jake Kagan

03/10/2023, 2:55 AM

there's a table that i'mm trying to add data to incrementally. basically most recent usage data. if i query bigquery for it at 2pm i get today's data upto 2pm. if i query the same data tomorrow at 3pm, i have it all of today's data (upto and after 2pm) and tomorrow's until 3pm. so there's a double count of today's data until 2pm. i could go by time or something, but it's easier by date. and really it's a good lesson for me on dagster. so, on the left side of the graph, is me deleting existing yesterday's data (let's say it goes to 4pm) and on the right side i'm just adding new data starting from yesterday (so full day) -- there aren't any row ids or something for me to use, so i'm going by dates

chris

03/10/2023, 9:34 PM

Any reason 2B1 and 2B2 can’t just be consolidated into using bigquery to retrieve the data via dataframe, and then io manager converts dataframe to pq and drops to s3? Doesn’t seem like the io manager in 2B1 is really doing anything

Jake Kagan

03/14/2023, 4:42 PM

hey sorry, totally missed this! i mean i could consolidate the entire thing into one python script, i was trying to split up business logic from io and SB1 is an op that takes arguments and fills out a sql template which then sends it to the io manager to hitup bigq... SB2 is an io manager that takes different s3 arguments, so that i can reuse for other jobs

Open in Slack

Previous Next