still making a proof of concept for myself, and si...
# ask-community
j
still making a proof of concept for myself, and since im clear about not being able to run parallel ops while using in process io manager, I'm hoping to get some advice on how to approach a problem like this: is the idea to consolidate 2B1 + 2B2? and is there some kind of way to chain io managers that im not aware of?? thank you!!
c
I think the ideal thing might be to take a step back and think about what the core outputs are that you care about - then can structure accordingly around those things. You can parallalelize io managers into the same op / asset by having multiple assets that have different IO manager keys
j
hmm, the main thing im trying to do is add incrementally to a table. but that increment might include half day data, so it's easier to just remove that the next day and add the full day's worth.. so the logic was, if the two things don't depend on each other i can run them in parallel.
but that move from 2b1 dataframe into 2b2 is done in memory...and i thought the idea was to break things up
conceptually kind of difficult to see how i can make the final table an asset when it involves this many moves before hand
c
I don’t think I understand the full picture of what you’re trying to do - so it’s a bit hard to provide structure recommendations. Can you lay out the full scope of the job?
j
there's a table that i'mm trying to add data to incrementally. basically most recent usage data. if i query bigquery for it at 2pm i get today's data upto 2pm. if i query the same data tomorrow at 3pm, i have it all of today's data (upto and after 2pm) and tomorrow's until 3pm. so there's a double count of today's data until 2pm. i could go by time or something, but it's easier by date. and really it's a good lesson for me on dagster. so, on the left side of the graph, is me deleting existing yesterday's data (let's say it goes to 4pm) and on the right side i'm just adding new data starting from yesterday (so full day) -- there aren't any row ids or something for me to use, so i'm going by dates
c
Any reason 2B1 and 2B2 can’t just be consolidated into using bigquery to retrieve the data via dataframe, and then io manager converts dataframe to pq and drops to s3? Doesn’t seem like the io manager in 2B1 is really doing anything
j
hey sorry, totally missed this! i mean i could consolidate the entire thing into one python script, i was trying to split up business logic from io and SB1 is an op that takes arguments and fills out a sql template which then sends it to the io manager to hitup bigq... SB2 is an io manager that takes different s3 arguments, so that i can reuse for other jobs