Hi! How do you manage memory issues in simple ETL ...
# ask-community
a
Hi! How do you manage memory issues in simple ETL pipelines with dagster? I am trying to do a simple extract from postgres, transform into elasticsearch dicts, load into elasticsearch. The issue is that the extracted table from postgres might be very large so I would like to chunk the outputs. I tried doing this with the DynamicOutput solution but it seems that it yields all the steps before continuing down the pipeline. I manged to make the transform and load steps in the correct order by utilizing priority tags but it does not really matter if the extract step yields all the outputs before continuing. I have tried recreating the downstream solids into a single composite solid, changing the IOManager into a pickle filesystem one, and tweaking the priority of the tasks. So far I have been unsuccessful and the memory usage of all the different implementations have been about the same. If you see in my screenshot it seems it yields all rows before continuing. I would like to see yield -> load, yield -> load and so on. TLDR: How do I chunk outputs from a solid and let the downstream solids run in order for each output to limit memory usage of pipeline?
Found a related issue and added my issue there as well. https://github.com/dagster-io/dagster/issues/4200
a
i responded on the issue since i saw that first, can follow up here if you have further questions
a
Thanks! I continued the discussion in Github since we are probably in different timezones. I would really like to discuss the future road-map concerning issues such as mine since it might be a deal-breaker for my use-case if Dagster is not able to handle chunked workloads. Let me know if that discussion is better suited elsewhere!
a
I believe we are only a few small changes away from allowing the io manager approach to work, moved to targeted issue https://github.com/dagster-io/dagster/issues/4262