Hi all I m doing a task who reads from CSV file about 100k l dagster #ask-community

Hi all! I'm doing a task who reads from CSV file a...

Alvaro Arias

05/30/2022, 10:53 AM

Hi all! I'm doing a task who reads from CSV file about 100k lines and stores it on a DB. Can anybody give me a hint on how to achieve parallelism using Dagster on store DB operation?

Copy code

def process_clients_file():
    dataframe = get_dataframe()
    clients_array = process_dataframe(dataframe)
    store_clients(clients_array)


@op(required_resource_keys={"dgraph_manager"})
def store_clients(context, clients_array):
    for client in client_array:
        context.resources.dgraph_manager.load_client(client)

I need to speed up the "for" loop but I don't know if its possible to do that on Dagster. Maybe I need to make this step in other container or something like that. Looking into documentation I found that Dagster have support for Dask but I read "we use Dask to orchestrate execution of the steps in a job, not to parallelize computation within those steps." so now I'm lost. Thanks!

🤖 1

Isaac Harris-Holt

05/30/2022, 11:00 AM

Have a look at dynamic graphs: https://docs.dagster.io/concepts/ops-jobs-graphs/dynamic-graphs

Alvaro Arias

05/30/2022, 11:03 AM

Wow thanks!

owen

05/31/2022, 10:02 PM

just as a quick note, dynamic graphs are useful in this case, but I wouldn't recommend having a dynamic output for each individual line. in general, the overhead of spinning up a new process (seconds) is very high compared to the cost of inserting a row into a database (milliseconds or less). I'd recommend "bucketing" the rows such that each process does somewhere around 1k-10k rows each (tuning depends on the specifics of your situation).

Alvaro Arias

06/01/2022, 3:19 PM

Hi Owen! I made it splitting the dataframe in chunks of 2k rows and it works like a charm. Thanks!

owen

06/01/2022, 3:43 PM

great! glad it worked for you 🙂

🖖 1

13 Views

Open in Slack

Previous Next