Hi all! I'm doing a task who reads from CSV file a...
# ask-community
a
Hi all! I'm doing a task who reads from CSV file about 100k lines and stores it on a DB. Can anybody give me a hint on how to achieve parallelism using Dagster on store DB operation?
Copy code
def process_clients_file():
    dataframe = get_dataframe()
    clients_array = process_dataframe(dataframe)
    store_clients(clients_array)


@op(required_resource_keys={"dgraph_manager"})
def store_clients(context, clients_array):
    for client in client_array:
        context.resources.dgraph_manager.load_client(client)
I need to speed up the "for" loop but I don't know if its possible to do that on Dagster. Maybe I need to make this step in other container or something like that. Looking into documentation I found that Dagster have support for Dask but I read "we use Dask to orchestrate execution of the steps in a job, not to parallelize computation within those steps." so now I'm lost. Thanks!
🤖 1
i
a
Wow thanks!
o
just as a quick note, dynamic graphs are useful in this case, but I wouldn't recommend having a dynamic output for each individual line. in general, the overhead of spinning up a new process (seconds) is very high compared to the cost of inserting a row into a database (milliseconds or less). I'd recommend "bucketing" the rows such that each process does somewhere around 1k-10k rows each (tuning depends on the specifics of your situation).
a
Hi Owen! I made it splitting the dataframe in chunks of 2k rows and it works like a charm. Thanks!
o
great! glad it worked for you 🙂
🖖 1