Hello folks, in production we use the `gcs_pickle_...
# ask-community
c
Hello folks, in production we use the
gcs_pickle_io_manager
since it’s required by the multiprocess executor. Unfortunately, this makes one of our pipeline run extremely slow compared to local development sharing data in memory. Nodes in our graph spend almost the whole time in “preparing” as you can see in the screenshot. This pipeline is scheduled every hour and takes almost one hour being at risk of not completing in time. Can you explain what happens in that “preparing” phase making it take so long? And potentially ways to improve performance? I’m thinking to use the single process executor for that job. Is it common to have some jobs run in multi process and some in single process? Would you recommend not to do that? If yes why? I can’t increase the runs concurrency because I would risk hitting BigQuery’s concurrency limit.
dagster bot responded by community 1
Here you can see with the timescale that we’re waiting for minutes
z
Is it not just waiting for data transfer? We're comparing local in memory time with object store & serialization time. Depending on the size and structure of said objects it may be entirely io.
c
If the data volumes were high that would make sense but here it’s not even 1 kilobyte for each of those blue bars. I wouldn’t even expect 1 second to download so little data.
Excluding opening the connection maybe
z
i guess what I'm confused about is why are there so many small jobs running in parallel? Part of me is wondering if there's a better way to express this. If you do want to debug this, you could potentially set this up to run locally with the different Io manager and throw some breakpoints or run it with pyspy, or other profilers to see what's going on.
c
Good question. Basically, we have hundreds of partitions. Each partition becomes a suite of small ops. Those are executing very quickly. We are basically breaking down a huge query into multiples ones. BigQuery doesn’t accept the full one. There is surely a better way to express this but this works really well for us and seems to be aligning well with Dagster’s philosophy of breaking down complex pipelines into simple reusable ops. The real problem is that I can’t seem to understand why Dagster is waiting minutes to read small volumes from object storage. It seems almost like there is problem in the scheduler. I can’t really breakpoint since this is not my code but Dagster internals. Unless I can?
z
Dagster’s ability to run locally + touching dagster internals is one of the biggest reasons I use dagster 🙂 . You can use the
dagster dev
command to run it locally. We also have debug profiles set up to do this in VSCODE, and have resources parameterized such that we can control what settings/resources are used via environment variables. (EG: Local memory testing by default, staging/‘live’ tests when it’s equal to another one, than final prod is another one still. Setting breakpoints itself however may be a bit challenging to figure out where exactly it’s taking time. (I mean you could just click ‘next’ until it works).
pikachu wow 1
c
Right, it's seems a bit complicated to figure where to breakpoint but I could try
What are those debug profiles?
z
I don’t have an answer on why dagster is doing this (I am not a elementl engineer), but I’m wondering if there could be some scaling issue with running so many parts simultaneously, can you try the same graph/job and restrict how many are ran concurrently even further? Does that help reduce latency between ops running? I also personally think about things a bit differently. Ops and Assets being outcomes that we want to orchestrate with dagster. An asset may be a table or dataset in a datalake that it generates (even if partition at a time. While ops are things that have some outcome, but, aren’t a deliverable on their own. For example one op that we have is to optimize our delta lake tables (eg: reorganize for performance, but not create/delete data). I’m not aware if this is the de jure understanding or not.
Here’s an example of a vscode profile to run dagit for our ‘staging’ setting. That being said YMMV, since you may have configured your environment/resource stuff diffrently:
Copy code
"name": "STAGING - Debug Dagit",
            "type": "python",
            "request": "launch",
            "module": "dagit",
            "cwd": "${workspaceFolder}",
            "args": [
                "-w",
                "workspace.yaml"
            ],
            "console": "integratedTerminal",
            "justMyCode": false,
            "env": {
                "DAGSTER_DEPLOYMENT": "staging",
                "DAGSTER_CLOUD_IS_BRANCH_DEPLOYMENT": "1",
                "DAGSTER_CLOUD_PULL_REQUEST_ID": "0"
            }
        },
(also note, this uses
dagit
instead of
dagster dev
, I’ve been procrastinating on updating them 😬 )
c
Ah sorry I thought you were from elementl 😅 When running locally with a single process executor it goes really quick. I like your idea and will dive into this tomorrow when I'm back on my laptop. Thanks.
I agree with you, my entire graph is an asset, but each smaller steps are ops.
j
I am not an expert but here is what I understand: When you use multi_process_executor, between the execution of each asset/op there is a kind of reload of all your workspace code to find the next asset to execute. So assuming your workspace contains a lot of jobs where the build is done via long API calls, this may explain the overhead you encounter. (I don't know why this behavior exists, and I personally find it quite restrictive for large scale parallelization) By using in_process_executor, there is no such constraint, but the prossess are computed serially and therefore potentially faster than in parallel.
c
Interesting. I guess let’s wait for someone from the Elementl team to chime in here. This seems like a bug to me.