is it possible to launch a run in a docker contain...
# ask-community
d
is it possible to launch a run in a docker container, and then have each fanned out op step be done in_process or multi_process? I'm not sure how the DockerRunLauncher interacts with the executor for each op step.
d
Hi Danny - doing it multi_process is actually the default behavior if you use the DockerRunLauncher - it still uses the default executor which is described here: https://docs.dagster.io/concepts/ops-jobs-graphs/job-execution#default-job-executor
d
we have an
Engine_Event
saying
Starting execution with step handler DockerStepHandler.
Does that launch a new container for each step? How do I configure it not to do that for each step?
d
I would only expect that to show up if you've explicitly enabled the docker_executor in your job
d
I have a defined job on a sensor that has that specified, but does the UI use that job as well for the asset?
d
I wouldn't expect it to if it's solely defined on the sensor - any chance you can share the code?
d
here's the graph of ops and asset definition from the graph:
Copy code
@graph
def full_scored_data_set(recruiter_teams_to_score, trained_model):
    """chunk the recruiters to score and score them in batches"""
    # result = (
    #     chunking_op(recruiter_teams_to_score)
    #     .map(lambda batch: generate_score_for_ids(batch, trained_model))
    #     .collect()
    # )
    result = (
        key_to_score(recruiter_teams_to_score)
        .map(
            lambda key: score_data_set(
                data_to_score_for_key(recruiter_in_batch=key), trained_model
            )
        )
        .collect()
    )
    return append_scores(result)


graph_asset = AssetsDefinition.from_graph(
    full_scored_data_set,
    keys_by_input_name={
        "recruiter_teams_to_score": AssetKey("recruiter_teams_to_score"),
        "trained_model": AssetKey("trained_model"),
    },
    resource_defs={"mem": mem_io_manager},
)
here's the sensor definition:
Copy code
distributed_scoring_sensor(
        define_asset_job(
            "distributed_scoring_job",
            selection=AssetSelection.groups("distributed_scoring"),
            executor_def=docker_executor,
        )
    ),
d
If you're then launching a job called "distributed_scoring_job" in dagit, then i would expect that to use that executor config
d
I'm launching it via the "Materialize" button in the UI
d
And what's "It" here precisely? The "distributed_scoring_job" job?
d
the
full_scored_data_set
asset
d
Hmm, I'll ask around about this with somebody who will know this for sure
d
thanks!
our end goal is to have the run be spun up in a Docker container and have
data_to_score_for_keys
and
score_data_set
be run in-process so we can use the memory IO manager (we don't want to persist the data from
data_to_score_for_keys
as it's a lot of data to persist to disk). If that's not possible, I'm guessing I would need a custom IO manager that could store the data in a file and then clean up the file after the downstream op ingests it?
d
is there also a Definitions object?
(just double-checking that you didn't define a default executor there)
d
ah yep, I have it in the Definitions object
d
ah that makes more sense
d
would removing that still have the run be launched in a Docker Container?
and then each step would be the default multi_process?
d
yeah, that's controlled by the DockerRunLauncher
👍 1
d
excellent! thanks for the help!
condagster 1
ah, so now the executor is only spinning up 4 processes at a time. Is it possible to increase that default concurrency limitation for every run in the UI?
d
Yeah take a look at the examples with max_concurrent on that page: https://docs.dagster.io/concepts/ops-jobs-graphs/job-execution#default-job-executor
d
right, but can I pass that config to the Definitions object somehow? Or would it be loaded in via the dagster.yaml config?
d
Yeah, you can set default configuration by doing e.g.
Copy code
from dagster import multiprocess_executor

Definitions(
  ...,
  executor=multiprocess_executor.configured({"max_concurrent": 8})
d
ah, prefect. thanks so much!
so I take it it's not possible to have the
data_to_score_for_key
op be a multiprocess executor, but have the
score_data_set
op be an in_process executor? I was hoping to do that so that the
data_to_score_for_key
could be persisted only in memory until the
score_data_set
op is finished.
if it's not possible, would a custom IO manager that cleans up it's own files be the better solution?
d
Splitting out executors per op isn't possible yet, but it's something that's on our radar for a future improvement
👍 1
What you're describing does sound more like an IO manager
d
Hm this is a bit more out of my wheelhouse, could you maybe make a new post for the IOManager piece and our OSS oncall who knows these things better can weigh in?
d
absolutely, thanks for your help with all of this! Absolutely loving Dagster so far
condagster 1