Vinnie
06/10/2022, 11:25 AMbuild_resources()
function to pass it to the get_s3_keys()
function and then to the job/op that needs to download it so I don’t need to assume the role twice. When passing this resource though, I get the error “Object of type S3 is not JSON serializable”. I assume this is because dagster tries to convert the S3 resource into YAML for the job configuration, so passing the s3 client is likely not the “correct” way to go about things. I wouldn’t like to pass the STS credentials, as, well, they’re credentials.
Question is, what is the “proper” way to use this S3 Client pattern as a resource? Or to avoid having the AssumeRole call go out multiple times during execution? I couldn’t find anything in the docs, but is it possible to pass an argument to a job, so it can be forwarded to the ops? Something like the following.
@job
def foo(s3_client):
downloaded_file_path = download_from_s3(s3_client)
johann
06/10/2022, 3:45 PMObject of type S3 is not JSON serializableWhen the sensor emits a run request, that run will start in a new subprocess of gRPC container (or elsewhere if you’ve configured a run launcher, e.g. a K8s Job or ECS task). There’s no current way to share a resource across the process boundary. If you include the s3 client resource on the job (by just passing config rather than the actual python object), it will be reinitialized by every op that needs it (because by default with the
multiprocess_executor
, every op is in a new subprocess). Or, with the in_process_executor
, the resource will only be initialized once for the run and shared across opsI wouldn’t like to pass the STS credentials, as, well, they’re credentialsA few options here. A common approach is to surface them in the environment of wherever your job is running- might be done with something like vault, or by attaching secrets on K8s, etc. - then using
StringSource
in your resource to get the config from the env.
Example config schema: https://github.com/dagster-io/dagster/blob/7ecd4b3d3b28d0852ebec92658442b1cbd15f03[…]odules/libraries/dagster-snowflake/dagster_snowflake/configs.py
Example of using the schema, pointing at an env var: https://github.com/dagster-io/dagster/blob/7207a6e2dc3fd3a6e9705ca361b9f5a18204c1e[…]ules/libraries/dagster-snowflake/dagster_snowflake/resources.pyVinnie
06/10/2022, 5:44 PMtry-catch
block in my case since they time out after a few minutes. I guess I’ll just pass the role name and run the AssumeRole
in the op that needs it. If I do need it again in subsequent ops I can turn the whole AssumeRole
call into a single op and pass the returned value, that seemed to work for me. I’d just have liked to avoid running the AssumeRole in the sensor and then again when each job starts.johann
06/10/2022, 6:01 PMVinnie
06/10/2022, 6:19 PMjohann
06/14/2022, 2:19 PMsandy
06/14/2022, 2:40 PMbuild_resources
inside the decorated schedule function?Vinnie
06/14/2022, 2:42 PMschedules[f"{job_name}_schedule"] = ScheduleDefinition(
name=f"{job_name}_schedule",
job=run_datahub_pipeline,
cron_schedule=job_options["schedule"],
run_config={
"resources": {"values": {"config": job_options},
"io_manager": dagster_bucket_io_manager}
}
)
dagster_bucket_io_manager
being:
@io_manager
def dagster_bucket_io_manager():
return s3_pickle_io_manager(
build_init_resource_context(
config={"s3_bucket": os.getenv("S3_IO_BUCKET")},
resources={"s3": s3_resource} # s3_resource being dagster_aws.s3.s3_resource
)
)
sandy
06/16/2022, 6:04 PMVinnie
06/16/2022, 6:40 PMsandy
06/16/2022, 8:23 PMrun_datahub_pipeline = some_graph.to_job(resource_defs={"io_manager": dagster_bucket_io_manager}
or
@job(resource_defs={"io_manager": dagster_bucket_io_manager})
def run_datahub_pipeline():
...
would that work for you?Vinnie
06/17/2022, 6:37 AMrun_datahub_pipeline
into a graph and then generating assets and jobs instead of passing the configs through the schedule to the job.
Overall I’d still argue a separation in resource definitions between purely config-based resources and “everything else” would make sense for added clarity, but having the asset documentation front and center made it extremely obvious that what I was doing wasn’t the recommended way to go, so maybe other people will just “get it” now.sandy
06/17/2022, 6:38 PMVinnie
06/20/2022, 7:40 AM