Philip Graae
08/09/2019, 10:57 AMnate
08/09/2019, 2:43 PMPhilip Graae
08/10/2019, 12:49 PMnate
08/10/2019, 3:31 PMprocessImage(img: ImagePath): ProcessedImage
you want to execute on every image; I think it’d be a good fit to just have a Spark job that loads all the images just maps that function over your images like:
loadImages("<s3://our-images/prod/2019/08/10>")
.map(processImage)
.map(saveImageToS3)
Then, the scheduler / Dagster would run Spark job to run each day as part of a larger pipeline (which might have other jobs, notebooks, etc. as part of the overall DAG)Philip Graae
08/11/2019, 10:46 AMdask_kwargs
or similar, either in the solid/pipeline definitions or in the config. Just a thought.
Interesting solution. So basically moving the mapping over the spatial dimension to the execution layer, while handling the time dimension as part of the scheduling. That does seem to be a good way to separate concerns.
Thank you for the help and the quick, positive response to my feedback. It will be very interesting to follow the work you are doing on Dagster!Cloves Almeida
08/19/2019, 12:11 PMnate
08/19/2019, 7:48 PMnate
08/24/2019, 1:48 AM@solid(
...
step_metadata_fn=lambda _: {'dagster-dask/resource_requirements': {'CPU': 1}},
)
def foo(context):
pass
And this will pass along those resource requirements to Dask, assuming you launched your Dask worker with e.g. dask-worker --resources "CPU=1" scheduler-address.internal:8786
Philip Graae
09/16/2019, 10:52 AM