I'm thinking about checking out Dagster for some o...
# ask-community
u
I'm thinking about checking out Dagster for some of our genomics/bioinformatics workflows. The Dagster model is clearly quite rich, and so it's not obvious to me what is the best way to map some of our problems to your DSL/API.
s
The short answer here is that, when dealing directly with files instead of in-memory objects, it's often easiest to sidestep Dagster's IO manager machinery and just handle the IO yourself. So you can do something like:
Copy code
from dagster import op, job, In


@op
def op1() -> None:
    write_file_x()


@op(ins={"after": In(type(None))})
def op2() -> None:
    read_file_x()


@job
def job1():
    op2(after=op1())
u
Is there anything functional that is lost in doing what you suggest, other than the "aesthetic" aspects of using the function args to define inputs and composing the ops in the job? (Though it would be sad to lose that)
I'm assuming that if we don't have a shared filesystem across systems that are running ops, then perhaps we'd have to read from/write to S3 in preparation for each op. Is there an IO manager that takes an S3 URI as input and output and "makes it available" to the op by actually writing it to a predictable location in the local filesystem? I think this is basically what Nextflow does to deal with isolation of "ops" that all end up running command line tools against the local filesystem. For each node in the graph, it sets up an isolated working directory. All inputs/outputs are copied into it using configured names, and then your command line invocation is against these local filenames. One complaint is that it would just be quite wasteful in terms of IO. I suppose you could use an S3 IO manager at the "boundaries" of some graph, and all the internal ops are just reading/writing from local disk. This seems risky, though, if you instead run your ops on different nodes without a shared filesystem, but I guess you could just swap out ALL the IO managers to read/write to S3. But anyway, would configuring just the boundary nodes to use one IOmanager and the inner ones to use another be a horrible mess in the dagster API?
s
But anyway, would configuring just the boundary nodes to use one IOmanager and the inner ones to use another be a horrible mess in the dagster API?
It's fairly straightforward in Dagster to say "use IO manager X for these node outputs and use IO manager Y for these other node outputs" It's hairier to say "these nodes need to run on the same machine but these other nodes can run on different machines"
Is there anything functional that is lost in doing what you suggest, other than the "aesthetic" aspects of using the function args to define inputs and composing the ops in the job? (Though it would be sad to lose that)
Just that you're responsible for your own IO
Is there an IO manager that takes an S3 URI as input and output and "makes it available" to the op by actually writing it to a predictable location in the local filesystem?
This is something I've played around with in the past but we don't have something out of the box. If you're trying to get something up and running as quick as possible, I'd recommend against this route and instead just put
download_from_s3()
and
upload_to_s3
at the beginning and end of your ops / assets. I do think is the ideal solution ultimately and would help you with it if you wanted to write one.