Hi all, I am trying to develop a good workflow to...
# ask-community
o
Hi all, I am trying to develop a good workflow to go from nothing to a production ready project. My data science user are domain experts in alternate domains that are moving into data science, as such simplicity is key. However I also want to minimise productionisation and mismatch between production and development. All very WIP as well.. so maybe this is premature optimisation. I think it's standard to fit the development into 2 phases. r+d and then productionisation. R+D is handled by data scientists, I imagine they will use assets and will not expect them to handle anything to do with resources and io. eg they just define their resources in the asset and get their pipeline developed. next comes data eng, they will pull out all resources, attach appropriate io managers and the rest. To POC the idea I have 1. created several assets that work for a small amount of data (data scientist duties) 2. created a graph that strings together the assets with some batching (data eng duties) I was asking @chris about an unrelated topic and he mentioned that it's bad practice to have assets as operations in graphs. Unfortunately this will then mean a lot more refactoring for the data eng and after productionisation the data scientists may have trouble recognising and being able to run their code. the alternative, to write all assets as ops and then
AssetsDefinition.from_op
in the datascientists role, but this takes away some of the magic and ease of SDA's IMO I had a look at the example projects but nothing really seemed to match my use case too much, the relevant examples tend to use partitioning for scaling which is a little difficult to implement in my situation since the number of partitions is data dependent so a few questions if I may 1. suggestions for project layout that could facilitate this 2. why shouldn't assets be used as ops -- not sure why but I thought this was an intended mechanic 3. suggestions on any other workflows that might fit this use case thanks 🙂
example of the code in question
Copy code
@asset(
    io_manager_key='inferences_io'
)
def inference(model: FlairNerModel, dataset: list[object]):
        
    inference = model.predict_many(preprocess)

    return inference

@op(out=DynamicOut())
def batch(context, dataset):

    batch_size = context.op_config['batch_size']

    n_batches = int(ceil(len(dataset)/batch_size))
    get_batch = lambda x: dataset.iloc[x*batch_size:(x+1)*batch_size]
    batches = map(get_batch, range(n_batches))
    indexed_batches = zip(batches, range(n_batches))

    wrap_batches = lambda data, idx: DynamicOutput(data, mapping_key=str(idx))
    outputs = starmap(wrap_batches, indexed_batches)


    yield from outputs

@op
def collect(results):
    return pd.concat(results, ignore_index=True)


@graph
def batch_inference(model, dataset):

    batch_size = 50
    batcher = batch.configured({'batch_size': batch_size}, f'batch_{batch_size}')

    batches = batcher(dataset)
    inferenced = batches.map(lambda x: inference(model, x))
    
    return collect(inferenced.collect())
j
Hi @Oliver this guide explains some of the differences ops and assets a bit better. it's written for someone potentially migrating from ops to assets, but the first two sections (Why use software defined assets, and When should i use software defined assets) are good general information. @sandy can probably speak to this better than I can, but in a general sense, an asset is supposed to be a declarative software representation of a single persistent object (a table in a DB, an ML model, etc), whereas an op is more like a task. So in the code snippet you gave, id recommend that the
inference
asset actually be an op since you are using it to perform the same task on a variety of inputs. without knowing more about your use case, i think what could potentially be assets are the
dataset
input to your graph and the output of the
batch_inference
graph
🌈 1
s
@Oliver - when you say "this takes away some of the magic and ease of SDA's IMO", are there elements you're thinking about in particular that you'd lose if you used ops instead of assets in this case?
Also, something I believe you could do if you want to keep
inference
as an asset is:
Copy code
inferenced = batches.map(lambda x: inference.op(model, x))
o
ah maybe just being pendatic 😅 I guess I am thinking of two views, data eng and data science and I want to abstract dagster internals out of the equation for ds, assets does that really well yes that works! would there be any caveats?
s
no caveats that I can think of - curious to hear how it goes for you
🌈 1
o
not well - couldn't scale it within a single run so ended up just going with partitioning 🌈