Hi all new to dagster but have a bit of DAG experience from dagster #ask-community

Hi all, new to dagster, but have a bit of DAG expe...

Matt Clarke

08/17/2022, 12:54 PM

Hi all, new to dagster, but have a bit of DAG experience from Aiida. Wondering if I could get some advice on best practice for how to structure a workspace: • We have a number of physical assets. Each of these have physical sensor data logged to their own partitioned parquet file in an s3 bucket • There are several models which we would want to run on each of these assets tracking things like evolution of internal/hidden states over time with both periodic and ad-hoc calculations • Generally, if we were to deploy a new model, we would want to deploy that new model to each of the assets • We would then want to group a number of these assets and calculate things like distributions of usage across multiple assets My initial instinct would be somehow having a template repository and a list of configurations that combine to create a workspace with a repository for each asset, but I don't know if there is some example of setting up something like this someone could point me towards? Thanks!

🤖 1

sandy

08/17/2022, 5:43 PM

In general, I'd recommend putting everything in a single repository unless the different assets have different sets of Python dependencies. It will make life simpler. If you want to create one-asset-per-x, you could do something like this:

Copy code

base_assets = [...]

model1_assets = []
for base_asset in base_assets:
    @asset(name=base_asset.key.path[-1] + "_model1", ins={"arg1": AssetIn(key=base_asset.key)})
    def model1_asset(arg1):
        ...

    model1_assets.append(model1_asset)

Matt Clarke

08/18/2022, 8:29 AM

Thanks very much!

Matt Clarke

08/24/2022, 8:52 AM

Hi Sandy, we've started using this approach for dev but there are some concerns about how we'd scale up. Part of the appeal of dagster is that we could see at a glance what was going on with a given physical asset (had all the calcs run, are there any faults detected, etc). To recreate our current models we'd probably be looking on the order of 100 ops, with around 500 as we expanded out. With 1000 physical devices, we'd have 10000-50000 nodes to look at. Having dug a bit more, I'm thinking something like a graph defining all the ops for an arbitrary device, and then a

asset.from_graph()

type workflow to keep things tidy, but not sure if I'm missing the mark with this. We'd expect more devices to be added/removed at different times which is part of my reasoning for wanting a

get_device_asset(device_id, start_date)

type workflow to be possible. Also, my first instinct was to look for something like nested partitions (partition by category, then datetime) but I don't see an easy way to do this...not sure if it's something I've missed or something that would be hacky to the point of irresponsibility!

sandy

08/26/2022, 11:06 PM

Also, my first instinct was to look for something like nested partitions

this is the issue where we're tracking this: https://github.com/dagster-io/dagster/issues/4591 I've been experimenting on this recently and posted an initial implementation here: https://github.com/dagster-io/dagster/pull/9511 there are still some issues we need to sort out

Matt Clarke

08/30/2022, 8:01 AM

Ah, this looks really interesting. Thanks very much!

4 Views

Open in Slack

Previous Next