Hi all We re using dagster to process time series data in ET dagster #ask-community

Hi all ! We're using dagster to process time seri...

Thibault

01/26/2023, 4:03 PM

Hi all ! We're using dagster to process time series data in ETLs. We have a first version that works fine using jobs, graphs, ops and partitions. However, we are now in a situation where • we have to define massive partitions that combine daily schedules, days to retrieve (as we are working with time series) and other configurations (like IDs) • we are reusing these pipelines with multiple configurations (multiple IDs, different dates to retrieve, different API keys stored in a redis) • the partitions are generated dynamically, as some other app "knows" which configuration should be applied That causes the following problems • dagit is struggling to display the partitions, as they have grown too large • we are drifting away from Dagster's asset-based paradigm and using partitions for scheduling only and are missing out on some features that would be of great help such as freshness policy and asset dependencies However I fail to see how we could model our assets using Dagster's concepts • One asset per ID ? If so, how do we ensure things like dates to retrieve, execution failures for specific retrieved dates, especially without rematerializing the whole asset ? • One asset per (ID, date to retrieve) ? If so, doesn't that shift the problem from too many partitions to too many assets ? • The configuration itself as an asset ? I would gladly explain my use-case in more details if something wasn't clear. Thanks for helping out and have a great day!

yuhan

01/27/2023, 5:40 PM

@sandy re: modeling partitioned assets

sandy

01/31/2023, 12:26 AM

Hi @Thibault - I'm curious how many partitions you have? Generally, if you can model your use case using N partitions or N assets, we'd recommend N partitions.

Thibault

01/31/2023, 8:18 AM

Hi @yuhan, @sandy, thanks for following up, I really appreciate it 🙂 As of now, we have 78862 partitions for this pipeline. But this is only bound to grow as every day passes and more connectors are required. To give you more background on our data : • we collect electricity load curves, that is to say electrical load (W) versus time • because all the chronological data needs to be stored, we can't have things like retention policies • for each meter that we work with (electricity production or consumption), we need to be able to collect data for every day X that we've been allowed to query • X may be in the past (partitions backfills worked great up to a certain point) or a scheduled daily execution • we may need to collect data every day from a starting point in the past that can be a few years to a few months ago • in other words, we have one partition per meter per day, from a point in the past to the present We thought of limiting the number of partitions by moving some of the logic to our backend (that knows which data is there or not) to avoid that issue. Somewhat like • backend returns configurations for missing data to Dagit • Dagit then builds partitions only for the data that hasn't been collected, thus limiting the number of partitions and still allowing backfills However that means losing Dagster capabilities for pipelines that have already been executed. For example if we notice that something went wrong but didn't break the pipelines.

Generally, if you can model your use case using N partitions or N assets, we'd recommend N partitions.

I'm a bit surprised by that, could you please elaborate ? I thought that Dagster was moving towards software defined assets, using partitions only for scheduling.

sandy

01/31/2023, 11:09 PM

~100k is around the upper limit of what we support. We'd eventually like to support more than this, but I can't make promises about how soon this will be a good experience. Why have a separate partition per meter? Is it that, when you launch a run, you want to be able to target a particular meter? Might it make sense to bucket a set of meters together? E.g. have a partition that includes all the meters whose

meter_id % 100 == 0

and so forth?

I'm a bit surprised by that, could you please elaborate ? I thought that Dagster was moving towards software defined assets, using partitions only for scheduling.

When I say N partitions, I mean a software-defined asset with N partitions (or a set of software-defined assets that all share the same N partitions)

Thibault

02/01/2023, 10:14 AM

~100k is around the upper limit of what we support. We'd eventually like to support more than this, but I can't make promises about how soon this will be a good experience.

Yes of course, I understand. That's why I'm trying to find a way to reduce the number of partitions we are working with 🙂

Why have a separate partition per meter? Is it that, when you launch a run, you want to be able to target a particular meter? Might it make sense to bucket a set of meters together? E.g. have a partition that includes all the meters whose
meter_id % 100 == 0
and so forth?

Although we could group some of them together, each meter typically belongs to a different functional scope. So a run failing for a meter shouldn't impact runs for other meters. That's why each meter has its own partition for every day. Besides, the API we're using for this is known to be rather unreliable at times for specific meters. So we can take for granted that for some meters, the API will fail once in a while. Bucketing a set of meters together would most likely result often in run failures because of a single meter in the bucket. Can you think of a way to avoid leaking one run failure onto other runs when bucketing a set of meters together, while also allowing visualization, retries, etc. ?

When I say N partitions, I mean a software-defined asset with N partitions (or a set of software-defined assets that all share the same N partitions)

Ok, I think I get it. Ideally I would like to have one asset per meter over a partitioned schedule. But I haven't found a way to define assets dynamically

Thibault

02/07/2023, 3:45 PM

@sandy Hopefully I'm not being too pushy with this reminder but I'd be very grateful if you could provide some insights regarding my latest comment

sandy

02/08/2023, 1:53 AM

Can you think of a way to avoid leaking one run failure onto other runs when bucketing a set of meters together, while also allowing visualization, retries, etc. ?

Alas there is not. Based on what you've described, having a partition per meter sounds like the best option.

Ok, I think I get it. Ideally I would like to have one asset per meter over a partitioned schedule. But I haven't found a way to define assets dynamically

In this week's release or next week's release, we're going to introduce dynamically partitioned assets. You'll be able to add (and remove) partitions dynamically and then launch runs tjat target those partitions.

Thibault

02/09/2023, 8:41 AM

Okay, thank you @sandy for your insight. I'll be looking out for dynamically partitioned assets in the next releases then. Hopefully more people run into use cases similar to mine and share my needs

sandy

02/10/2023, 12:57 AM

an experimental version of this is now available in our latest release here are the docs: https://dagster-git-claire-dynamic-partitions-docs-elementl.vercel.app/_apidocs/partitions#dagster.DynamicPartitionsDefinition

👀 1

sandy

02/23/2023, 4:27 PM

@Thibault btw, if you get a chance to try this out, I'm curious to hear how it goes for you

Thibault

03/09/2023, 1:32 PM

@sandy yes, at the moment we have moved to another solution that limits the number of partitions (but makes us lose the ability to retry past pipelines). I'll let you know as soon as I have some feedback !

12 Views

Open in Slack

Previous Next