https://dagster.io/ logo
#ask-community
Title
# ask-community
j

John Smith

06/29/2023, 10:22 PM
I Appreciate that Multipartitions > 2 isn't yet supported but is there a workaround that can achieve the flow below and: 1. auto materialize appropriate downstream partitions when upstream dependencies are met 2. allow me to trace upstream assets per partitions e.g. via _input_data_version_
was hoping I could next MultiPartitions but alas that throws an error
Copy code
multip = MultiPartitionsDefinition({
    "a": MultiPartitionsDefinition(...), 
    "b": StaticPartitionsDefinition(...)
})
o

owen

06/29/2023, 11:10 PM
I think the closest you could get here would be something that explicitly merges two of the dimensions together, i.e.
Copy code
A = StaticPartitionsDefinition(["a1", "a2", "a3"])
B = StaticPartitionsDefinition(["b1", "b2", "b3"])
C = StaticPartitionsDefinition(["c1", "c2", "c3"])

BC = StaticPartitionsDefinition(
    ["b1|c1", "b1|c2", "b1|c3", "b2|c1", ...]
)

ABC = MultiPartitionsDefinition(
    "a": A,
    "bc": BC,
)
from there, you'd need to create a pair of StaticPartitionMappings to tell dagster how to take an arbitrary partition of
BC
and map it to either a partition of
B
or a partition of
C
. I will note though that this approach can definitely suffer significantly from scaling issues, as we generally quote a maximum number of partitions where things will work smoothly as ~10k, and adding dimensions makes that number grow exponentially (if A, B, and C all have just 20 partition keys, you are already close to that limit).
j

John Smith

07/03/2023, 8:37 PM
@Sean Lopp perhaps better to continue the convo in this thread
@owen is there any example usage of StaticPartitionMappings? especially for a use case above. I couldn't find any in the docs
o

owen

07/11/2023, 5:14 PM
Hi @John Smith! I put together an example here: https://gist.github.com/OwenKephart/600e1c791f192ea78ef06ac8dc944e42
j

John Smith

07/11/2023, 7:32 PM
thanks @owen that's exactly the functionality I'm after thank you!! could you put a PR in to support MultiPartitions as you've suggested so I can use dates on top of my desired partition please? I've also taken the liberty of completing your example with dummy return values so it's a runnable example.
import itertools
from dagster import (
AssetIn,
MultiPartitionsDefinition,
StaticPartitionMapping,
StaticPartitionsDefinition,
asset,
)
A = StaticPartitionsDefinition(["a1", "a2", "a3"])
B = StaticPartitionsDefinition(["b1", "b2", "b3"])
C = StaticPartitionsDefinition(["c1", "c2", "c3"])
BC = StaticPartitionsDefinition(
# instead of actual multi-partitions, just create strings of the form b1.c1
[f"{b}.{c}" for b, c in itertools.product(B.get_partition_keys(), C.get_partition_keys())]
)
ABC = MultiPartitionsDefinition(
{
"a": A,
"bc": BC,
}
)
@asset(partitions_def=A)
def assetA():
return [1,2,3]
@asset(partitions_def=B)
def assetB():
return ['a', 'b']
@asset(partitions_def=C)
def assetC():
return [11, 22, 33]
@asset(
partitions_def=BC,
ins={
"assetB": AssetIn(
partition_mapping=StaticPartitionMapping(
{
# each partition of b maps to...
b_partition: {
# all multi partition keys that contain that partition
f"{b_partition}.{c_partition}"
for c_partition in C.get_partition_keys()
}
for b_partition in B.get_partition_keys()
}
),
),
"assetC": AssetIn(
partition_mapping=StaticPartitionMapping(
{
# each partition of c maps to...
c_partition: {
# all multi partition keys that contain that partition
f"{b_partition}.{c_partition}"
for b_partition in B.get_partition_keys()
}
for c_partition in C.get_partition_keys()
}
),
),
},
)
def assetBC(assetB, assetC):
return assetB + assetC
@asset(partitions_def=ABC)
def assetABC(assetA, assetBC):
return assetA + assetBC
as a related aside, is there a way to examine in the UI which downstream partitions are ready to be run / updated when backfilling? similar to how eager materialization would work the view below doesn't show me which upstream dependent partitions have completed such that I can run the downstream partition
o

owen

07/11/2023, 9:57 PM
Hi! Re "could you put a PR in to support MultiPartitions as you've suggested so I can use dates on top of my desired partition please?", how many of your dimensions are time based? In the example code above, swapping out
A
from a StaticPartitionsDefinition to a DailyPartitionsDefinition will work just fine, as the MultiPartitionsDefinition natively knows how to mix together those sorts of dimensions. However, I think trying to make B or C time-based would be a bit harder, as creating a StaticPartitionMapping wouldn't work -- these mappings expect to know all the keys they need to map from / to at definition time, but the set of partitions that exist inside of a DailyPartitionsDefinition changes (grows) as time goes on
re: the UI bit, that sort of view (or really any sort of partition-mapping specific view) unfortunately doesn't exist at the moment, although you will be warned if you try to materialize a partition that has missing parents
j

John Smith

07/14/2023, 8:14 PM
unfortunately all A B and C need to be Multipartitions each with the same Date Component unless there's another way track a 'Date" dimension without partitions? e.g. a new group per day spawned from asset factory. we don't tend to look at multiple dates together often
@owen @Sean Lopp I've managed to achieve what I'm after using MutliPartitionMappings. I've put the code as a comment reply to your post: https://gist.github.com/OwenKephart/600e1c791f192ea78ef06ac8dc944e42?permalink_comment_id=4634120#gistcomment-4634120 could you review it and point out any pitfalls please?
o

owen

07/19/2023, 9:37 PM
Hi @John Smith -- this seems like a reasonable approach (a single date shared date dimension makes sense), the main pitfall remains the same (as the dimensionality of a * b * c increases, this approach will struggle). I think it would help to learn more about what A, B, and C are representing (and the dimensionality you're imagining them having). If these really do only have a few partitions per definition, then this approach seems totally fine to me. I saw your note about wanting downstream runs to execute in-process, so i'm also wondering if it's possible to capture your desired setup with simpler setups (i.e. not even explicitly breaking certain dimensions into partitions, just executing your different pieces of work in process directly)
j

John Smith

07/19/2023, 9:53 PM
I don't envisage there being over 200 partitions in total which run at different times over the course of the day, are you concerned about how quickly the tasks can spawn or the UI being able to keep track of all the partition combinations? I'm less concerned about the former because I don't envisage over 10 assets running simultaneously. furthermore, each asset takes about 10s to 1 minute to complete. so a few extra seconds of overhead is tolerable. lastly we don't expect to do much backfilling. In fact the only time we care about older dates is when we want to compare model performance between current and older date.
o

owen

07/19/2023, 9:55 PM
ah so to clarify, A * B * C would be equal to ~200? in that case, I wouldn't expect too much trouble. My main concern was the UI being able to keep track of the partition combinations, but that shouldn't be an issue in your case
❤️ 1
j

John Smith

07/19/2023, 10:00 PM
oh.. A * B * C *D * E would be < 200 per day would the number for days we store data for overwhelm the UI? In which case is there a way to toggle the number of days we display in the UI without losing the metadata so that we can look back in time should we need to in rare occasions?
o

owen

07/19/2023, 10:09 PM
yep gotcha re: the per day thing, I would not expect you to experience significant issues (at least for a few years!). in general, the UI views are well-compressed when possible, but if the cardinality of one dimension gets too crazy then there's not much we can do to slim down the data into something that could fit into the UI
j

John Smith

07/21/2023, 8:54 AM
quick follow-up question 1. given a partition in a final asset, how do I trigger a refresh of all dependent partitions for all upstream assets? 2. Can I create a custom date time partition which only includes certain hours of the day e.g. 7, 10am, 4pm. But still retain all the features that a datetime partition gives me "for free" ? 3. given a downstream partition, how do I check if all upstream dependent partitions are available. (e.g. with date partitions, it appears in as a warning but not with our setup above)
4 Views