Is it possible to run partitions one by one (seria...
# ask-community
r
Is it possible to run partitions one by one (serial) ?
f
Hey @Rafael Gomes, great question! I asked for something related recently in the CoRise slack, and @yuhan kindly replied:
it depends on how you kick off the job. if you kick them off using the backfill button in the UI, they should be run sequentially (e.g. if date partitioned, dagster will run the earliest date first)
hat tip 1
I have not yet tested
and I was wondering if there's a way to specify the sorting key
e.g. when you are using dynamic partitions, it it possible to define a sorting key (or sorting date) which is different than the partition key?
@yuhan dagster angel is it possible?
s
hey Rafael - can you tell me a little more about your use-case? is this with assets or op-based jobs? are you using time partitions? do you want them to be serial to avoid too much concurrent load on your systems, or because later partitions depend on the data in earlier partitions?
r
In my case since the order doesn't matter, I end up using the following configuration (due high load in my system):
Copy code
tag_concurrency_limits:
      - key: "dagster/backfill"
        limit: 1
s
that's what I was going to recommend - does address your issue, or are there remaining gaps?
r
It did solve my issue 🙂 thanks anyway
👍 1
f
hey @sandy, sorry to throw in my question. For my use case, later partitions depend on the data from previous partition
s
@Félix Tremblay got it - the previous partition of the same asset or an upstream asset? if it's an upstream asset, the asset reconciliation sensor should handle this if you use a custom partition mapping. if it's the same asset, I'm actually currently investigating what it would look like to support this
and are you using ordered partitions that aren't time-based? I'm curious- what kinds of partitions?
f
it's the previous partition of the same asset
we have a bunch of database backup (.bak) files in S3 and we need to process them, sequentially. we can't use time-base partitions because the number of bak files per day is variable
there's generally one backup per day, but for some days we are missing the bak files. and there's a possibility that the backup frequency gets increased to multiple times per day in the future
so I was planning to use the bak_file_name as the partition key
s
got it - there's no great builtin way to do that right now, but we're interested in supporting it in the future. you could always write a custom sensor that submits runs in the order you expect
f
ok great!
I also thought of using a custom sensor, but was not sure of how it would work
I suppose the sensor could read from the s3 bucket, sort the list of file names, and yield a job run for each file
would I have to implement a logic in the sensor to determine if a file was already processed, or Dagster takes already care of running each job "key" exactly once?
s
if you include values for the
run_key
argument of your `RunRequest`s, Dagster will avoid launching duplicate runs for the same run key