https://dagster.io/ logo
Title
j

Jack Yin

10/31/2022, 5:19 PM
on a related note to the above, is there an easy way to run a backfill on a schedule? • I have a data source that is sensibly partitioned into days but is updated in a “lumpy” way, i.e. not daily but X days at a time, every X days, with a variable number X. • I would like to use asset sensors to then kick off downstream jobs upon materialization The obvious way to handle this (for me) would be to just attempt a backfill every day, but I don’t see an easy way to do that within the scheduler code. I’d like to avoid programmatically making calls to the dagster CLI, e.g. defining a cron job that just runs
dagster job backfill
from console every day, but if that’s the only way to do it then i guess I can do that
:dagster-bot-resolve: 1
p

prha

10/31/2022, 9:05 PM
Right now, the two ways to do this is to use the CLI:
dagster job backfill
or to use our GraphQL API, hitting the
launchPartitionBackfill
mutation (how backfills are launched from dagit).
j

Jack Yin

11/01/2022, 9:26 PM
@prha makes sense. Looking into it though - what if it’s a partitioned asset and not a partitioned job?
i guess part of what i don’t understand is the relationship between asset materializations and jobs
it seems like
__ASSET_JOB_0
and other miscellaneous asset jobs are autogenerated
how would i even make an asset-materializing job explicit
p

prha

11/01/2022, 9:37 PM
You can create an explicit job of assets using
define_asset_job
. A job is a bound set of operations, tied to an environment. An asset materialization is the single instance of an asset execution, as part of a job run.
❤️ 1
j

Jack Yin

11/01/2022, 9:50 PM
awesome, thanks!
@prha getting closer - but realizing now that
dagster job backfill
backfills everything
is there an easy way for me to only grab the missing partitions?
I can easily inject a date into that particular command, but how do I grab a list of missing partitions from dagster?
p

prha

11/02/2022, 12:18 AM
hmm, yeah, we don’t expose that easily in any fashion except by manually fetching tags by job, or by querying our graphql endpoint for the field
j

Jack Yin

11/02/2022, 12:22 AM
so would it be best to keep my own separate table of successful partitions?
or if you have any good examples of code that others have written grabbing missing partitions that’d be great too
or i guess i can query the graphql endpoint
p

prha

11/02/2022, 12:24 AM
Yeah, I think querying the graphql endpoint might be your best bet.
j

Jack Yin

11/02/2022, 12:30 AM
alright i’ll give that an honest try and report back if i get stuck
p

prha

11/02/2022, 12:32 AM
Here’s the query schema we use to calculate it in dagit:
partitionSetOrError(
      repositorySelector: $repositorySelector
      partitionSetName: $partitionSetName
    ) {
      ... on PartitionSet {
        id
        name
        pipelineName
        partitionsOrError {
          ... on Partitions {
            results {
              name
            }
          }
        }
        partitionStatusesOrError {
          __typename
          ... on PartitionStatuses {
            results {
              id
              partitionName
              runStatus
              runDuration
            }
          }
        }
      }
    }
  }
j

Jack Yin

11/02/2022, 12:32 AM
nice thanks
p

prha

11/02/2022, 12:33 AM
I think
partitionsOrError
gives the full range of partitions, and
partitionStatusesOrError
give the status of any runs that exist for each partition. I think the difference of those two (when selecting partition names) will give you the set of missing partitions.
j

Jack Yin

11/02/2022, 12:40 AM
is there an easy way to like pry into the dagit UI to see what query is being run
i’m having trouble building out all of these objects by hand
like what is my partition set name? the name of the partitioned job?
p

prha

11/02/2022, 1:38 AM
Yes, in the user settings tab (the gear in the upper right corner), you can enable “Debug console logging”
That should show all the data requests being made / returned.
The partition set name is kind of an old artifact, but you should be able to get it off of your job like this:
my_job.get_partition_set_def().name
j

Jack Yin

11/02/2022, 4:53 AM
nice, i got it working. Thanks so much for your help!