Periodic Reporting on Jobs Schedules and Runs I d like to c dagster #ask-community

Periodic Reporting on Jobs, Schedules and Runs I'...

Stefan Adelbert

11/25/2022, 4:04 AM

Periodic Reporting on Jobs, Schedules and Runs I'd like to create a job which will run periodically to query the dagster instance for information like, • all jobs in all workspaces • all schedules associated with those jobs • all run within some recent time range I already have a job which calls

context.instance.get_runs()

and then runs some checks on those jobs, so that should cover the last point above. But it's not clear from the [documentation](https://docs.dagster.io/_apidocs/internals#instance) how I could programmatically get information about workspaces, jobs and schedules from the dagster instance. I can get this information from the GraphQL API (https://docs.dagster.io/concepts/dagit/graphql#get-a-list-of-repositories), but I'd rather have a job collect this info and then "call home". Any advice on this would be very useful.

Denis Maciel

11/25/2022, 12:48 PM

I am also interested in that

Stefan Adelbert

11/27/2022, 9:15 PM

@Denis Maciel Please let me know if you find anything useful on the Dagster instance, particularly for getting info on the workspace (repos, jobs, tags, metadata). I'm going to look today and I'll post back if I find something.

Stefan Adelbert

11/27/2022, 11:03 PM

I've had a look at the source code and it's not obvious how to get a snapshot of the workspace status using a dagster instance directly. I don't know enough about how the code is organised logically. Maybe there is a zero-config way of making GraphQL queries from a job. By zero-config, I mean a way to discover the GraphQL endpoint without needing to configure the job with a hostname or IP address. This obviously only matters if the job is being executed on a different machine to dagit. Perhaps the dagster instance can be queried for that information.

Stefan Adelbert

11/29/2022, 3:52 AM

@Denis Maciel Any ideas on how to do this? The only way I can think of it making GraphQL calls to dagit from a dagster job. The GraphQL python client (https://docs.dagster.io/concepts/dagit/graphql-client) only caters for a small subset of the GraphQL capabilities, so the calls would need to be made directly, probably using

gql

(like the dagster GraphQL python client does).

Denis Maciel

11/29/2022, 4:22 AM

hey stefan, I haven't looked into the issue really. We have a use case for it but it's not super prio for us now. To me, it would be enough to set up a GraphQL client from within a job run. I am just not sure what's the easiest way to do it.

owen

11/29/2022, 4:21 PM

hi @Stefan Adelbert -- you're (unfortunately) correct that there's no way find this information by directly querying the instance, and graphql is basically the only reasonable way of accessing this information in a structured way. (simplifying a bit) we only serialize this information over GRPC calls (it's not stored in the database), and the dagit instance is generally the place where you configure your list of code locations (so that's the thing that knows which GRPC servers to make those calls to, not the dagster instance). Then, graphql adds a layer of abstraction over the GRPC calls. I also don't think there's really a no-configuration way to know where to send those GQL queries (as the dagit host can change independently from the instance, the simplest example being that dagit doesn't need to be running for dagster to run -- the state of the instance is identical in either case) But down to specifics, I think the query you'd want to send is along the lines of:

Copy code

{
  repositoriesOrError {
    ... on RepositoryConnection {
      nodes {
        name
        jobs {
          name
          schedules {
            name
          }
        }
      }
    }
  }
}

This will return data of the form:

Copy code

"data": {
    "repositoriesOrError": {
      "nodes": [
        {
          "name": "hacker_news_repository",
          "jobs": [
            {
              "name": "__ASSET_JOB_0",
              "schedules": []
            },
            {
              "name": "activity_analytics_job",
              "schedules": []
            },
            {
              "name": "core_job",
              "schedules": [
                {
                  "name": "core_job_schedule"
                }
              ]
            },
            {
              "name": "story_recommender_job",
              "schedules": []
            }
          ]
        },
    ]
  }
}

, a list containing one element per repository, and each repository element contains a list of jobs, and each job contains a list of schedules that it's associated with. you can add more fields to this returned object if desired (digging around in <dagit_url>/graphql is pretty useful)

❤️ 1

Stefan Adelbert

11/29/2022, 10:41 PM

@owen Thank you for your reply. This is pretty much where I'd landed as the intended solution. In my case, the (docker compose) service running

dagit

is called

dagit

, so I should be able to find the GraphQL endpoint implicitly from a user code repo container, i.e. http://dagit:3000/graphql. At least this way I can avoid configuration by convention. I have modified my thinking subtly in that I reckon I'll periodically retrieve a list of all jobs (as you've outlined above) - this will give me a periodic snapshot. And then for runs, I'll set up run status sensor (https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors#run-status-sensors) which will update the reporting database with relevant run info.

2 Views

Open in Slack

Previous Next