Hello Team I hope you are doing well I am currently working dagster #ask-community

Hello Team, I hope you are doing well. I am curre...

Jay

06/12/2023, 10:21 PM

Hello Team, I hope you are doing well. I am currently working on a monitoring tool and I am attempting to get all run IDs for a particular partition_key. I was thinking of either using the GraphQL api or somehow utilizing the dagster instance to get the runstorage. Would this be the best way to achieve this or is there a better method? Thanks in advance

owen

06/12/2023, 10:49 PM

hi @Jay! I wrote the answer up into a github discussion for better visibility here: https://github.com/dagster-io/dagster/discussions/14763. I focused on the python api, as that's somewhat simpler, but equivalent queries could be made via graphql with a bit of effort (you'd want to access the assetMaterializations on the AssetNode of the asset you're interested in)

Jay

06/12/2023, 10:51 PM

Hey @owen, Thanks for the information, unfortunately I have tried a similar approach and our team does not have a dagster.yml set up, meaning that the

Copy code

DagsterInstance.get()

will fail. Is there anyway to access the event records without the instance?

owen

06/12/2023, 10:55 PM

What error message are you getting? I wouldn't expect the call to fail even without a dagster.yaml file (you might get a warning), but even then I'd recommend just creating an empty

dagster.yaml

file in

~/.dagster/

Jay

06/12/2023, 10:58 PM

The error that I am getting is the following

dagster._core.errors.DagsterInvariantViolationError: $DAGSTER_HOME "/opt/dagster/dagster_home" is not a directory or does not exist. Dagster requires this environment variable to be set to an existing directory in your filesystem

. Just for context, our team uses Postgres to store run history information, so would that mean that by adding the following directory and then adding an empty dagster.yaml file I would be able to access the event_storage?

owen

06/12/2023, 11:00 PM

re: using Postgres to store the run history info, I assume this means that wherever you've deployed dagster does have a

dagster.yaml

file, it's just that you don't have a copy of that locally? If so, you're correct that the empty dagster.yaml file wouldn't get you access to that remote storage.

owen

06/12/2023, 11:01 PM

is the goal to write a script that you can run locally, or will this script be deployed to the same place where dagster is deployed?

Jay

06/12/2023, 11:02 PM

for testing purposes it would be nice to have the script run locally, but the end goal would be to deploy it as a job that would run hourly on dagster

owen

06/12/2023, 11:03 PM

I see -- if you're running it in a job, then you can get access to the instance that the job is running against from the relevant

context

object, i.e.

Copy code

@op
def watcher_op(context: OpExecutionContext):
    context.instance.get_event_records(...)

Jay

06/12/2023, 11:04 PM

would that get the event_records for all assets?

Jay

06/12/2023, 11:05 PM

essentially, the end goal of the job would get all assets/job under a definition and then for each partition of an asset get all of the corresponding run_ids.

owen

06/12/2023, 11:07 PM

it sounds like that'd require getting all of the materialization events for all assets across all time, is that right? it's doable, but would end up straining your database if you didn't have some cursor or something to keep track of which run_ids you'd already looked at

owen

06/12/2023, 11:08 PM

what would you be doing with the run ids?

Jay

06/12/2023, 11:08 PM

getting the metadata (start time, end time, etc)

Jay

06/12/2023, 11:09 PM

yes, all of the materialization events for all assets across all time would be the end goal. My thought was that we can filter for recent runs as the job would be done daily/hourly, that way can minimize the strain on the db

owen

06/12/2023, 11:11 PM

might it make sense to do this in reverse order then? that'd be first getting all of the recent runs, then seeing which assets / partitions those runs materialized

owen

06/12/2023, 11:11 PM

that's not a hard requirement or anything, the get_event_records() method will let you get whatever events you want (check out the EventRecordsFilter class linked in the github discussion), just giving some other options

Jay

06/12/2023, 11:12 PM

that would also work. but how would I get that information given the issues I am facing with the dagster_home, is it simply just adding that directory where we deployed dagster?

owen

06/12/2023, 11:14 PM

the instance object should be available to you from the op context as shown above without needing to call DagsterInstance.get()

Jay

06/12/2023, 11:15 PM

and that would allow to access the event_records for all assets then? and not just one?

owen

06/12/2023, 11:16 PM

depends on what you pass in as the EventRecordsFilter (this takes an optional asset_key argument to filter for events specific to a single asset key, but that does not need to be set)

Jay

06/12/2023, 11:17 PM

I see, great. Will try it out. Thanks for all of your help. I appreciate your time.

Jay

06/12/2023, 11:52 PM

Hey @owen, Out of curiosity what would be the difference between

EventLogRecord

and

RunRecord

? When should I look the access one over the other?

owen

06/12/2023, 11:54 PM

event log records are the structured events that happen within a run, and run records are the representations of the runs themselves. so if you're curious in high-level stats about a run (did it succeed, when did it start, etc.), then you'd want RunRecord, and if you want to know specific details about events that happened (when was this asset / partition materialized), you'd want the EventLogRecord

Jay

06/12/2023, 11:56 PM

I see thanks

Open in Slack

Previous Next