is it possible to define a graph that uses a runtime config dagster #ask-community

is it possible to define a graph that uses a runti...

Charlie Bini

02/15/2022, 11:36 PM

is it possible to define a graph that uses a runtime config? I want the graph to use an “op factory” pattern based off of a config variable but the value of that variable needs to be defined at the job level. does that make sense?

Charlie Bini

02/15/2022, 11:45 PM

i’m trying to define a generic graph that will take a list of queries and generate the ops that will, in parallel, execute each query and then do some subsequent steps with the respective output

Charlie Bini

02/15/2022, 11:45 PM

the jobs created from the graph will all follow the same pattern, but the queries (and other config) passed to each job will be different

Charlie Bini

02/15/2022, 11:57 PM

so I want the graph to look something like this

Copy code

@graph
def configurable_graph():
    client = get_client()

    for query in graph_config["queries"]:
        op1, op2, op3 = op_factory(query)

        out1 = op1(client)

        out2 = op2(out1)

        out3 = op3(out2)

Charlie Bini

02/15/2022, 11:58 PM

and the jobs would look like this:

Copy code

job1 = configurable_graph.to_job(
    resource_defs={
        "foo": bar1,
        "spam": eggs1,
    },
    config=config_from_files(
        [
            f"./config/resource.yaml",
            f"./config/queries-1.yaml",
        ]
    ),
)

job2 = configurable_graph.to_job(
    resource_defs={
        "foo": bar2,
        "spam": eggs2,
    },
    config=config_from_files(
        [
            f"./config/resource.yaml",
            f"./config/queries-2.yaml",
        ]
    ),
)

Charlie Bini

02/16/2022, 12:01 AM

the graph should read the queries from

queries-#.yaml

and be accessible through what I called

graph_config["queries"]

, but I don't know what Dagster component

graph_config

actually is, if it even exists

David Farnan-Williams

02/16/2022, 12:03 AM

Something similar I wrote, perhaps this is helpful:

Copy code

def get_query_dataframe_op(
    name: str,
    connection_resource: str,
    query: Optional[Dict[str, Any]] = None,
    parameters: Optional[Dict[str, Any]] = None,
) -> SolidDefinition:

    if "sql" not in query:
        sql_query_file_path = file_relative_path(__file__, f"queries/{name}.sql")
        with open(sql_query_file_path, "r") as query_file:
            query["sql"] = query_file.read()

    output_dataframe_name = f"{PREFIX_DATAFRAME}{name}"
    out = {
        output_dataframe_name: Out(
            Any,
            metadata={"label": name},
            # asset_key=AssetKey(name),
            io_manager_key="dataset_io_manager",
        )
    }

    @op(
        name=f"query_{name}",
        description=f"\n```SQL\n{query['sql']}```",
        out=out,
        required_resource_keys={connection_resource},
    )
    def query_dataframe(context: OpExecutionContext,) -> Iterator[pd.DataFrame]:

        if parameters is not None:
            query["sql"] = query["sql"].format(**parameters)
        dataframe = pd.read_sql_query(
            con=getattr(context.resources, connection_resource).engine, **query
        )
        return Output(value=dataframe, output_name=output_dataframe_name)

    return query_dataframe

David Farnan-Williams

02/16/2022, 12:07 AM

Sounds like you're maybe just needing 1 layer above that for example, here where they show building a graph from yaml: https://docs.dagster.io/concepts/ops-jobs-graphs/jobs-graphs#graph-dsl

David Farnan-Williams

02/16/2022, 12:29 AM

In their graph dsl example they don't show loading the graph config or yaml through the graph or job definition or using launch pad. I think this is because the graph is created on compiling of the code for the registry, and I think the launchpad configuration is intended to fit the graph that exists at that time. If you didn't pass it a yaml there wouldn't be a graph. So it seems you need some configuration further back at the locations or dagster yaml level or like they did, built in to their graph compiling code. Seems like you're wanting configuration exposed in the UI that could be used to construct a graph or graphs? Prior to running them through launch pad? Or does the UI not factor into this and you just need to execute an dynamically created graph, and don't really care if it is exposed through the UI prior to asset materialization?

Charlie Bini

02/16/2022, 12:31 AM

yeah UI doesn't really matter, building a graph from yaml is interesting but now that seems too far in the other direction lol

Charlie Bini

02/16/2022, 12:34 AM

I've got this working by spooling the queries via DynamicOutput, but I haven't been successful in passing that output to an op factory

Charlie Bini

02/16/2022, 12:37 AM

and I'm trying to use an op factory to leverage some more of the op definition features that I'm not able to hardcode (e.g. metadata)

David Farnan-Williams

02/16/2022, 2:12 AM

Example Dynamic Query Graph:

Copy code

from typing import Tuple, List

from dagster import DynamicOutput, DynamicOutputDefinition, In, graph, job, op, resource
from dagster.utils.yaml_utils import load_yaml_from_path

import sqlalchemy as sa

@op(config_schema={"yaml_path":str},output_defs=[DynamicOutputDefinition(Tuple[str,str])])
def query_iterable_from_yaml(context):
    <http://context.log.info|context.log.info>("Loading query yaml: " + context.op_config["yaml_path"])
    yaml_data = load_yaml_from_path(context.op_config["yaml_path"])

    for name,query_config in yaml_data["queries"].items():
        yield DynamicOutput((name, query_config["sql"]), mapping_key=name)


@op(required_resource_keys={"sqlalchemy_connection"})
def execute_query(context, query_tuple):
    name, sql = query_tuple
    result = context.resources.sqlalchemy_connection.execute(sql)
    return result

@op
def result_fan_in(results: List):
    for result in results:
        pass

@graph
def query_graph():
    query_iterable = query_iterable_from_yaml()
    results = query_iterable.map(execute_query).collect()
    result_collection = result_fan_in(results)
    return result_collection

@resource(config_schema={"connection_string": str})
def sqlalchemy_resource(context) -> sa.engine.Engine:
    engine = sa.create_engine(context.resource_config["connection_string"])
    return engine

@job(resource_defs={"sqlalchemy_connection": sqlalchemy_resource.configured({
    "connection_string": "<mssql+pyodbc://server/database?driver=ODBC+Driver+17+for+SQL+Server&trusted_connection=yes>"
})})
def query_job():
    query_graph()

query_job.execute_in_process(run_config={"ops":{"query_graph": {"ops":{"query_iterable_from_yaml":{"config":{"yaml_path": "query.yaml"}}}}}})

David Farnan-Williams

02/16/2022, 3:25 AM

I think the above graph is roughly what you're looking for.

6 Views

Open in Slack

Previous Next