May I ask the question if under any circumstance no matter i dagster #ask-community

May I ask the question if - under any circumstance...

Peter Rietzler

03/06/2023, 4:27 PM

May I ask the question if - under any circumstance (no matter if run locally or deployed however on a cluster) - the

load_definitions

function in this example is executed for every single materialization of any of the given assets ? I see the

startup_time

is different for the assets if I materialize both of them in a single run and I would just like to verify if I can rely on this behaviour in any execution case (locally, one pod to materialize all assets, one pod per asset materialization, CacheableAssetsDefinition etc.). Thank you very much for your help!

Copy code

startup_time = datetime.datetime.utcnow()

def load_definitions() -> Definitions:

    @asset
    def asset_one(context):
        return repr({"startup_time 1": startup_time})

    @asset
    def asset_two(context):
        return repr({"startup_time 2": startup_time})

    return Definitions(
        assets=[asset_one, asset_two]
    )


defs = load_definitions()

Peter Rietzler

03/06/2023, 4:33 PM

May I also ask if there is a way to find out which asset materialization is currently being executed while

load_definitions

is executed ? Is there an environment variable, parameter or anything else available ?

chris

03/06/2023, 9:56 PM

You can rely on

load_definitions

to be called once per process not necessarily once per asset. For example, if you switch to using the in_process_executor, then

load_definitions

would not be guaranteed to load for every single asset materialization, likewise if you use a graph-backed asset which contains multiple underlying ops, it may call

load_definitions

multiple times for the execution of that graph-backed asset.

chris

03/06/2023, 10:04 PM

Regarding determining which asset materialization is being executed, there’s no easy way to do so right now. I think your suggestion is pretty reasonable of having some sort of environment variable set in a new process that tells what is being executed, mind filing an issue for that?

Peter Rietzler

03/07/2023, 7:51 AM

Thank you very much for your help! May I also ask if

load_definitions

would also be called in the case

CacheableAssetsDefinition

is used ? As far as I understood,

CacheableAssetsDefinition

should be used in the case that

load_definition

takes a lot of time or resources to actually compute the asset graph ? I've not used it by now and thus wonder how this would behave ?

Peter Rietzler

03/07/2023, 9:04 AM

I am having a hard time anyway finding an example how CacheableAssetsDefinition is being used. Would it be possible to point me to some example code ? Thanks in advance!

Mhd Mousa Hamad

03/07/2023, 9:41 AM

Allow me to iterate on this question .. considering the modified code version below:

Copy code

class MyDataPipeline:
    def __init__(self):
        self.service = ...
        
    def load_definitions(self) -> Definitions:
        startup_time = datetime.datetime.utcnow()
        
        @asset
        def asset_one(context):
            # Can we reliably use `self.service` here without creating a resource for it
            self.service.do_somthing()
            return repr({"startup_time 1": startup_time})
    
        @asset
        def asset_two(context):
            return repr({"startup_time 2": startup_time})
    
        return Definitions(
            assets=[asset_one, asset_two]
        )

pipeline = MyDataPipeline()
dfs = pipeline.load_definitions()

Is it still an acceptable practice to use the

service

as defined compared to defining a resource for it? Any known pros/cons for this approach? Thank you!

owen

03/07/2023, 10:07 PM

hi!

CacheableAssetsDefinition

is an internal api and generally not recommended to be subclassed by users (as the interface may be broken even on minor releases). But to give a bit more context, its use does not impact how and when

load_definitions

will be called. When cacheable assets are present, information about the cacheable assets definitions is generated/serialized when your user code server starts up. This serialized information is then passed along to subprocesses when runs are launched. These subprocesses will take a slightly different internal code path (reading from the serialized data to generate their assets rather than having to create that serialized data from scratch). This abstraction is useful mostly when the act of generating that serialized data may take a long time (i.e. it hits some slow API). As for the most recent message, the resource abstraction is useful in cases where a) you might want to reuse a resource for multiple different assets, and have them all share some sort of config or b) you want to substitute out different definitions of the resource between environments (i.e. the local service hits a mock endpoint, and the prod service hits the real thing). If neither of these are useful to you, then there should be no issues in just using the service in line. However, is there a particular reason that you're going with this class-based approach? In general, the Definitions object will map to the entire collection of assets / jobs in a code location, and so its scope should usually be larger than a single data pipeline.

Mhd Mousa Hamad

03/10/2023, 12:47 PM

Thank you Owen for your reply.

Mhd Mousa Hamad

03/10/2023, 12:50 PM

The generation of the artefacts of definitions depends on multiple services which are easily wired into the class constructor and which makes testing for us much easier.

2 Views

Open in Slack

Previous Next