https://dagster.io/ logo
#ask-community
Title
# ask-community
p

Peter Rietzler

03/06/2023, 4:27 PM
May I ask the question if - under any circumstance (no matter if run locally or deployed however on a cluster) - the
load_definitions
function in this example is executed for every single materialization of any of the given assets ? I see the
startup_time
is different for the assets if I materialize both of them in a single run and I would just like to verify if I can rely on this behaviour in any execution case (locally, one pod to materialize all assets, one pod per asset materialization, CacheableAssetsDefinition etc.). Thank you very much for your help!
Copy code
startup_time = datetime.datetime.utcnow()

def load_definitions() -> Definitions:

    @asset
    def asset_one(context):
        return repr({"startup_time 1": startup_time})

    @asset
    def asset_two(context):
        return repr({"startup_time 2": startup_time})

    return Definitions(
        assets=[asset_one, asset_two]
    )


defs = load_definitions()
May I also ask if there is a way to find out which asset materialization is currently being executed while
load_definitions
is executed ? Is there an environment variable, parameter or anything else available ?
c

chris

03/06/2023, 9:56 PM
You can rely on
load_definitions
to be called once per process not necessarily once per asset. For example, if you switch to using the in_process_executor, then
load_definitions
would not be guaranteed to load for every single asset materialization, likewise if you use a graph-backed asset which contains multiple underlying ops, it may call
load_definitions
multiple times for the execution of that graph-backed asset.
Regarding determining which asset materialization is being executed, there’s no easy way to do so right now. I think your suggestion is pretty reasonable of having some sort of environment variable set in a new process that tells what is being executed, mind filing an issue for that?
p

Peter Rietzler

03/07/2023, 7:51 AM
Thank you very much for your help! May I also ask if
load_definitions
would also be called in the case
CacheableAssetsDefinition
is used ? As far as I understood,
CacheableAssetsDefinition
should be used in the case that
load_definition
takes a lot of time or resources to actually compute the asset graph ? I've not used it by now and thus wonder how this would behave ?
I am having a hard time anyway finding an example how CacheableAssetsDefinition is being used. Would it be possible to point me to some example code ? Thanks in advance!
m

Mhd Mousa Hamad

03/07/2023, 9:41 AM
Allow me to iterate on this question .. considering the modified code version below:
Copy code
class MyDataPipeline:
    def __init__(self):
        self.service = ...
        
    def load_definitions(self) -> Definitions:
        startup_time = datetime.datetime.utcnow()
        
        @asset
        def asset_one(context):
            # Can we reliably use `self.service` here without creating a resource for it
            self.service.do_somthing()
            return repr({"startup_time 1": startup_time})
    
        @asset
        def asset_two(context):
            return repr({"startup_time 2": startup_time})
    
        return Definitions(
            assets=[asset_one, asset_two]
        )

pipeline = MyDataPipeline()
dfs = pipeline.load_definitions()
Is it still an acceptable practice to use the
service
as defined compared to defining a resource for it? Any known pros/cons for this approach? Thank you!
o

owen

03/07/2023, 10:07 PM
hi!
CacheableAssetsDefinition
is an internal api and generally not recommended to be subclassed by users (as the interface may be broken even on minor releases). But to give a bit more context, its use does not impact how and when
load_definitions
will be called. When cacheable assets are present, information about the cacheable assets definitions is generated/serialized when your user code server starts up. This serialized information is then passed along to subprocesses when runs are launched. These subprocesses will take a slightly different internal code path (reading from the serialized data to generate their assets rather than having to create that serialized data from scratch). This abstraction is useful mostly when the act of generating that serialized data may take a long time (i.e. it hits some slow API). As for the most recent message, the resource abstraction is useful in cases where a) you might want to reuse a resource for multiple different assets, and have them all share some sort of config or b) you want to substitute out different definitions of the resource between environments (i.e. the local service hits a mock endpoint, and the prod service hits the real thing). If neither of these are useful to you, then there should be no issues in just using the service in line. However, is there a particular reason that you're going with this class-based approach? In general, the Definitions object will map to the entire collection of assets / jobs in a code location, and so its scope should usually be larger than a single data pipeline.
m

Mhd Mousa Hamad

03/10/2023, 12:47 PM
Thank you Owen for your reply.
The generation of the artefacts of definitions depends on multiple services which are easily wired into the class constructor and which makes testing for us much easier.
2 Views