hello fellow dagsters, i really need help now, and...
# ask-community
a
hello fellow dagsters, i really need help now, and i am getting amazingly frustrated. for the life of me i can’t put the pieces of the dagster docs & tutorial together for a (what i think) super-simple real world use case: • get/create/magically invent an asset (a ZIP file) ◦ unpack that ZIP file, and split it into multiple assets (1 per CSV file inside) ◦ process those CSV files in independent … jobs? whatever, “processing streams”. i need the same for the CSV files, they are split as well. BUT. let’s start with the obvious question, based on this code example, where the … will the asset data be?!? persist_to_storage is neither imported nor explained, and the API docs of
AssetMaterialization
are of no help as well. then i don’t know - at all - how to connect `Jobs`s ,`Op`s, and `Asset`s. can i write
def my_op(my_asset): …
? or something like this:
Copy code
# for example purposes, i would love to have ...
#   my_zip_asset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]
# ... and split_asset_op() to split that into singular
# assets each representing one number

@job
def my_job():
    # called automatically once zip_asset is available?!
    # how do i define zip_asset, when it is being read
    # from a queue? as an op? how?
    split_asset_op(my_zip_asset)
    
    # if i get this correctly, the OP will then "materialize"
    # an asset called "zip_sub_asset" or whatever i want to
    # name it, and i then can react on this ... later? somewhere else?

@job
def process_splits():
    # called automatically when the "zip_sub_asset" is available?!
    # since this is not an annotated function - how do i reference
    # the asset without a function?
    # stupid op that converts 5 to "0005"
    zfill_op(zip_sub_asset)
if anybody could help me with a couple of lines of code instead of documentation pointers, i would be amazingly greatful. i really really don’t get it.
t
Hi! Sorry to hear about your frustration. In this case: •
persist_to_storage
isn't a function that Dagster provides. It's a hypothetical function to show that the user would write the data to storage themselves. • Where the asset data will be will depend on where you write them. If you're manually writing to the file system, it'll be on the file system that Dagster is running on. ◦ If you're using an I/O manager (ex. returning data from an asset), by default, the data will be written to storage as a pickled file either under the directory defined by the
DAGSTER_HOME
env var, if defined, or it'll be in a directory under your dagster project prefixed with
tmp*
if
DAGSTER_HOME
isn't defined. • re: connecting ops, jobs, and assets ◦ Because of the dynamic nature of what you're doing, I'd recommend using Ops to create assets, rather than using than the
@asset
definition. ◦ Instead, it'd be easier to use an op- and graph-based approach that generates assets. So it'd be: 1. Op to get/create/magically invent the ZIP file 2. Downstream Op to that that processes those ZIP files 3. Wrap those in a graph to let you loop over each ZIP file 4. Meanwhile, you can use `AssetMaterialization`s to tell Dagster that these assets are being built during these runs
a
thanks. this is, unfortunately, barely helping, because – like i said – i horribly fail at creating those things you mentioned. i am experimenting with asset materializations, and the documentation is just utterly *un*helpful. • what is a
ReadMaterializationConfig
? is that again something that is created by the user, without any hint except - maybe - the ops name
read_materialization
? • where (in that example) is
asset_event.dagster_event.asset_key
coming from? it’s not a property from
DagsterEvent
, and there is - of course - no explanation. ◦ is it maybe the
asset_key=
parameter from the
AssetMaterialization
creation? would
description
also be a part of it? is there any mention of that magic anywhere, at all? the longer i look at that documentation, the more do i wonder if the docs are just horribly sub-par, or the whole thing just utterly over-brained and complex. as i said, i am really getting frustrated here, and so far the only reason i didnt abandon the whole thing yet is that i am really stubborn.
d
To answer the questions in the two bullet points: • ReadMaterializationConfig is intended to be a user-defined config class for the op, yes - https://docs.dagster.io/concepts/configuration/config-schema#defining-and-accessing-configuration-for-an-op-or-asset gives an example of such a class. Appreciate the feedback that this could be more clear. •
asset_key
is actually a property of DagsterEvent: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_core/events/__init__.py#L632-L634 - we'll look into why it's not appearing in the API docs, there may be a bug in the parsing library that we're using to turn annotations into API docs. We'll pass the broader feedback about the docs being unclear on to the team working on docs improvements as well - appreciate that there's a steep learning curve with many concepts and that there are many places we can make them better.
I just put out a PR for the issue with the missing properties in the API docs - thanks for reporting that
❤️ 1
a
i “kinda” managed, and i already got the feedback i might have missed a better way to do it. partitioned … assets IIRC. which i looked at, but i failed to even understand what problem they solve 😆 . if you’re interested, see my blog post.
180 Views