https://dagster.io/ logo
r

Rafał Wojdyła

02/16/2021, 11:05 AM
👋 I'm trying to use
MemoizableIOManager
but I having hard time, and right now hitting 2 issues/questions, I would highly appreciate your help. I'm on dagster==0.10.5. 1.
has_output
appears to be getting called with context set to
None
, thus I can't validate if the external data exists, this is a simplified version of my current POC: https://gist.github.com/ravwojdyla/6a546e3fa65459b17413aac10c643f25 2. when I hard-code
had_output
to return
True
(just as a POC), I'm getting this error:
Copy code
for step in self.get_steps_to_execute_by_level()[0]:
IndexError: list index out of range
I was expecting to get an error in that case, but that specific error made me think how the memoization appears to work, is it fair to say that for memoization to work the "central" scheduler needs to have access the the previously run "steps" (to fetch config etc)?
a

alex

02/16/2021, 3:27 PM
@chris
🙏 1
c

chris

02/16/2021, 3:30 PM
hey! thanks for giving memoization a shot. Can I get more information regarding how you're executing? As in, are you using the python API, CLI, etc?
r

Rafał Wojdyła

02/16/2021, 3:33 PM
👋 @chris executing via python API and CLI (tried both)
I'm happy to create GH issues for those two btw(?)
c

chris

02/16/2021, 3:42 PM
hey! so I was able to reproduce the second error
It's happening whenever every step in a pipeline is memoized. Thanks so much for finding this! Will make an issue
As far as the first issue, 2 things: 1. I noticed that in the github snippet you posted, the solid has no version attribute attached. Is this the case in the actual version of the pipeline you're running? 2. Would you mind posting the exact error message?
r

Rafał Wojdyła

02/16/2021, 3:47 PM
@chris do you mind if I create an issue for the 2nd problem? that way I can report that back?
c

chris

02/16/2021, 3:48 PM
yup that's fine
🙌 1
r

Rafał Wojdyła

02/16/2021, 3:54 PM
@chris re 1st issue q: • correct - no version - is it required?
@chris actually I simplified that gist, and the issue isn't actually that the
context
is
None
, but that
context.log
is `None`:
Copy code
File "pipelines/data_sources/efo/dagster_tasks.py", line 30, in has_output
    <http://context.log.info|context.log.info>(f"Trying to load from {context}")
AttributeError: 'NoneType' object has no attribute 'info'
is that a bug as well?
c

chris

02/16/2021, 3:57 PM
ah, so yea it's required that you provide a version attribute to your solids, which essentially represents the version of the code enclosed by the solid. You can set it to some fixed value, a la
@solid(version="hello")
to get up and running
r

Rafał Wojdyła

02/16/2021, 3:59 PM
@chris I see, added version. just FYI as you probably expect that doesn't mitigate the issue of
context.log
being
None
, maybe some kind of init is not done before
has_output
?
c

chris

02/16/2021, 4:05 PM
Right. regarding the logging issue, that is definitely a bug.
r

Rafał Wojdyła

02/16/2021, 4:05 PM
@chris ok, will create an issue for that as well
c

chris

02/16/2021, 4:06 PM
Appreciated! Will try to get fixes out for these soon
r

Rafał Wojdyła

02/16/2021, 4:06 PM
@chris thanks for your help thus far btw, going back to the q:
Copy code
I was expecting to get an error in that case, but that specific error made me think how the memoization appears to work, is it fair to say that for memoization to work the "central" scheduler needs to have access the the previously run "steps" (to fetch config etc)?
c

chris

02/16/2021, 4:07 PM
Right, so this error is actually happening downstream of the memoization process
r

Rafał Wojdyła

02/16/2021, 4:09 PM
@chris right, for context I am currently evaluating whether we should start using Dagster as an orchestrator, so I'm trying grasp the assumptions dagster makes about memoization. So for memoization to work, Dagster needs information about previous runs of the pipeline or not?
c

chris

02/16/2021, 4:11 PM
In theory, it should not. Memoization is designed not to be tied to runs, but instead to computed versions for a given output.
r

Rafał Wojdyła

02/16/2021, 4:12 PM
@chris in theory? is the practice different than theory?
c

chris

02/16/2021, 4:15 PM
Sorry, in theory was a poor choice of words. "It should not". Is a more accurate way of saying it.
r

Rafał Wojdyła

02/16/2021, 4:19 PM
@chris cool, thanks! FYI the 2nd issue: https://github.com/dagster-io/dagster/issues/3690. Speaking of the memoization, I'm sorry I have all these questions, but I don't think there is documentation for this (???), but my 2nd question is about memoization depending on the value of parameters (given from CLI), is there a documentation or example you could point me at?
c

chris

02/16/2021, 4:22 PM
To give a brief primer on the way memoization works 1. Given the full execution plan, we assign a "version" to each output in the pipeline. This version is dependent upon all upstream versions in the pipeline. 2. For each output, we use the io manager to determine whether or not an object has already been stored with the version tied to that output. If a given output/version has not been stored, then that output + all downstream outputs to that output must be recomputed. 3. We execute the set of steps that have not been memoized. For steps that expect inputs from steps that have cached outputs, we just use the io manager to retrieve those inputs.
There exists documentation of how versions for each output are computed here, which also describes how versions change as parameters change (config, version arg to solid, etc). However, I don't think we expose this in our API docs.
Do you think it would be helpful to have a doc that described this process in more detail?
r

Rafał Wojdyła

02/16/2021, 4:43 PM
@chris yea, so full discloser, I been reading the doc and playing with dagster for only a couple of hours, and I think the doc is a bit all over the place, it's missing a kind of "golden/blessed path" example with a "standard" data pipeline with "best data engineering practices". And by standard data pipeline, I mean a pipeline that works with files (which are "memoized"), multiple idempotent steps. In the example there should be at least 2 such pipelines and some data dependencies between them.
@chris if you think that makes sense I can open an issue for that as well and we can continue the discussion there?
c

chris

02/16/2021, 4:54 PM
so are you referring to the dagster docs in general, or the memoization doc specifically?
r

Rafał Wojdyła

02/16/2021, 5:27 PM
@chris dagster doc in general, there's airline example, that's too complex as an example to illustrate best practices?
@chris btw, the solids in that example do not set
version
, is that right?
@chris I'm not sure you are familiar with luigi, but take a look at this example: https://luigi.readthedocs.io/en/stable/example_top_artists.html, I know Luigi and Dagster are not the same systems and have different assumptions (tho both are orchestrators), but arguable that example does better job at presenting essentially the same set of features as the airline example (if we focus on the orchestration). wdyt?
@chris and regarding memoization, the airline example, which I assume presents "real life" pipelines, doesn't use
version
or the hashing of the files per solid? is that kind of hashing of files per solid something you recommend doing in real pipelines (which would require a file per solid)?
c

chris

02/16/2021, 6:04 PM
ah gotcha. So memoization is still an experimental feature, and only works from the CLI. Thus, we don't really expose it in examples intended to reflect more idiomatic usage yet
r

Rafał Wojdyła

02/16/2021, 6:06 PM
@chris so would you agree with the statement that right now dagster doesn't have idiomatic support for memoized file outputs?
c

chris

02/16/2021, 6:11 PM
Memoization is definitely designed around using something like file outputs, I think it's more accurate to say that memoization in general is still quite rough around the edges
This is all super valuable signal though, really appreciate your feedback as we iterate on memoization capabilites
As far as your question of whether the kind of hashing files per solid is something I'd recommend doing on real life pipelines, if you need memoization as part of your use case, then I would say yes, but with the caveat that memoization isn't really ready for production use as of yet, so expect to run into some rough edges.
r

Rafał Wojdyła

02/16/2021, 6:51 PM
@chris I see, could you please elaborate why? there is obvious cost to it (and I mean readability). And maybe it's worth to scope the problem: I can understand why you would like that in a super general case of an orchestrator, BUT if we focus on an idiomatic data pipeline (idempotent, ETL with date partitioned output), in most cases you don't need to recompute on code changes (apart from limited special cases), since you just produce new data in a new date partition? Am I missing sth that dagster assumes?
c

chris

02/16/2021, 6:53 PM
Right, if you don't care about recomputing on code changes, then you can hard code the version argument and then recomputation will only happen as inputs and config change
The code versioning is more useful for the purposes of active development. But you're right, if you're at a state where you don't care to recompute on code changes, then hard-coding the version argument would be the approach to use