Hey everyone it's been really exciting to see the ...
# announcements
Hey everyone it's been really exciting to see the community growing. Our team is really excited about all the high quality engagement in the slack and for all the interesting projects that everyone is working on. Please keep that feedback coming. It's absolutely invaluable. I wanted to point out some recent improvements that we pushed out yesterday that we think will be of broader interest. We are looking for early users of these features to get feedback leading up to our next major release, planned for June 4th. 1) Re-execution UI improvements: As @yuhan noted (and she in fact implemented) we have improved the re-execution mechanisms in the run viewer. There is one button now (yay!) and it is far more straightforward to retry successful steps. We think this is a big improvement for both local development and for operational workflows. 2) Asset manager: You'll notice that there is a new item in the left hand navigation. This is for tracking assets. We are (finally!) starting to build on our metadata abstraction to deliver real value for everyone. In this case you can set the "asset key" on a Materialization event, which will end up building an index of assets. Our intended use case is that you can develop an intuitive scheme for creating data assets (e.g. db_schema.db_name), and then be able to look up those assets. Then you can see when it was last touched, metrics (e.g. size), and what runs and pipelines affect it. We think this is going to be great for ops workflows, e.g.: a) oh my this table is not up to date b) look in asset viewer c) see that it was last touched by pipeline X 2 weeks ago d) look at runs for pipeline and debug. It's very flexible so you can assign asset keys to things like ML models, reports, or even emails. Totally up to you. Cheers to @prha for whipping this together. We consider this an "alpha" feature at the moment but it would be great to get feedback leading up to our June release. 3) Remote pyspark execution: @sandy has done some really interesting work so that entire steps can be executed in remote compute environments and a bunch of infra work to package pyspark code and ship it to S3 EMR. The net result of this is that it is straightforward (just a config change of out-of-the-box components) to switch between pure local execution and remote execution in EMR, and you no longer have to roll your own scripts to deploy. You can just write your computations in terms of DataFrames and you have a testable pyspark pipeline that can also remotely execute on EMR with very little additional work. We'll continue to build on this to enable other capabilities. This should be really exciting for pyspark users who buy into the dagster vision, but haven't really seen it come together for the pyspark use case. Thanks everyone!
partywizard 1
👍 4
🎉 12