https://dagster.io/ logo
d

dwall

01/07/2020, 9:31 PM
some best practice questions for y'all - A very common pattern for us is to execute a SQL query on Snowflake, fetch the results, and return the result set as a pandas DataFrame (most of the time saving an artifact of the dataset in cloud storage as well). Is there already an established pattern in Dagster for this? Is it recommended to decompose this kind of pattern into multiple solids (one for running the query, one for constructing the DataFrame from the result set, etc) or rather just house this kind of operation in a single solid? second question: we need to use the result set mentioned above ^ and feed it into multiple downstream solids that can all be run in parallel. These solids are largely identical and only differ by a small piece of config. Is there a way to programmatically generate these solids with different config values and have them all be able to be run in parallel?
m

max

01/07/2020, 9:51 PM
hi @dwall
d

dwall

01/07/2020, 9:51 PM
👋
m

max

01/07/2020, 9:52 PM
we are definitely seeing some people use factory functions to generate solids, which might be an approach you could use for (2) -- alternatively, you could use many aliases of the same solid, and configure them explicitly in yaml, etc., which might be more explicit
the right answer to (1) probably depends on how reusable you want this pattern to be out of the box, and what intermediate format you want to standardize on. i suspect you will want a custom data type for your data frame, and if you want your solids to speak data frames to each other (that is, if you'll never or rarely use the raw format you get back from Snowflake), i would collapse the two operations into one solid
depending on how standard the process for saving the data frame to cloud storage is, you might want to use intermediate storage: https://dagster.readthedocs.io/en/0.6.6/sections/learn/tutorial/intermediates.html
this lets you write a serialization method once that will work with every one of your data frames and automagically gets you copies in cloud storage indexed by run id, etc. (not without some overhead)
if you need some subset of the artifacts to live in some particular place, you might then have a specialized solid that writes the DF somewhere (and emits a Materialization) https://dagster.readthedocs.io/en/0.6.6/sections/learn/tutorial/materializations.html
d

dwall

01/07/2020, 10:41 PM
cool - this is really helpful. thanks @max!