Prratek Ramchandani
03/14/2021, 12:50 AMsolid_A
fetches some data from an API, and uses an IO Manager that persists the data to a JSON file in GCS. The path at which to store the file is specified as config for IO Manager as described in the docs here. solid_B
then loads that file from GCS to BigQuery, but I’m not sure how to pass the GCS URI between the two solids. I can’t have solid_B
use the same IO Manager and have the IO Manager’s load_input
handle locating the file because I don’t want to “load” it to perform computation.
Is there a way for solid_A
to access that IO manager config to return it as on output? Also, is there a better way I could model this series of tasks?sephi
03/14/2021, 12:29 PMDaniel H
03/14/2021, 3:12 PMMark Myers
03/15/2021, 1:59 AMGerhard Van Deventer
03/15/2021, 7:02 AMAlex V
03/15/2021, 11:49 AMMissing required config entry "solids" at root level
.
My execution block looks like this:
if __name__ == "__main__":
run_config = {
"solids": {"load_data": {"config": {"file_path": "data/day_1_input.txt"}}}
}
result = execute_pipeline(day_1_pipeline, run_config=run_config)
Any idea what could be going on?Alex Despotakis
03/15/2021, 12:45 PMRubén Lopez Lozoya
03/15/2021, 3:38 PMschrockn
03/15/2021, 5:47 PMschrockn
03/15/2021, 5:47 PMTobias Macey
03/15/2021, 8:55 PMGerhard Van Deventer
03/16/2021, 6:12 AMGerhard Van Deventer
03/17/2021, 8:12 AMNawafSheikh
03/17/2021, 11:37 AMSteve Pletcher
03/17/2021, 2:31 PMbklau-zap
03/17/2021, 2:52 PMAlex Despotakis
03/17/2021, 3:16 PMYan
03/17/2021, 10:12 PMDan Coates
03/18/2021, 1:11 AM<title>
for each url then doing some further (fairly simple) processing for each one. The 100,000 number is the initial backlog and once that is complete about 1000 more would need to be processed each day.
I'm not sure what the best way to structure this is. From reading the docs it seems like there are a few options:
1. Use partitions and have 1 partition for each url (not sure if partitions can scale out this much)
2. Have one pipeline that takes a single domain and does the processing required for it, then a second pipeline that takes the list of urls and executes the first pipeline for each url
3. Do it all within one pipeline and use a dynamic graph to fan out once the list of urls is fetched (not sure if dynamic graphs can handle fan out to 100,000, I'm guessing not?
4. Do it all within one pipeline and handle the processing of all urls inside a single solid. It would be a pity to do this as you lose a lot of the great observability that dagster seems to offer.
Any help or thoughts much appreciated :)geoHeil
03/18/2021, 10:40 AMgeoHeil
03/18/2021, 11:31 AMNawafSheikh
03/18/2021, 12:33 PMAndrew Brown
03/18/2021, 2:17 PMSteve Pletcher
03/18/2021, 3:28 PMStringSource
fields that draw from an environment variable. reproduction code in the thread.King Chung Huang
03/18/2021, 4:18 PMuser
03/18/2021, 9:33 PMNoah K
03/18/2021, 9:34 PMdwall
03/18/2021, 9:37 PMmax
03/18/2021, 10:08 PMBrian Abelson
03/18/2021, 10:45 PMfailure_hook
on all my pipelines that looks like this:
@failure_hook(required_resource_keys={"slack"})
def slack_message_on_failure(context: HookContext):
message = f"*ALERT*: Dagster failed while running pipeline `{context.pipeline.pipeline_name}` at solid `{context.solid.name}`.\nCheck out the logs at <https://dagster.ioby.network/instance/runs/{context.run_id}>"
context.resources.slack.chat_postMessage(channel="product-alerts", text=message, icon_emoji=':scream_cat:', username='Uh-oh Cat')