Daniel Kilcoyne
12/15/2022, 2:58 PMZach
12/15/2022, 4:39 PMDaniel Kilcoyne
12/15/2022, 5:36 PMforeachBatch
and write to target, metrics, etc. tables, on some processingTime interval of >10s (prob a bit more), or processing everything available then shutting down. 1 source readStream maps to N target table writeStreams, with each of those writeStreams writing to multiple tables. Also, Databricks clusters usually take 3-5min to start up.
Brand new to all of this, but I'm thinking the common pattern is:
• readStream is a resource
• off that shared stream, N streams created with different op transformations (none of them write to disk yet)
• in foreachBatch the N streams do some more ops and each write to M respective tables. That's where I'd leverage the IOmanagerZach
12/15/2022, 6:04 PMDaniel Kilcoyne
12/15/2022, 7:32 PMZach
12/15/2022, 8:20 PMDaniel Kilcoyne
12/15/2022, 10:52 PMenableChangeDataFeed
, it doesn't work well with column mapping)? I'll have upserts and deletes so I can't do a: .read.format("delta").load().filter("last_modified > last_job_run_time")
because I wouldn't retrieve deleted rows that I need to delete downstream. Either I need to read the transaction logs myself or keep the event type info somewhere else. Unless I'm missing something.