I wrote a bunch of pipelines to migrate data from Mongo (App DB) to Postgres (temporary budget data warehouse until we upgrade to something like SnowFlake). Realised too late that what I was really doing was defining Airbyte Sources https://docs.airbyte.com/connector-development/tutorials/building-a-python-source and Destinations which are already well defined and robust.
Our main Software Engineer is touchy about App DB query load. I'd imagine that Airbyte will be just as efficient as anything I've written if not more so? My mongo queries retrieve all the data in a collection since a timestamp from just before the previous pipeline execution if that makes sense. And then upsert on a key so a tiny bit of record overlap doesn't matter.
Basically my pipelines can be a little buggy, should I just migrate? @Stephen Bailey
I do use copy_expert for my backfills which makes them speedy, not sure if Airbyte would do that. But I guess the fact that Airbyte takes everything and loads it into json means that I wouldn't need to run a backfill because there's no transformation logic to get wrong or change. Not posting on Airbyte slack because Airbyte devs will obviously tell me to use Airbyte.
06/29/2022, 3:53 PM
woof, this is a decision. i have spent a lot of time with meltano and singer taps, and it is just a lot of work to maintain and build the extractors. Having a framework is nice, and i looked into deploying airbyte for us -- still might do so -- but its definitely a Thing to manage. for us, a lot of our key pipelines are already pretty well-defined and are a "devil you know" so we didn't end up adding them in. but now that we want to make ti more self-service to add, e.g. airtable exports, we are revisiting the architecture.
06/29/2022, 4:16 PM
At our company we make effective use of Airbyte with Dagster. I'd certainly recommend switching to Airbyte if you can. It's another thing to manage, but also when you switch to Snowflake, all you have to do is add the new destination and you're sorted. Makes bringing in other first- and third-party data much easier too
06/29/2022, 5:17 PM
do you self-host @Isaac Harris-Holt?
06/30/2022, 8:18 AM
@Stephen Bailey yes we do
06/30/2022, 9:42 AM
Surprised it wasn't built for k8s from day one. Was assuming it would have good support (not that my workloads would really need it but nice to have standardised deployments).
I'm torn, realised most of the places that cause me difficulty would not be fixed by Airbyte (nested timestamps past annoying IDs mongo so that you can't easily run update syncs). The simplicity of switching destinations had barely occured to me though so thanks for that reminder @Isaac Harris-Holt. Think it's really just a case of whether I have a quiet two days when everyone's on holiday to chuck it together and test it properly for the simple pipelines.
06/30/2022, 12:40 PM
Yeah, the main value to me is that 1) it seems like it's getting pretty good traction as an OSS alternative to fivetran / has enough funding to stick around for a while, 2) i think analysts could self-serve the creation of new datasets pretty easily. I don't really want our team in the middle of ingesting different saas app sources, but I don't see it replacing some of our core pipelines that need a higher SLA