Hi, I'm a data analyst, a newbie in ETL and orchestration, or data engineering in general. I need some helps here. Currently I'm using Airflow deployed on Google Composer 2 to EL data from Aurora MySQL to BigQuery on daily basis, ~200K rows. It charge me quite costly (around $350 a month). My question is, can I use Dagster to perform that job? Will it be cheaper if I deploy Dagster on GCP? And which GCE instance should I use? Thank you so much. Appreciate the helps. 🙏
08/02/2022, 2:07 AM
Not knowing the details of your EL, you could likely use dagster to orchestrate that. I don't know exactly what GCE instance you would want to use - this would depend on what kind of memory and parallelism your workflows require. I'd imagine though that ~200k rows would be able to be done on a pretty small instance - Dagster doesn't really take a ton of resources itself, usually it's your workflow logic that incurs the biggest resource requirement (I've found 4cpu / 16GB to be more than sufficient for workloads that don't do big moves). You can store everything on the instance itself and just attach a larger hard drive if you want to go really cheap, but the dagster team usually recommends you use a separate database (you could use google's managed database offering, probably don't need anything more than 1 cpu and a couple GB memory + some storage) for run and event storage. On AWS this is all is pretty easy to do for less than $75 a month, looks like you could do it for a similar price on GCP (less if you went without a database).
08/02/2022, 1:38 PM
Hi Zach, thanks for replying. Currently I just query from aurora mysql and store the result to dataframe before inserting to BigQuery. I guess I will try with the smallest E2 instance and a cloud sql, and see if it works. Thanks! $75 sounds good!