An ETL project using canadian housing data to demonstrate knowledge of Spark, Terraform, and GCP (Dataproc, Cloud Storage, BigQuery). The main.tf terraform file will create all the infrastructure needed for this pipeline: a Google Cloud Storage bucket, a Dataproc cluster, a Dataproc job, and a BigQuery dataset. It will also upload the source csv file and the transform.py script to the Google Cloud Storage bucket, so that they can be accessed by the Dataproc Pyspark Job running on the Dataproc cluster.
This is an ETL project using GCP (Pub/Sub, Cloud Run, and BigQuery) to stream simulated telemetry data. This set up would be good for any pipeline where you need to do light weight transformations on messages before storing them.