Performance Observability for Apache Spark
-
Updated
Apr 6, 2025 - TypeScript
Performance Observability for Apache Spark
Companion to Learning Hadoop and Learning Spark courses on Linked In Learning
Ephemeral Hadoop clusters using Google Compute Platform
Solution Accelerators for Serverless Spark on GCP, the industry's first auto-scaling and serverless Spark as a service
EtlFlow is an ecosystem of functional libraries in Scala based on ZIO for running complex Auditable workflows which can interact with Google Cloud Platform, AWS, Kubernetes, Databases, SFTP servers, On-Prem Systems and more.
Debussy is an opinionated Data Architecture and Engineering framework, enabling data analysts and engineers to build better platforms and pipelines.
Data Pipeline from the Global Historical Climatology Network DataSet
Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag
Creating an Inverted Index of words occurring in a large set of documents extracted from web pages using Hadoop MapReduce and Google Dataproc
ecommerce GCP Streaming pipeline ― Cloud Storage, Compute Engine, Pub/Sub, Dataflow, Apache Beam, BigQuery and Tableau; GCP Batch pipeline ― Cloud Storage, Dataproc, PySpark, Cloud Spanner and Tableau
A search engine to query social media insights with political theme
An educational project to build an end-to-end pipline for near real-time and batch processing of data further used for visualisation and a machine learning model.
GCP_Data_Enginner
Add a description, image, and links to the dataproc topic page so that developers can more easily learn about it.
To associate your repository with the dataproc topic, visit your repo's landing page and select "manage topics."