#

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Here are 8,266 public repositories matching this topic...

tobymao / sqlglot

Python SQL Parser and Transpiler

mysql python bigquery parser postgres sql spark presto hive clickhouse sqlite snowflake optimizer transpiler redshift databricks tsql trino sqlparser duckdb

Updated May 16, 2024
Python

hablapps / doric

Type safety for spark columns

scala big-data spark typesafe big dataframe spark-columns

Updated May 16, 2024
Scala

ytsaurus / ytsaurus

YTsaurus is a scalable and fault-tolerant open-source big data platform.

sql big-data spark clickhouse distributed-database lakehouse olap-database ytsaurus

Updated May 16, 2024
C++

apache / spark

Apache Spark - A unified analytics engine for large-scale data processing

python java r scala sql big-data spark jdbc

Updated May 16, 2024
Scala

zookage / zookage

Hadoop on Kubernetes on Docker Desktop.

kubernetes spark hive hadoop hbase zookeeper ozone tez trino

Updated May 16, 2024
Shell

YeonwooSung / DevOpsMisc

Miscellaneous codes and writings for DevOps

nginx aws devops sql spark serverless gcp pyspark infra devops-pipeline devop

Updated May 16, 2024
Jupyter Notebook

SuperCowPowers / sageworks

SageWorks: An easy to use Python API for creating and deploying AWS SageMaker Models

python aws machine-learning big-data spark pandas data-engineering

Updated May 16, 2024
Python

longNguyen010203 / Youtube-ETLT-Pipeline

💜🌈📊 A Data Engineering Project that implements an ETL data pipeline using Dagster, Apache Spark, Streamlit, MinIO, Apache Superset, Dbt 🌺

mysql processing docker dockerfile machine-learning spark docker-compose postgresql pyspark data-engineering minio dbt data-engineer etl-pipeline data-engineering-pipeline cleaning-data dagster

Updated May 16, 2024
Jupyter Notebook

apache / incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

big-data spark etl graph pyspark graph-analysis data-orchestration graph-storage

Updated May 16, 2024
C++

marsfoundation / spark-app

spark ethereum dapp dai makerdao defi

Updated May 16, 2024
TypeScript

polyaxon / traceml

Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.

Updated May 16, 2024
Python

onetl

MobileTeleSystems / onetl

One ETL tool to rule them all

spark etl plugin-system etl-pipeline etl-components pydantic hwm

Updated May 16, 2024
Python

apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

bigquery real-time sql database spark hive hadoop etl snowflake olap query-engine redshift dbt elt iceberg hudi delta-lake lakehouse

Updated May 16, 2024
Java

apache / celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.

spark bigdata shuffle

Updated May 16, 2024
Java

delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

big-data spark analytics acid delta-lake

Updated May 16, 2024
Scala

kamu-data / kamu-cli

New generation decentralized data lake and a streaming data pipeline

data-science sql spark jupyter blockchain open-data data-management flink data-as-code datafusion kamu open-data-fabric

Updated May 16, 2024
Rust

listenbrainz-server

metabrainz / listenbrainz-server

Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

react python music typescript database web big-data spark listenbrainz-server

Updated May 16, 2024
Python

xl-xueling / xl-lighthouse

XL-LightHouse是一套支持超大数据量、支持超高并发的通用型流式大数据统计系统。常见的应用场景包括：PV、UV统计；电商销售额、下单用户数统计；日志量统计；接口调用量、异常量、耗时情况统计；服务器运维指标监控等功能。系统支持多维度统计，支持各种复杂的条件筛选和逻辑判断，一键部署，一行代码接入，轻松实现各种海量数据实时统计，帮助企业以更低的成本快速搭建起数据指标体系，是企业降本增效的好帮手！

statistics big-data spark analytics clickhouse flink digital-solutions

Updated May 16, 2024
Java

gchq / Gaffer

A large-scale entity and relation database supporting aggregation of properties

big-data spark hadoop graph accumulo hbase graph-database parquet aggregation

Updated May 16, 2024
Java

NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

big-data spark gpu rapids

Updated May 16, 2024
Scala

Created by Matei Zaharia

Released May 26, 2014

Followers: 414 followers
Repository: apache/spark
Website: spark.apache.org
Wikipedia: Wikipedia

Related Topics