Skip to content

Latest commit

 

History

History
16 lines (9 loc) · 924 Bytes

06-PySpark.md

File metadata and controls

16 lines (9 loc) · 924 Bytes

PySpark

PySpark is a great language for performing exploratory data analysis at scale, building machine learning pipelines, and creating ETLs for a data platform. If you’re already familiar with Python and libraries such as Pandas, then PySpark is a great language to learn in order to create more scalable analyses and pipelines.

Apache Spark is a Python API for Spark that supports the collaboration of Apache Spark and Python PySpark interfaces with Resilient Distributed Datasets (RDDs) in Apache Spark and Python, by means of the Py4j library. Py4J is a popular library which is integrated within PySpark and allows python to dynamically interface with JVM objects.


Reference