PySpark

PySpark is a great language for performing exploratory data analysis at scale, building machine learning pipelines, and creating ETLs for a data platform. If you’re already familiar with Python and libraries such as Pandas, then PySpark is a great language to learn in order to create more scalable analyses and pipelines.

Apache Spark is a Python API for Spark that supports the collaboration of Apache Spark and Python PySpark interfaces with Resilient Distributed Datasets (RDDs) in Apache Spark and Python, by means of the Py4j library. Py4J is a popular library which is integrated within PySpark and allows python to dynamically interface with JVM objects.

Reference

https://towardsdatascience.com/a-brief-introduction-to-pyspark-ff4284701873
https://databricks.com/discover/introduction-to-data-analysis-workshop-series/intro-apache-spark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!