pyspark

Here are 3,360 public repositories matching this topic...

pervaje28megh / Healthcare-system

company tomake appropriate business strategies to enhance their revenue by analyzing customers behaviors and send offers and royalties to customers respectively

python big-data pyspark mysql-database

Updated May 23, 2023
Jupyter Notebook

aehabV / Indeed-fake-job-posting-prediction

Star

A machine learning model is built using PySpark's MLlib library to automatically flag suspicious job postings on Indeed.com. The dataset includes 18,000 job descriptions, out of which about 800 are fake.

nlp natural-language-processing pyspark indeed pyspark-mllib fake-jobposts-prediction job-postings

Updated May 18, 2023
Jupyter Notebook

basel-ay / Hands-on-Apache-Spark

Star

Writing dummy snippets of code to read, manipulate, and build a simple ML model with PySpark.

apache-spark linear-regression pyspark

Updated Jul 18, 2023
Jupyter Notebook

zuliani99 / All-Pairs-Docs-Similarity

Star

Given a set of documents and the minimum required similarity threshold find the number of document pairs that exceed the threshold

sklearn pyspark tf-idf cosine-similarity document-similarity beir

Updated May 26, 2023
Jupyter Notebook

phricardorj / pyspark-study

Star

🐍 | My PYSPARK studies. PySpark is an interface for Apache Spark in Python.

pyspark phyton

Updated Nov 11, 2022
Jupyter Notebook

JonathanPollyn / Spark

Star

This notebook contains detailed code for spark and machine learning and databricks

python spark pyspark spark-sql pyspark-python

Updated Mar 15, 2023
Jupyter Notebook

data-miner00 / spark

Star

A laboratory to carry out experiments with PySpark

python pyspark databricks

Updated Nov 5, 2023
Jupyter Notebook

khaledshabasy / Data-Modeling-Spark-udacity-capstone

Star

An ETL pipeline for I94 immigration, global land temperatures and US demographics datasets is created to form an analytics database on immigration events. A data model is established with pandas and pyspark to find patterns of immigration to the United States.

aws s3 pandas pyspark sas7bdat

Updated Apr 6, 2023
Jupyter Notebook

furkancets / PrescreiberPipelineSpark

Star

Trying best case apache spark working environment for robust data pipelines

spark apache-spark hadoop pyspark

Updated Apr 1, 2023
Python

simonediluna / Distributed-Data-Analysis-and-Mining

Star

An academic project carried out for the Distributed Data Analysis and Mining course (a. y. 2022/2023)

distributed-systems data-science pyspark

Updated May 18, 2023
Jupyter Notebook

Ayoub-etoullali / Activites-Pratiques-BigData

Star

MapReduce Job Development, RDDs Programming, Medical Data Management, Sales Analysis, And Efficient Data Integration For Big Data Analysis. Spark: Big Data Processing, SQOOP Integration, And Spark Structured Streaming For Real-Time Data.

real-time spark apache-spark pyspark data-integration mapreduce real-time-data sqoop mapreduce-jobs sales-analysis spark-structured-streaming mapreduce-java real-time-database big-data-processing rdds sqoop-export sqoop-import big-data-analysis medical-data-management

Updated Jun 7, 2023
Java

chabir / Most-Popular-R-packages

Star

python package r visualisation multithreading pyspark sparse longest-path sparse-matrices networkd3 beautifulsoup4

Updated Oct 4, 2015
Python

zydusss / Spark

Star

Data Analytics using Spark

python streaming spark analytics graph pyspark dataset mllib sparksql rdd dataspark

Updated Dec 4, 2019
Jupyter Notebook

samanta-anupam / big_data_assignments

Star

Assignments as given in the course of CSE545. All assignments are part of this course

lsh pyspark dimensionality-reduction svd word-count satellite-images blog-corpus

Updated Dec 3, 2017
Jupyter Notebook

vamshitalla / python

Star

python spark datascience pyspark

Updated Aug 12, 2018
Python

appaulo14 / spark_analysis_of_public_data_from_askubuntu

Star

This project focuses on analyzing the questions on askubuntu.com to find the most common topics asked about in order to better understand what areas of Ubuntu may need more attention for bug fixing and also what features might be good to add in future releases of Ubuntu. To do this, I analyzed public data from askubuntu.com using Azure HDInsight…

python spark ubuntu insights pyspark stackexchange askubuntu