Skip to content

mateuszk098/kaggle-notebooks

Repository files navigation

Set of Notebooks from Kaggle Competitions

GitHub last commit

What is this? 📜

This repository contains different notebooks from Kaggle competitions in which I have participated and am currently taking part in. You can find here different types of notebooks, starting from very simple datasets like Titanic one up to challenges related to NLP like machine translation with Transformer architectures or competitions with money prizes.

Contents 🕵

  • Digit Recognizer - MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.
    In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images. We’ve curated a set of tutorial-style kernels which cover everything from regression to neural networks. We encourage you to experiment with different algorithms to learn first-hand what works well and how techniques compare.

    You can see more here: https://www.kaggle.com/competitions/digit-recognizer.

  • House Prices - Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
    With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

    You can see more here: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques.

  • ICR - In this competition, you’ll work with measurements of health characteristic data to solve critical problems in bioinformatics. Based on minimal training, you’ll create a model to predict if a person has any of three medical conditions, with an aim to improve on existing methods.
    You can see more here: https://www.kaggle.com/competitions/icr-identify-age-related-conditions.

  • MT with Transformers - This notebook aims to handle one of the natural language processing (NLP) challenges, i.e. machine translation. We will focus on employing the encoder-decoder RNN architecture and a disruptive approach to NLP, i.e. transformers architecture. To do that, we will use two English-French datasets. In the first part, we will focus on an easy dataset (around 180000 sentences, 12 MB), whereas in the second part, we will use the second dataset (about 22.5 million sentences, 8 GB). In this notebook, we translate English sentences into French ones. Therefore, we tackle the sequence-to-sequence (seq2seq) learning problem.
    You can see more here: https://www.kaggle.com/datasets/devicharith/language-translation-englishfrench and here https://www.kaggle.com/datasets/dhruvildave/en-fr-translation-dataset.

  • Playground Series S3E09 - The dataset for this competition (both train and test) was generated from a deep learning model trained on the Concrete Strength Prediction dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance. You need to predict the strength of concrete based on its characteristics.
    You can see more here: https://www.kaggle.com/competitions/playground-series-s3e9.

  • Playground Series S3E10 - The dataset for this competition (both train and test) was generated from a deep learning model trained on the Pulsar Classification. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance. You need to predict whether the star is a pulsar (1) or not (0).
    You can see more here: https://www.kaggle.com/competitions/playground-series-s3e10.

  • Playground Series S3E12 - The dataset for this competition (both train and test) was generated from a deep learning model trained on the Kidney Stone Prediction based on Urine Analysis dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance. You need to predict occurrence of kidney stones. Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.
    You can see more here: https://www.kaggle.com/competitions/playground-series-s3e12.

  • Playground Series S3E15 - The dataset for this competition (both train and test) was generated from a deep learning model trained on the Predicting Critical Heat Flux dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance. You need to impute the missing values of the feature x_e_out [-] (equilibrium quality).
    You can see more here: https://www.kaggle.com/competitions/playground-series-s3e15.

  • Playground Series S3E18 - The dataset for this competition (both train and test) was generated from a deep learning model trained on a portion of the Multi-label Classification of enzyme substrates. This dataset only uses a subset of features from the original (the features that had the most signal). Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance. For this challenge, you are given 6 features in the training data, but only asked to predict the first two features (EC1 and EC2).
    You can see more here: https://www.kaggle.com/competitions/playground-series-s3e18.

  • Playground Series S3E19 - For this challenge, you will be predicting a full year worth of sales for various fictitious learning modules from different fictitious Kaggle-branded stores in different (real!) countries. This dataset is completely synthetic, but contains many effects you see in real-world data, e.g., weekend and holiday effect, seasonality, etc. You are given the task of predicting sales during for year 2022.
    You can see more here: https://www.kaggle.com/competitions/playground-series-s3e19.

  • Playground Series S3E20 - The ability to accurately monitor carbon emissions is a critical step in the fight against climate change. Precise carbon readings allow researchers and governments to understand the sources and patterns of carbon mass output. While Europe and North America have extensive systems in place to monitor carbon emissions on the ground, there are few available in Africa. Approximately 497 unique locations were selected from multiple areas in Rwanda, with a distribution around farm lands, cities and power plants. The data for this competition is split by time; the years 2019 - 2021 are included in the training data, and your task is to predict the CO2 emissions data for 2022 through November.
    You can see more here: https://www.kaggle.com/competitions/playground-series-s3e20.

  • Playground Series S3E21 - This is a different type of competition. Instead of submitting predictions, your task is to submit a dataset that will be used to train a random forest regressor model. This model will then be used to make predictions against a hidden test dataset. Your score will be the Root Mean Square Error (RMSE) between the model predictions and ground truth of the test set.
    You can see more here: https://www.kaggle.com/competitions/playground-series-s3e21.

  • Playground Series S3E22 - The dataset for this competition (both train and test) was generated from a deep learning model trained on a portion of the Horse Survival Dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance. Submissions are evaluated on micro-averaged F1-Score between predicted and actual values.
    You can see more here: https://www.kaggle.com/competitions/playground-series-s3e22.

  • Playground Series S3E25 - The dataset for this competition (both train and test) was generated from a deep learning model trained on the Prediction of Mohs Hardness with Machine Learning dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance. Submissions are evaluated on median absolute error (MedAE) between predicted and actual values.
    You can see more here: https://www.kaggle.com/competitions/playground-series-s3e25.

  • Playground Series S3E26 - The dataset for this competition (both train and test) was generated from a deep learning model trained on the Cirrhosis Patient Survival Prediction dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.
    You can see more here: https://www.kaggle.com/competitions/playground-series-s3e26.

  • Playground Series S4E03 - The dataset for this competition (both train and test) was generated from a deep learning model trained on the Steel Plates Faults dataset from UCI. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.
    You can see more here: https://www.kaggle.com/competitions/playground-series-s4e3.

  • Shap Shapley Values - This project aims to present machine learning models' interpretability and explainability using the so-called Shapley values and shap library. The shap library will be depicted in both regression and classification problems. We will use some easy datasets like as Titanic and House Prices.

  • Spaceship Titanic - Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good. The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars. While rounding Alpha Centauri en route to its first destination - the torrid 55 Cancri E - the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!
    In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

    You can see more here: https://www.kaggle.com/competitions/spaceship-titanic/overview.

  • Titanic - The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered "unsinkable" RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
    In this challenge, we ask you to build a predictive model that answers the question: "what sorts of people were more likely to survive?" using passenger data (ie name, age, gender, socio-economic class, etc).

    You can see more here: https://www.kaggle.com/competitions/titanic/overview.

  • Tyre Quality Classification - This notebook aims to handle a simple task of Computer Vision, i.e. binary classification of images, based on tyre quality images. The dataset description is as follows. The dataset contains 1854 digital tyre images, categorized into two classes: defective and good condition. Each image is in a digital format and represents a single tyre. The images are labelled based on their condition, i.e., whether the tyre is defective or in good condition.
    This dataset can be used for various machine learning and computer vision applications, such as image classification and object detection. Researchers and practitioners in transportation, the automotive industry, and quality control can use this dataset to train and test their models to identify the condition of tyres from digital images. The dataset provides a valuable resource to develop and evaluate the performance of algorithms for the automatic detection of defective tyres.
    The dataset may also help improve the tyre industry's quality control process and reduce the chances of accidents due to faulty tyres. The availability of this dataset can facilitate the development of more accurate and efficient inspection systems for tyre production.

    You can see more here: https://www.kaggle.com/datasets/warcoder/tyre-quality-classification.

  • Weather Forecast - In this notebook, our main focus is on the temperature variations that have occurred in Warsaw, Poland over the past 30 years. We will be examining a time series dataset and utilizing visualizations to better understand the data. Additionally, we will be employing ARIMA and Recurrent Neural Networks to predict the weather patterns for the chosen year.
    You can see more here: https://www.kaggle.com/datasets/mateuszk013/warsaw-daily-weather.