Skip to content

MiguelMochizukiDev/house-prices-advanced-regression-techniques

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

House Prices - Advanced Regression Techniques

A comprehensive machine learning solution for predicting residential property prices in Ames, Iowa, using advanced regression techniques and ensemble methods.

Project Overview

This project tackles the Kaggle House Prices competition, developing a data-driven approach to predict house sale prices based on 79 explanatory variables describing various aspects of residential homes.

Business Objective

Support informed pricing decisions for residential properties by:

  • Reducing uncertainty in property valuation
  • Assisting sellers, developers, and real estate professionals in setting competitive listing prices
  • Minimizing financial loss from underpricing
  • Avoiding excessive overpricing that delays sales

Success Criteria

The solution is considered successful when:

  • Predictions fall consistently within ±10–15% of actual sale prices (MAPE)
  • The model generalizes well to unseen properties
  • Predictions remain stable across neighborhoods and property types
  • Pricing decisions can be justified via interpretable feature contributions

Project Structure

.
├── README.md                    # Project documentation
├── environment.yml              # Conda environment specification
├── data/
│   ├── raw/                     # Original dataset files
│   │   ├── train.csv
│   │   ├── test.csv
│   │   ├── sample_submission.csv
│   │   └── data_description.txt
│   ├── interim/                 # Cleaned data (post-cleaning, pre-transform)
│   │   ├── train_clean.csv
│   │   └── test_clean.csv
│   ├── features/                # Model-ready numeric arrays
│   │   └── features.npz
│   ├── submission/              # Competition submission files
│   │   └── submission.csv
│   └── leaderboard/             # Leaderboard results
├── notebooks/
│   ├── house_prices.ipynb       # Main analysis notebook
│   └── leaderboard_data.ipynb   # Leaderboard analysis
└── src/
    └── fetch_data.py            # Kaggle data download script

Getting Started

Prerequisites

Installation

  1. Clone the repository

    git clone https://github.com/AlbertNewton254/house-prices-advanced-regression-techniques.git
    cd house-prices-advanced-regression-techniques
  2. Create and activate the conda environment

    conda env create -f environment.yml
    conda activate house-prices-advanced-regression-techniques
  3. Download the dataset

    Ensure your Kaggle API credentials are configured (~/.kaggle/kaggle.json), then run:

    python src/fetch_data.py

    Alternatively, manually download the data from Kaggle and place the files in data/raw/.

Dataset

The Ames Housing dataset includes:

  • Training set: 1,460 properties with sale prices
  • Test set: 1,459 properties (prices to be predicted)
  • Features: 79 explanatory variables describing:
  • Property characteristics (lot size, building type, number of rooms)
  • Quality ratings (overall quality, kitchen quality, basement condition)
  • Location attributes (neighborhood, zoning classification)
  • Temporal information (year built, year remodeled, month/year sold)
  • Amenities (garage, pool, fireplace, porch)

Detailed feature descriptions are available in data/raw/data_description.txt.

Methodology

The project follows CRISP-DM:

1. Business Understanding

  • Define the problem scope and stakeholders
  • Establish success metrics and constraints

2. Data Understanding

  • Explore data distributions and relationships
  • Identify data quality issues
  • Analyze correlations with target variable

3. Data Preparation

  • Handle missing values
  • Remove outliers
  • Engineer features
  • Encode categorical variables
  • Scale numerical features

4. Modeling

Multiple regression models are evaluated and compared:

  • Linear Models: Lasso, Ridge (regularized baselines)
  • Ensemble Models: Random Forest
  • Gradient Boosting: XGBoost, LightGBM, CatBoost

Cross-validation with repeated k-fold is used to ensure robust performance estimates.

5. Evaluation

Models are assessed using:

  • Root Mean Squared Error (RMSE)
  • Root Mean Squared Logarithmic Error (RMSLE) - competition metric
  • Mean Absolute Percentage Error (MAPE)
  • Cross-validation scores

6. Deployment

Generate predictions for the test set and create submission files for Kaggle.

Usage

Running the Analysis

Open and execute the main Jupyter notebook:

jupyter notebook notebooks/house_prices.ipynb

The notebook is organized into sections matching the CRISP-DM methodology:

  1. Business Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

Generating Predictions

After running the notebook, predictions will be saved to:

data/submission/submission.csv

This file can be submitted directly to the Kaggle competition.

Models

The following models are implemented and compared:

Model Type Key Hyperparameters
Lasso Linear regression with L1 regularization alpha
Ridge Linear regression with L2 regularization alpha
Random Forest Ensemble of decision trees n_estimators, max_depth
XGBoost Gradient boosting with trees learning_rate, max_depth, n_estimators
LightGBM Fast gradient boosting framework learning_rate, num_leaves, n_estimators
CatBoost Gradient boosting with categorical features learning_rate, depth, iterations

Results

Model performance and leaderboard scores are tracked in:

Technologies

  • Python 3.12
  • Data Processing: pandas, numpy
  • Visualization: matplotlib, seaborn
  • Machine Learning: scikit-learn, XGBoost, LightGBM, CatBoost
  • Environment: Conda
  • Development: Jupyter Notebook

License

This project is for educational purposes as part of the Kaggle competition.

Acknowledgments

  • Dataset provided by Dean De Cock for use in data science education
  • Kaggle for hosting the competition
  • Ames, Iowa Assessor's Office for the original data

Contact

For questions or feedback, please open an issue on GitHub.


Happy modeling! 🏡📊

About

A comprehensive machine learning solution for predicting residential property prices in Ames

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors