A comprehensive machine learning solution for predicting residential property prices in Ames, Iowa, using advanced regression techniques and ensemble methods.
This project tackles the Kaggle House Prices competition, developing a data-driven approach to predict house sale prices based on 79 explanatory variables describing various aspects of residential homes.
Support informed pricing decisions for residential properties by:
- Reducing uncertainty in property valuation
- Assisting sellers, developers, and real estate professionals in setting competitive listing prices
- Minimizing financial loss from underpricing
- Avoiding excessive overpricing that delays sales
The solution is considered successful when:
- Predictions fall consistently within ±10–15% of actual sale prices (MAPE)
- The model generalizes well to unseen properties
- Predictions remain stable across neighborhoods and property types
- Pricing decisions can be justified via interpretable feature contributions
.
├── README.md # Project documentation
├── environment.yml # Conda environment specification
├── data/
│ ├── raw/ # Original dataset files
│ │ ├── train.csv
│ │ ├── test.csv
│ │ ├── sample_submission.csv
│ │ └── data_description.txt
│ ├── interim/ # Cleaned data (post-cleaning, pre-transform)
│ │ ├── train_clean.csv
│ │ └── test_clean.csv
│ ├── features/ # Model-ready numeric arrays
│ │ └── features.npz
│ ├── submission/ # Competition submission files
│ │ └── submission.csv
│ └── leaderboard/ # Leaderboard results
├── notebooks/
│ ├── house_prices.ipynb # Main analysis notebook
│ └── leaderboard_data.ipynb # Leaderboard analysis
└── src/
└── fetch_data.py # Kaggle data download script
- Anaconda or Miniconda
- Kaggle API credentials (for data download)
-
Clone the repository
git clone https://github.com/AlbertNewton254/house-prices-advanced-regression-techniques.git cd house-prices-advanced-regression-techniques -
Create and activate the conda environment
conda env create -f environment.yml conda activate house-prices-advanced-regression-techniques
-
Download the dataset
Ensure your Kaggle API credentials are configured (
~/.kaggle/kaggle.json), then run:python src/fetch_data.py
Alternatively, manually download the data from Kaggle and place the files in
data/raw/.
The Ames Housing dataset includes:
- Training set: 1,460 properties with sale prices
- Test set: 1,459 properties (prices to be predicted)
- Features: 79 explanatory variables describing:
- Property characteristics (lot size, building type, number of rooms)
- Quality ratings (overall quality, kitchen quality, basement condition)
- Location attributes (neighborhood, zoning classification)
- Temporal information (year built, year remodeled, month/year sold)
- Amenities (garage, pool, fireplace, porch)
Detailed feature descriptions are available in data/raw/data_description.txt.
The project follows CRISP-DM:
- Define the problem scope and stakeholders
- Establish success metrics and constraints
- Explore data distributions and relationships
- Identify data quality issues
- Analyze correlations with target variable
- Handle missing values
- Remove outliers
- Engineer features
- Encode categorical variables
- Scale numerical features
Multiple regression models are evaluated and compared:
- Linear Models: Lasso, Ridge (regularized baselines)
- Ensemble Models: Random Forest
- Gradient Boosting: XGBoost, LightGBM, CatBoost
Cross-validation with repeated k-fold is used to ensure robust performance estimates.
Models are assessed using:
- Root Mean Squared Error (RMSE)
- Root Mean Squared Logarithmic Error (RMSLE) - competition metric
- Mean Absolute Percentage Error (MAPE)
- Cross-validation scores
Generate predictions for the test set and create submission files for Kaggle.
Open and execute the main Jupyter notebook:
jupyter notebook notebooks/house_prices.ipynbThe notebook is organized into sections matching the CRISP-DM methodology:
- Business Understanding
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Deployment
After running the notebook, predictions will be saved to:
data/submission/submission.csv
This file can be submitted directly to the Kaggle competition.
The following models are implemented and compared:
| Model | Type | Key Hyperparameters |
|---|---|---|
| Lasso | Linear regression with L1 regularization | alpha |
| Ridge | Linear regression with L2 regularization | alpha |
| Random Forest | Ensemble of decision trees | n_estimators, max_depth |
| XGBoost | Gradient boosting with trees | learning_rate, max_depth, n_estimators |
| LightGBM | Fast gradient boosting framework | learning_rate, num_leaves, n_estimators |
| CatBoost | Gradient boosting with categorical features | learning_rate, depth, iterations |
Model performance and leaderboard scores are tracked in:
notebooks/house_prices.ipynb- Cross-validation resultsdata/leaderboard/- Competition leaderboard scores
- Python 3.12
- Data Processing: pandas, numpy
- Visualization: matplotlib, seaborn
- Machine Learning: scikit-learn, XGBoost, LightGBM, CatBoost
- Environment: Conda
- Development: Jupyter Notebook
This project is for educational purposes as part of the Kaggle competition.
- Dataset provided by Dean De Cock for use in data science education
- Kaggle for hosting the competition
- Ames, Iowa Assessor's Office for the original data
For questions or feedback, please open an issue on GitHub.
Happy modeling! 🏡📊