The following details and explains performing regression on the California housing dataset using a range of ML models:
Linear Regression, Support Vector Regression, Decision Trees, and Random Forest Regression.
The dataset contains 20640 entries and 10 variables.
- Longitude
- Latitude
- Housing Median Age
- Total Rooms
- Total Bedrooms
- Population
- Households
- Median Income
- Median House Value
- Ocean Proximity
In the notebook, I perform:
- Data investigation
- Data cleaning
- Removing outliers
- Exploratory data analysis
- Feature engineering
- Dimensionality reduction
- Feature encoding
- Correlation and multicolinearity assessment
- Feature scaling
- Model training (including grid search)
The Random Forest Regression model emerged as the best performer among the trained models, with an average accuracy of $43,658.
- R^2 Score: 0.7933309926525507
- Mean Absolute Error: 29580.49344298964
- Mean Squared Error: 1906039202.1731477
- Root Mean Squared Error: 43658.208875000215
- Mean Absolute Percentage Error: 17.003087000720146%