Report

Introduction

Accurate house pricing is crucial in the real estate and mortgage industries, influenced by physical attributes and neighborhood characteristics. Research has explored various machine learning techniques for house price prediction, with a study titled “An Optimal House Price Prediction Algorithm: XGBoost” comparing algorithms such as XGBoost, Support Vector Regressor, Random Forest Regressor, Multilayer Perceptron, and Multiple Linear Regression using the Ames City housing dataset. XGBoost was identified as the most effective model, offering interpretability, simplicity, and accuracy [3].

The dataset we plan on using includes features like price, area, year built, bedrooms, and bathrooms, and other amenities which are crucial for pricing. Location details will assist in regional analysis [1].

Problem Definition

In Georgia’s dynamic real estate market, accurately pricing a house is a nuanced challenge. Traditional valuation methods, such as comparative market analysis, often miss critical details and lack important context [2]. As a result, homes are frequently inaccurately appraised, and more accurate valuations can be costly in terms of both time and expenses. This project is motivated by the opportunity to overcome these limitations by leveraging machine learning models.

Methods

In the preprocessing phase, we will employ encoding to convert non-numerical features, such as the construction material of a house or its neighborhood, into numerical values our models understand. We will standardize the features to align them closer to a standard Gaussian distribution by subtracting the mean of each feature and dividing by its standard deviation. This step ensures compatibility with scikit-learn models. We will also implement feature engineering to incorporate additional relevant data, such as construction investments in the area or the quality of nearby schools, which could impact house prices.

For the machine learning algorithms, we will use Linear Regression as a baseline to identify features that linearly correlate with house prices. We will explore XGBoost, which Sharma, Harsora, and Ogunleye identified as the optimal model for predicting house prices [3]. XGBoost will be more effective in handling non-linear features introduced during feature engineering. Furthermore, we are considering a strategy that involves K-means Clustering followed by Regression. This approach acknowledges that houses in different regions may have distinct price-determining features. By clustering the houses based on similarities and applying tailored regression models to each cluster, we can address specific regional factors. For example, the quality of the local public school might heavily influence prices in one neighborhood but be irrelevant in another due to demographic differences.

Potential Results and Discussion

The quantitative metrics we plan to use include Root Mean Squared Error, Mean Absolute Error, and R-squared score [1]. The goal of our project is to achieve lower Root Mean Squared Error and Mean Absolute Error scores, along with a higher R-squared score compared to other models and traditional valuation methods. The expected result of our project is a model that can accurately predict housing prices in Georgia, discover which factors have the largest effect on price, and provide transparency in the housing market.

References

[1] F. Bergadano, R. Bertilone, D. Paolotti, and G. Ruffo, “Developing real estate automated valuation models by learning from heterogeneous data sources,” International Journal of Real Estate Studies, vol. 15, no. 1, pp. 72–85, Jun. 2021. doi:10.11113/intrest.v15n1.10

[2] G. Kamtziridis, D. Vrakas, and G. Tsoumakas, “Does noise affect housing prices? A case study in the urban area of Thessaloniki,” EPJ Data Science, vol. 12, no. 1, Oct. 2023. doi:10.1140/epjds/s13688-023-00424-3

[3] H. Sharma, H. Harsora, and B. Ogunleye, “An optimal house price prediction algorithm: Xgboost,” Journal of Analytics, vol. 3, no. 1, pp. 30–45, Jan. 2024. doi:10.3390/analytics3010003


GANTT Chart


Contribution Table

Name Contributions
Anh Dang Video Presentation, GitHub
Arjun Birthi Gantt Chart, Problem Definition
Ayush Gundawar Introduction/Background
Daniel Oh Potential Results and Discussion
Jatong Su Methods


Video

A link to the video can be found here.

A link to the slideshow can be found here.