Introduction

Accurate house pricing is essential in the real estate and mortgage industries, influenced by physical attributes and neighborhood characteristics. Research has explored various machine learning techniques for house price prediction. A study titled “An Optimal House Price Prediction Algorithm: XGBoost” compared algorithms such as XGBoost, Support Vector Regressor, Random Forest Regressor, Multilayer Perceptron, and Multiple Linear Regression using the Ames City housing dataset. XGBoost was identified as the most effective model, offering interpretability, simplicity, and accuracy [3].

The dataset we used was a set of listings scraped from realtor.com, focusing on houses sold within the past 10 years in the following zip codes in Georgia: 30097, 30096, 30022, 30005. This dataset includes features like sell price, property location, year built, bedrooms, bathrooms, and other amenities crucial for pricing. Location details will assist in future regional analysis [1].

Problem Definition

In Georgia’s fluctuating real estate market, accurately pricing a house is a complex challenge. Traditional valuation methods, such as comparative market analysis, often overlook essential details and lack critical context [2]. This results in homes being frequently mispriced, leading to discrepancies in their market value. Moreover, obtaining more accurate valuations can be both time-consuming and costly. This project seeks to address these limitations by leveraging machine learning models, which can analyze vast amounts of data to provide more precise and reliable property valuations.

Methods

First, we cleaned the data by removing duplicate listings and replacing null values. During the preprocessing step, we selected features that could be numerically represented for the training step. Out of the 29 columns of the original dataset, we reduced 12 relevant features. Features such as state and status were removed because their variances were 0. Irrelevant features such as mls_id, property_url, and primary_photo were also removed. Location features such as city and street were removed because longitude and latitude were already provided.

The following table represents the remaining features after cleaning:

# Column Non-Null Count Dtype
0 style 21338 non-null object
1 beds 21338 non-null float64
2 full_baths 21338 non-null float64
3 half_baths 21338 non-null float64
4 sqft 21338 non-null float64
5 year_built 21338 non-null float64
6 list_date 21338 non-null object
7 sold_price 21338 non-null int64
8 lot_sqft 21338 non-null float64
9 latitude 21338 non-null float64
10 longitude 21338 non-null float64
11 stories 21338 non-null float64
12 parking_garage 21338 non-null float64

We one hot encoded the style feature because it is a categorical feature. Then, we converted list_date to the number of days since unix time so that we could treat it numerically. We looked at the correlation matrix for our features and saw that beds, full_baths, and other features were highly correlated, so we decided to test ridge regression along with linear regression. We then standardized the features and applied linear regression.

We chose linear regression as our ML algorithm to predict property prices. Since linear regression is a supervised learning method suitable for continuous data, it made sense to apply it to our target variable. We selected this model for its interpretability, as it easily allowed us to understand how different property attributes influence prices by looking at the weights of the model. Additionally, its computational efficiency and simplicity make it an excellent baseline model. We noticed that there were a couple of highly correlated features, so we decided to also test ridge regression.

Our RMSE for the two regressions was around 160,000, so we introduced polynomial features to better fit our data. Using polynomial features of degree 2, we were able to reduce RMSE to around 140,000. Higher degrees resulted in overfitting.

The following table displays results for polynomial degree 2:

Best alpha 6.7341506577508214
Train RMSE 149941.43291780588
Test RMSE 140464.64208894438
Train RMSE Ridge 152639.48486846185
Test RMSE Ridge 143193.17126680695
Train R2 0.7871708164031062
Test R2 0.8080459623798023
Train R2 Ridge 0.7794425922708255
Test R2 Ridge 0.8005161081147739

The following table displays results for polynomial degree 1:

Best alpha 351.11917342151276
Train RMSE 168714.003776309
Test RMSE 159313.80684948218
Train RMSE Ridge 168894.6141816911
Test RMSE Ridge 159715.8792556989
Train R2 0.7305425912464059
Test R2 0.7530721650984262
Train R2 Ridge 0.7299653675430132
Test R2 Ridge 0.7518242110813702xs

Results and Discussion

We had a high RMSE of around 140,000 for linear regression. Note that the mean house price is around 490,000. We suspect this is because the relationship between price and the features is nonlinear. We did try increasing the degrees of numerical features, but that helped only a little. However, our R^2 values for our models were around 0.8, which means the models have relatively good fit. Thus, the high RMSE value could be partially explained by some bad datapoint outliers that are disproportionately affecting RMSE (such as the points in the bottom right corner of the graph).

The following graphs show the results of linear and ridge regression, respectively.

Linear Regression Results There are lots of points that are outliers from the linear regression line, which may explain the high RMSE.

Ridge Regression Results Similar to our linear regression model, there are lots of points that are outliers from the linear regression line, which may explain the high RMSE.

Correlation Matrix Results This correlation heatmap shows the correlations between different features in the dataset, with lighter colors representing a strong correlation and vice versa.

One of our next steps includes using a different model like XGBoost to handle non-linear features. We could also first cluster the data using K-means and then apply a regression like Neural Networks or XGBoost to each individual cluster.

References

[1] F. Bergadano, R. Bertilone, D. Paolotti, and G. Ruffo, “Developing real estate automated valuation models by learning from heterogeneous data sources,” International Journal of Real Estate Studies, vol. 15, no. 1, pp. 72–85, Jun. 2021. doi:10.11113/intrest.v15n1.10

[2] G. Kamtziridis, D. Vrakas, and G. Tsoumakas, “Does noise affect housing prices? A case study in the urban area of Thessaloniki,” EPJ Data Science, vol. 12, no. 1, Oct. 2023. doi:10.1140/epjds/s13688-023-00424-3

[3] H. Sharma, H. Harsora, and B. Ogunleye, “An optimal house price prediction algorithm: Xgboost,” Journal of Analytics, vol. 3, no. 1, pp. 30–45, Jan. 2024. doi:10.3390/analytics3010003

GANTT Chart

Contribution Table

Name Contributions
Anh Dang Data preprocessing, GitHub Pages
Arjun Brithi Model Building
Ayush Gundawar Model Building, Methods
Daniel Oh Model Building
Jatong Su Data preprocessing, Methods