Zillow Home Value Prediction Models

Baseline & Linear Regression

Baseline Model and Linear Regression Modeling

Create a Train/Test Split

The data is split into an 80/20 training test split for training and prediction in the linear model. Then the split data has specified columns converted to factor columns, or converted into categorical data, as the csv did not retain the categorical information because of the categorically encoded nature of the columns. Finally, the parcelID is removed as to leave out some additional noise given by the unique IDs.

data = data_og

#remove the large categorical variables
data = data[-16, -17, -18]

n = nrow(data) #row number
train_index = sample(n, .8 * n) #80% train test split creation index
train_x = data[train_index, ] #index data for training
test_x = data[-train_index, ] #index data for testing
train_y = data$logerror[train_index] #create response variable - training
test_y = data$logerror[-train_index] #create response variable - testing
train_data = data.frame(train_x) #full data training
test_data = data.frame(test_x)  #full data testing
 

#using only these columns to change into categorical data - makes it a simpler model with only a few categorical encodings rather than hundreds to thousands of columns - fine for our methods here
cols_to_factor = c('ac_type', 'deck', 'tub_or_spa', 'heating_system', 'pool_hot_or_spa', 'patio', 'shed', 'flag_fireplace','tax_delinquency')
  
train_data[cols_to_factor] = lapply(train_data[cols_to_factor], factor) #apply factor function to selected columns for train
test_data[cols_to_factor] = lapply(test_data[cols_to_factor], factor) #factor selected test columns
train_data = train_data[-1]#remove parcel ID - train
test_data = test_data[-1] #remove Parcel ID - test
train_x = train_x[-1] #remove parcelid train x
test_x = test_x[-1] #remove parcelid test x

Create A Baseline Model with Linear Regression

Now, a baseline model can be constructed using all of the variables as treated by the train/test split step. This model is a linear regression using every feature in the training data except categorical variables that exceeded 8 unique categories (like county). We summarize the model to see how each feature performed in relation to the response variable. Finally, we predict using the test data set and the linear model created with the training dataset.


baseline_lin_model = glm(logerror ~ ., data = train_x) #create a baseline linear model using glm
summary(baseline_lin_model)

Call:
glm(formula = logerror ~ ., data = train_x)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.6716  -0.0386  -0.0079   0.0243   5.2435  

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       7.436e-05  2.110e-01   0.000 0.999719    
ac_type           2.041e-04  3.082e-04   0.662 0.507739    
num_bathroom     -6.804e-03  2.872e-03  -2.369 0.017855 *  
num_bedroom       2.047e-03  6.832e-04   2.996 0.002740 ** 
building_quality -9.816e-04  3.054e-04  -3.214 0.001310 ** 
deck             -8.726e-03  5.419e-03  -1.610 0.107336    
area_total_calc   1.308e-05  1.144e-06  11.439  < 2e-16 ***
num_fireplace    -3.507e-03  1.402e-03  -2.501 0.012372 *  
num_full_bath     6.688e-03  2.798e-03   2.390 0.016842 *  
tub_or_spa       -1.384e-02  4.657e-03  -2.972 0.002957 ** 
heating_system    8.443e-04  2.429e-04   3.476 0.000509 ***
latitude         -2.141e-02  2.791e-02  -0.767 0.443027    
longitude         1.176e-02  1.911e-02   0.615 0.538309    
num_pool         -8.466e-03  1.231e-03  -6.876 6.19e-12 ***
pool_hot_or_spa   6.424e-04  6.171e-03   0.104 0.917097    
zoning_landuse    7.350e-04  2.596e-04   2.832 0.004631 ** 
region_county     5.451e-06  1.658e-06   3.288 0.001008 ** 
patio            -1.433e-03  3.060e-03  -0.468 0.639514    
shed             -2.857e-02  1.418e-02  -2.015 0.043909 *  
property_age     -4.607e-05  2.927e-05  -1.574 0.115466    
flag_fireplace    1.561e-02  9.307e-03   1.677 0.093526 .  
tax_building     -1.016e-08  4.329e-09  -2.347 0.018949 *  
tax_total         3.569e-08  3.440e-09  10.374  < 2e-16 ***
tax_property     -3.475e-06  2.779e-07 -12.508  < 2e-16 ***
tax_delinquency   2.287e-02  2.834e-03   8.070 7.09e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.02610273)

    Null deviance: 3277.6  on 124951  degrees of freedom
Residual deviance: 3260.9  on 124927  degrees of freedom
AIC: -100914

Number of Fisher Scoring iterations: 2


#==================================================================================================


predictions = predict(baseline_lin_model, data.frame(test_x)) #make predictions with test data specified above
mse = mean((test_y - predictions)**2) #calculate the MSE score from predictions
mae = mean(abs(test_y - predictions)) # produce MAE from predictions

Results

The baseline model results in an mse of 0.0242259 and and mae of 0.0657223. This means this model predicts within 0.0657223 of the true log error on average throughout the dataset.In terms of the data context - we want to minimize the difference between the estimate price of a given home, and its actual price when the home sold - expressed through the log of each. The linear relationship between many of the variables and the log error is apparent, and so a basic linear model does a pretty good job at producing some accurate or close to accurate predictions.

Is there a model that can further minimize the difference between the estimate and the actual sale price?

Improve the Baseline Model by selecting Significant Features

The model can probably be improved by selecting only those varaibles that are significant in predicting the log error from the baseline model. Below the model attempts to capture this idea by only using the significant variables, and only binary encoded categorical variables.


#filter by only numerical data, and significant variables indicated by baseline, rather than with any categorical or full dataset
sig_lin_model = glm(logerror ~ num_bathroom + num_bedroom + building_quality + area_total_calc + num_fireplace + num_full_bath + num_pool + tax_total + tax_property + tax_delinquency, data = train_x) 
summary(sig_lin_model)

Call:
glm(formula = logerror ~ num_bathroom + num_bedroom + building_quality + 
    area_total_calc + num_fireplace + num_full_bath + num_pool + 
    tax_total + tax_property + tax_delinquency, data = train_x)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.6719  -0.0385  -0.0078   0.0243   5.2535  

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       2.908e-03  2.348e-03   1.238  0.21560    
num_bathroom     -4.460e-03  2.504e-03  -1.781  0.07494 .  
num_bedroom       1.467e-03  6.224e-04   2.357  0.01841 *  
building_quality -9.838e-04  3.048e-04  -3.228  0.00125 ** 
area_total_calc   1.086e-05  1.014e-06  10.717  < 2e-16 ***
num_fireplace    -3.401e-03  1.171e-03  -2.905  0.00367 ** 
num_full_bath     4.836e-03  2.477e-03   1.953  0.05086 .  
num_pool         -9.725e-03  1.164e-03  -8.352  < 2e-16 ***
tax_total         3.390e-08  3.312e-09  10.236  < 2e-16 ***
tax_property     -3.486e-06  2.753e-07 -12.663  < 2e-16 ***
tax_delinquency   2.274e-02  2.828e-03   8.043 8.87e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.02611171)

    Null deviance: 3277.6  on 124951  degrees of freedom
Residual deviance: 3262.4  on 124941  degrees of freedom
AIC: -100885

Number of Fisher Scoring iterations: 2


#==================================================================================================


predictions = predict(sig_lin_model, data.frame(test_x))
mse = mean((test_y - predictions)**2)
mae = mean(abs(test_y - predictions))

Results

This improved model has an mse of 0.0242305 and an mae of 0.065721. This model compares marginally with the baseline model. Although the improvement may very tiny, there are gains in terms of the model complexity and the error over time. Even an error as small as this one can mean thousands to millions of dollars that were more accurately estimated over the course of months to years.

Overall, Linear Regression techniques set a tremendous baseline. The models perform reasonably well with fairly low complexity and straightforward and immediate results.

Subset Selection

Using Subset Selection to determine the best fit

Create new train/test split for consistency

These subset selection models will use the same train/test split as previously, including categorically encoded categorical variables. It is important to note that in the subset selection model, the categorically encoded variables become one-hot encoded and are modeled as such!

data = data_og # initialize original data

#remove the large categorical variables
data = data[-16, -17, -18]

n = nrow(data) #row number
train_index = sample(n, .8 * n) #80% train test split creation index
train_x = data[train_index, ] #index data for training
test_x = data[-train_index, ] #index data for testing
train_y = data$logerror[train_index] #create response variable - training
test_y = data$logerror[-train_index] #create response variable - testing
train_data = data.frame(train_x) #full data training
test_data = data.frame(test_x)  #full data testing
 


#using only these columns to change into categorical data - makes it a simpler model with only a few categorical encodings rather than hundreds to thousands of columns - fine for our methods here
cols_to_factor = c('ac_type', 'deck', 'tub_or_spa', 'heating_system', 'pool_hot_or_spa', 'patio', 'shed', 'flag_fireplace','tax_delinquency')
  
train_data[cols_to_factor] = lapply(train_data[cols_to_factor], factor) #apply factor function to selected columns for train
test_data[cols_to_factor] = lapply(test_data[cols_to_factor], factor) #factor selected test columns
train_data = train_data[-1]#remove parcel ID - train
test_data = test_data[-1] #remove Parcel ID - test
train_x = train_x[-1] #remove parcelid train x
test_x = test_x[-1] #remove parcelid test x

Methods and Overview

The first model is using forward stepwise selection to create a model in sequential order from the given columns in the data set. In this case the second column chosen for the model is dependent on the first column and so on, so fourth. This method is a quick and effective way to quickly and easily access what variables may perform best in a model - and with the added benefit of possibly decreasing model complexity. The second model is similar, but instead of modeling forward, we model from the last variable to the first in sequential succession. In this case the models produce nearly identical results and suggest using the same variables - but rather than picking variables arbitrarily, the variables that are picked for regression come from the best evaluation score ‘BIC’ from each fit.

We use this method for subset selection instead of Lasso or Ridge methods because they perform reasonably well and give our baseline model a boost. Unfortunately, the complexity in modeling a ridge or lasso method with the intensity of some of the data’s categorical variables makes it cost prohibitive in our case - in which we chose to still use some subset selecction methods, but with more availability in the data wrangling fitting.

Forward Selection


regfit.fwd = regsubsets(logerror∼ ., data = train_data, nvmax = 19, method = "forward")

sum_fwd = summary(regfit.fwd)
# sum_fwd

plot(regfit.fwd, scale = 'bic')

names(coef(regfit.fwd, 8))
[1] "(Intercept)"      "building_quality" "area_total_calc"  "num_fireplace"   
[5] "tub_or_spa1"      "num_pool"         "tax_total"        "tax_property"    
[9] "tax_delinquency1"

Backward Selection

regfit.bwd = regsubsets(logerror ∼ ., data = train_data, nvmax = 19, method = "backward")

sum_bwd = summary(regfit.bwd)
# sum_bwd

plot(regfit.bwd, scale = 'bic')

names(coef(regfit.bwd, 7))
[1] "(Intercept)"      "area_total_calc"  "num_fireplace"    "tub_or_spa1"     
[5] "num_pool"         "tax_total"        "tax_property"     "tax_delinquency1"

Regression Using Subset Selected Variables

In the model below, we use the best subset of variables recommended by both subset selection models, and fit a linear regression. This is just to show some of the uses of the subset selection methods that can be applied to produce an even less complex model but with reasonably similar performance when compared with the baseline.

lin_subset_selection = glm(logerror ~  + area_total_calc + num_fireplace + tub_or_spa + num_pool + tax_total + tax_property + tax_delinquency + tax_building, data = train_x)

#model below is for different data - data without the same categorical columns
# lin_subset_selection = glm(logerror ~ building_quality + area_total_calc + tub_or_spa + num_pool + tax_total + tax_property + tax_delinquency, data = train_x)

summary(lin_subset_selection)

Call:
glm(formula = logerror ~ +area_total_calc + num_fireplace + tub_or_spa + 
    num_pool + tax_total + tax_property + tax_delinquency + tax_building, 
    data = train_x)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.6702  -0.0382  -0.0075   0.0244   5.2504  

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -2.603e-03  1.058e-03  -2.461 0.013875 *  
area_total_calc  1.438e-05  7.533e-07  19.096  < 2e-16 ***
num_fireplace   -3.853e-03  1.143e-03  -3.372 0.000748 ***
tub_or_spa      -1.173e-02  3.074e-03  -3.817 0.000135 ***
num_pool        -8.246e-03  1.154e-03  -7.148 8.87e-13 ***
tax_total        2.936e-08  3.399e-09   8.637  < 2e-16 ***
tax_property    -3.098e-06  2.760e-07 -11.226  < 2e-16 ***
tax_delinquency  1.761e-02  2.801e-03   6.287 3.24e-10 ***
tax_building    -1.052e-08  4.150e-09  -2.535 0.011249 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.02572527)

    Null deviance: 3228.8  on 124951  degrees of freedom
Residual deviance: 3214.2  on 124943  degrees of freedom
AIC: -102750

Number of Fisher Scoring iterations: 2
predictions = predict(lin_subset_selection, test_x)
mse = mean((test_y - predictions)**2)
mae = mean(abs(test_y - predictions))

Results

This model produces a MSE of 0.0257668 and an MAE of 0.0658934. This is really good considering the model only uses 7 of the most compelling variables to predict the log errors. It is still not the best we can do but shows how engineering the fit of models by understanding the data and how the data relates can iteratively improve model complexity and performance combinations.

Gradient Boosting

Using Gradient Boosting for Regression

In the model below, we use boosting methods on the data to try and branch away from simple linear regressions. We ended up selecting boosting because it is a more advanced method for regression and can be easily done within a few lines of code. Boosting is also reasonably straightforward, and since it uses decision trees, the model can more reliably and comfortably take some instense categorical variables as inputs and produce a good result. Boosting is also really a first step into tree methods, with a random forest method being the end goal - It gives us a good benchmar. Additionally, using gradient boosting techniques for regression introduces tunable paramters and control, as well as implements a more advance technique for regression methods over linear regression. Finally, this is the first of the ensemble methods employed for this dataset, and will serve as a good benchmark for the next ensemble method planned - Regression with Random Forest

Train/Test Split

The train/ test split for this model takes every single variable into acount, including the massively encoded categorical variabls. The model fits comfortably regardless of the categorical variables. As seen below, the usual split occurs, then all the available categorical variables are encoded and used. In contrast, in previous models we were only using specific categorical variables.

data = data_og # initialize original data

n = nrow(data)
train_index = sample(n, .8 * n)
train_x = data[train_index, ]
test_x = data[-train_index, ]
train_y = data$logerror[train_index]
test_y = data$logerror[-train_index]
train_data = data.frame(train_x)
test_data = data.frame(test_x)

cols_to_factor = c('ac_type', 'deck', 'tub_or_spa', 'heating_system', 'zoning_landuse', 'region_city', 'region_county', 'pool_hot_or_spa', 'patio', 'shed', 'flag_fireplace','tax_delinquency')#only some of the categorical

train_data[cols_to_factor] = lapply(train_data[cols_to_factor], factor)
test_data[cols_to_factor] = lapply(test_data[cols_to_factor], factor)
train_data = train_data[-1] #remove parcel ID - train
test_data = test_data[-1] #remove Parcel ID - test

Boosting Model


# boosting
gbm.out <- gbm(logerror ~ ., distribution = "gaussian",interaction.depth = 3,
               n.trees = 1000, shrinkage = 0.001, data = train_x)
gbm.pred <- predict.gbm(gbm.out,newdata=test_x, n.trees = 1000)
gbm.mse <- mean((gbm.pred-test_y)^2)
print(paste0('gbm Model training MSE: ', gbm.mse))
[1] "gbm Model training MSE: 0.0259755433258037"

mae = mean(abs(test_y - gbm.pred))

Results

The boosting model produces an mse of 0.0259755 and an mae of 0.0658451. The model performs pretty well, and gives us a result on par with earlier models, but with the option to tune the parameters more and engineer successful scenarios. This model performs comparitively, so for now, it’s parameters will remain as they are shown.

Random Forest

We chose Random Forest Regression because it is an advanced method for regression that gives a large amount of control over tunable paramters and learns and predicts very well. We expect this model to perform the best, as well as produce a variable importance plot used to gain some additional insight about the data.

More importantly, this advanced method employs an ensemble learning technique that combines multiple algorithms to produce a better result. This algorithms ability to handle many inputs was also attractive in planning the modeling techniques, and as such will be

Advanced Regression Modeling with Random Forests

Train/Test Split

This split is the same as the boosting model, and uses all categorical variables in the model to perform regression.

data = data_og # initialize original data

n = nrow(data)
train_index = sample(n, .8 * n)
train_x = data[train_index, ]
test_x = data[-train_index, ]
train_y = data$logerror[train_index]
test_y = data$logerror[-train_index]
train_data = data.frame(train_x)
test_data = data.frame(test_x)

cols_to_factor = c('ac_type', 'deck', 'tub_or_spa', 'heating_system', 'zoning_landuse', 'region_city', 'region_county', 'pool_hot_or_spa', 'patio', 'shed', 'flag_fireplace','tax_delinquency')#only some of the categorical

train_data[cols_to_factor] = lapply(train_data[cols_to_factor], factor)
test_data[cols_to_factor] = lapply(test_data[cols_to_factor], factor)
train_data = train_data[-1] #remove parcel ID - train
test_data = test_data[-1] #remove Parcel ID - test

Random Forest Model


# random forest
library(h2o)
h2o.init(nthreads = -1)

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    C:\Users\Griffin\AppData\Local\Temp\RtmpcpWVqo\file496c3fed6e03/h2o_Griffin_started_from_r.out
    C:\Users\Griffin\AppData\Local\Temp\RtmpcpWVqo\file496c74be42cf/h2o_Griffin_started_from_r.err


Starting H2O JVM and connecting:  Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 seconds 170 milliseconds 
    H2O cluster timezone:       America/New_York 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.30.0.1 
    H2O cluster version age:    18 days  
    H2O cluster name:           H2O_started_from_R_Griffin_rfr964 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   7.09 GB 
    H2O cluster total cores:    12 
    H2O cluster allowed cores:  12 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
    R Version:                  R version 3.6.2 (2019-12-12) 
train_h2o <- as.h2o(train_x)

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
rf_trained <- h2o.randomForest(
  training_frame = train_h2o,
  x=names(train_x)[-which(names(train_x)=="logerror")],
  y="logerror",
  model_id = "rf",
  ntrees = 1000,
  seed = 123)

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   6%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |=======                                                               |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |========================================                              |  58%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |===================================================                   |  74%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |=====================================================                 |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |========================================================              |  81%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |=================================================================     |  94%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%
rf.pred <- predict(rf_trained,newdata=as.h2o(test_x))

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
rf.mse <- mean((rf.pred-test_y)^2)
print(paste0('random forest Model training MSE: ', rf.mse))
[1] "random forest Model training MSE: 0.00135561192644133"
mae = mean(abs(test_y - rf.pred))


h2o.varimp_plot(rf_trained, 15)



h2o.shutdown(prompt=F)
detach("package:h2o")
[1] "A shutdown has been triggered. "

Results

The boosting model produces an mse of 0.0013556 and an mae of 0.0201656. The model performs extraordinarily well in comparison to the other methods. There are some concerns with Random Forests tendency to overfit, however with the amount of the data and the train/test split employed, the model should be fine. Overall, it seems that this ensemble model produces good predictions and will ultimately be our final model in predicting the value of homes.

Finally, The variable importance plot gives further insight into which variambles are most important in determining what the logerror will be for a given home.

Zillow Home Value Prediction Models

Baseline & Linear Regression

Baseline Model and Linear Regression Modeling

Create a Train/Test Split

Create A Baseline Model with Linear Regression

Results

Improve the Baseline Model by selecting Significant Features

Results

Subset Selection

Using Subset Selection to determine the best fit

Create new train/test split for consistency

Methods and Overview

Forward Selection

Backward Selection

Regression Using Subset Selected Variables

Results

Gradient Boosting

Using Gradient Boosting for Regression

Train/Test Split

Boosting Model

Results

Random Forest

Advanced Regression Modeling with Random Forests

Train/Test Split

Random Forest Model

Results

Next Steps