Notebooks.utf8

Competition Notebooks

Linear Regression

The author of this notebook built a simple linear regression model predicting the logerror as a function of # of full baths, finished squared feet and parcel ID columns of the data set. They did not explain why they chose these variables, but these are typically very strong predictors of home value.

https://www.kaggle.com/monkeyorman/simple-linear-regression-in-r-0-0648835/data

Data Pretreatment

The author imports the training data csv file containing the log errors, drops transaction date and turns parcel ID into an integer. He creates a new dataframe by running a SQL query that selects parcel ID and average logerror, grouped by parcel ID.

He also reads in the dataset that contains the house features and merges this dataset with the dataset trasformed before.

print('Importing train_2016 ... should be 90275')
train_2016 <- read_csv("../input/train_2016_v2.csv", col_types = cols(transactiondate = col_skip(), parcelid = col_integer())); nrow(train_2016)
print('Combining dupes (taking average) ... should be 90150')

train_2016 <- sqldf('SELECT parcelid, avg(logerror) as logerror FROM train_2016 GROUP BY parcelid'); nrow(train_2016)
print('Importing properties_2016 ... should be 2,985,217')

properties_2016 <- read_csv("../input/properties_2016.csv", col_types = cols(parcelid = col_integer())); nrow(properties_2016)
print('Combining train_2016 and properties_2016 ... should be 2,985,217')

alldata <- merge(x = properties_2016, y = train_2016, by = "parcelid", all.x = TRUE)

Building the Model

The author builds a model to predict logerror as a function of # full bathrooms, finished square footage, and parcel ID using all the data for 2016. They then build predictions with the same dataset. A better approach would have been to use all of the available data from 2016 & 2017 and split it into training and test sets. Additionally, using parcel ID in prediciton is exploiting a loophole in the data structure and would not be useful in a real world deployment.

Finally, the author appends the predictions to the the whole dataframe, replaces missing prediction values with the mean of all predictions and takes the average of the entire prediction column to produce an overall average error for the model.

lr1 <- lm(logerror ~ fullbathcnt + calculatedfinishedsquarefeet + parcelid, data=alldata);
summary(lr1) # view the model

print('Making predictions ...')
predictions <- data.frame(predict(lr1, alldata))

print('Appending predictions to alldata ...')
alldata$p_lr1 <- predictions$predict.lr1..alldata.
alldata$p_lr1[is.na(alldata$p_lr1)] <- mean(alldata$logerror, na.rm = TRUE)  # Replace missing with average

print('Average prediction value is ...')
mean(alldata$p_lr1)

Discussion

The author of this notebook took a very simple yet effective approach; the notebook ranked fairly highly on the public leaderboards. The model would serve as a good baseline because of how basic it is. Predictions are built on two traditional home characteristics for valuing homes - number of bathrooms and square footage. In reality, most other features will have a much smaller effect on home value than these two but Zestimate is such a valuable tool because it can exploit these minor characteristics to produce highly accurate home valuation predictions. Any future model of value should improve on the performance of this model.

Simple Boosted Regression

The author of this notebook tried numerous models but settled on a generalized boosted model using the gbm() function.

The author appends the training data containing the logerrors to the house feature dataset.

https://www.kaggle.com/venkatvurimalla/simple-boosted-regression

train_2016 <- read.csv("train_2016_v2.csv", header = T)
properties_2016 <- read.csv("properties_2016.csv", header = T)

data <- merge.data.frame(train_2016,properties_2016, by = "parcelid")

n <- nrow(data)
train.indices.credit <- sample(n, .8 * n)
train.x <- data[train.indices.credit, ]
test.x <- data[-train.indices.credit, ]
train.y <- data$logerror[train.indices.credit]
test.y <- data$logerror[-train.indices.credit]
train.data <- data.frame(train.x, logerror=train.y)
test.data <- data.frame(test.x, logerror=test.y)

The author then builds a generalized boosted model on the training data. He models logerror as a function of all variables, except those listed in the model. He excludes variables such as property zoning, building class type and pool type, for example. His model contains 1000 trees and shrinkage of 0.001.

train.data$transactiondate <- as.factor(train.data$transactiondate)
test.data$transactiondate <- as.factor(test.x$transactiondate)

gbm.out <- gbm(logerror ~ .-(propertyzoningdesc +buildingclasstypeid + decktypeid + 
                poolcnt +pooltypeid10 + pooltypeid2 + pooltypeid7 + storytypeid + 
                  assessmentyear), distribution = "gaussian",interaction.depth = 3, 
               n.trees = 1000, shrinkage = 0.001, data = train.data)

train.pred <- predict(gbm.out, train.data, n.trees = 1000)

mse <- mean((train.pred-train.y)^2)
print(paste0('GBM Model training MSE: ', mse))

## [1] "GBM Model training MSE: 0.00819233777966675"

Finally the author tests the model using subsets of months from 2016 and 2017. We found this method to not make sense, or to not clearly work, so we modified it to get results that were usable.

pred <- predict(gbm.out, test.data, n.trees = 1000)

mse1 <- mean((pred-test.y)^2)
print(paste0('GBM Model training MSE: ', mse1))

## [1] "GBM Model training MSE: 0.00726934484405573"

Discussion

The author of this notebook used a method we did not cover in the course - generalized boosting models. In general, we had not covered boosting models but we saw a prevalance of successfuly models that incorporated boosting models. The above model proves how valuable boosting models can be for predicting the log error, given the small MSE.

Feature Engineering

This was a very popular notebook on the public leaderboards because of the thought paid to feature selection and engineering. The author of this notebook uses all available home features data, the transactional and test data. Specifically, the author generates new features in 3 sub-categories: geo-location, tax-related info, and ratios.

https://www.kaggle.com/lauracozma/eda-data-cleaning-feature-engineering

Geo-location variables

Location information is included as a variable for each home, however, many of the location variables are missing or incomplete. The author noticed that for each unique instance of latitude and longitude, there was not homogeneity across values for other location variables. The author, for each unique instance of lat, long, takes the row with the most complete collection of location variables. Additionally, the author performs K-means clustering on latitude and longitude to produce a new feature called “Kmeans_cluster” with 405 clusters.

Tax features

The author run a quick explanatory analysis to extract patterns and detect anomalies in tax related variables,then creats several new features as a combination of tax specific info and properties info.

After detecting outliers, the author removes those that are below the 0.005 percentile and above the 0.995 percentile. The author also creates a tax ratio which is equal to the tax amount variable divided by the tax value variable. Additionally, the author creates a tax deliquency variable to describe how much property tax is due on the parcel of land and a market variability variable which the market mean tax value and variability of properties across zipcodes.

## outliers cleasing for variables:  taxvaluedollarcnt structuretaxvaluedollarcnt landtaxvaluedollarcnt

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.012   0.013   0.014   0.014   2.188    5075

Ratios

In the last part of the notebook, the author create several home characteristic related features such as average room size which calculated by total finishing living area diveded by count of room.

#create train dataset
df_train <- train %>% 
  left_join(properties_2016, by="parcelid") %>% 
  setDT 
#define the features
df_train_features <- df_train %>% 
  mutate(
    years_life = 2018 - yearbuilt,    # life of property
    living_area_error = calculatedfinishedsquarefeet / finishedsquarefeet12, # error in calculation of the finished living area
    first_floor_living_area_error = finishedsquarefeet50 / finishedfloor1squarefeet, # error in size of the finished living area on the first (entry)
    bathroom_nr_error = calculatedbathnbr / bathroomcnt, # error in calculation of the number of bathrooms
    bathroom_count_error = (threequarterbathnbr + fullbathcnt) / bathroomcnt, # error in count of different number of bathrooms
    avg_room_area_calc = calculatedfinishedsquarefeet / roomcnt, # average room size: calculated total finished living area / nr. rooms
    avg_room_area = finishedsquarefeet12 / roomcnt, # average room size: total finished living area / nr. rooms
    living_area_prop2 = finishedsquarefeet12 / finishedsquarefeet15,   # rate area living vs area total
    bathroom_bedroom_ratio = bedroomcnt / bathroomcnt, # number of bathrooms per bedroom
    bathroom_bedroom_ratio_calc = calculatedbathnbr / bedroomcnt, # number of calculated bathrooms per bedroom
    avg_bathroom_area_calc = calculatedfinishedsquarefeet / bathroomcnt, # average bathroom size: calculated total finished living area / nr. rooms
    avg_bathroom_area = finishedsquarefeet12 / bathroomcnt , # average bathroom size: total finished living area / nr. rooms
    basement_area_ratio = basementsqft / finishedsquarefeet12, # % of basement with respect to total finished living area 
    basement_area_ratio_calc = basementsqft / calculatedfinishedsquarefeet, # % of basement with respect to calculated total finished living area 
    living_area_total_area_ratio_calc = calculatedfinishedsquarefeet / finishedsquarefeet15, # calculated total finished living area / total area
    living_area_total_area_ratio = finishedsquarefeet12 / finishedsquarefeet15, # total finished living area / total area
    living_area_lot_area_calc = calculatedfinishedsquarefeet / lotsizesquarefeet, #calculated total finished living area /  area of the lot
    living_area_lot_area = finishedsquarefeet12 / lotsizesquarefeet, #total finished living area /  area of the lot
    avg_size_pool = poolsizesum / poolcnt, # average size of pool
    pools_patio_ratio = poolsizesum / yardbuildingsqft17, #  % of area of pools in patio
    pools_shed_ratio = yardbuildingsqft26 / yardbuildingsqft17, # % of shed area in patio area
    floor_shape_calc = finishedsquarefeet13 / calculatedfinishedsquarefeet, # living perimiter / living area - circular areas will have low ratio
    floor_shape = finishedsquarefeet13 / finishedsquarefeet12, # living perimiter / calculate living area - long areas will have high ratio
    num_fireplace_area = fireplacecnt / finishedsquarefeet15) # number of fireplace per area

#Now we want to get a summary of the features created
new_features <- setdiff(names(df_train_features), names(df_train))
df_train_features %>% 
  select_(.dots = new_features) %>% 
  summary()

Random Forest

Finally, the author built a random forest model to get features’ predictive importance and list the top20 features according to that. Several features that were created by the author make it into the top 20 including the kmeans_cluster which ranks 5th.

# use random forest to get a ranking of best variables

df_train_features_2 <- df_train_features %>% 
  select(-parcelid) %>% setDF

library(h2o)
h2o.init(nthreads = -1)

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\david\AppData\Local\Temp\Rtmpgl6XU0\file2cb052fc62c8/h2o_david_started_from_r.out
##     C:\Users\david\AppData\Local\Temp\Rtmpgl6XU0\file2cb05e294a45/h2o_david_started_from_r.err
## 
## 
## Starting H2O JVM and connecting:  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 seconds 418 milliseconds 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.30.0.1 
##     H2O cluster version age:    12 days  
##     H2O cluster name:           H2O_started_from_R_david_mmf047 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.52 GB 
##     H2O cluster total cores:    12 
##     H2O cluster allowed cores:  12 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.6.3 (2020-02-29)

train_h2o <- as.h2o(df_train_features_2)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

rf_trained <- h2o.randomForest(
  training_frame = train_h2o,
  x=names(df_train_features_2)[-which(names(df_train_features_2)=="logerror")],
  y="logerror",
  model_id = "rf",
  ntrees = 500,
  seed = 123)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |============================                                          |  41%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |===================================                                   |  51%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |=====================================                                 |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=====================================================                 |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |====================================================================  |  98%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%

h2o.varimp_plot(rf_trained, num_of_features = 20)

h2o.shutdown(prompt=F)
detach("package:h2o") # avoid select issue with dplyr

## [1] "A shutdown has been triggered. "

Discussion

From the variable importance plot, we see that property county land use code is by far the most important new feature but that numerous new features made it into the top 20. This suggests the power of feature engineering in strengthening a dataset.

This notebook was very highly rated on the public leaderboards. We think this is because of the astute attention to data imputation and feature engineering. Augmenting the original dataset went a long way in model performance. We are sure that many other teams took some cues on feature engineering from this author.