Neural network analysis - listings and sales

A neural network prediction of the time it takes to sell an item since listing

A sample of 10000 listed items was taken from German database and exported to CSV from Hive

We will use nnet and caret packages to create a neural network and train it

In [1]:
require(nnet)
require(caret)
Loading required package: nnet
Loading required package: caret
Loading required package: lattice
Loading required package: ggplot2

Items are read and the metrics are normalised for more convenient analysis

In [3]:
items <- read.csv("listing.csv")
items$trukme <- as.numeric(as.Date(items$sale_date)-as.Date(items$local_date))
items$when <- as.numeric(items$days_since_registration - items$trukme) 
items$when_n <- (items$when-min(items$when))/(max(items$when)-min(items$when))
items$trukme_n <- (items$trukme-min(items$trukme))/(max(items$trukme)-min(items$trukme))
items$days_since_registration_n <- (items$days_since_registration-min(items$days_since_registration))/(max(items$days_since_registration)-min(items$days_since_registration))
items$listing_price_eur_fixed_n <- (items$listing_price_eur_fixed-min(items$listing_price_eur_fixed))/(max(items$listing_price_eur_fixed)-min(items$listing_price_eur_fixed))
foo <- data.frame(items$status, items$category, items$trukme_n, items$days_since_registration_n, items$listing_price_eur_fixed_n, items$when_n)
names(foo) <- c("status", "category", "trukme", "days_since_reg", "price", "when")

We split data into 70% training set and 30% test set

In [4]:
index <- sample(1:nrow(foo),round(0.7*nrow(foo)))
train <- foo[index,]
test <- foo[-index,]

Here is the Neural Network model itself

In [5]:
model <- train(trukme ~ status + price + when,
               train, method='nnet', linout=1, trace=F, maxit=10000, MaxNWts=30000,
               tuneGrid=expand.grid(.size=c(2, 3, 5),.decay=c(0.001, 0.01, 0.1)))
model
Out[5]:
Neural Network 

7000 samples
   5 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 7000, 7000, 7000, 7000, 7000, 7000, ... 
Resampling results across tuning parameters:

  size  decay  RMSE        Rsquared   RMSE SD      Rsquared SD
  2     0.001  0.06523965  0.5769124  0.002201100  0.02533697 
  2     0.010  0.06648973  0.5608278  0.002679545  0.02760100 
  2     0.100  0.06951389  0.5208065  0.002157327  0.02633275 
  3     0.001  0.06511631  0.5787308  0.002313365  0.02733156 
  3     0.010  0.06568067  0.5711052  0.002209810  0.02585617 
  3     0.100  0.06904968  0.5272302  0.002088746  0.02598566 
  5     0.001  0.06495730  0.5807438  0.002076235  0.02457236 
  5     0.010  0.06570316  0.5707927  0.002166298  0.02560607 
  5     0.100  0.06905890  0.5270862  0.002087202  0.02589815 

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were size = 5 and decay = 0.001. 

Do the predictions based on the model and plot the results

In [7]:
ps <- predict(model, test)
plot(ps, test$trukme, xlab="NN predicted", ylab="Real data")

Compare this to the simple linear regression

In [8]:
lm.fit <- glm(trukme ~ status + price + when, data=train)
summary(lm.fit)
pr.lm <- predict(lm.fit,test)
plot(pr.lm, test$trukme)
Out[8]:
Call:
glm(formula = trukme ~ status + price + when, data = train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.17178  -0.05447  -0.01631   0.03007   0.73790  

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             0.361455   0.027569  13.111  < 2e-16 ***
statusa. New with tags -0.039691   0.028007  -1.417 0.156470    
statusb. New           -0.072101   0.027233  -2.648 0.008126 ** 
statusc. Mint          -0.024824   0.027204  -0.913 0.361518    
statusd. Very good     -0.099637   0.027165  -3.668 0.000246 ***
statuse. Good          -0.087365   0.027702  -3.154 0.001619 ** 
price                   0.016233   0.027812   0.584 0.559463    
when                   -0.394167   0.007943 -49.623  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.007353833)

    Null deviance: 70.224  on 6999  degrees of freedom
Residual deviance: 51.418  on 6992  degrees of freedom
AIC: -14513

Number of Fisher Scoring iterations: 2

And a comparison of MSE of residuals in Neural Network and Linear Regression model

In [9]:
MSE.lm <- sum((pr.lm-test$trukme)^2)/nrow(test)
MSE.nn <- sum((ps-test$trukme)^2)/nrow(test)
c(MSE.lm, MSE.nn)
Out[9]:
  1. 0.00612943980201201
  2. 0.00371389874798525

Conclusion: Neural network does predict the listing to sale time better than a simple linear model

In [ ]:

links

social