January 2015

creativecommons

Overview



  1. Variable inclusion plots
  2. Model stability plots
  3. Adaptive fence method
  4. Bootstrapping glmnet
  5. Robustness considerations
  6. The mplot package
  7. Diabetes data example

A stability based approach

Aim

To provide scientists/researchers with tools that give them more information about the model selection choices that they are making.

Method

  • interactive graphical tools
  • exhaustive searches (where feasible)
  • bootstrapping to assess selection stability

Concept of model stability independently introduced by Meinshausen and Bühlmann (2010) and Müller and Welsh (2010) for different linear regression situations.

Some notation

  • Say we have a full model with an \(n\times p\) design matrix \(\mathbf{X}\).
  • Let \(\alpha\) be any subset of \(p_\alpha\) distinct elements from \(\{1,\ldots,p\}\).
  • We can define a \(n\times p_\alpha\) submodel with design matrix \(\mathbf{X}_\alpha\) subset from \(\mathbf{X}\) by the elements of \(\alpha\).
  • Denote the set of all possible models as \(\mathcal{A}=\{\{1\},\ldots,\alpha_f\}\).
  • Coefficent vector \(\beta\).

Artificial example

Artificial example

require(mplot)
lm.art = lm(y~., data=artificialeg)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.10 0.33 -0.31 0.76
x1 0.64 0.69 0.92 0.36
x2 0.26 0.62 0.42 0.68
x3 -0.51 1.24 -0.41 0.68
x4 -0.30 0.25 -1.18 0.24
x5 0.36 0.60 0.59 0.56
x6 -0.54 0.96 -0.56 0.58
x7 -0.43 0.63 -0.68 0.50
x8 0.15 0.62 0.24 0.81
x9 0.40 0.64 0.63 0.53

Artificial example – Stepwise

The true data generating process is:

\[y = 0.6x_8 + \varepsilon.\]

art.true = lm(y~x8,data=artificialeg)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03 0.29 0.11 0.91
x8 0.55 0.05 10.43 0.00

Tuning parameters show up frequently in model selection!

Information Criterion

  • Generalised IC: \(\text{GIC}(\alpha;\lambda) = -2\times \text{LogLik}(\alpha)+\lambda p_\alpha\)

With important special cases:

  • AIC: \(\lambda=2\)
  • BIC: \(\lambda = \log(n)\)

Regularisation routines

  • Lasso: minimises \(-\text{LogLik}(\alpha) +\lambda\ ||\beta_\alpha||_1\)
  • Many variants of the Lasso, SCAD,…

Variable inclusion plots

Variable inclusion plots

Aim

To visualise inclusion probabilities as a function of the penalty multiplier \(\lambda\in [0,2\log(n)]\).

Procedure

  1. Calculate (weighted) bootstrap samples \(b=1,\ldots,B\).
  2. For each bootstrap sample, at each \(\lambda\) value, find \(\hat{\alpha}_\lambda^{(b)}\in\mathcal{A}\) as the model with smallest \(\text{GIC}(\alpha;\lambda) = -2\times \text{LogLik}(\alpha)+\lambda p_\alpha\).
  3. The inclusion probability for variable \(x_j\) is estimated as \(\frac{1}{B}\sum_{b=1}^B 1\{j\in \hat{\alpha}_\lambda^{(b)}\}\).

References

  • Müller and Welsh (2010) for linear regression models
  • Murray, Heritier, and Müller (2013) for generalised linear models

Artificial example – VIP

require(mplot)
vis.art = vis(lm.art)
plot(vis.art, which = "vip")

Model stability plots

Artificial example – Loss against size

plot(vis.art, which = "lvk")

Artificial example – Loss against size

plot(vis.art, which = "lvk", highlight = "x6")

Model stability plots

Aim

To add value to the loss against size plots by choosing a symbol size proportional to a measure of stability.

Procedure

  1. Calculate (weighted) bootstrap samples \(b=1,\ldots,B\).
  2. For each bootstrap sample, identify the best model at each dimension.
  3. Add this information to the loss against size plot using model identifiers that are proportional to the frequency with which a model was identified as being best at each model size.

References

  • Murray, Heritier, and Müller (2013) for generalised linear models

Artificial example – Model stability plot

plot(vis.art, which = "boot")

The adaptive fence

The fence

Notation

  • Let \(Q(\alpha)\) be a measure of lack of fit
  • Specifically we consider \(Q(\alpha)=-2\text{LogLik}(\alpha)\)

Main idea

The fence (Jiang et al. 2008) is based aroung the inequality: \[ Q(\alpha) \leq Q(\alpha_f) + c \]

  • A model \(\alpha\) passes the fence if the inequality holds.
  • For any \(c\geq 0\), the full model always passes the fence.
  • Among the set of models that pass the fence, model(s) with smallest dimension are preferred.

Illustration

Illustration

Problem: how to choose \(c\)?

Solution: Bootstrap over a range of values of \(c\).

Procedure

  1. For each value of \(c\):
    • Perform parametric bootstrap under \(\alpha_f\).
    • For each bootstrap sample, identify the smallest model that passes the fence, \(\hat{\alpha}(c)\). Jiang, Nguyen, and Rao (2009) suggest that if there are more than one model, choose the one with the smallest \(Q(\alpha)\).
    • Let \(p^∗(\alpha) = P^∗\{\hat{\alpha}(c) = \alpha\}\) be the empirical probability of selecting model \(\alpha\) at a given value of \(c\).
    • Calculate \(p^* = \max_{\alpha\in\mathcal{A}} p^*(\alpha)\)
  2. Plot values of \(p^*\) against \(c\) and find first peak.
  3. Use this value of \(c\) with the original data.

What does this look like?

Source: Jiang, Nguyen, and Rao (2009)

Our implementation

Core innovations

  • Enhanced the plot of \(p^*\) against \(c\) so that you can see which model is being selected most often
  • Allows for consideration of all models that pass the fence at the smallest dimension
  • Integration with the leaps and bestglm packages
  • Implemented multicore technology to speed up the embarassingly parallel calculations

Optional innovations

  • Intial stepwise procedure to restrict the values of \(c\)
  • Can force variables into the model
  • Specify the size of the largest model to be considered
  • Specify the maximum value for \(c\)

Artificial example – adaptive fence

af.art = af(lm.art, B = 100, n.c = 50, n.cores = 2)
plot(af.art)

Speed (linear models; B=50; n.c=25)

Speed (linear models; B=50; n.c=25)

Bootstrapping glmnet

Artificial example – Lasso

## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x8 
##  8 
## [1] "x8"
## [1] "x8"
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x6 
##  6 
## [1] "x6" "x8"
## [1] "x6+x8"
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x1 
##  1 
## [1] "x1" "x6" "x8"
## [1] "x1+x6+x8"
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x4 
##  4 
## [1] "x1" "x4" "x6" "x8"
## [1] "x1+x4+x6+x8"
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x3 
##  3 
## [1] "x1" "x3" "x4" "x6" "x8"
## [1] "x1+x3+x4+x6+x8"
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x7 
##  7 
## [1] "x1" "x3" "x4" "x6" "x7" "x8"
## [1] "x1+x3+x4+x6+x7+x8"
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x5 
##  5 
## [1] "x1" "x3" "x4" "x5" "x6" "x7" "x8"
## [1] "x1+x3+x4+x5+x6+x7+x8"
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x2 
##  2 
## [1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8"
## [1] "x1+x2+x3+x4+x5+x6+x7+x8"
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x5 
## -5 
## [1] "x1" "x2" "x3" "x4" "x6" "x7" "x8"
## [1] "x1+x2+x3+x4+x6+x7+x8"
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x6 
## -6 
## [1] "x1" "x2" "x3" "x4" "x7" "x8"
## [1] "x1+x2+x3+x4+x7+x8"
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x5 
##  5 
## [1] "x1" "x2" "x3" "x4" "x5" "x7" "x8"
## [1] "x1+x2+x3+x4+x5+x7+x8"
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x9 
##  9 
## [1] "x1" "x2" "x3" "x4" "x5" "x7" "x8" "x9"
## [1] "x1+x2+x3+x4+x5+x7+x8+x9"
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x6 
##  6 
## [1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9"
## [1] "x1+x2+x3+x4+x5+x6+x7+x8+x9"
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x2 
## -2 
## [1] "x1" "x3" "x4" "x5" "x6" "x7" "x8" "x9"
## [1] "x1+x3+x4+x5+x6+x7+x8+x9"
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x2 
##  2 
## [1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9"
## [1] "x1+x2+x3+x4+x5+x6+x7+x8+x9"
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x3 
## -3 
## [1] "x1" "x2" "x4" "x5" "x6" "x7" "x8" "x9"
## [1] "x1+x2+x4+x5+x6+x7+x8+x9"
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.751 
## Sequence of LASSO moves:
##      x8 x6 x1 x4 x3 x7 x5 x2 x5 x6 x5 x9 x6 x2 x2 x3 x3
## Var   8  6  1  4  3  7  5  2 -5 -6  5  9  6 -2  2 -3  3
## Step  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## x3 
##  3 
## [1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9"
## [1] "x1+x2+x3+x4+x5+x6+x7+x8+x9"

Artificial example – Lasso

Artificial example – Lasso

Artificial example – Lasso

Artificial example – Lasso

Bootstrapping glmnet

bgn.art = bglmnet(lm.art, lambda = seq(0.05, 2.75, 0.1))
plot(bgn.art, which = "models")

Bootstrapping glmnet

bgn.art = bglmnet(lm.art)
plot(bgn.art, which = "variables")

Robustness considerations

Everything presented so far is inherently non-robust

It would be great to have a complete suite of robust alternatives.

  • robust alternatives to GIC
  • robust loss measures and model fits for model stability plots
  • adaptive fence with robust measures
  • using weights

Used to comparing and contrast robust with non-robust plots to check for any differences.

In the absence of implementing a suite of robust alternatives, I have implemented a very simple alternative, based on a robust initial screening method (Filzmoser, Maronna, and Werner 2008).

The mplot package

Get it on Github

install.packages("devtools")
require(devtools)
install_github("garthtarr/mplot",quick=TRUE)
require(mplot)

Vignettes

vignette("mplot-guide",package="mplot")
vignette("mplot-stepwise",package="mplot")

Main functions

  • af() for the adaptive fence
  • vis() for VIP and model stability plots
  • bglmnet() bootstrapping glmnet
  • mplot() for an interactive shiny interface

Diabetes example

Variables

Variable Description
age Age
sex Gender
bmi Body mass index
map Mean arterial pressure (average blood pressure)
tc Total cholesterol (mg/dL)
ldl Low-density lipoprotein ("bad" cholesterol)
hdl High-density lipoprotein ("good" cholesterol)
tch Blood serum measurement
ltg Blood serum measurement
glu Blood serum measurement (glucose?)
y A quantitative measure of disease progression one year after baseline

Source: Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., (2004). "Least angle regression. The Annals of Statistics 32 (2): 407-499. doi:10.1214/009053604000000067

Diabetes data set

Diabetes data set (with contamination)

Variable importance (clean, no screen)

Variable importance (clean, screen)

Variable importance (cont, no screen)

Variable importance (cont, screen)

Model stability (clean, no screen)

Model stability (clean, screen)

Model stability (contaminated, no screen)

Model stability (contaminated, screen)

Adaptive fence (clean, no screen)

Adaptive fence (clean, screen)

Adaptive fence (cont, no screen)

Adaptive fence (cont, screen)

glmnet VIP (clean, no screen)

glmnet VIP (clean, screen)

glmnet VIP (contaminated, no screen)

glmnet VIP (contaminated, screen)

Underlying technology

R packages

  • leaps
  • bestglm
  • googleVis
  • shiny
  • parallel
  • mvoutlier

Web

Package development

  • RStudio makes setting up package structure easy
  • Version control with Github
  • Documentation with roxygen2
  • Vignettes (and these slides) with markdown
  • devtools is an essential package for building and loading R packages
  • R packages by Hadley Wickham

Session Info

sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] xtable_1.7-4     mplot_0.4.9      glmnet_1.9-8     Matrix_1.1-4    
##  [5] mvoutlier_2.0.5  sgeostat_1.0-25  shiny_0.10.2.2   googleVis_0.5.7 
##  [9] doParallel_1.0.8 iterators_1.0.7  foreach_1.4.2    bestglm_0.34    
## [13] leaps_2.9        knitr_1.8       
## 
## loaded via a namespace (and not attached):
##  [1] cluster_1.15.3        codetools_0.2-9       colorspace_1.2-4     
##  [4] DEoptimR_1.0-2        digest_0.6.8          evaluate_0.5.5       
##  [7] formatR_1.0           GGally_0.5.0          ggplot2_1.0.0        
## [10] grid_3.1.2            gtable_0.1.2          htmltools_0.2.6      
## [13] httpuv_1.3.2          lattice_0.20-29       MASS_7.3-35          
## [16] mime_0.2              munsell_0.4.2         mvtnorm_1.0-2        
## [19] pcaPP_1.9-60          pls_2.4-3             plyr_1.8.1           
## [22] proto_0.3-10          R6_2.0.1              Rcpp_0.11.3          
## [25] reshape_0.8.5         reshape2_1.4.1        RJSONIO_1.3-0        
## [28] rmarkdown_0.3.10      robCompositions_1.9.0 robustbase_0.92-2    
## [31] rrcov_1.3-8           scales_0.2.4          stats4_3.1.2         
## [34] stringr_0.6.2         tools_3.1.2           yaml_2.1.13

References

Filzmoser, Peter, Ricardo A Maronna, and Mark Werner. 2008. “Outlier Identification in High Dimensions.” Computational Statistics & Data Analysis 52 (3): 1694–1711. doi:10.1016/j.csda.2007.05.018.

Jiang, Jiming, Thuan Nguyen, and J. Sunil Rao. 2009. “A Simplified Adaptive Fence Procedure.” Statistics & Probability Letters 79 (5): 625–29. doi:10.1016/j.spl.2008.10.014.

Jiang, Jiming, J. Sunil Rao, Zhonghua Gu, and Thuan Nguyen. 2008. “Fence Methods for Mixed Model Selection.” The Annals of Statistics 36 (4): 1669–92. doi:10.1214/07-AOS517.

Meinshausen, Nicolai, and Peter Bühlmann. 2010. “Stability Selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 (4): 417–73. doi:10.1111/j.1467-9868.2010.00740.x.

Murray, K, S Heritier, and Samuel Müller. 2013. “Graphical Tools for Model Selection in Generalized Linear Models.” Statistics in Medicine 32 (25): 4438–51. doi:10.1002/sim.5855.

Müller, Samuel, and Alan H. Welsh. 2010. “On Model Selection Curves.” International Statistical Review 78 (2): 240–56. doi:10.1111/j.1751-5823.2010.00108.x.