Robust methods and model selection

Garth Tarr
15-16 September 2014

MSA Pathways Committee Meeting

Part 1: Robustness

Unusual observations are unavoidable

plot of chunk input

Unusual observations are unavoidable

plot of chunk plot2

Unusual observations are unavoidable

plot of chunk groupoutlierplot

Unusual observations are unavoidable

plot of chunk group5outlierplot

Unusual observations are unavoidable so what can we do?

Use robust estimators!

Robust estimators are designed to model the “bulk” of the data.

Examples:

  • trimmed mean vs mean
  • interquartile range vs standard deviation
  • MCD vs classical covariance matrix
  • MM regression vs ordinary least squares regression

Motivating example: consumer data (n=2938)

plot of chunk pairs

5 star only (n=384)

plot of chunk pairs5

5 star (location)

Mean (not robust)

apply(d5,2,mean)
 tender   juicy    flav overall 
  77.77   76.36   80.48   82.64 

Median (robust)

apply(d5,2,median)
 tender   juicy    flav overall 
     82      78      83      85 

5 star (location)

plot of chunk bplot

5 star (scale)

  • Standard deviation (not robust)
  • Pn (moderately robust)
  • MAD (very robust)
x = rbind( round(apply(d5,2,sd),1),
           round(apply(d5,2,Pn),1),
           round(apply(d5,2,mad),1))
rownames(x) = c("sd","Pn","MAD")
x
    tender juicy flav overall
sd    15.3  14.0 10.5     9.3
Pn    14.2  13.7 10.5     8.9
MAD   11.9  13.3 10.4     7.4

5 star (correlation)

round(cor(d5),2)
        tender juicy flav overall
tender    1.00  0.56 0.44    0.51
juicy     0.56  1.00 0.39    0.45
flav      0.44  0.39 1.00    0.74
overall   0.51  0.45 0.74    1.00
round(cov2cor(covMcd(d5,raw.only=T)$cov),2)
        tender juicy flav overall
tender    1.00  0.77 0.96    0.96
juicy     0.77  1.00 0.78    0.79
flav      0.96  0.78 1.00    0.99
overall   0.96  0.79 0.99    1.00

5 star (classical correlation)

plot of chunk pairs5again

5 star (robust correlation)

plot of chunk pairs5MCD

Why isn't everyone using robust methods?

Some reasons:

  • education
  • availability
  • performance

A focus of my PhD was deriving the properties of new robust estimators and comparing the performance to that of existing estimators.

What does performance mean?

Getting an estimate close to the truth as often as possible.

This is measured by the efficiency of an estimator.

Doing robustness well (at the normal, n=10)

plot of chunk efficiencyN10

Doing robustness well (at the normal, n=10)

Variability of the estimators

round(apply(res,2,var),3)
  Mean Median   TM20   TM40     HL 
 0.098  0.139  0.110  0.139  0.105 

Efficiency of the estimators

round(apply(res,2,var)[1]/apply(res,2,var),2)
  Mean Median   TM20   TM40     HL 
  1.00   0.71   0.89   0.71   0.94 

Heavier tailed distribution

plot of chunk tvn

Doing robustness well (t_4, n=10)

plot of chunk efficiency2

Doing robustness well (t_4, n=10)

Variability of the estimators

round(apply(res,2,var),3)
  Mean Median   TM20   TM40     HL 
 0.199  0.166  0.148  0.166  0.153 

Efficiency of the estimators

round(apply(res,2,var)[1]/apply(res,2,var),2)
  Mean Median   TM20   TM40     HL 
  1.00   1.20   1.35   1.20   1.30 

Scattered (cellwise) contamination

Another part of my PhD looked at estimating covariance in the presence of scattered contamination.

Important for:

  • high dimensions
  • automated data collection and analysis methods
  • e.g. -omics type data

Incorporated regularisation techniques to deal with p>n type problems.

plot of chunk scattered

Scattered (cellwise) contamination

Another part of my PhD looked at estimating covariance in the presence of scattered contamination.

Important for:

  • high dimensions
  • automated data collection and analysis methods
  • e.g. -omics type data

Incorporated regularisation techniques to deal with p>n type problems.

plot of chunk scattered2

Robust regression

Ordinary least squares regression is highly susceptible to outliers and influential points.

Alternatives:

  • MM-regression
  • Quantile regression

MM-regression

plot of chunk MMreg

Quantile regression

plot of chunk quantreg

Part 2: Variable selection

Standard approaches

  • Forward/backwards selecton using AIC or BIC
  • Lasso type methods

Why do people like it?

It gives you “the answer”.

What's wrong with it?

It gives you “the answer” but doesn't give you an indication of sensitivity – if you change a parameter slightly would you get a totally different answer?

Solution

Bootstrapping to give scientists an indication of how often a variable is selected.

When stepwise fails

plot of chunk step1

When stepwise fails

Full model

mf = lm(y~.,data=df)
round(summary(mf)$coef,2)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    -0.10       0.33   -0.31     0.76
x1              0.64       0.69    0.92     0.36
x2              0.26       0.62    0.42     0.68
x3             -0.51       1.24   -0.41     0.68
x4             -0.30       0.25   -1.18     0.24
x5              0.36       0.60    0.59     0.56
x6             -0.54       0.96   -0.56     0.58
x7             -0.43       0.63   -0.68     0.50
x8              0.15       0.62    0.24     0.81
x9              0.40       0.64    0.63     0.53

When stepwise fails

mf.step=step(mf)
Start:  AIC=79.3
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9

       Df Sum of Sq RSS  AIC
- x8    1      0.24 164 77.4
- x3    1      0.69 164 77.5
- x2    1      0.71 164 77.5
- x6    1      1.31 165 77.7
- x5    1      1.44 165 77.7
- x9    1      1.61 165 77.8
- x7    1      1.88 166 77.9
- x1    1      3.50 167 78.4
- x4    1      5.74 169 79.0
<none>              164 79.3

Step:  AIC=77.37
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x9

       Df Sum of Sq RSS   AIC
<none>              164  77.4
- x2    1      20.4 184  81.2
- x5    1      26.0 190  82.7
- x9    1      33.6 198  84.7
- x4    1      34.5 198  84.9
- x7    1      62.1 226  91.4
- x1    1      68.3 232  92.8
- x3    1      71.3 235  93.4
- x6    1     107.9 272 100.7

When stepwise fails

Model chosen by stepwise procedure

round(summary(mf.step)$coef,2)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    -0.11       0.32   -0.36     0.72
x1              0.80       0.19    4.13     0.00
x2              0.40       0.18    2.26     0.03
x3             -0.81       0.19   -4.22     0.00
x4             -0.35       0.12   -2.94     0.01
x5              0.49       0.19    2.55     0.01
x6             -0.77       0.15   -5.19     0.00
x7             -0.58       0.15   -3.94     0.00
x9              0.55       0.19    2.90     0.01
  • Dropped x8, everything else is significant.
  • So we're done right?

The fence

Variable inclusion plots

Model selection curves

Model selection curves (bootstrapped)

Conclusion

“Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.”

John Wilder Tukey (1962)

Unexplained code that wants to appear