Robust methods and model selection

Garth Tarr
15-16 September 2014

MSA Pathways Committee Meeting

Part 1: Robustness

Unusual observations are unavoidable

plot of chunk input

Unusual observations are unavoidable

plot of chunk plot2

Unusual observations are unavoidable

plot of chunk group5outlierplot

Unusual observations are unavoidable so what can we do?

Use robust estimators!

Robust estimators are designed to model the “bulk” of the data.

Examples:

trimmed mean vs mean
interquartile range vs standard deviation
MCD vs classical covariance matrix
MM regression vs ordinary least squares regression

Motivating example: consumer data (n=2938)

5 star only (n=384)

plot of chunk pairs5

5 star (location)

Mean (not robust)

apply(d5,2,mean)

 tender   juicy    flav overall 
  77.77   76.36   80.48   82.64

Median (robust)

apply(d5,2,median)

 tender   juicy    flav overall 
     82      78      83      85

5 star (location)

plot of chunk bplot

5 star (scale)

Standard deviation (not robust)
Pn (moderately robust)
MAD (very robust)

x = rbind( round(apply(d5,2,sd),1),
           round(apply(d5,2,Pn),1),
           round(apply(d5,2,mad),1))
rownames(x) = c("sd","Pn","MAD")
x

    tender juicy flav overall
sd    15.3  14.0 10.5     9.3
Pn    14.2  13.7 10.5     8.9
MAD   11.9  13.3 10.4     7.4

5 star (correlation)

round(cor(d5),2)

        tender juicy flav overall
tender    1.00  0.56 0.44    0.51
juicy     0.56  1.00 0.39    0.45
flav      0.44  0.39 1.00    0.74
overall   0.51  0.45 0.74    1.00

round(cov2cor(covMcd(d5,raw.only=T)$cov),2)

        tender juicy flav overall
tender    1.00  0.77 0.96    0.96
juicy     0.77  1.00 0.78    0.79
flav      0.96  0.78 1.00    0.99
overall   0.96  0.79 0.99    1.00

5 star (classical correlation)

plot of chunk pairs5again

5 star (robust correlation)

plot of chunk pairs5MCD

Why isn't everyone using robust methods?

Some reasons:

education
availability
performance

A focus of my PhD was deriving the properties of new robust estimators and comparing the performance to that of existing estimators.

What does performance mean?

Getting an estimate close to the truth as often as possible.

This is measured by the efficiency of an estimator.

Doing robustness well (at the normal, n=10)

plot of chunk efficiencyN10

Doing robustness well (at the normal, n=10)

Variability of the estimators

round(apply(res,2,var),3)

  Mean Median   TM20   TM40     HL 
 0.098  0.139  0.110  0.139  0.105

Efficiency of the estimators

round(apply(res,2,var)[1]/apply(res,2,var),2)

  Mean Median   TM20   TM40     HL 
  1.00   0.71   0.89   0.71   0.94

Heavier tailed distribution

plot of chunk tvn

Doing robustness well (t_4, n=10)

plot of chunk efficiency2

Doing robustness well (t_4, n=10)

Variability of the estimators

round(apply(res,2,var),3)

  Mean Median   TM20   TM40     HL 
 0.199  0.166  0.148  0.166  0.153

Efficiency of the estimators

round(apply(res,2,var)[1]/apply(res,2,var),2)

  Mean Median   TM20   TM40     HL 
  1.00   1.20   1.35   1.20   1.30

Scattered (cellwise) contamination

Another part of my PhD looked at estimating covariance in the presence of scattered contamination.

Important for:

high dimensions
automated data collection and analysis methods
e.g. -omics type data

Incorporated regularisation techniques to deal with p>n type problems.

plot of chunk scattered

Scattered (cellwise) contamination

Another part of my PhD looked at estimating covariance in the presence of scattered contamination.

Important for:

high dimensions
automated data collection and analysis methods
e.g. -omics type data

Incorporated regularisation techniques to deal with p>n type problems.

plot of chunk scattered2

Robust regression

Ordinary least squares regression is highly susceptible to outliers and influential points.

Alternatives:

MM-regression
Quantile regression

MM-regression

plot of chunk MMreg

Quantile regression

plot of chunk quantreg

Part 2: Variable selection

Standard approaches

Forward/backwards selecton using AIC or BIC
Lasso type methods

Why do people like it?

It gives you “the answer”.

What's wrong with it?

It gives you “the answer” but doesn't give you an indication of sensitivity – if you change a parameter slightly would you get a totally different answer?

Solution

Bootstrapping to give scientists an indication of how often a variable is selected.

When stepwise fails

plot of chunk step1

When stepwise fails

Full model

mf = lm(y~.,data=df)
round(summary(mf)$coef,2)

            Estimate Std. Error t value Pr(>|t|)
(Intercept)    -0.10       0.33   -0.31     0.76
x1              0.64       0.69    0.92     0.36
x2              0.26       0.62    0.42     0.68
x3             -0.51       1.24   -0.41     0.68
x4             -0.30       0.25   -1.18     0.24
x5              0.36       0.60    0.59     0.56
x6             -0.54       0.96   -0.56     0.58
x7             -0.43       0.63   -0.68     0.50
x8              0.15       0.62    0.24     0.81
x9              0.40       0.64    0.63     0.53

When stepwise fails

mf.step=step(mf)

Start:  AIC=79.3
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9

       Df Sum of Sq RSS  AIC
- x8    1      0.24 164 77.4
- x3    1      0.69 164 77.5
- x2    1      0.71 164 77.5
- x6    1      1.31 165 77.7
- x5    1      1.44 165 77.7
- x9    1      1.61 165 77.8
- x7    1      1.88 166 77.9
- x1    1      3.50 167 78.4
- x4    1      5.74 169 79.0
<none>              164 79.3

Step:  AIC=77.37
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x9

       Df Sum of Sq RSS   AIC
<none>              164  77.4
- x2    1      20.4 184  81.2
- x5    1      26.0 190  82.7
- x9    1      33.6 198  84.7
- x4    1      34.5 198  84.9
- x7    1      62.1 226  91.4
- x1    1      68.3 232  92.8
- x3    1      71.3 235  93.4
- x6    1     107.9 272 100.7

When stepwise fails

Model chosen by stepwise procedure

round(summary(mf.step)$coef,2)

            Estimate Std. Error t value Pr(>|t|)
(Intercept)    -0.11       0.32   -0.36     0.72
x1              0.80       0.19    4.13     0.00
x2              0.40       0.18    2.26     0.03
x3             -0.81       0.19   -4.22     0.00
x4             -0.35       0.12   -2.94     0.01
x5              0.49       0.19    2.55     0.01
x6             -0.77       0.15   -5.19     0.00
x7             -0.58       0.15   -3.94     0.00
x9              0.55       0.19    2.90     0.01

Dropped x8, everything else is significant.
So we're done right?

The fence

Variable inclusion plots

Model selection curves

Model selection curves (bootstrapped)

Conclusion

“Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.”

John Wilder Tukey (1962)

Robust methods and model selection

Part 1: Robustness

Unusual observations are unavoidable

Unusual observations are unavoidable

Unusual observations are unavoidable

Unusual observations are unavoidable

Unusual observations are unavoidable so what can we do?

Use robust estimators!

Examples:

Motivating example: consumer data (n=2938)

5 star only (n=384)

5 star (location)

5 star (location)

5 star (scale)

5 star (correlation)

5 star (classical correlation)

5 star (robust correlation)

Why isn't everyone using robust methods?

Some reasons:

What does performance mean?

Doing robustness well (at the normal, n=10)

Doing robustness well (at the normal, n=10)

Heavier tailed distribution

Doing robustness well (t_4, n=10)

Doing robustness well (t_4, n=10)

Scattered (cellwise) contamination

Scattered (cellwise) contamination

Robust regression

MM-regression

Quantile regression

Part 2: Variable selection

Standard approaches

Why do people like it?

What's wrong with it?

Solution

When stepwise fails

When stepwise fails

When stepwise fails

When stepwise fails

The fence

Variable inclusion plots

Model selection curves

Model selection curves (bootstrapped)

Conclusion

Unexplained code that wants to appear