Garth Tarr
15-16 September 2014
MSA Pathways Committee Meeting
Robust estimators are designed to model the “bulk” of the data.
Mean (not robust)
apply(d5,2,mean)
tender juicy flav overall
77.77 76.36 80.48 82.64
Median (robust)
apply(d5,2,median)
tender juicy flav overall
82 78 83 85
x = rbind( round(apply(d5,2,sd),1),
round(apply(d5,2,Pn),1),
round(apply(d5,2,mad),1))
rownames(x) = c("sd","Pn","MAD")
x
tender juicy flav overall
sd 15.3 14.0 10.5 9.3
Pn 14.2 13.7 10.5 8.9
MAD 11.9 13.3 10.4 7.4
round(cor(d5),2)
tender juicy flav overall
tender 1.00 0.56 0.44 0.51
juicy 0.56 1.00 0.39 0.45
flav 0.44 0.39 1.00 0.74
overall 0.51 0.45 0.74 1.00
round(cov2cor(covMcd(d5,raw.only=T)$cov),2)
tender juicy flav overall
tender 1.00 0.77 0.96 0.96
juicy 0.77 1.00 0.78 0.79
flav 0.96 0.78 1.00 0.99
overall 0.96 0.79 0.99 1.00
A focus of my PhD was deriving the properties of new robust estimators and comparing the performance to that of existing estimators.
Getting an estimate close to the truth as often as possible.
This is measured by the efficiency of an estimator.
Variability of the estimators
round(apply(res,2,var),3)
Mean Median TM20 TM40 HL
0.098 0.139 0.110 0.139 0.105
Efficiency of the estimators
round(apply(res,2,var)[1]/apply(res,2,var),2)
Mean Median TM20 TM40 HL
1.00 0.71 0.89 0.71 0.94
Variability of the estimators
round(apply(res,2,var),3)
Mean Median TM20 TM40 HL
0.199 0.166 0.148 0.166 0.153
Efficiency of the estimators
round(apply(res,2,var)[1]/apply(res,2,var),2)
Mean Median TM20 TM40 HL
1.00 1.20 1.35 1.20 1.30
Another part of my PhD looked at estimating covariance in the presence of scattered contamination.
Important for:
Incorporated regularisation techniques to deal with p>n type problems.
Another part of my PhD looked at estimating covariance in the presence of scattered contamination.
Important for:
Incorporated regularisation techniques to deal with p>n type problems.
Ordinary least squares regression is highly susceptible to outliers and influential points.
Alternatives:
It gives you “the answer”.
It gives you “the answer” but doesn't give you an indication of sensitivity – if you change a parameter slightly would you get a totally different answer?
Bootstrapping to give scientists an indication of how often a variable is selected.
Full model
mf = lm(y~.,data=df)
round(summary(mf)$coef,2)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.10 0.33 -0.31 0.76
x1 0.64 0.69 0.92 0.36
x2 0.26 0.62 0.42 0.68
x3 -0.51 1.24 -0.41 0.68
x4 -0.30 0.25 -1.18 0.24
x5 0.36 0.60 0.59 0.56
x6 -0.54 0.96 -0.56 0.58
x7 -0.43 0.63 -0.68 0.50
x8 0.15 0.62 0.24 0.81
x9 0.40 0.64 0.63 0.53
mf.step=step(mf)
Start: AIC=79.3
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
Df Sum of Sq RSS AIC
- x8 1 0.24 164 77.4
- x3 1 0.69 164 77.5
- x2 1 0.71 164 77.5
- x6 1 1.31 165 77.7
- x5 1 1.44 165 77.7
- x9 1 1.61 165 77.8
- x7 1 1.88 166 77.9
- x1 1 3.50 167 78.4
- x4 1 5.74 169 79.0
<none> 164 79.3
Step: AIC=77.37
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x9
Df Sum of Sq RSS AIC
<none> 164 77.4
- x2 1 20.4 184 81.2
- x5 1 26.0 190 82.7
- x9 1 33.6 198 84.7
- x4 1 34.5 198 84.9
- x7 1 62.1 226 91.4
- x1 1 68.3 232 92.8
- x3 1 71.3 235 93.4
- x6 1 107.9 272 100.7
Model chosen by stepwise procedure
round(summary(mf.step)$coef,2)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.11 0.32 -0.36 0.72
x1 0.80 0.19 4.13 0.00
x2 0.40 0.18 2.26 0.03
x3 -0.81 0.19 -4.22 0.00
x4 -0.35 0.12 -2.94 0.01
x5 0.49 0.19 2.55 0.01
x6 -0.77 0.15 -5.19 0.00
x7 -0.58 0.15 -3.94 0.00
x9 0.55 0.19 2.90 0.01
x8
, everything else is significant.“Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.”
John Wilder Tukey (1962)