February 2015

creativecommons

Overview



  1. Purpose of robust statistics
  2. Location estimation
  3. Covariance estimation
  4. Precision, sparsity and cellwise contamination
  5. Where to find out more

Purpose

The aim of robust statistics is to model the core of the data.

Location estimation

Location estimators

Consider a sample of \(n\) observations, \(x_1,\ldots,x_n\).

Mean (not robust)

  • sample average: \(n^{-1}\sum_{i=1}^n x_i\)

Hodges-Lehmann estimator (moderately robust)

  • median of the \(n\choose2\) pairwise means: \(H_n^{-1}(0.5)\) where \(H_n(t) = {n\choose 2}^{-1}\sum_{i<j} 1\{(x_i+x_j)/2 \leq t\}\).

Median (most robust)

  • middle of the data set: \(F_n^{-1}(0.5)\) where \(F_n(t) = n^{-1}\sum_{i=1}^n 1\{x_i\leq t\}\).

Correlation

Consumer testing data

  • Consumer data is notoriously messy
  • In taste tests, consumers are asked to score pieces of meat out of 100 on the following categories:
    • Tenderness
    • Juicyness
    • Flavour
    • Overall
  • A meat grading system uses consumer testing data to predict the eating quality of a particular piece of meat:
    • 3 star (every day eating)
    • 4 star (premium quality)
    • 5 star (supreme quality)

Consumer data (n=2938)

Consumer data (classical correlation)

Consumer data (robust correlation)

Regression

Example: brain to body mass ratio

  • Brain size usually increases with body size in animals.
  • The relationship is not linear. Generally, small mammals have relatively larger brains than big ones.
  • Body weight and brain size can be modelled using an allometric relationship, \(y=\theta x^\beta\) where \(y\) is the brain weight and \(x\) is the body weight.

Example: brain to body mass ratio

Example: brain to body mass ratio

Example: brain to body mass ratio

Example: brain to body mass ratio

Example: brain to body mass ratio (with outliers)

Example: brain to body mass ratio (with outliers)

Outliers are dinosaurs

Cellwise contamination

Cellwise contamination

A key component of my PhD looked at estimating precision matrices in data sets contaminated in a cellwise manner.

Important for:

  • high dimensional data
  • automated data collection and analysis methods
  • e.g. -omics type data

Often sparsity is assumed, i.e. the precision matrix will have many zero entries.

Cellwise contamination

A key component of my PhD looked at estimating precision matrices in data sets contaminated in a cellwise manner.

Important for:

  • high dimensional data
  • automated data collection and analysis methods
  • e.g. -omics type data

Often sparsity is assumed, i.e. the precision matrix will have many zero entries.

Cellwise contamination

A key component of my PhD looked at estimating precision matrices in data sets contaminated in a cellwise manner.

Important for:

  • high dimensional data
  • automated data collection and analysis methods
  • e.g. -omics type data

Often sparsity is assumed, i.e. the precision matrix will have many zero entries.

Cellwise contamination

A key component of my PhD looked at estimating precision matrices in data sets contaminated in a cellwise manner.

Important for:

  • high dimensional data
  • automated data collection and analysis methods
  • e.g. -omics type data

Often sparsity is assumed, i.e. the precision matrix will have many zero entries.

Cellwise contamination

A key component of my PhD looked at estimating precision matrices in data sets contaminated in a cellwise manner.

Important for:

  • high dimensional data
  • automated data collection and analysis methods
  • e.g. -omics type data

Often sparsity is assumed, i.e. the precision matrix will have many zero entries.

Financial example

Aim: to estimate the dependence structure with S&P 500 stocks over the period 01/01/2003 to 01/01/2008 (before the GFC).

  • We have \(n=1258\) obervations (trading days) over \(p=452\) dimensions (stocks).
  • Observe \(S_{t,j}\) the closing price of stock \(j\) on day \(t\) for \(j=1,\ldots,p\) and \(t=1,\ldots,n\).
  • Look at the return series \(X_{t,j} = \log\left(\frac{S_{t,j}}{S_{t-1,j}}\right)\).
  • We want to estimate a sparse precision matrix where the zero entries correspond to (conditional) independence between the stocks.

How: using the graphical lasso with a robust covariance matrix as the input.

Financial example

require(huge)
data(stockdata)
X = log(stockdata$data[2:1258,]/stockdata$data[1:1257,])
par(mfrow=c(3,2),mar=c(2,4,1,0.1))
for(i in 1:6) ts.plot(X[,i],main=stockdata$info[i,3],ylab="Return")

Classical approach

Robust approach

Classical approach (extra contamination)

Robust approach (extra contamination)

Want to know more?

References

Location: Hodges-Lehmann estimator

Covariance: Minimum Covariance Determinant (MCD) estimator

Regression: MM estimator

References

Sparse precision matrix estimation: Graphical lasso

Robust precision matrix estimation

  • Tarr, Müller, Weber (2015). Robust estimation of precision matrices under cellwise contamination. Computational Statistics & Data Analysis, to appear. DOI:10.1016/j.csda.2015.02.005

Robust scale estimation

  • Tarr, Müller, Weber (2012). A robust scale estimator based on pairwise means. Journal of Nonparametric Statistics, 24(1): 187-199. DOI:10.1080/10485252.2011.621424

Overview of robust estimation of scale and dependence

  • Tarr, (2014). Quantile Based Estimation of Scale and Dependence. PhD Thesis. University of Sydney, Australia. hdl.handle.net/2123/10590

Contact info