Statistics

Home/Statistics

Recommended Reading

A student today asked if there were any books on statistics that I could recommend. He was after more generalist type books. I ended up sending him this list:

  1. The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy (USYD, Amazon) is a great (generalist) read on the progression of bayesian statistics. It’s a really fun read (for a book about statistics).
  2. The lady tasting tea : how statistics revolutionized science in the twentieth century (USYD, Amazon) I quite enjoyed this one, it’s nicely written history of some key stats players.
  3. Statistics on the table : the history of statistical concepts and methods (USYD, Amazon) is a bit dryer than the above two – I haven’t made it all the way through yet – waiting for a rainy day!
  4. Mostly harmless econometrics : an empiricist’s companion (USYD, Amazon) is quite a bit more technical than the above books and more focussed on econometrics (statistics for economics).
  5. The Signal and the Noise: Why So Many Predictions Fail — but Some Don’t (Amazon) I’ve never read this but Nate Silver’s pretty hot right now.
  6. Probabilities : the little numbers that rule our lives (USYD, Amazon) I’m reading this on and off at the moment – it has some interesting observations.
By | 2016-10-15T05:47:47+00:00 April 11th, 2013|Statistics, Teaching|0 Comments

Australians love their cash!

Intrigued by this article in the SMH, I went and got some data from the RBA and the RBNZ. Using the googleVis package, available on CRAN, I made this chart to compare the value each person holds on average:

We can also translate this into the average number of notes each person has squirrelled away:

While it appears that Kiwis have more $20 notes per person than Aussies, it does seem that Australians have a stronger affinity for the higher denomination notes.[/fusion_builder_column][/fusion_builder_row][/fusion_builder_container]

By | 2016-10-15T05:47:47+00:00 February 14th, 2013|Feature, Statistics|0 Comments

The Elements of Statistical Learning

Also not really my field, but just found out that “The Elements of Statistical Learning: Data Mining, Inference, and Prediction” is available online for free. An extremely comprehensive text book written by giants in the field – it’s amazing that they were able to make it available for free (though if you find use for it you can (should?) still buy a hard copy).

Available here.

By | 2013-08-28T11:59:39+00:00 January 16th, 2013|Statistics|0 Comments

Odds Ratio and Relative Risk

I’m far from an epidemiologist, but odds ratios and relative risk come up often enough that it’s handy to have a solid understanding of what they mean.  These measures are used when faced with contingency tables:

begin{array}{c|cc|c}    & D^+ & D^- & text{Total} \ hline    S^+ & a & b & a+b \    S^- & c & d & c+d \ hline    text{Total} & a+c & b+d & a+b+c+d    end{array}

where D^+ is having a disease/condition/event under study and D^- is not having the disease/condition/event under study.  Also S^+ is testing positive/symptomatic/presence of a particular trait and S^- is testing negative/asymptomatic/not having a particular trait.

Odds ratio

The odds of success is the ratio of the probability of success p to the chance of failure 1-p:

text{Odds} = dfrac{p}{1-p} .

In the context of disease testing, we’d consider the odds of a disease for S^+ people (those with particular traits) against the S^- group (people without a particular trait).  The probability of having a disease for the S^+ group can be found by restricting attention to the S^+ row (restrict attention to the people who have the trait) and working out what proportion of those people have the disease:

P(D^+ | S^+) =dfrac{a}{a+b}

and the probability of having a disease for the S^- group is

P(D^+|S^-) = dfrac{c}{c+d}.

Hence the odds of disease for S^+ patients is:

text{Odds for }S^+ = dfrac{P(D^+ | S^+)}{1-P(D^+ | S^+)}

and the odds of disease for S^- people is:

text{Odds for }S^- = dfrac{P(D^+ | S^-)}{1-P(D^+ | S^-)} .

Finally the odds ratio is:

text{Odds ratio} = dfrac{text{Odds for }S^+}{text{Odds for }S^-}=dfrac{ad}{bc}.

That last step is just algebra.

What does it mean?

The odds ratio is a measure of effect size – how much of a difference does the positive test/symptoms/particular trait have on your chances of getting the disease?  An odds ratio of 1 indicates that the disease/condition/event under study is equally likely to occur in both groups (that is to say D and S are independent of one another). An odds ratio greater than 1 indicates that the disease more likely to occur in the S^+ group than the S^- group. Similarly, an odds ratio less than 1 indicates that the disease is less likely to occur in the S^+ group.

For example  an odds ratio of 2 indicates that people from the S^+ group had twice the risk of having the disease as people from the S^- group.

When can you use it?

The odds ratio can be used in observational studies (examining the effect of a risk factor/symptom on the disease outcome), prospective studies (where subjects who are initially identified as “disease-free” and classified by presence or absence of a risk factor are followed over time to see if they develop the disease) and retrospective studies (subjects are followed back in time to check for the presence or absence of the risk factor for each individual).

Relative risk

The relative risk is a measure of the influence of risk on disease.  It is the probability of contracting the disease given you have the risk factor divided by the probability of contracting the disease given you don’t have the risk factor:

text{Relative risk} = dfrac{P(D^+|S^+)}{P(D^+|S^-)} = dfrac{a/(a+b)}{c/(c+d)}.

What does it mean?

A relative risk of 1 means there is no difference in risk (of contracting the disease) between the two groups.  A relative greater than 1 means the disease is more likely to occur in the S^+ group than in the S^- group.  A relative risk less than 1 means the disease is more likely to occur in the S^- group than in the S^+ group.

For example a relative risk of 2 would mean that S^+ people would be twice as likely to contract the disease than people from the S^- group.

When can you use it?

Relative risk can only be used in prospective studies – note the wording above is all in terms of “contracting” the disease.  It is often used to compare the risk of developing a disease in people not receiving a new medical treatment (or receiving a placebo) versus people who are receiving an established treatment.

Odds ratio vs relative risk

Odds ratios and relative risks are interpreted in much the same way and if a and c are much less than b and d then the odds ratio will be almost the same as the relative risk.  In some sense the relative risk is a more intuitive measure of effect size.  Note that the choice is only for prospective studies were the distinction becomes important in cases of medium to high probabilities. If action A carries a risk of 99.9% and action B a risk of 99.0% then the relative risk is just over 1, while the odds associated with action A are more than 10 times higher than the odds with B.

This not being my area, naturally I turned to Wikipedia, which suggests that the odds ratio is commonly used for case-control studies, as odds, but not probabilities, are usually estimated whereas relative risk is used in randomized controlled trials and cohort studies.

Finally (and the real motivation for the post), an award winning video has been made by Susanna Cramb discussing the differences between odds ratios and relative risk:

By | 2016-10-15T05:47:47+00:00 January 10th, 2013|Statistics|0 Comments

Hans Rosling’s 200 Countries, 200 Years, 4 Minutes

Hans Rosling from The Joy of Stats on BBC Four. Another excellent example of data communication. I use it in first year lectures to elicit discussion on the issues with aggregating data, in particular how a summary statistic can hide differences between subgroups. We also talk about how many variables are being plotted. It’s something different for them – it puts what they’re learning in a global context and shows statistics as being more than just calculating means and variances.

Pretty neat, eh?

By | 2016-10-15T05:47:47+00:00 December 15th, 2012|Statistics, Teaching|0 Comments

Selling Statistics

This video clip does a great job of selling statistics to a general audience (despite being created SAS). It’s only 2:30 mins – a good length for adding some interest at the start of a first year statistics unit.

“Statisticians help researchers keep children healthy”

Statistics: saving children’s lives since 1850.

By | 2016-10-15T05:47:50+00:00 December 12th, 2012|Statistics, Teaching|0 Comments

On the Numerical Accuracy of Spreadsheets

I came across this journal article a couple of years ago.  It’s very accessible (not at all difficult to understand as journal articles go).  It provides some interesting results that may help inform your decision about whether or not to use Excel and some background as to why we use dedicated statistical/computational software such as Matlab/Scilab/R.

The failings of Excel might seem like they only occur in extreme cases, but it is the way Excel handles the errors that is most concerning.  In many of the examples listed in the article, it will return an incorrect value rather than admit that it doesn’t know the answer to that level of precision.  I.e. when beyond the ability of the function, Excel should return NAs.

The last paragraph:

“Finally, as a rule of the thumb, every user should be aware that spreadsheets have serious limitations. Other platforms are advisable, being currently R the most dependable FLOSS (Free/Libre Open Source Software, see Almiron et al. 2009).”

Almiron, M. G., Lopes, B., Oliveira, A. L. C., Medeiros, A. C., and Frery, A. C. (2010). On the numerical accuracy of spreadsheets. Journal of Statistical Software, 34(4):1–29.

By | 2013-08-28T11:59:40+00:00 December 11th, 2012|R, Statistics|0 Comments

Law of large numbers

The first two minutes of this video for a graphical representation of the law of large numbers (the physicist’s center of gravity is the statistician’s mean).  It’s worth a look if only for the awesome 80’s styling and soundtrack.

By | 2016-10-15T05:47:51+00:00 December 10th, 2012|Statistics, Teaching|0 Comments