Number of Results Outside the Margin of Error has a Margin of Error Too.

I was reading a few tweets recently about the number of polls expect to produce results outside the margin of error (MoE). For example @mysterypollster tweets, "...keep in mind that 1 poll in 20 should produce a result outside MoE, we've had almost that many since 1/1." This statement strongly implies that if one has conducted twenty polls, then one poll will predictably lie outside the margin of error. In fact, chance of one poll in twenty producing results outside the margin of error is only 38%. The probability of none of the twenty polls producing results outside the MoE is 36%. This leaves a 26% chance of producing two or more results outside the margin of error. The claim that "1 poll in twenty should produce a result outside the MoE" is pretty misleading, especially since the chance of NOT not happing is 62%.

The bar plot below shows the distribution of results from 20 polls outside the MoE. And if you're curious,the 95% confidence iterval for number of polls of 20 outside the MoE is [0, 3], with a mean of 1.

Posted in methods | Leave a comment

Multiply Imputing an Outcome Variable

I've become frustrated lately with the idea that one should not impute the dependent variable. While perhaps difficult to see at first, just a little more thought should reveal that the outcome is far too important to listwise delete. In fact, I'm not sure how one can suggest that MI works for explanatory variables, but not for outcomes. There might be a good argument for it, but I haven't heard one. Below I lay out an intuitive case for multiply imputing outcomes and illustrate with a simulated data set.

Suppose we are interested in estimating the effect of eduction on income, that survey respondents' education is completely observed, and richer respondents are less likely to report their income. We have missingness in our outcome variable. Is MI beneficial in this case? Absolutely! Think about it simplistically. If education has a positive effect on income as we might guess, then we should observe the income of most of our less educated respondents. Among more educated respondents, however, we'll tend to observe income only for those who make less than expected. Overall, this should cause us to underestimate the effect of education.

The intuitive solution, of course, is to make a guess about what the missing incomes are. Education doesn't offer us any new information, but we might think of some covariates that are causally posterior to income, such as one's occupation or the kind of car one drives. If we can come up with variables (in addition to those we'll use in the final model) to predict our outcome, we can reduce this bias.

To illustrate, I did a quick simulation. Keep in mind that in the simulation below, the missingness is non-ignorable, meaning that the probability that y is missing depends on y. In fact, in this simulation, that is all it depends on, which is one of the worst situations for MI. But because I have several variables that are caused by y, and thus wouldn't be included in the final analysis model, I can leverage these to reduce the bias caused by missingness.


# simulate a data set
n = 100000
x1 = rnorm(n) # x's are explanatory variables
x2 = rnorm(n)
y = 0 + x1 + x2 + rnorm(n) # y is the outcome variable
z1 = y + rnorm(n, 0) # z's are variables caused by y
z2 = y + rnorm(n, 0)
z3 = y + rnorm(n, 0)


# generate missing observations
p.mis = pnorm(y)
mis = rbinom(n, 1, p.mis)
y[mis == 1] = NA
data = data.frame(y, x1, x2, z1, z2, z3)


# mi and estimate models
library(Zelig)
a = amelia(data)
mi = zelig(y ~ x1 + x2, data = a$imputations, model = "ls")
ld = zelig(y ~ x1 + x2, data = data, model = "ls")
summary(mi); summary(ld)

In this simulation, the true coefficients of x1 and x2 are 1. The estimates using listwise deletion are about 0.76 with a tiny standard error of about 0.004. This is way off. The MI estimates of x1 and x2 are much better, with estimates of about 0.925. The bias is substantially reduced although still present because our simulation generates non-ignorable missingness.

I think it is at least somewhat intuitive that we should multiply impute our outcomes, and this simulation illustrates that it can be helpful and offers some intuition as to why. I'd be interested in hearing from someone who opposes MI for outcomes. Perhaps there are non-trivial conditions in which it can make you worse off.

Posted in methods, R | 4 Comments

Using Inkscape to Post-edit Labels in R Graphs

I like to label points in my graphs. For example, in my previous exchange with Seth Masket, we both labeled every point. Often, default labeling schemes in R will produce text that overlaps substantially, requiring complicated code to prevent overlap or only label a few of the points. I have recently been experimenting with post-editing graphs produced in R with Inkscape, an open-source version of Adobe Illustrator. I have found post-editing in Inkscape especially useful for correcting overlap when labeling points. Below I discuss how I go about this and give an example. Continue reading

Posted in graphs, R | 6 Comments

Academic Blogs as Open Ideas

The Florida State Library is holding a symposium on scholarship in a digital age and the general theme is open scholarship. One of the components is a series of lightning talks for which they are now accepting proposals. I've wanted to do one of these short talks since I started watching the Ignite presentations and I am considering proposing a talk on "Blogs as Open Ideas." While considering shifting my blog from its old WordPress.com address to my own domain, I thought a lot about why blogging is important to me. I concluded that blogging matters because blogs are an important source of idea creation and criticism. I elaborate on this below and give some examples from my own blog. Continue reading

Posted in blogging | Leave a comment

Chartjunk and Clear Comparisons of Data

Chartjunk has been mentioned several times over the last few days. I cleaned up a plot by Wired magazine, arguing that the original graph made the data difficult to compare. Nathan Yau elaborated on a similar point in his recent "5 misconceptualizations about visualization" post. Finally, David Smith of Revolutions takes issue with this paper and this talk by Alex Lundry, which David summarizes as follows:

It was an entertaining talk, but his main point was to encourage data visualization partitioners to actively insert a point of view into the presentation of data. For example, he encourages more charts like the one on the right below, rather than the one on the left.

Usefuljunk-costs Usefuljunk-monster

Continue reading

Posted in graphs | 1 Comment

Weird Circle Charts and Nice Dot Plots

Recently, Wired magazine showed a series of plots relating to "The Dead, the Dollars, the Drones" in their analysis of the cost of the war on terror. I linked to a similar infographic posted by Flowing Data in my weekly review posted on 9/11 and I linked to the plots in Wired in last week's review. These are cool infographics. They catch the reader's eye and raise questions, but don't communicate the data as well as other types of graphics. Below I compare an infographic and a statistical graphic of the same data. Continue reading

Posted in graphs | Leave a comment

Graphs, Color Gradients, and Negative Advertising

In a previous post, I argued that colors are best used when mapped to quantitative variables in natural ways. I use this technique in work for my dissertation. Jon Peltier disagreed and we had a discussion in the comments section here. Jon made several good points, and I have tried to address some of his concerns. Below I discuss the progress I've made and compare the old and new figures. Continue reading

Posted in graphs | Leave a comment

Week in Review: Pie Charts and Maps, Reproducibility, and Social versus Physical Science

Graphs from the week

  1.  Interesting graphs of the cost of the war on terror.
  2. Jon Peltier looks at pie charts showing who is blamed for the mess in Washington. The Monkey Cage presents a pie chart on grading schools. Is a pie chart ever useful? Never!
  3. Maps with geographic data make amazing graphs. One map this week looks at the distance to the nearest McDonald's in the US and another replicates the figure for the UK. Another map graphs world population densities by income.
  4. Earlier this week, I added a post about colors in graphs and had a discussion with Jon Peltier in the comments section. Georgette Asherman offers some thoughts on how use colors effectively. Some R advice on converting values to colors.
  5. Changes in militancy of religions across time. Worst graph of the year?

Reproducibility of research came up several times in several forms.

  1. An article in Nature suggests the 50% of published studies might be wrong.
  2. Andrew Gelman kicked things off with a post about the statistical significance filter, and following up by commenting on type M errors in the lab and featuring a post about data management errors.
  3. Tyler Cowen and Bruce Booth offer their takes. Observations Epidemiology presents some thoughts as well.
  4.  I offer some thoughts about how to encourage reproducible findings.
  5. An article in Significance makes some suggestions on encouraging reproducibility, and I offer comments.
  6. The Economist brought up reproducibility as well and Revolutions gave their take on it.

An interesting discussion also emerged on the differences between the hard and social sciences, especially their public perceptions.

  1. Discover Magazine kicks things off.
  2. Overcoming Bias responds.
  3. Modeled Behavior continues the discussion.
Posted in graphs, methods | Leave a comment

Reproducibility in Observational Studies

Earlier, I wrote that the editorial policies of journals encourage findings that cannot be reproduced. This was in part motivated by Andrew Gelman's recent post making me think that journal editors work as statistical significance filters, thus creating overestimates and findings that cannot be reproduced. The issue of reproduciblity has come up several times on Andrew's blog (see "Reproducibility in Practice" and "Type M Errors in the Lab") and other blogs (see Tyler Cowen's take and this post on Observational Epidemiology) lately as well. To solve the problem, I suggested that publishing short replications and basing acceptance only on the replications' procedures, not the results.

The newest issues of Significance Magazine has an article titled "Deming, Data, and Observational studies" that offers another procedure that is much stronger and mainly designed to encourage reproducible observational studies. The authors write

The more startling the claim, the better. These results are published in peer-reviewed journals, and frequently make news headlines as well. They seem solid. They are based on observation, on scientific method, and on statistics. But something is going wrong. There is now enough evidence to say what many have long thought: that any claim coming from an observational study is most likely to be wrong–wrong in the sense that it will not replicate if tested rigorously. Continue reading

Posted in methods | Leave a comment

Colors, Legends, and Labeling

Using color to label data on plots sometimes creates problems because the reader can't easily remember the meaning of the various colors. Same holds for shapes. I sometimes solve this problem by using color gradients with intuitive starting and ending colors. For example, I use blue to represent campaigns that aren't advertising and red to represent campaigns that are advertising heavily. Campaigns in the middle are represented by a color between red and blue.

Continue reading

Posted in graphs | 5 Comments