This is part 36 of a series of articles featuring the book Beyond Connecting the Dots, Modeling for Meaningful Results.
To help us get at this fundamental classification scheme, let’s first talk for a moment about the process of inference. Take the earlier example of determining whether wealth results in increased high-school test scores. We phrased this hypothesis in a specific way: that increased wealth will always increase test scores. This illustrative statement, however, actually differs from what is often done in practice. In general, researchers simply ask the question “Does X affect Y?” rather than “Does X increase Y?” It’s just a slight difference, but it is a more flexible question that allows for many forms of relationships. For our example, we would ask the question “Does wealth affect tests scores?”
The gold standard to answering questions like this is the controlled experiment. Controlled experiences allow you to develop strong inferences, as you can see how a system responds when you hold all variables constant except for the single one you are interested in. For our example, we could imagine an experiment where we took a sample of a thousand families from a school district. When these families’ children enter high school we would randomly select half to be in a “poor” category and the other half to be in a “rich” category. Families in the rich category are given grants of $500,000 a year to spend as they wish, while the parents in the poor category are fired from their jobs and have their savings frozen for the duration of the experiment. Once the students graduate from high school, we would compare the scores for the students in the two categories.
These controlled randomized experiments are considered the ideal approach to answering inferential questions like these as they allow you to truly determine the effect of your variables, in this case wealth. For many types of questions, such experiments can be implemented (for instance, does consumption of a new drug help treat a disease?). Unfortunately, in general, complex social questions are simply impossible to answer this way. We can consider the testing procedure we just imagined to assess the effect of wealth on scores, but it would be impossible and unethical to undertake in a real community. Furthermore, even if you were to implement the experiment as described, the behavior of a family that was poor or wealthy to begin with might very well differ from a family that experiences a sudden change in income.
Traditional Model Based Inference
Given our general inability to undertake the ideal controlled experiment, how do we answer inferential questions? The standard way is to collect data and then construct a model enabling us to measure the statistical significance of our hypothesis given the data. Due to history and simplicity, linear regression models are by far the most commonly used type of model today. A linear regression predicts an outcome (Y) based on the multiplication of variables (X’s) by a set of coefficients determining the effect of the variables on the outcome (β’s):
Y = β0 + β1 × X1 + β2 × X2…
For the education example we could collect data on a number of students, measuring their families’ wealth (X1 in the equation above) and the student’s test scores (Y). We would then run the linear regression to determine the coefficient values (β0 – the intercept – and β1 – the effect of wealth on test scores). If we thought there were other factors that affected test scores, we could measure them and include them as addition X’s in the regression.
In addition to obtaining the values of these coefficients, as a result of the regression we also obtain the statistical significances or “p values” of these coefficients. Although p values are commonly used in statistics, they are ubiquitously misunderstood1 so it is useful to briefly review them.
In short a p value measures the probability of seeing the measured data (or more extreme data) assuming the null hypothesis is true. Generally the null hypothesis will be that there is no relationship between the variables and the outcomes.
When assessing the significance of coefficients, a p value means the probability of seeing that value of a coefficient (or one even further from 0), assuming that the (unknown) truth is that the coefficient actually has a value of 0. In other words, it is the probability of seeing the observed non-zero value, assuming that the true value is in fact 0. Frequently, probabilities of 10%, 5% or 1% or smaller are taken as indicating statistical significance. These low values indicate that the coefficient value is so far from 0, and the probability of this occurring by chance so small, that we can reject the null hypothesis and accept the fact that the coefficient is not 0.
Using the p values enables inference by relying on the statistical significance of the coefficients. If the probability of β1 (the coefficient for the effect of wealth) occurring due to chance (given it is 0 in reality) is less than, say 5%, we can claim with reasonable strength that wealth does in fact affect test scores. This is the standard approach researchers take to model-based inference and is used ubiquitously.
A Troubled Sea of Assumptions
Let’s stop for a second and consider what we have done here. In carrying out these logical steps to apply model based inference to determine whether wealth affects test scores, we have had to make one very large assumption: that the relationship between test scores and wealth is linear.
Our linear regression equation assumes that for every increase in one unit of wealth (X1), test scores (Y) will increase on average by the amount of the coefficient (β1). What if this were not true? For instance, we could easily imagine the case where wealth initially helped test scores by providing students more resources and opportunities to learn. After a certain point, however, wealth might negatively impact scores as very wealthy students might lack the pressure or motivation to study hard.
If we believed this were the case, then our linear regression model would be wrong, as would the inferences we obtained from the model. We could correct our model and inferences by changing our regression formula to contain a squared term that could replicate this type of relationship:
Score = β0 + β1 × Wealth + β2 × Wealth2
Using this equation, at low values of wealth the β1 × Wealth term will have the most effect on scores. Conversely, at high levels of wealth, the β2 × Wealth2 term will have the most effect on scores. Thus by having a positive β1 and a negative β2 we can model wealth as having an initially beneficial and then detrimental effect. If our assumptions about the quadratic relationship are correct, then this model will yield accurate inferences. If they are wrong, our inferences will be wrong again.
What are we really doing when we assume regression forms like this? It might not be immediately obvious, but we are in fact telling a story. Using our first equation, we are telling the story that as wealth increases, test scores will almost always increase. Bill Gates’ children will perform amazingly well here! Using the second equation, we are telling a different story: As wealth increases, test scores initially increase, but after a certain point increased wealth will hurt test scores. That picture isn’t so rosy for the Bill Gates of the world!
And so we arrive at a key insight. By choosing our equations to tell a story, our inferences are in fact based on narrative modeling approaches. True, these inferences build upon numerous calculations and very advanced theoretical underpinnings, but ultimately what governs our conclusions and inferences are the stories or narratives we tell about our system. These are choices that we as narrators make and they are not determined by an objective truth or reality.
Is there an alternative approach to inference that does not rely so heavily on narrative? Can we accomplish it without assuming the relationships among variables? The answer is yes. Although they are not often used, alternative prediction-based approaches to inference are available. In these approaches, rather than calculating statistical significances as a function of an assumed model, we calculate significances as a function of the simple question: “Does knowing X help us to predict Y?” This question is effectively identical to our earlier question – “Does X affect Y?” – but it is structured in an explicitly predictive manner. If the answer to the question is true, then we can say that there is a relationship between X and Y.
The techniques to accomplish prediction-based inference are much newer than classic techniques such as linear regression. They rely on extensive computing power and would not be possible without modern technology. One of these approaches is the A3 method which uses resampling based algorithms to obtain estimates of predictive accuracy and statistical significance. A3 focuses purely on predictive accuracy of a model to determine whether a variable is significant, and often requires the automatic exploration of hundreds or thousands of competing models to find the one that best describes the data. The results of these analyses are inferences that are founded in the data of model fits only, not on subjective assumptions.
Next edition: Models and Truth: Predictive versus Narrative Modeling.
Header image source: Beyond Connecting the Dots.
- These misunderstandings are not only made by on-the-ground practitioners and analysts, they are frequently shared, and propagated, by university-level statistics instructors; see, for instance, Haller, H., and Krauss, S. (2002). Misinterpretations of Significance: A Problem Students Share with Their Teachers. Methods of Psychological Research Online 7(1): 1–20. ↩