Plus-minus data is at the bedrock of a lot of thoughtful NBA analysis. If I tell you that the Celtics have outscored their opponents by 153 points in the 350 minutes that Jaylen Brown, Marcus Smart, and Al Horford have shared the court this season (per basketball reference), then you might reasonably conclude that these players are doing something right when they are playing together. At the end of the day, a valuable lineup needs to have a good raw plus-minus score (points scores minus opponents points scored). If a lineup of players is, over time, consistently outscored by the opposing lineups then this lineup cannot be considered effective.
The key question is what exactly does over time mean? One of the problems with looking at raw-plus minus scores for lineups are that they are heavily effected by small sample sizes. Especially early in the season, a good game or two can make an entirely average lineup look like world beaters.
To quantify how many possessions we need to watch to determine if a lineup is truly effective, I went back to intro statistics. Specifically, we need to go over the basics of a method called hypothesis testing.
Essentially, hypothesis testing is a way of quantifying how skeptical we are that something called the null hypothesis is true. For our purposes, the null hypothesis is that a particular lineup is exactly league average in ability. If the null hypothesis is actually true, they would defeat another league average lineup 50% of the time (if each side has equal possessions). If we want to classify a lineup as productive, we need to see evidence that the null hypothesis is not true.
Suppose we observe this lineup play 100 possessions over the course of the season. If this lineup were truly league average in ability (and we assume possessions are independent, see comments at the end), we can use a very important theorem from statistics called the Central Limit Theorem. The Central Limit Theorem tells us that the raw plus-minus score of this lineup over the 100 possessions is approximately normally distributed. What that means is that we can compute the probability that the lineup’s raw plus-minus will wind up in a certain range by finding the area under a bell curve. For example, the probability that this lineup will have a raw plus-minus in the range of +5 to +15 is 0.202, the shaded area in the plot below.
Now, what hypothesis testing tells us to do is designate an area on this graph which we would be highly unlikely to end up in if the null hypothesis were true. Landing in this region, technically called the rejection region, would be strong evidence that the null hypothesis (of being an average lineup) is not true. Let’s take the right tail, which is shaded below.
If the lineup were truly average, there is only a 5% chance that it would outscore outscore its opponents by 27 points or more over 100 possessions (I will get back to where the 27 comes from). We call 0.05 our significance level.
Now suppose our lineup actually winds up with such a +30 raw plus-minus. We take this as strong evidence that our null hypothesis of a league average lineup is not correct. The lineup is given a p-value of about 0.031, which means that there is a 3.1% probability that an average lineup would outscore opponents by 30 or more points over 100 possessions. The p-value is less than 0.05 (the significance level that we set), so we instead conclude that this lineup has some significant positive ability.
This +27 points threshold is directly related to the variance (a technical term) of an average NBA possession and the number of possessions that we observe the team play. The threshold naturally will increase as the number of possessions increases. If we fix our significance level at 0.05, we can plot the raw plus-minus that would be considered evidence of a “good” lineup against the number of possessions. You will see a curve that has the same shape as the square root function.
People often like viewing lineups through something called net rating. Net rating is simply the points this lineup outscores opponents by, per 100 possessions. Viewed through the lens of net rating, a lineup that plays fewer possessions needs to have a higher net rating for us to have strong evidence that they are productive. This next plot demonstrates the net rating bar that would be called significant (at the 0.05 level) at different possession levels.
In the future, I plan to post a table of the most effective/ least effective lineups according to their p-values. The bottom line takeaway from this exercise is that we need to be cautious before when the sample size of possessions is small. As the plot above shows, a lineup with only about 200-300 possessions would need a net rating of at least +15 to +20 before we have convincing evidence that this is more than just the result of chance alone.
When introducing the Central Limit Theorem, I noted that we really need possessions to be independent to conclude that the null distribution is normal. Independence means that the result of one possession has no effect on the result of other possessions. If one lineup is playing for a stretch of the game against another lineup, it’s reasonable to conclude possessions are not independent. Still, we could argue possessions across games are more or less independent, and the normal distribution is still fine to use as long as there is not “too much” (loosely speaking) dependence between possessions.