Using Play-by-Play Data to Examine Volume and Shot Difficulty in the NBA

There are two main aspects to shooting and scoring in the NBA: volume and efficiency. The balance of the two is an interesting topic for debate, and it's hard to evaluate a scorer without both. But is volume really necessary? Once you cross a qualifying mark and eliminate the majority of sample size issues, can efficiency be the sole determiner? One of the reasons the answer is a relatively firm "no" is that the two should be inversely correlated. Scorers who have a larger portion of a team's offensive load should have to take more difficult shots more often, dragging down their overall efficiency. So, what happens if we put it to the test using play-by-play data?

Introduction

Unfortunately, there are a few issues that make testing this philosophy difficult. The first is that better shooters are likely given more volume by their coaches. This presents a survivorship bias of sorts. The second is that player-tracking data about defender distance that would be very valuable to answer this question isn't publicly available. The second problem doesn't have a real solution, so instead, I'll be using shot distance. Because of large differences in playstyle that would confuddle the data, I'll only be using three-point stats. As a Blazers fan, I've seen Damian Lillard launch many a thirty-footer because that's where the open shots are for him. But does this hold as a statistical trend?

The first problem results in a cleaner but more complicated solution. Essentially, the plan is to find the difficulty of a shot at each distance by looking at how players shoot there compare to their average shot. Then, I'd take this data and use it to determine the average difficulty of each NBA's players shots. Finally, I'd graph it with volume to see if a trend emerged. It's a bit complicated, so I'll explain with a simple example using two players from two shot distances. Capital M will represent a make, and lowercase m a miss.

Player A (7 shots): 25M, 25m, 25m, 25m, 25m, 25m, 35m

Player B (11 shots): 25M, 25M, 25M, 25m, 25m, 25m, 35M, 35M, 35m, 35m, 35m

As I mentioned above, we can't simply take the percentage from each distance. This is a problem because of something known as Simpson's paradox. It deals with how groups of data can interact with the data as a whole. In this example, taking the raw 3P% would show that shooting from 25 feet (4/12) and 35 feet (2/6) is equally efficient. Both player A and player B, though, individually shoot better from the shorter distance. How is this possible? Well, because the more efficient shooter (B) is shooting the longer shots at a higher rate, it skews the data. To solve this, we need to compare each shots' percentage (always either 100% or 0%) to the expected rate (the player's overall 3P%). 

Player A (7 shots, 14.3%): 25 (+85.7), 25 (-14.3), 25 (-14.3), 25 (-14.3), 25 (-14.3), 25 (-14.3), 35 (-14.3)

Player B (11 shots, 45.5%): 25 (+54.5), 25 (+54.5), 25 (+54.5), 25 (-44.5), 25 (-44.5), 25 (-44.5), 35 (+54.5), 35 (+54.5), 35 (-44.5), 35 (-44.5), 35 (-44.5)

Then, we need to take the average conversion rate over expected from each distance:

25 feet: +3.683%

35 feet: -6.467%

After that, we can find the average difficulty of each player's shots:

Player A: +2.233%

Player B: -0.930%

We can then stick these points on a graph, and we have our answer! For this example, there is a perfect negative correlation between volume and shot difficulty.

froala_undefined_1623514064589-1623514064589.png

Obviously, it was going to be much more difficult on a larger scale. The most efficient path would likely require some coding. However, because I don't know any coding languages, I decided to work in Excel and learn as I went along if I didn't know the particular formula I needed. This is the end result of a lot of trial and error. It wasn't as easy as it sounds, nor was my process as clean as it is written. If you aren't interested in the specifics, there will be a TL;DR at the end of the section. 

The Process

The first step was to download the play-by-play data. This would not have been possible without coders writing open-source scripts that they use to "scrape" data from various sites. I used play-by-play data released by the user schmadamco on Kaggle. The data is directly from Basketball-Reference, and I used the latest complete season (2019-20). Of course, it wasn't in the exact format I needed. There were no functions that directly showed whether or not a shot was a three-pointer, and the function for makes/misses was still in a text format, which isn't very good to work with. The solution to both issues was simple, as I could simply write a function asking Excel to check a cell and see if its contents were "3-pt jump shot," for the first and "make" for the second. If so, it would return the number one and, if not, zero.

Next, I sorted the sheet based on whether or not the value for a three-pointer was one. I copied out all of the three-pointers and the values for who the shooter was and whether the shot was made. At this point, I would need to know the three-point percentage of each player. Instead of having to manually plug this data in another way, though, I could thankfully use a feature of Excel known as a pivot table to get the work done for me. Pivot tables allow you to organize data based on other data. For this, I needed for it to take the average of the "make" value for each player. Because all makes are shown as a one and all misses a zero, this formulates their 3-point percentage.

froala_undefined_1623518134977-1623518134977.png

Even with the three-point percentages, though, it would still be a lot of work to manually plug in the value to every shot. Thankfully, there is an Excel command that takes a cell, references a table, and finds an identical cell in the left-most column of that table. Then, it takes the value from the same row of a column that you select. Doing this, I could add a three-point percentage value for every shot based on the player who took it. Here is a small slice of the data I now had:

froala_undefined_1623518969844-1623518969844.png

Next, like the example, I compared the make value to the 3P% of that player. This would form 3-point Percentage Over Expected or 3POE. As in the example, it will either be one or zero minus the player's 3P% so it looks weird, but it averages out nicely in the long run. After finding the 3POE for every shot, I used another pivot chart to take the averages by distance (I also adjusted every distance above 47 feet or half-court down to 47 to help with sample size issues). Then, I smoothed it by taking the mean of the scores for each distance and the two distances directly bordering it. Visualized in this chart are both the raw and smoothed values:

froala_undefined_1623523315962-1623523315962.png

It definitely makes intuitive sense, which is great. Shots get progressively more difficult as players move away from the basket until a steep drop-off when players go from true shots to what is likely primarily heaves. Because these shots prioritize luck to skill, the percentages plateau. While this range is noisy even in the smoothed version, I decided to leave it as-is because these shots shouldn't be statistically significant in moving a player's degree of difficulty, apart from super small sample sizes.

After that, I had to link the smoothed difficulty to the distances on a shot-by-shot level. To do this, I repeated the same process I used to link the 3P% to each player. I used the exact same formula in the exact same way, and it ended up giving the difficulty for every distance, which would be used in the next step to determine the overall difficulty of a particular players' shots. Here is a slice of what that looked like:

froala_undefined_1623524088016-1623524088016.png

I was now just one pivot table away from being able to graph the final result. All I did was chart the average difficulty by the player along with their volume of shots. It was now time to graph the final result. Because I am more comfortable with the interface and slightly prefer how the graphs look, especially scatter plots with a bunch of points, I copied the data over to Google Sheets. First I graphed every player and their average shot difficulty, before eliminating players who took less than 72 threes (one per game for the median team) so that the data could be more easily understood. Here are both charts:

froala_undefined_1623524507918-1623524507918.png
froala_undefined_1623524539244-1623524539244.png

TL:DR: I downloaded play-by-play data then formatted it so it was usable. Then I got rid of all the data points that weren't attempted threes and used a feature of Excel to calculate every player's 3P%. After that, I added a 3P% for every shot based on the player who took it and compared the actually shooting percentage (either 100% or 0%) to that. I took the average of what I just calculated for each distance to decide the difficulty of a shot from there. Then, I calculated the average shot difficulty by player and used it to make the charts you can see above.

Results

I was really happy with how this turned out. You can see that there is an imperfect but clear trend towards more difficult shots for high-volume shooters, proving my hypothesis. 59.5% of players who shoot less than 200 threes have an average shot difficulty higher than zero (easier than average); this number drops to 53.3% for those shooting 201-400, and all the way to 42.6 for players who shot more than 400 threes in 2020. Overall, more than 55% of players shot easier than average. This may not make sense, but it's similar to the difference between medians and means: the averages are different when averaging players than averaging shots because much of the shooting load is handled by a select few. 

The data also supports the consensus about certain players around the league. For example, the two qualifiers shooting the most difficult shots were Trae Young (-2.37%) and Damian Lillard (-2.00%). They both love to launch deep threes and highlight a general trend visible in the data. Players that are the focal points of their offenses are likely to have a higher difficulty rating, regardless of how many shots they take (at least among those who already have a high shot volume). By manually searching the USG% in 2020 of players who took at least 500 threes that year, we can build this chart:

froala_undefined_1623528362482-1623528362482.png

Even without using defender distance, which would probably show the trend even better, we can see that there is an obvious connection between load and shot difficulty. The more of an offensive load a player handles, the harder, on average, their shots are. This is unlikely to hold for players that don't take a ton of threes, as their gravity would focus almost entirely on defender distance. Intuitively, this trend makes sense for high-volume shooters, who often fall into two categories. There are the Damian Lillards of the NBA who have offenses revolving around getting them open anywhere downtown. Then, there are players like Duncan Robinson who shoot a ton of efficient shots because they benefit from the gravity of the true stars in their offense.

There are two lower-volume outliers that stand out because of how far they are from everyone else. The first is Montrezl Harrell. Of the twelve players that took more difficult threes on average than Trae Young, eleven of them took ten shots or fewer. Harrell, though, took twenty-three and still finished fifth among those twelve. Of Harrell's shots, though, ten (43.5%) were from 40+ feet. Damian Lillard, who is not only known for launching deep threes but also took more than thirty times as many shots as Harrell, took just five. You have to respect players who are willing to let it fly from deep when their team would benefit from it.

On the exact opposite end of the spectrum is P.J. Tucker. This isn't exactly surprising, as Tucker is known to be a corner-three specialist. But just how much he stands out is still kind of crazy. Because of how it's constructed, the maximum possible 3POE is 3.1%. By going one foot farther from the basket that changes to 2.6%. If you are just two feet behind the line in the corner (equivalent to being right behind the line anywhere else), you would end up with an average shot difficulty of +1.89. Tucker's is +1.92. That takes serious scheming and is a great example of the extremities of the Morey-era Rockets.

froala_undefined_1623529737488-1623529737488.png

One final thing I wanted to do was adjust the 3P% of all of these players based on their 3POE. What 3POE essentially measures is how much above or below average any given player might shoot if they took the same shots. It isn't perfect, but by subtracting a player's 3POE from their actual 3P%, we can roughly adjust for difficulty and get an idea of what a player might shoot with a perfectly average range of shots. Subtraction is used because negative scores mean harder shots and therefore those players should be given a boost. Of course, this is far from an objective ranking of the best three-point shooters, but I think it might be a slightly better indicator than normal 3P%.

froala_undefined_1623544369003-1623544369003.png

All things considered, it's a relatively minor adjustment. There is little deviation from the real 3P% to the adjusted 3P%, and this is once again primarily because I'm only considering shot distance. Players shooting this many threes are unlikely to deviate too far from an average distance. However, players on the extreme ends see some relatively major re-calculations. Trae Young goes from a below-average shooter (40th percentile) to a good one (66th). Damian Lillard goes from great (83rd) to elite (94th). P.J. Tucker sinks to a flat-out bad shooter (18th) when the unadjusted version thinks he's alright (38th). In total, 54 of 167 players see their percentile adjusted by at least five points. The rest stay about in the same place.

None of this is definitive, and it didn't make me wildly reconsider how the NBA works. However, I think it was an interesting look at both answering my original question and applying raw data to elegantly solve a problem. There was a clear trend in the direction I thought there would be, which is always nice, and it supported other basic intuitive ideas that I and others have about the NBA. It was definitely cool to go from a long incomprehensible list of plays to interesting charts that let me visualize certain trends. Overall, this was a great experience and something I may try again with a different problem.