Using Hexagonal Binning and GIS to Analyse Shooting Data

Over the past few years the process of hexagonal binning has been popping up everywhere in the data visualisation world. It's a great solution for representing density when working with large datasets made up of point data, which is why using hex bins for basketball shot charts seems like a natural progression. Geographer/journalist/analyst Kirk Goldsberry brought the use of hex bins into the basketball fan's mind with his ground breaking 2012 presentation at the Sloan Sports Analytics Conference, which you've no doubt all seen by now.

The theory behind the visualisation process is relatively simple - a grid of hexagons (or any other shape) is created across a surface containing the point data. Each hexagon creates a kind of 'bin' and a count of the total number of points that fall within that bin are added to each bin's data.

Why hexagons? Let's have a quick geometry refresher. Ideally the bin being used should best spatially represent the points within it. This means that the centre of the chosen shape should be as close to every point within the shape as possible. The perfect solution for this then is a circle, because points near the edge of the circle are closer to it's centre than points near the edge of any other shape. Below is an example of this, an 85px circle compared with an 85px square. The furtherest out point in the circle is closer to the centre than the furtherest point in the square:

The problem with circles however, is they don't fit nicely together in a grid. You'll need something with straight edges and equal areas to create a uniform grid (otherwise known as a regular tesselation) across the surface. Only triangles, squares and hexagons fit this criteria. Triangles and squares have more acute angles than hexagons (or 90° in the case of a square), which means their corners are further away from their centres than a shape with larger angles like a hexagon. Points that fall within a given hexagon therefore, are likely closer to the centre of the shape than any other available option.

Also, they look cooler.

Now that's out of the way, we can get on to the fun stuff.

In order to use hex bins to analyse shot data, you'll need a couple of input datasets. Most important of course being the shot locations themselves. I've covered off how I extracted these for the NBL in my earlier blog, so give that a read if you're interested in the process. Secondly, you'll need a basketball court surface which has been divided up into hexagons:

I created mine using QGIS, which is an open-source Geographic Information System (GIS), freely available on the web. I found the python-based MMQIS plugin for QGIS a great solution for making the grid.  There are a number of other ways to create a grid surface in other software packages, such as ArcGIS - but the MMQIS plugin for QGIS was the quickest, easiest and it's free. There are a few parameters available when creating the hexagons, the most significant of which is the size of each cell. I've made mine larger than Goldsberry's, as there are not as many shots for the NBL (due to season length etc) as the NBA, so it's not possible to be as detailed with the data.

Once the court has been decorated with lovely hexagonal tiles, the next step is to count how many shots fall inside each. There are numerous ways to do this and the aforementioned QGIS is one solution. You can use the Points in Polygon tool to achieve this. My shooting database contains a field for shot results, with a 1 being a made shot and a 0 being a miss. This means that while I'm performing this point count, I can ask the tool to aggregate (sum) this field to get the number of made shots in the cell:

The issue here is it's very time consuming. Yes, running this Points in Polygon tool will give you a count of how many shots (total and made) are in each hex, but you'll need to do it for every team, and on the entire league-wide dataset too, then compare the league data with each team's data to see if the team is shooting above or below league average in any given hexagon. Yawn.

For my purposes, I'd also need to repeat whole process at varying points in the season to keep the data updated.

To get around this I've used Safe Software's FME to process the entire league's data at once and spit me out some shapefiles (a basic vector storage format for geographic features). FME is essentially data manipulation software, you can put in data in numerous formats, processes it using a bunch of transformation tools, then output it in another format. The workbench I've created looks huge, but it's really pretty simple:

The FME workbench takes a .csv file containing the xy coordinates for every shot, the team that took the shot, whether it was a miss or make etc etc as an input file. It also grabs the hex grid I created earlier in QGIS as a shapefile.

The first step is to split the csv file out by team, then count each individual team's shot count and shot's made per hexagon. The workbench then adds to the team's file the league-wide count and shots made for each cell so a comparison can be made. There's a few other tricky little bits here and there, but that essentially all it does. We're left with a shapefile containing each team's shooting percentage and the league average shooting percentage within each hex. Here's a small section of Adelaide's data as an example:

FME allows me to save that process and run it at intervals throughout the season. It takes about 15 seconds each time, rather than hours of manual work.

Now that I have the data pre-processing completed, it's time to visualise.

One advantage of using bins to aggregate data is the ability to create a multivariate visualisation. This means I have the ability to display two statistical measures on the one chart, in this case shot accuracy and shot frequency.

Shot accuracy is best represented by giving each hexagon a colour relating to it's accuracy value. Hotter colours such as oranges or reds can symbolise 'hot' shooting spots, and cooler colours like blues can be used to symbolise 'cold' spots for example. Instead of displaying the raw shooting percentage for each team in a given cell though, I've decided to show how the team's percentage compares to the league average in that spot. I believe this creates a more informative dataset as to a team or player's shooting performance. Everyone knows that players shoot better closer to the hoop, it's far more interesting to know how players shoot compared with other players from the same spot. You can see in the image of the database above, I've created a field called UnderOver which contains the difference between the team's shooting percentage and the league average shooting percentage. A value of 0 in this field would mean the team shoots exactly the same as the league average at this spot. A negative value means they shoot below average, and a positive value means they shoot better than average. The structure of this field therefore means it's best to use a diverging colour scheme to represent the data - one that shows progression out from a midpoint (or league average) in either direction.

The other variable I want to show is shot frequency. This can be done by scaling the size of the hexagon. When I created my hexagon dataset in FME, I created two different output files for each team. The first contains the same polygons (hex bins) that went into the process, but with the additional shooting data merged with them. The second contains a point dataset, with indentical data to the hex bins. This was done through asking FME to create a point at the centroid of each hexagon, and transferring all the data to that point. The end result being something that looks like this (there are only points in cells that actually contain shot data):

The reason for creating this point grid rather than using the original hexagons is I'm now able to scale these points based on the number of shot attempts contained in each one - something not possible with a fixed hexagon grid made from  connected polygons.

(I'm going to used ESRI's ArcGIS software for the remainder of the visualising process, but you can do this in QGIS or any number of other GIS software packages too, all depends on what you're most comfortable with, I guess.)

The benefit of keeping all the data within a GIS is they are all georeferenced and meaning all software packages being used in the process will keep the points, grid and court background in the same spots - no need to realign each time you switch program.

Using ArcGIS's Symbolise by Multiple Attributes function, I firstly changed the point symbol to a hexagon shape then used the Variation by Symbol Size tool to scale these points based on the number of shots that fall within it. The points which contain the most data will scale all the way out to fill the original hexagon, points containing fewer shots scale only part of the way. When combining this with the colour scale used to represent shot accuracy a picture begins to form showing shot tendencies:

Chart showing only shot accuracy on the left | Chart showing both shot accuracy and shot frequency on right

Chart showing only shot accuracy on the left | Chart showing both shot accuracy and shot frequency on right

A note on scaling; As you would expect, shots that are layups, dunks, alley-oops or tipins make up the majority of the data for a team. This means that a large portion of the shot count falls in the cell which coincides with the hoop itself. The problem then arises when scaling the points based on shot frequency, as there might be 400 in the cell covering the hoop, but only 50-60 shots in others.  You'll end up with something pretty useless that looks like this:

It doesn't really tell you much at all. It's pretty obvious that's what a shot chart showing shot frequencies would look like. The solution I've found is to use logarithmic normalisation to process the data prior to displaying it. Log transformations are used to help make highly skewed data less so, which makes it easier to spot patterns and trends. Essentially the transformation takes those cells containing extreme values, such as the ones around the hoop and puts them, along with every other cell, into a range between 0 and 1. Fortunately ArcGIS has a built in parameter for doing exactly this.  This setting makes the shot frequency distribution less skewed and helsp to revel trends in the data that otherwise would be hidden.

This is an on going project for me, there is still a way to go in terms of creating a product that has real analytical value. As I am able to capture more and more data, these charts will continue to improve as 28 games per team in an NBL season means even at the end of the season the data is still relativity volatile. The data captured over the course of one season is not enough to properly assess a player's shot chart, so for now I have just stuck to teams. Player shot charts are the real goal though, and something I plan on implementing for next season. I also don't think I've yet found the optimal size for the hexagons, this will be something I'm looking into over the off season. For now though, here is an example what my final shot maps look like: