Using Pandas and Python to Merge Basketball Lineup Data

Over the past few weeks I've been working on a number of improvements to the SpatialJam Lineup-O-Matic. My previous blog post covers how the tool was originally created, the outcome was usable, but a bit clunky and cumbersome for users to navigate efficiently. This is because the interface was relying on the data visualization tool Tableau to do much of the processing and heavy lifting. While Tableau is a great platform, asking it to process hundreds of thousands of rows of play by play data was unrealistic.

To work around this, I utilized both Python and the open-source library Pandas to process the data prior to pushing it to Tableau servers.

I mentioned in my previous post some of the basic ideas SpatialJam is using to extract the lineup data from the FIBA Live Stats interface and while this process has been amended slightly to create a cleaner output, the principals remain the same. The biggest improvement with the new script it the ability to understand whether a particular lineup is playing offense or defense and record the statistics accordingly. This means for every line of play by play (for example a shot attempt or an assist occurring), there are now two lines of data added to the database, one for the lineup that recorded the statistic and one for the lineup the statistic was recorded against.

The resulting database is the processed using Pandas which allows the grouping of lineups and merging of statistics before pushing to the Tableau interface.

Pandas has the ability to quickly process large datasets by creating DataFrame objects, SpatialJam's tool utilities this functionality to group every line of Play by Play which has been attributed to a particular lineup and both sum and calculate statistics.

Firstly the input file is read - in this case a csv file containing all the lineup data extracted from the previous tool:

import pandas as pd
import numpy as np
#Read CSV
inputFile = pd.read_csv('C:\\...\\NBL\\combined.csv', encoding = "ISO-8859-1", low_memory=False)

Once imported Pandas can now aggregate the data using the groupby function:

df = inputfile.groupby([columnstosortby]).agg(aggregationmethod)

In the case of the lineup data, we're wanting to group the data by a number of columns; The lineup itself (a string field containing all 5 players on the court seperated by the ' | ' character), the team they play for and the team they were playing against. Because we're factoring in the opponent here, it means that a particular lineup will potentially appear in the final data multiple times if they were played against a number of different opponents - Tableau will take care of this for us.

Instead of just aggregating a single field, Pandas can work with many fields all at once, which is good in this instance as we're wanting to summarise a number of statistics that each lineup has recorded, not just +/-:

aggr = {
    '(+/-)': {'total': 'sum'},
    '2PA': {'total': 'sum'},
    '2PM': {'total': 'sum'},
    '3PA': {'total': 'sum'},
    '3PM': {'total': 'sum'},
    'FGA': {'total': 'sum'},
    'FGM': {'total': 'sum'},
    'FTA': {'total': 'sum'},
    'FTM': {'total': 'sum'},
    'dREB': {'total': 'sum'},
    'oREB': {'total': 'sum'},
    'AST': {'total': 'sum'},
    'STL': {'total': 'sum'},
    'BLK': {'total': 'sum'},
    'TOV': {'total': 'sum'},
    'PTS': {'total': 'sum'},
    'o2PA': {'total': 'sum'},
    'o2PM': {'total': 'sum'},
    'o3PA': {'total': 'sum'},
    'o3PM': {'total': 'sum'},
    'oFGA': {'total': 'sum'},
    'oFGM': {'total': 'sum'},
    'oFTA': {'total': 'sum'},
    'oFTM': {'total': 'sum'},
    'odREB': {'total': 'sum'},
    'ooREB': {'total': 'sum'},
    'oAST': {'total': 'sum'},
    'oSTL': {'total': 'sum'},
    'oBLK': {'total': 'sum'},
    'oTOV': {'total': 'sum'},
    'oPTS': {'total': 'sum'}
out = inputFile.groupby(['lineup', 'team', 'opponent', 'P1', 'P2', 'P3', 'P4', 'P5 ']).agg(aggr)

Here Pandas is summing a range of statistics if the fields being grouped by match.

At this stage we also want to calculate a number of metrics not processed in the initial pull of data from the FIBA System. To do this a new field is created, the formula for a particular metric is calculated and the result added to each lineup row

#Calculate Possesions
out['oPOS'] = (.96*(out['FGA']+out['TOV']+.44*out['FTA']-out['oREB']))
out['dPOS'] = (.96*(out['oFGA']+out['oTOV']+.44*out['oFTA']-out['ooREB']))

#Calculate Efficency
out['oEFF'] = ((out['PTS']/(.96*(out['FGA']+out['TOV']+.44*out['FTA']-out['oREB'])))*100)
out['dEFF'] = ((out['oPTS']/(.96*(out['oFGA']+out['oTOV']+.44*out['oFTA']-out['ooREB'])))*100)

#Calculate Metrics
out['eFG%'] = ((out['FGM']+0.5*out['3PM'])/out['FGA'])
out['oeFG%'] = ((out['oFGM']+0.5*out['o3PM'])/out['oFGA'])
out['2P%'] = (out['2PM']/out['2PA'])
out['o2P%'] = (out['o2PM']/out['o2PA'])
out['3P%'] = (out['3PM']/out['3PA'])
out['o3P%'] = (out['o3PM']/out['o3PA'])
out['TOV%'] = (out['TOV']/(out['FGA']+.44*out['FTA']+out['TOV']))
out['oTOV%'] = (out['oTOV']/(out['oFGA']+.44*out['oFTA']+out['oTOV']))

And that's about it. Pandas makes the processing of this large dataset effortless and super quick and Tableau can now be left to only have to think about visualizing the data without having to worry about having to do any of the intensive lineup grouping or calculating of metrics.

The resulting dataset can be seen here

Of Note:

The +/- statistic is also now recorded during the initial data extract, something that was not possible with version 1 of the Lineup-O-Matic. Every time a point is scored by a given lineup, the value of those points (1, 2 or 3pts) is recorded for that lineup (+) and against (-) the defensive lineup currently on the floor. There is of course an issue that arises with this approach; it's not uncommon for a substitution to occur prior to or during free throw shooting meaning the points for and against will be attributed to the incorrect players. Consider the example play by play below:

Player 1 fouls Player 2 while Player A was shooting
Player 1 is subbed out for Player 3
Player A makes both free throws
Play continues...

In this case the (-2) points would be attributed to Player 3 as he/she was on the court when the free throws occurred, rather than against Player 1 who committed the foul that lead to the (-2) points.

This is a problem I'm yet to solve and will require some complex logic to work around. While it will only result in a small and probably insignificant difference for the lineup data, it's still something that will require work.

SpatialJam - Behind the Scenes [Part Two]

In this post I wanted to give a quick bit of insight into how I've gone about creating SpatialJam's Lineup-O-Matic.

The Lineup-O-Matic essentially displays all the five man combinations used by a team and calculates the statistics accumulated for and against while they are on the court together. This is a useful metric for coaches in assessing how well a particular group of players work together and in which in-game situations a player has historically been well-suited.

As with any metric, it's a measure of what has happened in the past, rather than being predictive - but lineup data can be used to aid decision making and help coaches have a 'best-guess' at what may happen.

In the past I have measured line ups for teams in the NBL as part of statistics packages provided to coaches, but until last year this has been a tricky process and relied heavily on time consuming, manual data entry work. The Lineup-O-Matic on the other hand is fully automated and able to process a whole season of data in just a few seconds. This makes it a much more feasible approach in providing these metrics freely and to a wide audience.

As with the Shot Machine (see Blog Post One), the Lineup-O-Matic is built using a combination of Python (to pull and collate the data) and Tableau (to visualise the data) and is actually pretty simple once you get your head around the logic involved.

The script I've created for SpatialJam essentially creates a single row in a database for every event that occurs during a game, be it shot being made, a turnover or a shot clock violation etc. Regardless of the event, the row of data also contains the 10 players that are on the court at the time of it occurring, 5 on Team 1 and 5 on Team 2. Below is an example of how a couple of minutes of (very messy) game time looks in the SpatialJam database between Cairns and Perth last season:

You'll see regardless of which team causes the event, all 10 players on the court are still attributed to it. Also note how when Shaun Bruce is subbed out at the 3:48 mark for Markel Starks the data reflects this from that point onwards.

Creating the lists of events is simple, python just creates a new row for each new JSON event, the trick is adding in the players who are on court at the time. To create this data, python creates a list of players based on the substitution rows created in the Play by Play feed. It will look for any event that has been tagged with substitution and add a player to the 'on court' list based on what it finds.

For example at the start of every game, the starting five players are subbed onto the court:

Python creates two lists, one for the home team and one for the away team containing the five initial players subbed onto the court. From this point onwards it can add and subtract from this list as players leave and enter the court and every event along the way has these two lists of players attached to it.

This creates a very rich Play by Play database, not just for creating lineup data, but also other metrics that are not usually visible in a boxscore, such as a richer assist dataset showing who is passing to whom (as seen here) or a more detailed record of turnovers (travels, ball-handling errors etc), charges drawn and other statistics not usually provided by the league's official statistics.

The next step for SpatialJam is linking this Play by Play database to the Shot Machine, which will give a deeper understanding of the game. For example, users will be able to see not just where the shot happened, but who else was on the court at the time, who was credited with the assist and in the case of a missed shot, who rebounded the ball.

I don't believe The Lineup-O-Matic has reached it's full potential yet either. There are still a few issues with the UI I'm not completely happy with, such as the player drop down lists and some of the refresh speeds for the data. Still, this type of dataset is a first for the NBL and one I hope will become a powerful tool for coaches and fans alike in assessing the game Downunder.

Using Hexagonal Binning and GIS to Analyse Shooting Data

Over the past few years the process of hexagonal binning has been popping up everywhere in the data visualisation world. It's a great solution for representing density when working with large datasets made up of point data, which is why using hex bins for basketball shot charts seems like a natural progression. Geographer/journalist/analyst Kirk Goldsberry brought the use of hex bins into the basketball fan's mind with his ground breaking 2012 presentation at the Sloan Sports Analytics Conference, which you've no doubt all seen by now.

The theory behind the visualisation process is relatively simple - a grid of hexagons (or any other shape) is created across a surface containing the point data. Each hexagon creates a kind of 'bin' and a count of the total number of points that fall within that bin are added to each bin's data.

Why hexagons? Let's have a quick geometry refresher. Ideally the bin being used should best spatially represent the points within it. This means that the centre of the chosen shape should be as close to every point within the shape as possible. The perfect solution for this then is a circle, because points near the edge of the circle are closer to it's centre than points near the edge of any other shape. Below is an example of this, an 85px circle compared with an 85px square. The furtherest out point in the circle is closer to the centre than the furtherest point in the square:

The problem with circles however, is they don't fit nicely together in a grid. You'll need something with straight edges and equal areas to create a uniform grid (otherwise known as a regular tesselation) across the surface. Only triangles, squares and hexagons fit this criteria. Triangles and squares have more acute angles than hexagons (or 90° in the case of a square), which means their corners are further away from their centres than a shape with larger angles like a hexagon. Points that fall within a given hexagon therefore, are likely closer to the centre of the shape than any other available option.

Also, they look cooler.

Now that's out of the way, we can get on to the fun stuff.

In order to use hex bins to analyse shot data, you'll need a couple of input datasets. Most important of course being the shot locations themselves. I've covered off how I extracted these for the NBL in my earlier blog, so give that a read if you're interested in the process. Secondly, you'll need a basketball court surface which has been divided up into hexagons:

I created mine using QGIS, which is an open-source Geographic Information System (GIS), freely available on the web. I found the python-based MMQIS plugin for QGIS a great solution for making the grid.  There are a number of other ways to create a grid surface in other software packages, such as ArcGIS - but the MMQIS plugin for QGIS was the quickest, easiest and it's free. There are a few parameters available when creating the hexagons, the most significant of which is the size of each cell. I've made mine larger than Goldsberry's, as there are not as many shots for the NBL (due to season length etc) as the NBA, so it's not possible to be as detailed with the data.

Once the court has been decorated with lovely hexagonal tiles, the next step is to count how many shots fall inside each. There are numerous ways to do this and the aforementioned QGIS is one solution. You can use the Points in Polygon tool to achieve this. My shooting database contains a field for shot results, with a 1 being a made shot and a 0 being a miss. This means that while I'm performing this point count, I can ask the tool to aggregate (sum) this field to get the number of made shots in the cell:

The issue here is it's very time consuming. Yes, running this Points in Polygon tool will give you a count of how many shots (total and made) are in each hex, but you'll need to do it for every team, and on the entire league-wide dataset too, then compare the league data with each team's data to see if the team is shooting above or below league average in any given hexagon. Yawn.

For my purposes, I'd also need to repeat whole process at varying points in the season to keep the data updated.

To get around this I've used Safe Software's FME to process the entire league's data at once and spit me out some shapefiles (a basic vector storage format for geographic features). FME is essentially data manipulation software, you can put in data in numerous formats, processes it using a bunch of transformation tools, then output it in another format. The workbench I've created looks huge, but it's really pretty simple:

The FME workbench takes a .csv file containing the xy coordinates for every shot, the team that took the shot, whether it was a miss or make etc etc as an input file. It also grabs the hex grid I created earlier in QGIS as a shapefile.

The first step is to split the csv file out by team, then count each individual team's shot count and shot's made per hexagon. The workbench then adds to the team's file the league-wide count and shots made for each cell so a comparison can be made. There's a few other tricky little bits here and there, but that essentially all it does. We're left with a shapefile containing each team's shooting percentage and the league average shooting percentage within each hex. Here's a small section of Adelaide's data as an example:

FME allows me to save that process and run it at intervals throughout the season. It takes about 15 seconds each time, rather than hours of manual work.

Now that I have the data pre-processing completed, it's time to visualise.

One advantage of using bins to aggregate data is the ability to create a multivariate visualisation. This means I have the ability to display two statistical measures on the one chart, in this case shot accuracy and shot frequency.

Shot accuracy is best represented by giving each hexagon a colour relating to it's accuracy value. Hotter colours such as oranges or reds can symbolise 'hot' shooting spots, and cooler colours like blues can be used to symbolise 'cold' spots for example. Instead of displaying the raw shooting percentage for each team in a given cell though, I've decided to show how the team's percentage compares to the league average in that spot. I believe this creates a more informative dataset as to a team or player's shooting performance. Everyone knows that players shoot better closer to the hoop, it's far more interesting to know how players shoot compared with other players from the same spot. You can see in the image of the database above, I've created a field called UnderOver which contains the difference between the team's shooting percentage and the league average shooting percentage. A value of 0 in this field would mean the team shoots exactly the same as the league average at this spot. A negative value means they shoot below average, and a positive value means they shoot better than average. The structure of this field therefore means it's best to use a diverging colour scheme to represent the data - one that shows progression out from a midpoint (or league average) in either direction.

The other variable I want to show is shot frequency. This can be done by scaling the size of the hexagon. When I created my hexagon dataset in FME, I created two different output files for each team. The first contains the same polygons (hex bins) that went into the process, but with the additional shooting data merged with them. The second contains a point dataset, with indentical data to the hex bins. This was done through asking FME to create a point at the centroid of each hexagon, and transferring all the data to that point. The end result being something that looks like this (there are only points in cells that actually contain shot data):

The reason for creating this point grid rather than using the original hexagons is I'm now able to scale these points based on the number of shot attempts contained in each one - something not possible with a fixed hexagon grid made from  connected polygons.

(I'm going to used ESRI's ArcGIS software for the remainder of the visualising process, but you can do this in QGIS or any number of other GIS software packages too, all depends on what you're most comfortable with, I guess.)

The benefit of keeping all the data within a GIS is they are all georeferenced and meaning all software packages being used in the process will keep the points, grid and court background in the same spots - no need to realign each time you switch program.

Using ArcGIS's Symbolise by Multiple Attributes function, I firstly changed the point symbol to a hexagon shape then used the Variation by Symbol Size tool to scale these points based on the number of shots that fall within it. The points which contain the most data will scale all the way out to fill the original hexagon, points containing fewer shots scale only part of the way. When combining this with the colour scale used to represent shot accuracy a picture begins to form showing shot tendencies:

Chart showing only shot accuracy on the left | Chart showing both shot accuracy and shot frequency on right

Chart showing only shot accuracy on the left | Chart showing both shot accuracy and shot frequency on right

A note on scaling; As you would expect, shots that are layups, dunks, alley-oops or tipins make up the majority of the data for a team. This means that a large portion of the shot count falls in the cell which coincides with the hoop itself. The problem then arises when scaling the points based on shot frequency, as there might be 400 in the cell covering the hoop, but only 50-60 shots in others.  You'll end up with something pretty useless that looks like this:

It doesn't really tell you much at all. It's pretty obvious that's what a shot chart showing shot frequencies would look like. The solution I've found is to use logarithmic normalisation to process the data prior to displaying it. Log transformations are used to help make highly skewed data less so, which makes it easier to spot patterns and trends. Essentially the transformation takes those cells containing extreme values, such as the ones around the hoop and puts them, along with every other cell, into a range between 0 and 1. Fortunately ArcGIS has a built in parameter for doing exactly this.  This setting makes the shot frequency distribution less skewed and helsp to revel trends in the data that otherwise would be hidden.

This is an on going project for me, there is still a way to go in terms of creating a product that has real analytical value. As I am able to capture more and more data, these charts will continue to improve as 28 games per team in an NBL season means even at the end of the season the data is still relativity volatile. The data captured over the course of one season is not enough to properly assess a player's shot chart, so for now I have just stuck to teams. Player shot charts are the real goal though, and something I plan on implementing for next season. I also don't think I've yet found the optimal size for the hexagons, this will be something I'm looking into over the off season. For now though, here is an example what my final shot maps look like:

SpatialJam - Behind The Scenes

I love receiving feedback from users of this website, regardless of whether it's positive or negative. It's great to hear how basketball fans have been interacting with the data and tools and how you're using the information to improve your understanding and insights into the game. This is also a good chance for me to give a big shoutout too to those who have done a bit of quality checking on some of the data being served up - the site wouldn't be where it is today without your help, so thanks!

One of the most common questions I receive is around how the visualisations on the site are created, so I thought I'd go ahead and give a little background on how things are done here at SpatialJam, for those interested.



For me, the most important dataset on SpatialJam is the NBL Shot Machine, which holds all the shooting data. It was the original reason the site was set up, why the site was named SpatialJam and really my favourite visualisation to play around with. So for this first behind-the-scenes type blog, I'll be focusing on how this was created and maintained.

I'd like to preface this by saying this may not be the best, or most efficient way of producing these tools, but for me it's an ever evolving process and as much a learning and experimenting process personally as it is about providing the data out to the end users. I'd love to hear your thoughts if you have ways the process can be improved.

For years I'd been contemplating the idea of creating a shot database for the NBL. It's been done to death in the NBA, with analysts like Kirk Goldsberry (CourtVision Analytics and formerly Grantland) revolutionising and really bringing basketball shooting data into the mainstream over the past few years. The problem with the creating these graphics for the NBL is the access to the data, as we don't have the same databases, tools or API's at our disposal down-under.

Shot data access is something I'd been trying to figure out for a few years, even to the point of attempting to manually track the data myself (with the help of a friend) using Geospatial software. We used ArcGIS as a platform for databasing all the spatial data we could during a basketball game. This solution did work and was the basis of my work during the 2012-13 NBL season where I collected all shots, rebounds and turnover locations and provided reports to the New Zealand Breakers as part of an overall stats package. An example of some of the turnover data collected is seen below:

It's a massive job though - as I'm sure you can imagine and pretty soon I got tired of logging every game. The project was put on hold for a season while I looked into how best to automate this process (sadly no SportsVu for the NBL any time in the near future).

Fortunately, around the middle of last season, the FIBA LiveStats system used by the NBL for providing fans with live game information over the internet updated to an improved interface which in turn changed how the data was served up. With access now available to either an API from SportingPulse International (at a price) or some light hunting for the raw JSON (JavaScript Object Notation) file locations, the data was now available for nerds like me to download

The above is an example of the JSON data available for tracking the NBL shooting. It's the script behind what you see on the FIBA LiveStats platform online and controls how each shot is displayed on the shot chart section. You'll notice each line has a bunch of attributes, notably the X, Y coordinates of the shot, the player who took the shot and the shot result (r).

There are a ton of ways to extract the data from these JSON files; you could copy and paste it into Excel and use the string parsing functions to extract the required data (gross), or use Python or a similar language, which can handle JSON to pull the data from the web and push it into a database hosted either locally or in the cloud. SpatialJam uses the later option.

Python has good libraries for processing JSON data natively and can also pull the data from the web without any kind of human interaction required (such as opening a web browser). Once I got my head around the FIBA naming conventions for the JSON tags the data was easily extracted and pushed into a simple .csv file

There are a few nuances involved in making the data from FIBA actually usable in the way you see it on SpatialJam at the moment; Firstly FIBA record data at both ends of the court when tracked live, meaning a team will have shot locations mapped to different ends depending on the half:


To make the data useful, I first needed to flip the second half data so it would appear on the same end as the first half. This is done by scripting in a rule that  flips the X and Y values of any point over a particular threshold to match the same relative  location on the other end of the court. The reverse was also required for the second team playing, as their data was recorded 'correctly' in the second half but not in the first.

The second piece of data manipulation was required on all shots tagged as Dunks or Tip Ins. I wanted these shots centred on the rim itself, but unfortunately the human input (point and click) nature of the FIBA stat tracker means that in many cases this doesn't happen. In actual fact, tip ins are almost always shown as errors on the FIBA shot charts. On the image above, you'll notice a tiny blue dot in the bottom left corner of the court. Yup, that's all the shots tagged as tip ins. Again, this was a simple as scripting in a rule to take these shots and automatically adjust their X, Y values to coincide with the X, Y value of the hoop.

There are a number of other NBL errors which the SpatialJam script accounts for, such as misspelling of players names, incorrect or inconsistent player numbers and overtime periods, which I won't go into detail on, but suffice to say - I'm hoping the data you see on SpatialJam is an 'cleaner' version of that served up by the NBL at this stage.

Tableau Public is a free data visualisation tool which is quite frankly brilliant in many regards. There is a crazy amount of depth to the software and something I'm still very much in the learning stages of. The NBL Shot Machine is made up of a number of Tableau worksheets pieced together to form the overall graphic. The main one of these being the court visualisation itself, which is basically just an X, Y chart with basketball court lines mapped underneath. Tableau allows the user to interact and query the chart unlike 'old-school' shot charts which are just images. This creates a huge range of freedom and granularity for the end user when searching for information.

Due to the input limitations of the free Tableau interface, I can not yet have the data stored on a server or even a locally hosted SQLite database as I would like, so at this stage Tableau is reading the data directly from the .csv file created by SpatialJam's python script. The site does not yet therefore, have the ability to live update the data as it happens which means it still requires me to 'push' the data through to the Tableau before the fresh data appears on SpatialJam. Doing this only takes a few seconds, but it is something I would like to address in the future. Unfortunately, at this stage the solution to this costs money, and I believe free access to the site is more important than being a couple of minutes quicker at updating. I don't think anyone's that desperate to get their hands on this stuff...

The brains behind the operation.

The brains behind the operation.

One part of the process which is now automated however is the running of the scripts to pull the data. The SpatialJam python scripts are located on an internet connected Raspberry Pi unit running on Linux and are scheduled to run in the early hours of each morning to grab the latest NBL game data. The newly grabbed data is then sent to be merged with the main dataset ready for me to (manually...) push up to the Tableau servers and SpatialJam. This isn't really that much of a time saver, but it's a bit of geeky side note anyway.

The NBL Shot Machine is an ever evolving project for me. I've loved having your feedback on how you use the visualisation and improvements you'd like to see in the future. As the project develops over time I am aiming to expand the data to include the SEABL and NZ NBL shooting data for the upcoming seasons as well as fine tune the querying functionality on the site. I hope you enjoy using it as much as I've enjoyed building it!