Description


Motivation

Fantasy sports and sports betting are multi-billion dollar industries.1 2 Unsurprisingly, there are huge incentives in predicting the winner of a sports game. In order to produce accurate predictions of a game’s outcome, many have turned to machine learning. Neural networks are a popular prediction technique, including feedforward neural networks, probabilistic neural networks, and generalized neural networks.3 4 5 A maximum entropy principle based algorithm has been used to perform predictions when features are correlated.6 The results were compared against a number of other classifiers, including naive Bayes. Other strategies include logistic regression, decision trees, and support vector machines.7 These previous works show promising results, but are not perfect. Events like player injuries, retirements, and player trades can greatly disrupt team dynamics. A robust model should be able to take into account the chemistry of a team and address questions such as

Many existing models are only designed to address the first question. In order tackle these challenging questions, we hope to create a machine learning model that can find deep patterns and connections such as

Statistics

Due to the public availability and completeness of the statistics at stats.nba.com, we decided to create such a machine learning model for the National Basketball Association (NBA). The website provides historical player performance statistics such as shooting percentage, number of rebounds, points scored, and number of fouls. Note that an NBA team can have at most 15 active players, five of which are on the court at a time. Two of the players are forwards, two are guards, and one is a center.

Theoretical Description of the Machine Learning Model

An ideal model takes as input the five home team players and five away team players on the court and outputs some metric indicating how well the home team will do compared to the away team. We determined the most meaningful metric to use is plus/minus per minute. The plus/minus statistic is the number of points the home team scores minus the points the away team scores. We then normalize this quantity by time. Thus, a plus/minus per minute of -4 means the away team outscores the home team by four points every minute. This statistic is easily accessible from stats.nba.com for use is training and testing. This can be expressed as

where are players from the home team, are players from the away team, and is the label of plus/minus per minute. For consistency, players with subscript 1 are centers, players with subscripts 2 and 3 are forwards, and players with subscripts 4 and 5 are guards. Note that this is a regression problem. An alternative approach is to only look at the sign of the plus/minus per minute, which indicates whether or not the home team is winning. This turns the problem into a simpler classification problem.

Now, we will consider how the players are represented and what information they encode. We use the extensive player statistics from the 2016-17 season. Once again, these are averaged per minute for normalization (see here for example).

where is the number of player statistics. Thus, we can see the model takes in input arguments in total. Through a simple reshaping of the input, we can write the model now as

Training Data for the Model

With so many input arguments in the model, a large amount of training data is required. A single example of training data requires knowing the exact ten players on the court during a portion of a game. This is paired with a training label, which is the plus/minus per minute during that portion of the game. Note that within a single game, there are several substitutions made by both teams that change the ten-player matchup on the court.

Game timeline

Therefore, in a single game, there are numerous “time capsules” and corresponding plus/minus per minute labels that can be used for training. Moreover, there are 30 NBA teams, each with 82 games per season. We use all time capsules of at least five minutes from the 2016-17 regular season. These training examples can be organized into a matrix:

Season 1 Season 2 …
Game 1 Game 2 …
Time Cap 1 Time Cap 2 …
Home Team Center fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/-
Forward 1 fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/-
Forward 2 fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/-
Guard 1 fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/-
Guard 2 fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/-
Away Team Center fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/-
Forward 1 fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/-
Forward 2 fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/-
Guard 1 fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/-
Guard 2 fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/- fgm, reb, blk, … +/-

Each cell in the table above is a vector of statistics. In total, there are 600 rows (features) by 1585 columns (examples).

Testing Data for the Model

We decided to use the 2017 NBA playoff games for the testing data, as it represents a practical use-case for the model. There were 79 playoff games and 95 time capsules.

Using the Model

How does the model address the questions posed in the introduction? It is easy to see that this model directly answers the question of how Team A will perform compared to Team B via the predicted plus/minus per minute. How can the model choose an optimal substitute for a team, however? For each player on the bench capable of playing the position, we can plug their statistics into the model and find which one maximizes the plus/minus per minute. The problem of finding all five starters to counter the opponent’s starting lineup is solved with a similar strategy. For each position, there is a disjoint set of players capable of playing that position. We can try all possible team combinations under the position constraints and once again, find the one that maximizes plus/minus per minute.

References

  1. “Fantasy Sports now a $7 Billion Industry.” FSTA, fsta.org/press-release-fantasy-sports-now-a-7-billion-industry/. 

  2. Spear, Gillian. “Think sports gambling isn’t big money? Wanna bet?” NBCNews.com, NBCUniversal News Group, 15 July 2013, www.nbcnews.com/news/other/think-sports-gambling-isnt-big-money-wanna-bet-f6C10634316. 

  3. Torres, Renator Amorim. “Prediction of nba games based on machine learning methods.” University of Wisconsin, Madison (2013). 

  4. Loeffelholz, Bernard, Earl Bednar, and Kenneth W. Bauer. “Predicting NBA games using neural networks.” Journal of Quantitative Analysis in Sports 5.1 (2009). 

  5. McCabe, Alan, and Jarrod Trevathan. “Artificial intelligence in sports prediction.” Information Technology: New Generations, 2008. ITNG 2008. Fifth International Conference on. IEEE, 2008. 

  6. Cheng, Ge, et al. “Predicting the Outcome of NBA Playoffs Based on the Maximum Entropy Principle.” Entropy 18.12 (2016): 450. 

  7. Haghighat, Maral, Hamid Rastegari, and Nasim Nourafza. “A review of data mining techniques for result prediction in sports.” Advances in Computer Science: an International Journal 2.5 (2013): 7-12.