UW INFO 370: Introduction to Data Science
Final Project 2019
by Krishna Puvvada, Travis Neils, Zach Palmer, Paul Winebrenner
What factors are influential in determining the final score (and outcome) of a basketball game?
Can the outcome of a basketball game (win/loss, final score) be accurately predicted using machine learning?
The purpose of this project is to discover how machine learning could be applied to college basketball games to predict performance in the NCAA March Madness bracket. Sports generally have long been an area ripe for data analytics, given their many different measurable metric of play, and basketball is no exception with steals, points, assists, and other factors all leading to a win. Our goal with this project is to attempt to reduce a basketball game to its components (like scoring, rebounding, blocks and steals) with the purpose of being able to accurately predict the outcome of a game given a series of inputs.
This research is of particular importance for the sport of basketball as an accurate predictive model would allow teams to not only predict the outcomes of their games, but also to analyze what areas they could improve in to better their results. This would likely have an impact not only on the college (NCAA) level, but also at other levels of the game.
Additionally, this project is important as it explores whether or not sporting events can be predicted with any level of accuracy. Sports have long been elusive to predict, hence the strong gambling market surrounding them. The combination of skill and luck makes sports by nature difficult to predict. Nowhere is this better evidence than the NCAA tournament. Each year, 68 teams compete for a national title. The teams, selected and seeded by a special committee, are drawn from the 351 Division 1 college basketball teams. Nominally, the committee selects teams based on factors including but not limited to: wins, strength of schedule, conference record and, "various computer metrics". All of these factors are based on regular season play. Once in the tournament, teams play against each other until a national champion is crowned (67 games total). Assuming that the odds of picking a game correctly are .5 (½), the odds of getting every single game correct in the tournament is 6.7762636e-21. Obviously, those odds are not great. As such, our mission is to attempt to create a series of algorithms that allow us to use both regular season statistics, and information about seeding, to predict NCAA Tournament games with better accuracy.
In order to analyze how machine learning could be used to predict the outcome of college basketball games, we had to narrow down our approach and variables to a reasonable size. We decided to use individual game data for each season to train our model obtained from Kaggle. For every game in the regular season, the dataset recorded the field goals made, field goals attempted, three pointers made, three pointers attempted, free throws made, free throws attempted, offensive rebounds, defensive rebounds, assists, turnovers committed, steals, blocks, and personal fouls committed, as well as the final score. Since March Madness is notorious for upsets that are very difficult to predict, we thought it was best to reduce the dataset to its lowest level and focus on how we could predict an outcome of a game based on the game stats that any given team has, as well as their seed in the playoff bracket. If we can predict a game's score based on this collection of variables, we can then use that model to predict the NCAA bracket game by game using the season's aggregate data.
First we analyzed the correlations between all of the variables we had to see if there were any trends that worth noting. To do this we created a heat map and a pair of tables displaying the winning team and losing team correlations for reference:
In doing this, we found some interesting relationships between certain variables. Perhaps the most notable is the relatively strong correlation between WAst (winning team assists) and Wscore. An assist is when a player passes the ball to a teammate and the person receiving the ball scores immediately following the reception of the pass. It should be noted that an assist is only recorded when a basket is made directly following a pass, not when a shot is missed or possession is lost. While in some sense it is intuitive that the more assists a team has, the higher their score will be, this variable also has the underlying implication that team play is important. This is of particular interest as teamwork is a difficult statistic to measure, but is often cited as a very important factor in winning basketball games. The strength of the correlation (.56) indicates that teamwork may well be an important factor in determining the score of a basketball game. Correlation, of course, does not equal causation, but the strength of the relationship is notable.
Additionally, there is a very strong correlation between the opposing team's personal fouls (LPF/WPF) and the number of free throw shots attempted (WFTA/LFTA). This is, of course, an obvious correlation as fouls often result in free throw shots. The strength of the correlation, however, is notable. Fouls do not always result in free throw shots, there are some instances in which a foul would not: fouls away from the ball (team getting fouled not in the bonus) and a foul on the floor not in the act of shooting (team getting fouled not in the bonus). The insight here is that the strength of the correlation (.75 +), would seem to indicate that personal fouls result in free throws a significant portion of the time. This indicates that the number of fouls resulting in free throws is significantly higher than the number of fouls that do not result in free throws. While not particularly of relevance for this project, this correlation is a potential topic of further study as it indicates in part how the game is officiated.
Finally, the most interesting correlation value we found was the r value indicating the relationship between 3 pointers made (WFGM3/LFGM3) and score (WScore, LScore). Given the trend in modern basketball towards taking more 3 point shots, the relatively low correlation (.43) was of particular note. One would have expected the number of 3 pointers to be very strongly correlated to the winning score (perhaps a correlation of .7+) given how much the shot is emphasized at all levels of the game in the modern day. However, the reason for this seemingly low correlation likely stems from the data set used to calculate these correlations. The data set used includes basketball games from the last 30+ years reaching all the way back to the 1985 season in scope. As a result, this data set includes data from several different "periods" in the game. According to Shot Tracker, an online source that tracks the shot selection of NBA teams, the 3 point "revolution" started in 2010 and the number of 3 point shots attempted per game has increased substantially each year since (1). While NBA and NCAA basketball are fundamentally different in many ways, it is undeniable that the trend to take more 3 point shots is essentially universal to the game, although the college level likely adapted to the increase more slowly, following in the NBA's footsteps. Never the less, the relative recency of the surge in 3 point shots is likely the reason that the correlation value seems so low. Before 2010, 3 point shots were taken in significantly smaller numbers resulting in less makes and therefore less impact on the final score in a game. The 25+ years of data where 3 point shots were significantly less prevalent likely accounts for this seemingly low correlation. As a topic for further research, a comparison with post-2010 correlations between 3 pointers and WScore would be enlightening.
While the previous step established some interesting correlations between several different variables, we wanted to get an idea on which variables might be the most impactful for our final predictions. To do this we created a correlation graph between the different game variables and total points scored by the winning team below:
To see if similar variables would be correlated for the losing teams, we included a correlation graph below for the losing team:
The highest correlated variables across both types of teams are field goals made, assists, and field goals attempted, which makes sense since field goals are the most common shots to make in earning points for a team.
Given that we now know what most directly impacts the points a team can score, we wanted to see how their points would be affected by another important aspect of the NCAA bracket, the seeding. Teams starting out with a higher seed, based on their higher rank in the league, will be placed against teams with a lower seed in order to make the playoffs fairer to the teams that performed better during the regular split. To see how teams with higher seeds performed in previous playoffs, we created a graph showing the relationship between the tournament seed of unique teams in the past year and the points they scored in the playoffs:
The teams with a higher seed (higher being closer to 1) were able to generally score more points, around 300-500, in the tournament compared to the lower seeds that barely got above 200 points. This could be due to the higher seeded teams doing better in their games but is also definitely driven by them being able to play more games and get more points as the tournament progresses. This tells us that both the performance in a game and the seeding in which a team is placed will significantly impact a team's performance throughout the tournament and would be useful variables to include in our analysis. This analysis allowed us to answer which variables appeared to impact the outcome of a game answering research question 1.
Given this baseline for what game variables influence the final outcome of any basketball game and how tournament seeding can affect a team's overall performance, we turn to machine learning to figure out how to best leverage our data to generate accurate predictions.
Data preparations for this section of our research involved several complex steps to configure the data in a manner that was suitable for our modeling desires. The most important step was the preparations was the aggregation of season data for each team in each season using the 82,000 row box score data provided by Kaggle. We obtained the average season stats for each team in every game, and then calculated the difference between the average scores, attaching the differences to each March Madness game and then passed that data frame into our models. Additionally, the dataframes some school name modification was required.
After having prepared the data we created 5 different models to see which would be able to predict results the most accurately. We chose to use: a Bayesian Ridge Regression, a KNeighbors Regression, a Linear Regression, a MLP Regressor, and a Decision Tree Regression as the 5 models. Each of the models we used predicted the score difference for each game given two teams statistics. Thus a positive score differential indicated Team 1 won the game while a Negative one indicated that Team 2 won. We then compared the sign of the prediction to the sign of the actual score difference. If the signs matched that was considered a correct prediction.
Bayesian Ridge was chosen for its capability to adapt to the data at hand, and for its regularization procedure. Additionally, it is a variation of the linear model and was chosen in part to compare with our linear method to see what differences (if any) arose.
KNearestNeighbors was chosen for its capability in classification of similar observations. In this case, it was of particular interest to us to see whether games with similar statistics turned out the same way. This particular model classifies based on the "nearest neighbors" system, in essence predicting a game score based on game scores from games with similar teams. By making this model, and comparing to others that did not use this system we were able to gain some insight about the random element of basketball games.
The linear regression model was chosen for the purpose of assessing if a set of factors were linearly related to the overall winning score. Essentially, given that a significant part of a winning score can be attributed to variables like 3 pointers made, free throws made, field goals made, and assists we chose the linear regression to see if some combination of these (or other variables) were suitable inputs for accurately predicting winning score.
To leverage machine learning techniques to enhance our predictions, we used a neural network to test our hypothesis that certain features of the NCAA teams dataset would be positively correlated to wins and bracket standings. A neural network can dynamically adjust the weights given to certain features when building the classification model and is thus able to accomplish high levels of predictive accuracy. While this constant changing of weights might have complications due to overfitting or computational complexity, neural networks are still very flexible and would fit a model such as the NCAA due to the large amounts of different predictive variables.
The Decision Tree Regressor was chosen in large part because in our experience it predicts continuous values with a more significant spread than any of the other methods we chose to employ. In previous assignments, the model had produced significantly more interesting and variable predictions than other models. In this case, we were interested to see if a more variable model would better predict a seemingly random event.
The above table displays the accuracy scores (in terms of percent of results that were correctly predicted (correct win or loss) out of 1) each model attained when asked to generate predictions for the 2018 NCAA Tournament. While all of the results are of interest, 3 of the 5 are of particular note.
The very low score of KNeighbors in particular is striking as it performed quite poorly compared to all of the other models. In fact, the model was only .7% better than a coin flip guess for the tournament in 2018. It should be noted however that scores less than 30 are not at all unusual in bracket competitions, indicating that while the model was not particularly accurate, it predicted more accurately than a significant number of people do each year. This result was interesting in large part because we had expected KNeighbors to perform significantly better than the other models because of its capability to predict using scores from the closest neighbors. This result in particular indicates that while we passed in a significant number of factors to the model, that there is some other factor (perhaps random chance) that determines the score of a game and therefore the outcome. The insight is that while games may have similar teams playing each other, the final outcome is not always the same, in fact it is quite often different indicating some other factor is at play.
The second result that is of particular interest is the Decision Tree Accuracy Score of .567. While this result was more expected, it is interesting that the model which predicted results in a more variable manner (see analysis of Figure 2 below for further details) was inaccurate. This indicates that the point spreads in a game is not quite as variable as the Decision Tree Regressor predicted.
Finally, the last result of interest is the apparent tie between the Linear and Bayesian Ridge models in their accuracy scores. This very likely has to do with the fact that the Bayesian Ridge and Linear models are very similar in their approaches. These two models in conjunction suggest that basketball games are in some way predictable using the factors that we passed in as input, but cannot be considered entirely predictable.
Figures 1-5 display the comparison of each model's predictions to the actual values regarding the score differential of each game. Essentially the y-axis displays the predicted point spread while the x-axis displays the actual point differential. In each graphic, the line represents the case where a point spread was predicted exactly correctly by the model. Points that fall in the upper right-hand and lower left-hand corners are considered to have been predicted correctly as the point differential sign was the same indicating that the model correctly predicted which team won (not the correct point differential).
Figure 1 displays the comparison of the KNeighbors predicted and actual values. The most interesting aspect of this graph is how little variation the KNeighbors model predicted in terms of the predicted score differentials. The lack of variability is in significant part why this model was unable to accurately predict results.
Figure 2 displays the comparison of the Decision Tree Regression predicted and actual values. As expected, the Decision Tree exhibited much more variability in its predictions, although in this case, that variability hurt the accuracy score of the model more than it helped. The graphic displays several more points in the upper left and lower right-hand quadrants (these quadrants indicate an incorrect prediction of which team won) than say the KNeighbors model. Additionally, the model was by far the furthest off with incorrect predictions with points like (-20, 35), (12, -30), and (-30, 15) all being off by over 40 points. In some sense, it seems that while the model was predicting point differentials at a greater (and more accurate, see below for further details) spread, the model was too variable to be considered accurate.
Figure 3 displays the 3rd place model, the Neural Networks predictions in comparison to the actual values. The Neural Network performed with more accuracy than the Decision Tree and KNeighbors models but was still prone to large inaccuracies, particularly it seems in games that in actuality featured large point differentials. This model, similarly to the KNeighbors model consistently predicted lower point spreads than were actually the case. While not necessarily bad, the model was not able to generate accurate predictions mainly because it failed to recognize games where the point differential was large in actuality. This is best exhibited by the points (-30, 3), (27,0), and (25,2) all of which feature residuals of over 20.
Figures 4 and 5 display the Bayesian Ridge and Linear model predictions in comparison to the actual values. These models tied for the best score, likely because the underpinnings of both models are the same. These models predicted point differentials from -20 to +20, a much larger spread than the KNeighbors model and roughly 5 points larger (in each direction) than the Neural Network model. The significant difference with these models was their accuracy, Quadrant 2 and 4 points were significantly reduced, and were closer to the center (indicating that the Linearly based models were off by less) than the previous models. Additionally, the models managed to get higher densities of points Quadrants 1 and 3 than the other models did.
PLEASE NOTE: The Authors of this report would like to emphasize that the lines connecting each game are meaningless as the x-axis variable is discrete. This form was chosen as it provided the most compact reflection of the differences between the models for each tournament game (scatter plots seemed to balk at the inputs we attempted to pass in).
The above line graph displays the predicted score differential by each model, for each game in the 2018 NCAA Tournament. The graphic shows that the Decision Tree was very "volatile" in its predictions and, as mentioned previously, often predicted in the complete wrong direction. The Decision Tree model was however the closest in terms of the spread of point differentials to the actual data (see below for further details).
As can be seen, the yellow (Linear) and green (Bayesian Ridge) lines are very similar (as expected and analyzed above), however, this graph brings the new insight that the models did not perform exactly the same for every game as previously thought. There appear to be games where the models diverged in their predicted point differentials, although not by very much. These divergences all appear to occur well away from the 0 line (indicating the border between win and loss) which is likely the reason that the models did not differ in their accuracy scores.
The Linear and Bayesian Ridge models also followed the actual values (depicted by the blue line) most closely of any of the models although both seemed to consistently underestimate the actual point differential in their predictions. This indicates to us that with some finer tuning, a more accurate linearly based model might be able to predict game results with better accuracy.
The above graphic displays the trend of what the model was likely to predict. The narrower the model's distribution, the less likely the model was to predict large point differentials, while the wider models predicted much larger point differentials. Of significant note, however, is that although the graphic would seem to display that the Decision Tree was the most accurate, this was not actually the case. The Decision Tree was the model best at predicting the absolute value score differential, but was actually wildly inaccurate compared to the other models at times predicting point differentials in the complete wrong direction (actual = + 40, predicted = -40). The narrower models indicate that the model was less likely to predict large score differentials instead predicting that most games would be close rather than blowouts. The blue line above displays the actual variability in scores which, as previously mentioned, was best modeled by the Bayesian Ridge and Linear models.
In summary, this project set out to establish what factors were influential in determining the outcome of a basketball game and a machine learning algorithm capable of predicting the final result.
Our data research explored the correlations between all of the potential factors and the winning score (as well as the losing score and all of the other variables) returning several relatively high correlation values indicating that there were indeed some relationships between certain statistics and final score.
For our machine learning section, we chose to compare 5 different models in the hope of:
We achieved the first point, establishing that there are factors outside of what we included in our algorithms that have an impact on the final result, likely some combination of random chance and other factors such as seeding, referees, games played in the last few days etc. Given more time, an in-depth exploration of these areas would be of interest.
The second point was a moderate success. We tried 5 different models, with accuracy scores ranging between 50 and 65 percent accuracy, gathering insight along the way. Our best models were our linear based ones: Linear Regression and Bayesian Ridge. From these linear models, we deduced that some significant portion of a basketball games outcome is due to a combination of how a team plays (reflected in their statistics), but that there is also a significant element (or set of elements) that we were not able to identify. We also learned that while a Decision Tree was able to quite accurately model the absolute value of game spreads, it was wildly inaccurate on individual predictions.
As a whole, this project has given us a model with which to predict future March Madness tournaments and serves as the starting point for finer tuning. Using a coin flip, or other random selection methods, we stood a 6.7762636e-21 chance of getting a perfect bracket. With the help of machine learning, we have lowered those odds to 1.24558312e-13, a significantly better chance, but still far from realizable. Originally, we had set out to test our model on this year's Tournament, generating predictions and comparing with real time results. Unfortunately, the start of March Madness 2019 is March 17th, 4 days after this project is due. However, we plan to use the model outside of class to see how it performs.
(1) Shea, Stephen. “The 3-Point Revolution.” Shottracker, shottracker.com/articles/the-3-point-revolution.