Introduction
Predicting a basketball game requires, in simple terms, to guess the average number of points that each team will score. Having the average total points for each team will let you decide the points spread and match totals. When you have made your decision on the handicaps based on your model, the next part is the easiest one – checking which bookmaker has more value and placing the bets. The not easy part here is building a model that makes, on average, better predictions than your bookmaker. New technology and software have been developed to track sports data and there are now numerous game attributes that can help you analyze a basketball game. Stats such as the distance of the closest defender, the miles a player travels per game, shooting percentages, turnovers per possession, offensive rebounding percentage and getting to the foul line and more have dramatically changed the way analytics are used in the NBA.
For example, a metric called Expected Points per Shot was created by Ian Levy and it was an attempt to evaluate team’s shot selection. It is based on the understanding that not all shots are created equal. A layup has a higher chance to go in than a long jump shot or a three-pointer. If you measure the total number of points scored on shots from each location (restricted area, the paint, mid-range, corner three, above-the-break three) and divide by the number of attempts, you will have an expected value for shots from each location. With those expected values you can calculate a player or team’s expected pints per shot.
Basketball Data
There are two types of data that can be obtained and analyzed – box score and play-by-play statistics.
The box score statistics contains information about the teams and players that have already completed the match. It is a structured summary of the listed attributes and achievements per team and player. It can have the following stats: minutes players, field goals made/attempted, 3-pointers, free throws, defensive rebounds, assists, steals, blocks etc.
The play-by-play statistics is much more detailed and provides information such as time events (player subs, at what time of play an attribute was recorded for a player/team), counter-party linked events (player who gave the assists, defenders nearby), type of each field goal. Moreover, play-by-play data can provide you with court coordinates for each event that happened, along with players’ positions on the field. Play-by-play data can help you to determine the defensive value certain sets of players add and the synergies between players of different skillsets. For example, by knowing which ten players are on the court, you can notice a player that attracts a lot of defensive attention and for this reason his team-mates will get more chances of open shots. Basketball-reference is a good way to start in terms of data.
Models and Metrics
*Let’s start with team-based analysis and specifically teams’ possession. The objective of the offensive team should be to maximize the return of the team’s possession and gain more points while the defensive team should be looking to minimize the opponent’s possession. This insight leads us to the team level statistics of points scored per possession and points allowed per possession. Points per possession is one of the best ways to measure the strength and quality of a team’s attack and defense. This method can take into account numerous attributes such as field goals, turnover, rebounds, free throw percentages etc., as long as these can provide a measurement or the defensive and offensive quality of a team. Dean Oliver identifies four factors, in ’Basketball on Paper’, that affect points scored and allowed most than the other attributes: shooting percentage, turnover percentage, rebounding percentage and affinity to get to the free throw line.
*Individual players statistics can be used to judge a player’s value on a per-minute basis.
We can calculate the number of wins that a player produces for his team (Wins Produced). The method considers factors that contribute to wins at a team level, mainly offensive and defensive efficiency. Offensive efficiency or points scored per offensive possession can be produced through field goals, free throws, assists and offensive rebounds. The Wins Produced method shares a credit among the players responsible for the statistics added to the offensive efficiency build-up. The defensive efficiency or the points given up per defensive possession is allocated credit as well and it is split equally among players on the team, weighted by minutes played. The allocation of the “credit” is done by linear regression which means that the points scored per offensive possession for a team are regressed against individual performance statistics at a team level.
On a side note, for those unfamiliar with linear regression, it is a very basic type of predictive analysis. The estimates are used to explain the relationship between a single dependent variable and one or more independent variables.
*Win Shares is another method that calculates the offensive portion in a similar way to that of wins produced. The defensive win shares, on the other hand, are calculated by finding the position of the player and regressing his defensive efficiency to the offensive efficiency of his opponent at the same position. The total win shares for each player are the sum of the offensive and defensive win shares.
*Player Efficiency Rating is an attempt by John Hollinger to sum up all of a player’s performance into a single number. It measures a player’s per-minute performance, while adjusting for pace. The single number takes into account field goals, free throws, 3-pointers, assists, rebounds, blocks and steals and negative results such as missed shots, turnovers and personal fouls. As mentioned above, this metric is both a pace-adjusted and per-minute measure. The adjustment for pace is important because it make sure that the players on slow-paced teams are not penalized for having lower numbers on their stats sheet than fast-paced teams. Also, because of the per-minute measure, this metric can easily compare players with huge difference in minutes played in games.
PER metric has already been used as a component in a model. The process involves the exploration of the connection between teams’ performance and players’ statistics. In terms of teams’ performance, the teams’ win ratio (Wins / Total Games Played*100) over a season is a good start. We can say that Win ratio is more informative than teams’ ranking at the end of the season, because win ratio is more quantifiable. For our player performance, the player efficiency rating is a good start. PER is more detailed and accurate than the raw statistical totals and pre-game numbers. On the other hand, its major weakness is the lack of consideration of defense. Blocked shots and steals are considered, but the metric doesn’t consider great individual or team defense. A multi-linear regression can be used in analyzing the relationship between players’ and teams’ stats. In a nutshell, there is a strong linear correlation between a team’s win ration and the team PER metric and this is logical because talented players are supposed to win more games.
*Player analysis can be calculated with “plus-minus” statistics of a player. This statistic is explained as the number of points a player’s team scored while he was on the floor minus the number of points the opposing team score while the player was on the floor. This metric can be a negative number because it is possible that the opposing team can score more points than the player’s team. The adjusted plus-minus (explained in our previous article here) tries to adjust for various factors that inflate or deflate the plus-minus statistics.
Drawback in basketball models and metrics
We do need to mention that team level models fail to account for changes in minute allocation. These changes can be the result of injuries, trades and transfers, and season state – teams tend to play their best players more often in the playoffs than they do in the regular season. All this can lead to models overvaluing teams with strong benches and undervaluing teams with weak benches.
Individual box-score models should consider that different players contribute to a player’s box-score stats. For example, 30% of a player match points can be a result of another player drawing defensive attention and freeing up the shooter for an open shot. In this case, both players should be allocated credit for the offensive event.
Defense evaluation is another problem for metrics such as Wins produced and PER. Wins produced divided the credit equally, which is not fair when a good defensive player participates in a team with four bad ones. Win scores divides the credit based on the opposition at the same position. This can be unfair as well when a player is blamed for the free throw because someone else fouled the opposition. The player efficiency rating doesn’t attempt to divide defensive credit.
Understanding the context of play is another fail for the metrics mentioned above and the box-score stats don’t distinguish between the types of fouls. For example, a foul is a bad play, but a non-shooting fouls is worse than a shooting foul.
Also, the adjusted plus-minus model also ignores any player-to-player interactions that might be present, and it assumes that a team strength is a linear sum of player strengths.
Statistical methods and techniques
There are numerous models that take box-score data and apply machine learning techniques – logistic regression or naive Bayes’. We have already mentioned about logistic regression and we will quickly explain Bayes Theorem. Naive Bayes is a collection of classification algorithms based on the Bayes Theorem. It has a common principle that every feature being classified is independent of the value of any other feature. This can be said in another way – a naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
The above explained naive Bayes classifier is used by Nate Siler to predict NCAA basketball outcomes. He uses team strengths from 5 other models and pre-season predictions. He shows that the pre-season predictions are really important in the NCAA because teams that over perform during the season relative to pre-season predictions tend to do poorly in the NCAA tournament and the opposite. In his model, Silver uses other inputs such as distance travelled and discovers that teams that travel more tend to perform worse.
Another example is the Brownian motion model applied by Hal Stern for the progress of sports scores that fits well for basketball, yielding good in-game win probability estimates. There are two drawbacks that Stern mentions: the relative strengths of two teams playing are not included in the model; which team has possession is not included.
A model developed by Kvam and Sokol, applies a hidden Markov model which can observe a team’s strength when it plays another team, and these observations get updated as teams play other teams. This model considers injuries and it also assumes that teams’ strength relations are transitive.
The Markov model, in statistics, is used to model randomly changing systems. It assumes that the future states depend only on the current state, not on the events that occurred before it. A basketball game can be modelled as a sequence of transitions between discreet states and a Markov chain, which specifies that the probability of the next state depends only on the present state, is a good solution. A good Markov model in basketball must find the golden middle between being very detailed and complex, so as to capture the relevant events that can occur during a game and on the other hand, simple enough to fit and interpret. A minimum requirement for such model is that the exact number of points scored by each team is determined by the transition count.
The state of the Markov chain can be defined in the following way: 2 states for possession (home and away); 5 states for possession gained (inbound pass, steal, offensive rebound. Defensive rebound and free throws); 4 states for the points scored in the previous possession (0,1,2,3). The largest possible model would have 40 states (2x5x4), but because certain combinations of the three factors are impossible, the largest model would have 30 states. The number of states can be reduced if we assume, for example, that rare events like 4-points plays or loose ball fouls following missed free throws are impossible. If the Markov model fits the data well, it can provide details about in-game win probabilities for a given team; the expected number of points scored in a possession gained in different ways, such as offensive rebounds vs defensive rebounds.
Many features and their metrics, which are necessary to create a good predictive model in Basketball, already exist. For example, about 60% of the games in the NBA (NBA Eastern Conference Futures Preview) are won by the home team and for this reason it is natural to use home team as a feature in any predictive model (NBA Betting Systems). Other important ones are: rest – can be measured as the number of games played in the last X days or the number of days since the last game; distance travelled – important for the NCCA championship as mentioned by Nate Silver; Altitude – teams located in higher altitudes tend to have better home-court advantage that other teams.
In this article, we have already mentioned points scored and points allowed per possession and that they measure offensive and defensive efficiency. It is important to mention that these metrics do not show us details into why these offenses and defenses are more efficient. Instead of looking at total points per possession, Ian Levy replaced it with expected points per possession (as we have mentioned in the beginning of the article). The types of shot can be broken into five different types along with the NBA averages. Ian Levy dataset categorized each shot into one of five categories as can be seen in the table below.
AS per the table, in order to be a good offensive team, a team should take more of the shots that have higher points per shot. Based on this, Ian created a metric called expected points per shot (XPPS). Formula can be seen below and it is calculated for a particular team or player – the sum of all shot types i of the frequency of i multiplied by average points per shot of i . We will look into that metric in more detail in another article and provide further explanations
Conclusion
Modelling and forecasting the outcomes of basketball games has become a large topic of research over the past decades. These models try to forecast a large number of regular season and playoff games with the single intention to outperform the betting market. With the quality of detailed data available, it is still very hard to predict the outcomes of the final score and the winning team. Basketball teams’ analysis should be approached as a collective union of individual players rather than a single team.
Predictions range from human predictions to statistical analysis of historical data. Each player has a unique identity on the court, depending on the team’s offense and defense sides. Taking individual players into account, when building basketball models, will certainly provide a better impact on game predictions.
Georgie has been in the industry for over 11 years, working as a trader and a broker for some of the largest syndicates in the world. Georgie has focused his model development on international soccer leagues.
Leave a Reply