Bayesian skill rating

This page is work-in-progress.

This page details the background of a possible Bayesian skill rating system implementation in Legacy mod. See also #403 for implementation details.

Backround

Objectives:
  • Implement a new, proved metric that can be used to compare skill of players over time
  • Get rid of XP save (not XPs!) and all of its unwanted side effects
  • Add some nice features built upon this system

What about XP Save?

ET implements XPs, which measure player’ scores based on specific actions (kills, construction, revive, ..) and are also used to control the various skill levels, similarly to a RPG. However, XPs don’’t give any indication of the real global performance of a player. As XP Save (which isn’t part of the original game) is usually enabled on servers, XPs merely only give a hint about the time the player has played on a specific server. Saving XPs also create some imbalance by enabling abilities of skill levels permanently (such as adrenaline for medics) which have never been accounted for in the original game balance, where XPs were reset at the end of matches or the end of campaigns.

As FPS require skills in multiple areas like reflex, planning, tactical analysis or teamwork, with the relative importance of these skills depending on the map, game mode, player roles or team compositions, a *skill rating*’ could be defined as a metric measuring "all the parameters of a player that help his team to win". This can be measured by a probabilistic, Bayesian approach by simply looking at the output of a game instead of the various micro-parameters (such as accuracy, number of kill).

Beside being global, one important reason why a skill rating system should only consider match results to compute the skill is that accounting for extra information given by micro parameters might be abused by players. This can be very detrimental to team-based games, where players would try to maximize statistics that boost their skill (e.g. their own number of kills or awards) instead of doing what is best for their team to win.

What else?

Beside rating individual players, a Bayesian skill rating system can be useful for
  • predicting the outcome of matches in a multiplayer game
  • rating the "match quality" during a given map, and
  • keeping the teams balanced to, in theory, keep the gameplay entertaining
Also, it can be used to
  • track player ratings across multiple servers world-wide with a master server (global leaderboard)
  • recommend servers to players based on the difficulty level of each server (difficulty matching)
Additionally, analysis of collected statistics can give insights that lead to improvements in gameplay by:
  • allowing server administrators to more fairly balance matches
  • allowing level and map designers to either create more balanced maps, or recommend more appropriate time limits to create more balanced maps

Objectives

As a first step, a possible implementation should focus on giving useful information:
  • skill rating
  • match quality rating
  • outcome prediction
  • analysis of maps gameplay
Once this is working in a satisfying way, additional features could be build upon to provide the following:
  • team balancing
  • centralized leaderboard
  • matchmaking with players of similar skills
  • ...

Existing implementations

Note:

Elo

The most possibly prominent rating system in use today is Elo.
  • A player’s Elo rating is represented by a number which increases or decreases based upon the outcome of games between rated players.
  • After every game, the winning player takes points from the losing one. The difference between the ratings of the winner and loser determines the total number of points gained or lost after a game.
  • In a series of games between a high-rated player and a low-rated player, the high-rated player is expected to score more wins. If the high-rated player wins, then only a few rating points will be taken from the low-rated player. However, if the lower rated player scores an upset win, many rating points will be transferred. The lower rated player will also gain a few points from the higher rated player in the event of a draw.
  • This rating system is self-correcting. A player whose rating is too low should, in the long run, do better than the rating system predicts, and thus gain rating points until the rating reflects their true playing strength.

Glicko

The Glicko updating system improves over Elo by incorporating the variability in parameter estimates. Prior to a rating period, a player’s skill (θ) is assumed to follow a Gaussian distribution which can be characterized by two numbers:
  • the average skill of the player (μ)
  • the degree of uncertainty in the player’s skill (σ)

Glicko models the game outcomes by the Bradley-Terry model and updates players’ skills after a rating period.
Glicko performs best when the number of games per player is around 5-10 in a rating period.

The Glicko-2 rating system improves upon the Glicko rating system and further introduces the rating volatility, which indicates the degree of expected fluctuation in a player’s rating.

Though the Elo and Glicko ranking systems have been successful, they are designed for two-player games.

ETPub skill rating

ETPub implements such a metric with its "Player Rating", giving a normalized skill score.

References

Website: Reports:

Overview

Type: Single layer neural network using Bradley-Terry model (Logistic distribution) // (identical to logistic regression model)

The base abilities of the model is useful to determine player ratings by:
  • allowing for weighting individuals
  • dealing with rating uncertainty
  • preventing rating inflation
It is extended to take into account
  • the “map-side effect
  • the effects of time on gameplay
  • the server difficulty

The model allows for real time win performance prediction.

Notes:

Map-side effect
  • Most maps were designed such that one side has a major advantage.
Time effect
  • Can be used for real time determination of match outcome probability, not actually useful to determine the outcome prior to the match.
Server difficulty
  • allow players performance comparison across servers more effectively
  • gives us a measure of how difficult each server is, as this parameter can also be interpreted as the rating increase a given side expects for each additional player on that side. Servers where having additional players will not make up for the skill of the players are more difficult than those where a few extra players alone can decide the winner.

Limitations and possible improvements

Parameters:

  • Rating uncertainty : The down side to this method of modeling uncertainty is that is does not allow for a later change in an individuals rating due to long periods of inactivity or due to improving rating over time. This could be implemented with a form of certainty decay that lowered an individual’s certainty value over time (see Glicko-2 rating volatility).
  • Map side: a similar measure of uncertainty could be applied to the field-group parameters.
  • Server difficulty: The accuracy of server difficulty comparison is affected by how often players move between servers.
All associations are assumed to be additive and linear.
  • They are additive in the sense that the skill of a given team is proportional to the sum of the skill of that team’s players. It does not take into account higher-level interactions between the players.
  • It is linear because the effects of time and the number of players per team is assumed to affect the model in a linear fashion.
Two approaches to improving the robustness of the current model include:
  • Extending the current single-layer ANN to a multi-layer ANN
  • Constructing a higher-level statistical model that allows for these additional complexities

Evaluation

  • The model accuracy is tested against predicting the matches used to estimate the model parameters. The data set used consisted of 4,675 matches, 5,145 players, and 14 matches on average per player. The results show the accuracy to be 72.5%, but this result does not seem to have been cross validated.

TrueSkill

The TrueSkill system is a more modern algorythm that has been developed by MicroSoft for its XBox matching service. It has the advantage over the ETPub PR that it starts very low and increases over time (like XPs), before stabilizing when the skill rating is accurate (like the ETPub PR).

References

Website: Reports:

Overview

Type: Bayesian network using Thurstone-Mosteller model (Normal distribution)

  • Allow multi players comparison
  • Allow multi teams comparison
  • Allow draw
  • Allow partial play
  • Allow partial update
  • Allow uncertainty over time
  • Allow match quality evaluation
  • Allow win probability prediction

Limitations and possible improvements

  • No map-side effect parameter
  • No time effect parameter

TrueSkill can be expanded to take both of these parameters into account. They can be modeled with another Gaussian in the team performance factor.

Evaluation

The prediction error (fraction of teams that were predicted in the wrong order before the game) has been evaluated.
  • However. this measure is dicult to interpret because of the interplay of ranking and matchmaking: Depending on the (unknown) true skills of all players, the smallest achievable prediction error could be as big as 50%.
  • In order to compensate for this latent, unknown variable, we arranged a competition between ELO and TrueSkill: We let each system predict which games it considered most tightly matched and presented them to the other algorithm. The algorithm that predicts more game outcomes correctly has a better ability to identify tight matches.
  • For TrueSkill we used the matchmaking criterion and for Elo we used the difference in Elo scores, s1 - s2.

The Halo 2 Beta Dataset (v1.1) has been used to evaluate the model. It consists of various matches outcome, including small teams games (up to 4 players in 2 teams, 27539 games for 4992 player) and large teams games (up to 8 players in 2 teams, 1199 games for 2576 players).

From this dataset,
  • 80% has been used for data of training
  • 10% has been used for testing
  • 10% has been used for cross-validation

Prediction errors has been shown to be 37.17% (or 62.83% accuracy) for small team and 29.94% (or 70.06%) for large teams. TrueSkill proved to be signicantly better at predicting the tight matches than Elo (buffed for team ranking).

Convergence:

The algorithm converges, for 2 teams with 8 players per team, after 91 games (average number of games per gamer that the system ideally needs to identify the skill level).
  • This is quite slow to determine the skill rating
  • But this slowness is good as a replacement of XP save!

TrueSkill extensions

References

Various

References

Implementation

We’ll implement TrueSkill, for the following reasons:
  • Although the ETPub algorithm includes the extra map and time parameters, it severely suffers from a fixed rating uncertainty that TrueSkill actually manages properly.
  • TrueSkill is also quite simple to implement and fast to compute in ET context, as the game doesn’t require more than 2 teams (no complex Belief Propagation required) and doesn’t have draw either.
  • We could also extend TrueSkill for map-side and time parameters to improve it.
Overview of work required:
  • Implement rating database management
  • Take time of clients that have disconnected before the end of the game into account
  • Take time of clients that have disconnected then reconnected into account
  • Extend algorithm with field parameter
  • Extend algorithm with time parameter
  • Extend win probability with field parameter
  • Extend win probability with time parameter
  • Add cvar in default config file (when ready)
  • Implements useful commands (/rating, /allrating, ..)

See also #403 for implementation details.

Language

Although a Lua implementation is available (FAF), a C implementation would allow tigher integration in the mod and ease future features development.

  • Implement the base algorithm directly in C
  • Implement players database management in Lua instead. This can be integrated to our Lua administration suite.
  • What about map data collection? The mapvoting gametype already exist in C in the code, but it is based on a file format instead of DB. Let’s keep thing separate for now.

Algorithm

  • Basics
  • Partial play
Factor 1: What Do We Already Know About Your Skill?
  • previous skill level (μ, σ) from somewhere (e.g. a player database)
  • add some uncertainty (τ) to your skill’s standard deviation to keep game dynamics interesting
Factor 2: How Are You Going To Perform?
  • add in beta (β)
Factor 3: How is Your Team Going to Perform?
  • computing the performance of a team as a whole.
  • team’s performance is the sum of each team member’s performance
  • weight each team member’s contribution by the amount of time that they played
Factor 4: How’d Your Team Compare?
  • compare team performances in pairs
  • subtracting team performances to come up with pairwise differences
Factor 5: How Should We Interpret the Team Differences?
  • comparison factor based on the team performance differences
  • comparison depends on whether the pairwise difference was considered a “win” or a “draw.”
  • we imagine that there is a buffer of space called a “draw margin” (ε) where performances are equivalent.

Extensions

  • Map extension
We actually extend TrueSkill for map-side effect and time effect parameters:
  • Additional Gaussian can be used in the team performance factor
  • An optional cvar might be required as this requires prior map data collection

Map parameter

Approximate the map field advantage with a discrete Bernouilly distribution:
  • p = probability of a team winning on a certain map (0 < p < 1)
  • mean = p
  • var = p * (1-p)

Both map side and map time parameters are taken into account here, as the length of the map determines who is more likely to win.

Note:
  • Scaling of mean and var should be done by a factor of 2 * MU
  • Number of required match before p is valid should be defined. For the more general binomial distribution B(n,p) with n = matches and p = prob, mean = np, var = np(1-p), and the normal approximation is N(np, np(1-p)) with n > 20 and p not near 0 or 1. See Binomial proportion confidence interval and g_playerRating_mapPad from etpub to stabilize the map bias mean early.

Mod integration

See #403.