CRS Pilot, KPR Exposes Human Bias - 7/20/23
We have completed six weeks of our Computerized Ratings Pilot.
June 6/8 = Practice Week
June 13/15 = 1st Official Week, Scores Entered into DUPR via Phone
June 20/22 = 2nd Official Week, Scores Entered into DUPR via Administrator
June 27/29 = 3rd Official Week, All Players Must Participate in DUPR
July 11/13 = 4th Official Week, Thursday Women’s Session Cancelled due to Heat
July 18/20 = 5th Official Week, Team-Based Events
Upcoming July 25/27 = 6th Official Week, Normal Process
Upcoming August 1/3 = 7th Official Week, PICK YOUR PARTNER WEEK
We continue to learn!
130 players have played at least one game between June 13 and July 20.
If a player is mis-classified, the computer can move players at 0.25 point increments within 1-2 weeks.
It takes the computer 3-7 weeks (12-28 games) to move other players 0.25 ratings points. This doesn’t mean the player stays at the higher level.
About 75% of our players are moving +/- 0.15 ratings points (i.e., they are staying approximately where they started off).
We continue to measure DUPR, but DUPR ratings are not adjusting fast enough for broad use by a club our size.
The purpose of our pilot is to test the computer. Many of you are put in challenging spots, thereby artificially hurting/helping your computer rating. After 5-6 sessions, these biases start to even out. Once a player hits about 20 matches, we begin to see a more reasonable story of how that player compares to other players.
Upsets continue to happen at a consistent rate. Upsets have happened at consistent rates throughout the pilot.
Lower Color Level / Higher Color Level vs. Higher/Higher = 31% chance of an upset.
Lower Color Level / Lower Color Level vs. Higher/Higher = 19% chance of an upset.
Our First Significant Computer Failure! Or Did Kevin Fail?
Dear readers, no process works perfectly. This week, things did not work perfectly.
This week, we held a Team-Based event (Team Pebbles vs. Team Creekers). Here are the results of our Team-Based event:
Men: Team Pebbles = 16 Wins, Team Creekers = 4 Wins
Women: Team Pebbles = 7.5 Wins, Team Creekers = 7.5 Wins
The matchup among our Women was a spicy affair, with neither team ever leading by more than two wins.
However … the Men … what a mess!
I used a snake draft format (highest rated player = Team 1, 2nd/3rd highest rated = Team 2, 4th/5th highest rated = Team 1, 6th/7th highest rated = Team 2 … and so on). I used the same format for both Men and Women.
Here are the KPRs prior to starting the Men’s Team Event:
Team Pebbles = 3.289
Team Creekers = 3.279
In other words, the two teams are equal. You cannot ask a computer to do a better job, can you?
Here are the KPRs after completion of the Men’s Team Event:
Team Pebbles = 3.339
Team Creekers = 3.229
There isn’t a huge difference there – you’d expect Team Pebbles to gain about 0.15 points, they gained 0.10 points. Part of this happens because once Team Pebbles started winning, they became increasingly favored in rounds 3/4/5, earning fewer ratings points for subsequent wins.
So what happened? Was the computer at fault, or was Kevin at fault?
It was Kevin. His fault.
I let the computer pick teams.
I picked the matchups. I introduced a significant bias, where Team Creekers were generally expected to lose … often.
Out of 20 games, Team Creekers was favored to win one (1) game. One! Two equal teams, but when a human gets involved trying to create “fair matchups”, the human failed. Miserably!
Round 1 = Creekers favored in 0 of 4 games. They won 1 game.
Round 2 = Creekers favored in 0 of 4 games. They won 1 game.
Round 3 = Creekers favored in 0 of 4 games. They won 0 games.
Round 4 = Creekers favored in 0 of 4 games. They won 1 game.
Round 5 = Creekers favored in 1 game (they win it). They lose the other three.
We’ve talked about upsets during our pilot. Upsets should happen. For Team Creekers, it happened 3 out of 19 games (16%). Upsets should happen between 19% and 31% of the time, based on our pilot. So just a little bit worse than normal, but not enough to matter. If Creekers had won 5 out of 19 games, the final score would have been 14-6 instead of 16-4. No difference, really.
What is the cautionary tale here?
“HUMANS CAN BE VERY BIASED WHILE THINKING THEY ARE BEING VERY FAIR.”
I messed up this team event.
Badly.
Knowing how badly I messed this up on Tuesday, I over-compensated on some of our matchups for the Women’s team day. The result? A 7.5 – 7.5 outcome. I exaggerated to the point where I had players playing 2-3 levels above/below where they should be playing to see if I could override my “fairness bias” which turned out to be an “unfairness bias” on Tuesday. Also, we didn’t have enough A/B/G/I (Aqua, Burgundy, Green, Indigo) players to have more equal matchups, so I kind of had to create even matchups with significant differences in ratings on each side of the court (i.e. Indigo with Purple against Green with Purple, or Orange with Teal vs. Purple with Red).
Again … there is a key takeaway from today:
“HUMANS CAN BE VERY BIASED WHILE THINKING THEY ARE BEING VERY FAIR.”
There is a good reason we cannot move players up to higher color levels based on results in this pilot. Imagine being a guy on Team Creekers. I hurt their KPRs by an average of 0.10 points on Tuesday. That isn’t fair. It is, however, the whole purpose of a pilot. It is our job (all of us) to learn what works and what does not work.
It wasn’t a good idea for Kevin to take a fair split of players via computer and then assign matchups based on Kevin’s biased mind.
When an additional 800 players come back to PebbleCreek in October, we want to have a body of knowledge. If we decide to use a Computerized Ratings System this Fall/Winter, we need to know what works and what does not work. We learned a lot this week, didn’t we?
Parity of Play
I received numerous comments this week that the “parity of play” was poor (from those playing in the pilot). This happened for three reasons.
With a team-based event, the number of matchups we can employ are greatly limited. For this reason, I am leaning toward not having a team-based event at the end of August. Think about it this way. With four players in our normal format, we can get three games out of the four players, all unique combinations. Those same four players in a team format can only play each other one time. We lose 2/3rds of our available matchups.
Hot temperatures have reduced our player pool from 32 players (and 20 on a wait list) to about 20ish players per session.
Recent low participation among Aqua/Burgundy/Green players. When we have too few A/B/G players, we must match Maroons with Indigos, Oranges with Maroons, Purples with Oranges … you get the picture. A lack of A/B/G players trickles down to all players.
To date, 1% of our games included Aqua players, 4% Burgundy, and 8% Green. These rates of play are reasonable. No worries.
In the past week, 0% of our games came from Aqua/Burgundy players, and 6% of our games came from Green players. This impacts parity all the way down the line.
Targeting, Part 74
One of our A/B players issued the following statement (paraphrased here) to some of our volunteers:
“The players on the court are not equal color levels. This means that one or two players are targeted. You aren’t testing a computer. You are testing the ability of a better player to target another player.”
On the surface, this sounds like a seductive argument, challenging to refute.
Let’s refute the argument … with actual data from our pilot.
We are not testing player ability this summer.
We are testing the ability of a computer to properly assess the probability of a better team defeating a less-qualified team. If the computer can properly assess probabilities in this environment (where an Orange player might be playing with/against an Indigo player), the computer is absolutely gonna cook when evaluating an Aqua player against an Aqua player or a Burgundy player against a Burgundy player or a Teal player against a Teal player.
The computer has done a fabulous job of assessing probabilities. We know this is true, because when there are significant mis-matches, the team that wins earns 0.01 ratings point or less. The highly favored team gets virtually no credit for the win.
Targeting = Mismatches. Are Our Games Mismatches?
Those watching games on Court 1 might come to a different conclusion about the quality of our games than those who watch all of our players compete on all of our courts. Within our pilot, to-date:
Number of Games Within 0.10 rating point between the teams = 120 (41%)
Number of Games 0.10 to 0.25 rating point difference = 109 (37%)
Number of Games 0.25 to 0.50 rating point difference = 58 (20%)
Number of Games 0.51+ rating point difference = 6 (2%)
A whopping 78% of our games are played within the ratings level of one-color-group … in other words, 78% of our games mirror what you’d see in a round robin.
Let that one sink in for a moment.
78% of our games are played by players of a similar quality to what you’d observe in a Round Robin.
What you see happening on Court 1 is not representative of what happens on Courts 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 in total. Watch all our volunteers play, like I do, and perceptions might change.
Probability of Winning?
0.10 to 0.25 rating point worse than opponent = 31%
0.25 to 0.50 rating point worse than opponent = 19%
0.50 to 0.75 rating point worse than opponent = 17%
Even in the allegedly “non-competitive” games with color discrepancies on the court, the team being “targeted” has a 17% to 19% chance of winning the game. They are at a 1-2 color level disadvantage, and they still win 17% - 19% of their games.
Our volunteers are taking full advantage of the opportunities presented to them.
Our pilot is designed to prove or disprove the quality of a computer rating system.
Our pilot, however, provides opportunities for players who win. I’m proud of our volunteers for taking advantage of their opportunities this summer!!!
Humans can be very biased while thinking they are being very fair.