- Published on
- ·9 min read
Sorting things you cannot measure — only match
- Authors

- Name
- هاني الشاطر
Eight players, no stats sheet. You can't read anyone's skill; you can only make two of them play and watch who wins — and a single match is a coin weighted by the skill gap. How many matches until the league table is right, and who should you schedule?
Hidden true skill on the left, the Bradley-Terry estimate on the right. Press reveal true skill any time to see how close the table is. Change the scheduler to decide who plays — or set it to you scout and run the matches yourself.
This is sorting — but the comparison oracle is unreliable. In a textbook you sort by reading numbers. Here the only way to compare two players is to stage a match, and the better player merely probably wins. Sorting under that oracle is the noisy comparison (or dueling-bandit) problem, and it turns "what is the order?" into "where do I spend my matches?"
Act 1 — Sort, when you can read the numbers
Mergesort, quicksort, heapsort: comparisons against an exact oracle that says "x is larger" with certainty. The textbook closed this in 1962 and nobody reopened it. Every comparison is free of doubt, so you never ask the same one twice.
Act 2 — The oracle is a match, not a measurement
Now the eight things are footballers, wines, ad creatives, chess players — anything with a real but unobservable quality. You can't measure skill; you can only run a comparison, and the comparison is noisy. A lopsided match goes the right way almost always; an even match is close to a coin flip. The model for that coin is exactly the Bradley-Terry law: the probability the stronger wins rises smoothly with the skill gap.
The link to multi-armed bandits is direct: each of the pairs is its own Bernoulli arm, and you are running them in parallel, asking not "which arm pays most" but "which side of each pair is the win on." Your budget is the total number of matches. The estimate you carry isn't a win count — it's a fitted skill per player (Bradley-Terry), and the order is just the players sorted by fitted skill.
Math object
Latent skill & true order: ; the true order is the players sorted by (descending). Every player has one hidden number you never see — the league table is an attempt to recover the order those numbers imply.
Action: play a pair and observe one bit: who won.
Match law (Bradley-Terry): . The bigger the skill gap, the more lopsided the coin. At it is a fair flip; sets how decisive matches are.
MM estimator (the widget): . Here is 's total wins and the games and played. A few minorize-maximize passes fit the skills by maximum likelihood; the (Laplace) smoothing keeps a player who has won or lost everything from flying off to .
Metric (Kendall ): — the fraction of pairs your table orders the right way, rescaled to . is a perfect table.
Comparison-sort budget: exact oracle vs. noisy oracle repeats per close pair. With a certain oracle you never ask the same comparison twice. With a noisy oracle, a pair with skill gap needs repeated matches to be sure at confidence — close pairs (tiny ) dominate the cost, which is exactly why where you spend matches matters.
Act 3 — Where to spend the matches
The table in the hero already gives you the whole loop: schedule a pair, play the match, refit the skills, repeat. What it leaves open is who to schedule. Four policies, same league, plotted on one chart below:
- Random draws — pick any pair. Trivial, but spends as many matches confirming a blowout as resolving a coin-flip.
- Round-robin — everyone plays everyone equally: . Unbiased coverage, but wastes the budget on lopsided pairs whose order was never in doubt.
- Ladder (adjacent) — only play players one rank apart on the current table. The order is uncertain between neighbours, not between #1 and #8. Settles a correct table in ~1.5× fewer matches than random, but trusts the current (noisy) ranking to define "adjacent."
- Active (boundary) — play the closest, least-settled pair, . Information lives where two skills are nearly equal and you have little data, so this routes every match to the pair that buys the most table-correctness.
The information you need lives at the boundaries — pairs whose order is genuinely in doubt — and most pairs aren't close. Policies that aim at the boundary settle the table with far fewer matches.
Same hidden league for all four pairing policies. The curve is Kendall (table correctness) versus matches played; whichever reaches the top soonest settles the table first. Boundary-seeking policies (ladder, active) typically reach a correct table in ~1.5× fewer matches than random.
Now you be the scout
Set the hero's scheduler to you scout (click 2) and the matchmaking is yours: click any two players to stage a match, read the table, and decide who to test next. Pick a target — the #1 or the top 3 — and when you're sure, lock my pick: the widget freezes your current top-k and reveals whether you nailed it and how many matches it cost. Try to beat the active policy's budget. You'll feel the same pull the algorithms feel: the wasted matches are the lopsided ones; the matches that move you are the close calls near your cutoff.
Eight was the tutorial. Now ten thousand.
With eight players you can almost play everyone. With 10,000 you cannot: comparing every pair is million matches. And often you don't even want the full order — you want the best 100 (a shortlist, a leaderboard, a qualifying round). So the question sharpens: spend a limited match budget to recover the top-100, not to sort everyone.
Two changes make this tractable. First the estimator: fitting Bradley-Terry over 10k players every match is too slow, so we use Elo — an online, -per-match update that is essentially streaming Bradley-Terry:
The update is one SGD step on the same logistic likelihood; just rescales it so a 400-point gap means a 10:1 odds favourite. Second the matchmaking: random pairings waste most matches on blowouts far from the cutoff, while an adaptive scheme concentrates matches near the current top-100 cutline — the boundary, at scale. The metric is recall@100: how much of the true top-100 your estimated top-100 has captured.
Hidden skills for 10k players; each match is one noisy Bradley-Terry duel, scored by Elo. Adaptive matchmaking (green) vs random (grey) race on recall@100. The scatter shows estimated rating vs true skill for a 2k sample with the top-100 cutline. All-pairs round-robin would be ~50 million matches; adaptive recovers most of the top-100 in tens of thousands.
Two solvers, one goal: Bradley-Terry vs the combinatorial bandit
Finding the best K of N is a job two very different machines can do, and putting them side by side is the cleanest way to see what a combinatorial bandit actually is. Both want the top-k; they differ in the feedback they consume:
- Bradley-Terry active sorting asks for one match per round, fits latent skills by online logistic SGD, and reads the top-k off the sorted skills. Preference estimation → ranking inference — the natural tool when all you can collect is "A beat B."
- Combinatorial-UCB picks a top-k slate directly each round and observes a noisy reward per selected item (semi-bandit feedback). It never models preferences; it optimises the slate online.
This is deliberately not a fair race — Comb-UCB sees K reward samples per round, BT sees a single match bit — and that asymmetry is the lesson.
Real averaged curves from a 40-run experiment (N=30, K=5, 1000 rounds). Toggle the metric; press replay to animate. Both converge to almost the same top-k accuracy by round 1000 — BT 0.90, Comb-UCB 0.91 — but how fast shows the feedback asymmetry. By round 100 Comb-UCB is at 0.80 while BT is at 0.59, and over the full run Comb-UCB pays roughly half the cumulative regret. It pulls ahead precisely because it tastes five items every round instead of comparing two. But notice what each produces: BT hands you a full calibrated ranking of all 30 items plus a model of why (the latent skills), reusable for any k; Comb-UCB hands you a good slate and per-item means, and nothing about the items it never selected. Same goal, different artifacts — and the speed gap is the feedback, not the cleverness.
What you saw
Sorting under noise is choose-the-max in disguise — once for the best, again for the next, with comparisons you can repeat to drive down uncertainty. The budget question is exactly the bandit's explore vs exploit: spend matches on pairs you're already sure about (no new information) or on pairs whose order you don't know (every match buys real progress)? Active and ladder schedules route the budget to the boundary, and that's why they win.
It's also the first hint of structure. With independent arms every pair stands alone. Here the pairs are coupled by one global ranking: if A ≻ B and B ≻ C, then A ≻ C for free. That free transitive information is what gets exploited with a vengeance when the "arms" become the edges of a graph and the action you pick is a whole path through it.