Published on
·16 min read

From one-armed bandits to sorting you cannot measure

Authors
  • Avatar of هاني الشاطر
    Name
    هاني الشاطر
    Twitter

Five restaurants nearby. One will make you happiest on average. You can try only one tonight, your experience is noisy, and every mediocre dinner costs you a shot at a better one. That tiny dilemma is the seed of a whole family of problems — and by the end of this post it'll have grown into sorting players you can't measure and picking whole shortlists of K. The thread tying it all together never changes: explore vs exploit — where do I spend my next try?

Part I — The one-armed bandit: choosing dinner

That dinner problem is the multi-armed bandit: choose one option, observe one noisy outcome, update your belief, repeat. The hard part isn't the choosing — it's deciding how many bad dinners you're willing to buy in order to learn which restaurant is actually best.

The vocabulary is just the restaurant story in disguise. Each restaurant is an arm. A visit is a pull. The reward is whether tonight was good enough that you'd go back. You never see a restaurant's true average quality — only one noisy dinner at a time, and only for the place you chose. The others stay counterfactual: you never find out what you missed.

When you can read the numbers, it's boring

Hand me five restaurant scores — 0.3, 0.7, 0.5, 0.4, 0.6 — and "find the best" is a five-step scan. The interesting version hides the scores. Each dinner is a noisy sample of a quality you can't see: maybe the chef had a good night, maybe you ordered badly. One visit tells you something, not enough. You've got a hundred dinners — or ten thousand. Pick the order.

The cost of not knowing: regret

This is the shape of every repeated decision under uncertainty — where to eat, which email subject line to send, which widget to show. Formally: kk arms, each with a hidden reward distribution; one pull per round; a noisy reward observed; maximize cumulative reward. Equivalently, minimize regret — the gap to an oracle that always plays the best arm:

Regret(T)=t=1T(μμat),μ=maxiμi\text{Regret}(T) = \sum_{t=1}^{T}\big(\mu^* - \mu_{a_t}\big), \qquad \mu^* = \max_i \mu_i

Regret is the quality you leave on the table. If Sushi is truly best and you eat somewhere weaker tonight, the gap between those two hidden averages is tonight's expected regret. And here's the bind: you can't commit to the best arm after a handful of pulls (early estimates are noisy — you'll often crown the wrong one), but you can't explore uniformly forever either (every round on a bad arm is regret you never get back). Every algorithm is just a particular way of splitting your pulls between finding out and cashing in.

Three policies that cover most of the ground

The surprising lesson up front: the dumb-sounding policies work shockingly well, and the clever ones win by less than you'd think.

  • ε-greedy. Play the best-so-far 90% of the time; play a random arm the other 10%. The randomness is your tax against getting stuck on a wrong early guess. Hard to break, but it keeps wasting that 10% even after the winner is obvious.
  • UCB1optimism in the face of uncertainty. Score each arm by μ^i+2lnt/ni\hat{\mu}_i + \sqrt{2\ln t / n_i} and play the max: what you know, plus how wrong you might be. Barely-tried arms get a bonus that shrinks as you sample them, so exploration is directed at genuine uncertainty rather than sprayed uniformly.
  • Thompson samplingprobability matching. Keep a belief distribution per arm (a Beta for win/lose rewards); each round draw one sample from each and play the highest draw. A barely-tried arm has a wide posterior, so it sometimes draws high and gets explored; a known-bad arm has a narrow low posterior and almost never wins. Exploration fades on its own. Embarrassingly good in practice.

Notice what's shared: the loop is tiny — start each arm with a weak prior, pick, pull, update only that arm, repeat. What differs between algorithms is only how they turn "estimate plus uncertainty" into the next pull.

Race ε-greedy, UCB, and Thompson on the hard scenario (top two arms nearly tied), then race ε-greedy against pure exploit on balanced. The story tells itself: pure-exploit commits to the wrong arm about as often as the right one; ε-greedy beats it because a 10% random tax is cheaper than locking in a mistake; UCB and Thompson beat ε-greedy because their exploration is aimed; and UCB vs Thompson trade the lead by a noisy margin — bandits aren't where you find your edge.

A note on "UCB is slow": not in CPU terms — it's one index per arm. The slowness you feel is learning slowness. UCB1's 2lnt/ni\sqrt{2\ln t / n_i} bonus is a deliberately safe confidence radius (Hoeffding) valid for any bounded reward, so it keeps re-checking arms a more model-aware method would already drop. In production people reach for tighter variants (KL-UCB, UCB-Tuned) or Thompson when a Bayesian reward model is natural.

Pull it yourself — click a restaurant to eat there tonight, watch your regret line climb against a Thompson bot eating in parallel. A hundred dinners by hand is enough to feel how hard it is to beat directed exploration on instinct.

Context changes the best arm

One real-world wrinkle: the best dinner depends on the night. A quick lunch, a date night, a rainy solo dinner are different contexts. A plain bandit learns one global average per restaurant; a contextual bandit sees features of the situation first, then chooses. LinUCB is UCB with a twist — instead of one mean per arm, each arm learns weights θa\theta_a over a context vector xtx_t, predicts xtθax_t^\top \theta_a, and adds the same optimism bonus on top. The same restaurant can be wrong for lunch and perfect for a rainy night.

The one-sentence takeaway

A little cleverness wins by a lot; more cleverness wins by very little. ε-greedy, UCB, and Thompson are points on a smooth curve, and any of them crushes committing-too-early or never-committing. The textbook problem is essentially solved.

But it's a very specific problem: kk independent arms, one pull per round, reward seen immediately. The interesting world is when that structure breaks — when the "arm" you pull is a pair of items and the only thing you learn is which one is bigger. That's sorting under noise, and it's where we go next.

Part II — Sorting: when the arm is a pair

Eight players, no stats sheet. You can't read anyone's skill; you can only make two of them play and watch who wins — and a single match is a coin weighted by the skill gap. How many matches until the league table is right, and who should you schedule?

Hidden true skill on the left, the estimate on the right. Press reveal true skill any time to see how close the table is. Change the scheduler to decide who plays — or set it to you scout and run the matches yourself.

This is sorting — but the comparison oracle is unreliable. In a textbook you sort by reading numbers. Here the only way to compare two players is to stage a match, and the better player merely probably wins. Sorting under that oracle is the noisy comparison (or dueling-bandit) problem, and it turns "what is the order?" into "where do I spend my matches?"

We'll build it in three moves: the noise model (how a match becomes evidence), the estimator (how matches become a ranking), and the scheduler (which match to play next) — and then see why the "bandit" version that only cares about the top behaves so differently.

Act 1 — Sort, when you can read the numbers

Mergesort, quicksort, heapsort: O(nlogn)O(n \log n) comparisons against an exact oracle that says "x is larger" with certainty. The textbook closed this in 1962. Every comparison is free of doubt, so you never ask the same one twice, and there's nothing to estimate. Take that certainty away and everything changes.

Act 2 — The oracle is a match, not a measurement

Now the eight things are footballers, wines, ad creatives, chess players — anything with a real but unobservable quality. You can't read skill; you can only stage a match, and the match is noisy. A lopsided match goes the right way almost always; an even match is close to a coin flip. The model for that coin is the Bradley-Terry law:

P(ij)=σ ⁣(β(sisj))P(i \succ j) = \sigma\!\big(\beta\,(s_i - s_j)\big)

Each player carries one hidden number sis_i (their skill); σ\sigma is the logistic curve; β\beta is how decisive the sport is. The intuition is all in the gap sisjs_i - s_j: equal skill (si=sjs_i = s_j) gives a fair 50/50 coin; a wide gap pushes the coin toward a near-certain result. So one match is a single noisy sample of one pairwise gap — and that's exactly a Bernoulli arm. Sorting under noise is just running n(n1)/2n(n-1)/2 such arms in parallel and asking, for each pair, "which side does the win land on?"

The cost of certainty follows directly. A pair separated by a gap Δ\Delta needs about log(1/δ)/Δ2\log(1/\delta)/\Delta^2 repeated matches before you can call it at confidence 1δ1-\delta. Big gaps are settled in one match; near-tied pairs are where the budget goes. Hold onto that — it decides everything about scheduling.

How the table actually learns: from a batch fit to one step per match

We have match results; we want skills. Two ways to get them, and the second is the one that scales.

Batch fit (what the 8-player table does). Given all the matches, find the skills that best explain them — maximum likelihood for the Bradley-Terry model. The widget does this with a few minorize-maximize passes; you don't need the derivation, just the picture: it nudges everyone's skill until each player's predicted win-rate matches their actual win-rate. Then the ranking is simply the players sorted by fitted skill. Clean, but refitting from scratch after every match is hopeless once you have thousands of players.

Online fit = Elo = one gradient step per match. Treat each match as a single logistic-regression example and take one SGD step on it. For a match where ii beat jj:

sisi+η(1p),sjsjη(1p),p=σ(sisj)s_i \leftarrow s_i + \eta\,(1 - p), \qquad s_j \leftarrow s_j - \eta\,(1 - p), \qquad p = \sigma(s_i - s_j)

That is Elo (with step size η\eta playing the role of Elo's KK). The intuition is lovely: the update is the surprise, 1p1 - p. If the favourite wins (p1p \approx 1) the ratings barely move — you learned nothing new. If an underdog pulls an upset (pp small) the ratings swing hard. The system self-corrects toward whatever the matches keep saying, O(1)O(1) per match, no refit.

So the whole recipe in the noisy world is three separable dials:

estimate skills by SGD on matches → read off the ranking by sorting them → choose which match to sample next.

Change the estimator (batch vs Elo) without touching the rest; change the scheduler without touching the estimator. The next two sections are about that third dial — because which matches you feed the SGD is what decides how fast the sort converges.

Act 3 — Where to spend the matches

Here's the payoff of the Δ2\Delta^2 budget. Look again at the Elo step, η(1p)\eta(1-p):

  • A lopsided match (p1p \approx 1) produces a step near zero and never reorders anyone. It confirms what you already knew.
  • A near-even match (p12p \approx \tfrac12) produces a real step and it's between two players whose order is genuinely in doubt.

The informative matches are the close ones at the boundary. A good scheduler aims there. Four policies, same league, raced on the chart below:

  • Random — pick any pair. Trivial, but spends as many matches confirming blowouts as resolving coin-flips.
  • Round-robin — play the least-played pair, argmin(i,j)nij\arg\min_{(i,j)} n_{ij}. Fair coverage, but blind: it keeps re-playing settled blowouts.
  • Ladder — only play players one rank apart on the current table. The order is uncertain between neighbours, not between #1 and #8, so this aims at the boundary — ~1.5× fewer matches than random — but it trusts a noisy ranking to decide who's "adjacent."
  • Active — play the closest, least-settled pair, argmax(i,j)[closeness(s^is^j)+uncertainty(1/(1+nij))]\arg\max_{(i,j)}\big[\text{closeness}(|\hat{s}_i - \hat{s}_j|) + \text{uncertainty}(1/(1+n_{ij}))\big]. It explicitly routes each match to where it buys the most table-correctness.

This is the explore/exploit trade in its purest form: a match on a pair you're already sure about is pure exploitation with no information; a match on an uncertain pair is exploration that actually moves the answer.

Same hidden league for all four; the curve is Kendall τ\tau (table correctness) vs matches played. Whichever reaches the top soonest settles the table first — and the boundary-seekers win.

Now you be the scout

Set the hero's scheduler to you scout (click 2) and the matchmaking is yours: click two players to stage a match, read the table, and decide who to test next. You'll feel the same pull the algorithms feel: the wasted matches are the lopsided ones; the matches that move you are the close calls near your cutoff.

Why your candidate pool must be wider than K

Now suppose you don't want the whole order — just the best K. The tempting shortcut is to only play matches inside the current top-K. It backfires three different ways, and seeing why is the most useful idea in the post:

  1. Selection bias (the feedback loop). A genuinely top-K player who's currently underrated at rank K+1 never gets scheduled, so never climbs back. You let a noisy line decide who's allowed to play, and the line freezes itself in place.
  2. Dead gradients. Inside a settled top-K every match is lopsided (p1p \approx 1), so the Elo step η(1p)0\eta(1-p) \approx 0. Nothing reorders. The matches that actually move the cutline are the near-even ones that straddle it.
  3. Broken identifiability. Bradley-Terry/Elo skills are only well-defined while the "who-played-whom" graph stays connected. A top-K-only schedule disconnects the top group from the field, so "K-th place" loses its anchor entirely.

The fix is exactly the instinct you'd reach for in a real league: keep a candidate band around the cutline — everyone whose uncertainty still overlaps K-th place — and let the band narrow as you grow confident. Too tight (just K) and you get all three failures; too wide (the whole field) and you're back to round-robin, burning matches on decided blowouts. That shrinking band is precisely successive elimination: keep contenders until the data rules them out.

Eight was the tutorial. Now ten thousand.

With eight players you can almost play everyone. With 10,000 you cannot: every pair is n(n1)/250n(n-1)/2 \approx 50 million matches. So we lean on both ideas above. Estimator: Elo, the O(1)O(1)-per-match SGD update, because refitting Bradley-Terry over 10k players every match is hopeless. Scheduler: adaptive, which estimates the K-th rating (the cutline) and pairs players near it — the boundary band, at scale. The metric is recall@100: how much of the true top-100 your estimated top-100 has captured.

Each match is one noisy Bradley-Terry duel scored by Elo; adaptive matchmaking (green) vs random (grey) race on recall@100, and the scatter shows estimated rating vs true skill with the top-100 cutline. Random wastes itself on blowouts far from the line; adaptive recovers most of the top-100 in tens of thousands of matches instead of 50 million — entirely by spending its budget in the band around the cutline.

Part III — Combinatorial bandits: picking a whole top-K

Everything so far lives in one world: the only thing you ever observe is who won — a comparison bit. There's a second world where each thing you try hands back its own number: a click, a watch-time, a benchmark score. Top-K shows up in both, but they are not the same problem, and they need different machines.

  • Comparison feedback (this whole post): Bradley-Terry / Elo active sorting. Fit skills from "A beat B," sort, schedule the boundary.
  • Per-item reward feedback: Combinatorial-UCB. It's the exact UCB idea from Part I, scaled from one arm to a whole slate. Keep, for each item, a running mean of its rewards plus the same optimism bonus that's large when you've barely tried it, ui=xˉi+clnt/niu_i = \bar{x}_i + \sqrt{c\,\ln t / n_i}. Each round, pick the top-K by uiu_i (that's the "combinatorial" part — your action is a whole K-set, not a single arm), observe a reward for each of the K you picked ("semi-bandit" feedback), and update only those.

The honest catch — and it's the crux: Comb-UCB cannot run on "who won." It needs a reward per item; the football world never gives one. So the panel below is not Comb-UCB solving the sorting problem. It's the two machines on the same hidden item qualities but through different feedback channels — one drawing comparison bits, one drawing per-item rewards.

Averaged curves from a 40-run experiment (N=30, K=5, 1000 rounds). Both reach ~0.90 top-k accuracy by round 1000 — but how fast exposes the asymmetry: by round 100 Comb-UCB is at 0.80 while BT is at 0.59, because Comb-UCB tastes five rewards per round while BT extracts a single comparison bit (the widget's "feedback / round: 1 vs 5"). It's deliberately not a fair race — that's the lesson, not a verdict. And notice the artifacts differ: BT leaves you a full calibrated ranking of all 30 items plus a model of why (the latent skills), reusable for any K; Comb-UCB leaves you a good slate of 5 and per-item means, and nothing about the items it never picked. Same goal, different feedback, different machinery — pick the machine your signal can actually feed.

What you saw

Sorting under noise is choose-the-max in disguise, and it cleanly separates into three dials: estimate (SGD on matches → skills), read off (sort), schedule (aim at the boundary). The budget question — spend matches on pairs you're sure about, or on pairs whose order is in doubt? — is the bandit's explore/exploit, and the answer is always "the boundary," widened into a band whenever you only need the top-K.

It's also the first hint of structure. With independent arms every pair stands alone; here the pairs are tied together by one global ranking, so A ≻ B and B ≻ C buys you A ≻ C for free. That free transitive information is what gets exploited with a vengeance once the "arms" become edges of a graph and the action you pick is a whole path through it.