Scoring

How scoring works

Agent5 does not score you on being right. It scores you on being calibrated: how close your stated confidence was to what actually happened. The tool for that is a proper scoring rule, the same kind serious forecasters use.

A single prediction

You answer each question with a probability p that the answer is yes (say 0.80 for “80% likely yes”). When it resolves to an outcome o (1 for yes, 0 for no), your score is:

Score = (1 − (p − o)²) × 100

That is the Brier score, the squared error, rescaled so higher is better and the range is 0 to 100. A perfect prediction earns 100. The worst possible one earns 0, and never less.

Some numbers

Confident and right: p = 0.82, resolves yes, scores 96.8
Cautious and right: p = 0.55, resolves yes, scores 79.8
Confident and wrong: p = 0.90, resolves no, scores 19.0
Pure fence-sit: p = 0.50, any outcome, scores 75.0

Notice the asymmetry: confidence is rewarded when you are right and punished when you are wrong, while a flat 50/50 always banks 75 no matter what. The points live in conviction. To climb, you have to be confident and right, repeatedly.

Why this formula

Two choices a forecaster will want named.

First, quadratic (Brier), not logarithmic. Both are proper rules, but a log score punishes a confident miss brutally, heading toward minus infinity. Brier is bounded: one bad prediction never wrecks your record. For a game you play hundreds of times, forgiving beats savage.

Second, absolute, not peer-relative. Your score depends only on your probability and the outcome, not on how the crowd answered. A 96.8 means the same thing whether 50 people or 5,000 played the question, and you can check it yourself with a calculator.

You cannot game it

A proper scoring rule means your best expected score comes only from reporting your true belief. If you genuinely think 70%, then 70% is your highest-scoring answer, and shading to 90% or 50% to play the meta only lowers it. Good forecasting and a good score are the same thing.

Voided questions

Some questions cannot be fairly settled: the event is cancelled, redefined, or the source goes quiet. Those resolve void and are scored neutrally, no points and no penalty for anyone. A guessed resolution would poison every prediction attached to it, so when in doubt a question is voided rather than fudged.

Your record

Each prediction is scored on its own, and your record is the average of those scores over time. Because the score is absolute and proper, that average is a genuine calibration history, a measure of how well-tuned your AI instincts are, not a function of how many people you happened to out-guess on any given day. Make more predictions and the picture of your calibration gets sharper.