Public Methodology
Effectiveness results are reported only after treatment and control groups are defined by a specific assignment method. GEO does not merge observational matches, holdouts, and randomized rollouts into a single blended score because each design answers a different causal question.
matched observational
Optimized runs are paired to the nearest baseline runs on intent bucket, engine architecture, visibility profile, and recency before any lift estimate is published.
Treatment: Runs tagged as optimized through explicit assignment metadata, AutoGEO metadata, or higher template versions.
Control: Comparable baseline runs without optimization signals.
page holdout
Treatment and holdout cohorts remain explicit and are never blended with observational matches.
Treatment: Pages or queries explicitly marked for rollout.
Control: Pages or queries explicitly held out from rollout.
randomized rollout
Reserved for research-partner experiments with explicit randomized assignment and clear cohort separation.
Treatment: Randomly assigned treatment cells.
Control: Randomly assigned control cells.
Bootstrap confidence intervals
GEO resamples run-level treatment and control measurements with replacement for 1,000 iterations at a 95% confidence level. The interval is built from the empirical distribution of treatment-minus-control deltas.
Wilson intervals for proportions
GEO uses Wilson intervals with z=1.96 for bounded proportions such as recommendation and citation rates because Wilson intervals behave better than naive normal approximations at small sample sizes.
p-value interpretation
The public p-value is a two-sided bootstrap sign estimate. It is a stability signal for the direction of the observed delta, not a standalone business-decision rule.
Example experimental result using the current reporting format. Values are shown in percentage points.
Assignment methods stay separate
Observational matches, holdouts, and randomized rollouts are shown side by side rather than blended into one headline number.
Intervals carry more weight than point estimates
A positive point estimate with an interval overlapping zero is reported as no conclusion rather than uplift.
p-values are directional stability checks
They indicate how often the bootstrap draws reverse sign, not whether a business change is automatically important.
Production implementation anchors: packages/monitor/src/geo_monitor/services/effectiveness.py and packages/monitor/src/geo_monitor/services/statistics.py. GEO Index scoring details are published separately in GEO Index methodology.