Public Methodology

How GEO validates measured lift.

Effectiveness results are reported only after treatment and control groups are defined by a specific assignment method. GEO does not merge observational matches, holdouts, and randomized rollouts into a single blended score because each design answers a different causal question.

matched observational

Matched observational

Optimized runs are paired to the nearest baseline runs on intent bucket, engine architecture, visibility profile, and recency before any lift estimate is published.

Treatment: Runs tagged as optimized through explicit assignment metadata, AutoGEO metadata, or higher template versions.

Control: Comparable baseline runs without optimization signals.

page holdout

Page holdout

Treatment and holdout cohorts remain explicit and are never blended with observational matches.

Treatment: Pages or queries explicitly marked for rollout.

Control: Pages or queries explicitly held out from rollout.

randomized rollout

Randomized rollout

Reserved for research-partner experiments with explicit randomized assignment and clear cohort separation.

Treatment: Randomly assigned treatment cells.

Control: Randomly assigned control cells.

Statistical treatment

Bootstrap confidence intervals

GEO resamples run-level treatment and control measurements with replacement for 1,000 iterations at a 95% confidence level. The interval is built from the empirical distribution of treatment-minus-control deltas.

Wilson intervals for proportions

GEO uses Wilson intervals with z=1.96 for bounded proportions such as recommendation and citation rates because Wilson intervals behave better than naive normal approximations at small sample sizes.

p-value interpretation

The public p-value is a two-sided bootstrap sign estimate. It is a stability signal for the direction of the observed delta, not a standalone business-decision rule.

Worked example

Example experimental result using the current reporting format. Values are shown in percentage points.

Treatment mean41.2

Control mean34.7

Lift delta+6.5

95% confidence interval2.1 to 10.4

Bootstrap p-value0.03

Interpretation rules

Assignment methods stay separate

Observational matches, holdouts, and randomized rollouts are shown side by side rather than blended into one headline number.

Intervals carry more weight than point estimates

A positive point estimate with an interval overlapping zero is reported as no conclusion rather than uplift.

p-values are directional stability checks

They indicate how often the bootstrap draws reverse sign, not whether a business change is automatically important.

Production implementation anchors: packages/monitor/src/geo_monitor/services/effectiveness.py and packages/monitor/src/geo_monitor/services/statistics.py. GEO Index scoring details are published separately in GEO Index methodology.