Kwangmin Kim - Kohavi Ch.19 개관 — A/A Test (Null Test)

1 정의

정의: A/A Test (Null Test)

A/B test 와 동일 setup 이지만 Treatment 와 Control 이 identical 한 실험. 사용자를 random 으로 두 그룹으로 나누고, 두 그룹 모두 같은 experience 제공 (Kohavi, Tang, Xu, 2020, Ch.19).

1.0.0.1 기대되는 결과

True null (no effect):
  - 평균 5% 의 metric 에서 p < 0.05 (false positive, expected)
  - p-value distribution: uniform on [0, 1]

System 정상 시:
  - Repeated A/A trials 시 false positive ~5%
  - p-value 가 uniform

System 이 buggy 시:
  - False positive > 5% 또는 < 5%
  - p-value 가 non-uniform (skewed, mass at specific values)

1.0.0.2 5 가지 목적

목적	검증
Type I error control	5% 표준 의 actual rate 검증
Variance estimation	Variance 의 정확성 (특히 ratio metric)
Bias detection	Carry-over, system anomaly
System of record	외부 logging system 과 일치
Variance for power	A/B 의 sample size 계산 input

원문 인용 (Kohavi): “If everything is under Control, then you’re running an A/A test.”

핵심 통찰: A/A test 는 단순 검증 도구가 아닌 platform trust 의 foundational 도구. A/A 통과 없으면 모든 A/B 의 결과가 의심. Continuous A/A 가 production 의 안전망.

2 개념 및 원리

2.1 Why A/A Test 는 critical 한가

저자 도입 강조: “Running A/A tests is a critical part of establishing trust in an experimentation platform. The idea is so useful because the tests fail many times in practice, which leads to re-evaluating assumptions and identifying bugs.”

2.1.1 5 가지 목적의 깊이 풀이

2.1.1.1 목적 1 — Type I Error Control

이론:
  Standard test (alpha = 0.05) 시 false positive rate = 5%

A/A test 의 검증:
  1000 A/A trial → ~50 trial 이 false positive
  - 50 ± 약간 (chance 변동) → OK
  - 100+ 또는 10- → 문제

2.1.1.2 가능한 문제

False positive > 5%:
  - Variance underestimate
  - I.i.d. 가정 위반 (CTR 의 within-user correlation)
  - Outlier 영향
  - System bug

False positive < 5%:
  - Variance overestimate
  - Conservative 분석 (statistical power 손실)
  - Drepancy 자체 검토 필요

2.1.1.3 목적 2 — Variance Estimation 검증

저자 강조: “We can examine data from an A/A test to establish how a metric’s variance changes over time as more users are admitted into the experiment, and the expected reduction in variance of the mean may not materialize.”

이론:
  Var(Ȳ) = σ² / n
  → Sample size n 증가 시 variance 1/n 감소

A/A test 시:
  Various n 에서 variance 측정
  - 작은 n: variance 큼
  - 큰 n: variance 1/n 만큼 감소

실제 안 될 때:
  - 사용자 간 correlation 존재
  - Time correlation
  - Cluster effects

2.1.1.4 목적 3 — Bias Detection

저자 인용 (Kohavi et al. 2012 — Bing): “Bing uses continuous A/A testing to identify a carry-over effect (or residual effect), where previous experiments would impact subsequent experiments run on the same users.”

Carry-over effect 의 메커니즘:
  Day 1~7: 실험 X 실행
    User pool 의 일부가 X 의 Treatment 경험
  Day 8~14: 실험 Y 시작 (같은 user pool)
    이전 X Treatment 경험이 Y 의 결과 영향

A/A test 의 detect:
  계속 같은 user pool 에 A/A
  - 이전 실험 의 residual effect 가 A/A 의 결과 왜곡
  - 양쪽 group 의 mean 차이가 0 보다 큼
  - p-value 의 distribution 이 skewed

2.1.1.5 Reset 의 critical 성

Bias 회피:
  - User pool 을 다른 user 로 reset
  - 또는 충분 시간 wait (residual effect attenuation)
  - 또는 Bias 보정 후 분석

Continuous A/A:
  - 매일 새 randomization seed 로 A/A
  - Detect 즉시 → 분석가 alert

2.1.1.6 목적 4 — System of Record 비교

저자 명시: “Compare data to the system of record.”

시나리오:
  Logging system A: 실험 platform 이 사용
  Logging system B: business reporting (system of record)

  두 system 의 데이터 일치?
  - 같은 metric 이 같은 값?
  - User count 가 같음?

A/A test 의 사용:
  실험 platform 의 user count vs business reporting
  - 일치 → trustworthy
  - 차이 → leakage 또는 logging issue

2.1.1.7 사용자 leakage detection

저자 강조: “If the system of records shows X users visited the website during the experiment and you ran Control and Treatment at 20% each, do you see around 20% X users in each? Are you leaking users?”

Leakage 의 메커니즘:
  - 실험 platform 의 일부 user 가 logging 안 됨
  - System of record 보다 적은 user 표시
  - "User leakage"

Detection:
  Total A/A users / System of record total ≈ 0.4 (20% × 2)
  같지 않으면 → leakage
  → System debug 필요

2.1.1.8 목적 5 — Variance for Power Calculation

A/B test 의 sample size 계산:
  N = (z_alpha + z_beta)² × σ² / Δ²
  σ² 가 input

A/A test 가 σ² 의 정확한 추정 제공:
  - 실제 user behavior 의 variance
  - 가정 없이 empirical
  - Power calculation 의 trustworthy input

이 추정이 platform 의 sample size calculator 의 input.

2.1.2 Continuous A/A — 산업 표준

저자 권고: “We highly recommend running continuous A/A tests in parallel with other experiments to uncover problems.”

Continuous A/A 의 운영:
  - 매일 1+ A/A 실행 (다른 user pool, 다른 seed)
  - 자동 분석 + alert
  - Pass: trust 유지
  - Fail: 즉시 platform investigation

비용:
  - User traffic 의 일부 (~1% 또는 less)
  - Compute (분석 자동)
  - Engineer 시간 (fail 시 debug)

ROI:
  - Trust 보장
  - Bug 사전 catch
  - Carry-over detection

이 continuous monitoring 이 production platform 의 안전망.

2.2 5 가지 산업 사례 — 미리 보기

저자 명시 (Ch.19.2~19.4) — F-KOH19-1, F-KOH19-2 에서 상세.

2.2.1 Example 1 — CTR 의 i.i.d. 위반

저자 (Kohavi, Longbotham et al. 2009) 의 발견:
  CTR 의 변동 추정이 잘못
  A/A 시 false positive > 5%
  Root cause: page-level metric, user-level randomization
  → i.i.d. 위반
  Solution: delta method 또는 bootstrap

2.2.2 Example 2 — Optimizely Peeking

저자 (Kohavi 2014, Borden 2014) 의 사례:
  Optimizely 가 peeking + early stopping 권장
  사용자가 statistical significance 도달 시 stop

문제:
  Multiple peek → multiple test
  False positive rate 증가
  실제 5% 이상의 false significance

해결:
  Optimizely 의 New Stats Engine (Pekelis 2015)
  Always-valid p-value 사용

2.2.3 Example 3 — Browser Redirect

저자 (Kohavi and Longbotham 2010) 의 사례:
  새 site v2 redirect 실험
  B variant 가 redirect → A 가 win

3 가지 root cause:
  1. Performance: redirect 의 latency
  2. Bots: redirect 처리 다름
  3. Bookmarks: contamination

해결:
  Redirect 회피 또는 양쪽 redirect (degradation)

2.2.4 Example 4 — Unequal Splits

저자 (Kohavi and Longbotham 2010) 의 사례:
  10/90 split 의 LRU cache bias
  큰 variant 가 cache hit rate ↑ → unfair advantage

해결:
  Experiment ID 를 cache key 에 포함
  또는 10/10 split (80% 데이터 손실 but valid)

2.2.5 Example 5 — Hardware Differences

저자 (Bakshy and Frachtenberg 2015) 의 Facebook 사례:
  V1 service vs V2 service (다른 fleet)
  Hardware 같다고 가정
  A/A 실패

Root cause:
  미세한 hardware 차이 (CPU, memory, disk)
  같은 software 라도 다른 결과

2.3 A/A Test 의 운영 — How to Run

저자 명시 (Ch.19.5) — F-KOH19-2 에서 상세.

2.3.1 1000 A/A Simulation

Procedure:
  1. 지난 1 주의 raw data 저장
  2. 1000 random seed 의 A/A simulation:
     for i in range(1000):
         seed = random()
         group_A = users[hash(user) % 2 == 0]
         group_B = users[hash(user) % 2 == 1]
         # Same data, different group
         p_value_i = t_test(metric_A, metric_B)

  3. 1000 p-value 의 분포 plot
  4. Goodness-of-fit test (Anderson-Darling, KS)
     - Uniform [0,1] 인지 검증

이상적: histogram 이 평평
실제: 종종 skewed

2.3.2 Replay 기법의 가치

저자 강조: “you can just simulate the A/A test.”

Replay 의 advantage:
  - User 에 영향 없음 (production 변경 없음)
  - 빠름 (1000 simulation 가능)
  - 비용 0 (이미 저장된 raw data)
  - 새 metric 에도 retroactive 적용 가능

한계:
  - Performance issue catch 못 함 (실제 redirect 영향 등)
  - Shared resource 영향 detect 못 함 (LRU cache)
  - Real production 의 일부 issue 만 catch

이 hybrid (replay + production A/A) 가 산업 표준.

2.4 A/A Fail 시 진단

저자 명시 (Ch.19.6) — F-KOH19-3 에서 상세.

2.4.1 3 가지 fail 패턴

2.4.1.1 패턴 1 — Skewed Distribution

P-value distribution 이 0 근처 mass 큼:
  → False positive rate > 5%
  → Variance underestimate

가능한 원인:
  - i.i.d. 위반 (ratio metric)
  - Highly skewed metric distribution
  - Capping 부재 (outlier)

해결:
  - Delta method 적용
  - Capping 추가
  - Sample size 증가 (CLT convergence)

2.4.1.2 패턴 2 — Mass around p=0.32

P-value 가 0.32 근처에 큰 mass:
  → Single outlier 가 거의 모든 sample 효과 dominate
  → t-stat ≈ ±1
  → p-value ≈ 0.32

원인:
  - Single very large outlier
  - Bot 또는 anomaly user

해결:
  - Bot detection
  - Capping
  - Outlier root cause 분석

2.4.1.3 패턴 3 — Discrete Values

P-value 가 몇 개 discrete value 만:
  → Data 가 single-valued (0) + rare non-zero
  → Delta of means 의 가능 값 한정

원인:
  - Sparse metric (대부분 0, 일부 non-zero)
  - 예: high-value purchase

해결:
  - Treatment 의 effect 가 large 면 detect 가능
  - Metric 변환 (binarization)
  - Bootstrap 사용 (parametric 가정 회피)

직관 — A/A Test 의 본질

A/A test 는 시스템 자체의 self-check. 다음 5 가지 차원에서 검증.

2.4.1.4 1. 통계적 정확성

Standard 분석의 가정:
  - i.i.d. samples
  - Normal distribution (CLT)
  - Variance 정확 추정
  - Independent T 와 C

A/A 가 검증:
  실제로 가정이 hold?
  Hold 안 하면 fail

2.4.1.5 2. 시스템 무결성

Platform 의 component:
  - Randomization (hash function)
  - Cache (shared resource)
  - Logging (data capture)
  - Analysis (aggregation)

A/A 가 catch:
  Component 의 bug
  Cross-component interaction issue

2.4.1.6 3. 데이터 quality

Data pipeline:
  - Bot detection
  - Filter
  - Cleaning

A/A 가 catch:
  Filter bias (variant 별 다른 filter rate)
  Data quality 의 systemic issue

2.4.1.7 4. Carry-over Detection

Continuous A/A:
  - 같은 user pool 의 시간에 따라 A/A
  - 이전 실험의 residual effect detect

Bing 의 표준 운영:
  - Daily A/A (계속)
  - Carry-over 발견 시 user pool reset

2.4.1.8 5. Trust Foundation

A/A pass = "이 platform 신뢰 가능"
A/A fail = "이 platform 의 결과 의심"

A/B 결과 의 의미:
  Pass A/A: "이 +5% 가 진짜"
  Fail A/A: "이 +5% 가 noise 또는 platform bug 일 수 있음"

A/A 가 platform trust 의 foundational 검증.

3 왜 필요한가

A/A test 부재 시.

Variance underestimate — False positive rate 진짜 5% 이상, 모름
Bug 누적 — Platform 의 silent issue 가 분석 결과 왜곡
Carry-over — 이전 실험의 residual effect
System of record 분리 — Logging system 의 data 일치 안 됨
Trust 부재 — A/B 결과의 신뢰도 낮음

활성 시.

Trustworthy 통계 — False positive rate 정상 5%
Bug 사전 catch — Continuous monitoring
Bias 의 즉시 detect — Carry-over 자동
System 일치 — System of record 와 reconciled
Trust foundation — A/B 결과의 신뢰

이 격차가 platform trust 의 본질. Mature 회사의 표준.

4 응용 사례 — Bing 의 Continuous A/A

Bing 의 daily A/A (Kohavi et al. 2012):

매일:
  - Random new seed
  - User pool 의 일부 (수백만)
  - Standard A/B test infrastructure
  - 모든 metric 자동 분석

Pass:
  - Trust 유지
  - 어떤 issue 도 알림 안 함

Fail:
  - 즉시 alert
  - Engineer 의 root cause 분석
  - Carry-over 또는 bug 발견
  - Fix 후 재 A/A

발견된 issue 사례:
  - 특정 시간대 의 carry-over
  - 새 metric 의 i.i.d. violation
  - Cache invalidation bug

이 운영이 Bing 의 platform trust 의 backbone. 매일 하지 않으면 누적된 issue 가 한꺼번에 발견.

5 Ch.19 시리즈 다음 글

글	주제	KOH 라인
F19-1	Why A/A? + Examples 1, 2	L:3085~3153
F19-2	Examples 3~5 + How to Run	L:3154~3205
F19-3	P-value Distribution + When A/A Fails	L:3192~3228

6 코드 예시 — 1000 A/A Simulation

저자 명시의 simulation 패턴 구현.

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib

rng = np.random.default_rng(42)

# 가상 user 데이터 (지난 주의 metric)
n_users = 10_000
user_metric = rng.lognormal(2, 1, n_users)  # right-skewed (real metric 종)

# 1000 A/A simulation
n_simulations = 1000
p_values = []

for sim_i in range(n_simulations):
    sim_rng = np.random.default_rng(sim_i)
    # Random group assignment
    group = sim_rng.choice([0, 1], n_users, p=[0.5, 0.5])
    metric_A = user_metric[group == 0]
    metric_B = user_metric[group == 1]
    # T-test
    _, p = stats.ttest_ind(metric_A, metric_B)
    p_values.append(p)

p_values = np.array(p_values)

# False positive rate
fp_rate = (p_values < 0.05).sum() / n_simulations
print(f"False positive rate at alpha=0.05: {fp_rate*100:.1f}%")
print(f"Expected: 5.0%")

# P-value distribution check (Kolmogorov-Smirnoff)
ks_stat, ks_p = stats.kstest(p_values, 'uniform')
print(f"\nKS test for uniformity:")
print(f"KS statistic: {ks_stat:.4f}")
print(f"P-value: {ks_p:.4f}")
if ks_p > 0.05:
    print("=> P-values consistent with uniform (A/A test PASSED)")
else:
    print("=> P-values NOT uniform (A/A test FAILED)")

# Distribution analysis
print(f"\nP-value 분포 의 분위수:")
print(f"P10: {np.percentile(p_values, 10):.4f} (expected: 0.10)")
print(f"P25: {np.percentile(p_values, 25):.4f} (expected: 0.25)")
print(f"P50: {np.percentile(p_values, 50):.4f} (expected: 0.50)")
print(f"P75: {np.percentile(p_values, 75):.4f} (expected: 0.75)")
print(f"P90: {np.percentile(p_values, 90):.4f} (expected: 0.90)")

# Histogram
print(f"\nP-value histogram (bin width 0.1):")
hist, _ = np.histogram(p_values, bins=np.linspace(0, 1, 11))
for i, count in enumerate(hist):
    print(f"  [{i*0.1:.1f}, {(i+1)*0.1:.1f}): {count:>4} ({count/n_simulations*100:.1f}%, expected 10%)")

# 만약 i.i.d. violation 시뮬레이션 (하나의 시뮬레이션)
print(f"\n=== Simulation of i.i.d. violation ===")
# 같은 user 의 multiple pages 가 correlated
n_users_iid = 1000
pages_per_user = 50

# Generate page-level data with within-user correlation
user_baseline = rng.normal(0, 1, n_users_iid)
pages_data = []
for u in range(n_users_iid):
    pages_data.extend([user_baseline[u] + rng.normal(0, 0.5) for _ in range(pages_per_user)])
pages = np.array(pages_data)
user_index = np.repeat(np.arange(n_users_iid), pages_per_user)

# 1000 A/A at page level
p_values_iid = []
for sim_i in range(1000):
    sim_rng = np.random.default_rng(sim_i)
    user_group = sim_rng.choice([0, 1], n_users_iid)
    page_group = user_group[user_index]
    pages_A = pages[page_group == 0]
    pages_B = pages[page_group == 1]
    _, p = stats.ttest_ind(pages_A, pages_B)
    p_values_iid.append(p)

p_values_iid = np.array(p_values_iid)
fp_rate_iid = (p_values_iid < 0.05).sum() / 1000
print(f"Page-level analysis (i.i.d. 위반):")
print(f"False positive rate: {fp_rate_iid*100:.1f}% (expected 5%)")
if fp_rate_iid > 0.10:
    print(f"=> A/A FAILED: variance underestimate due to within-user correlation")

직관 — Simulation 의 메시지

이 코드의 핵심.

6.0.0.1 정상 A/A

User-level data:
  False positive rate: ~5% (정상)
  KS test: pass (uniform)
  Histogram: 평평 (각 bin ~10%)

→ Platform trustworthy

6.0.0.2 Violation A/A (i.i.d. 위반)

Page-level analysis (잘못된 분석 unit):
  False positive rate: 30%+ (정상의 6 배)
  KS test: fail (skewed)
  Histogram: 0 근처 큰 mass

→ Variance underestimate
→ False positive 폭증

6.0.0.3 함의

이 simulation 이 보여주는 것:

A/A 가 silent issue 를 vocal 하게 만든다: 분석 의 정확도 가 numerical 측정 가능
Continuous monitoring 의 가치: 매일 운영 시 platform 변화 즉시 catch
Platform trust 의 정량화: “5% 이내” → trustworthy

6.0.0.4 산업 표준 운영

1. 모든 platform launch 전 1000 A/A simulation
2. Continuous A/A (production traffic 의 일부)
3. P-value distribution 의 KS test
4. Threshold 위반 시 자동 alert
5. Engineer 의 root cause 분석

이 운영이 Run·Fly 단계의 표준.

7 관련 주제

선행

F18-* — Ch.18 Variance/CUPED — Variance 의 정확성
F14-1 — Randomization vs Analysis Unit

다음 글

관련 챕터

F21-* — Ch.21 SRM — Sample Ratio Mismatch detection
F16-1 — Bot Detection · Capping

다른 카테고리 연결