Kwangmin Kim - P-value Distribution 의 본질 · A/A Fail 의 3 가지 진단

1 정의

정의: P-value Distribution under Null Hypothesis

Standard hypothesis test 의 통계적 사실: Null hypothesis 가 true 일 때, p-value 는 uniform distribution on [0, 1] (Kohavi, Tang, Xu, 2020, Ch.19.6).

1.0.0.1 통계적 근거

Theorem:
  Continuous metric + simple null hypothesis (예: equal means)
  Null 하에서 p-value ~ U(0, 1)

Implication:
  P(p < 0.05) = 0.05 (정확)
  P(p < 0.10) = 0.10
  P(p < 0.50) = 0.50

A/A test 는 null 이 항상 true:
  - B = A 이므로 effect = 0
  - Repeated A/A 의 p-value 가 uniform

원문 인용 (Ch.19.6, Dickhaus 2014, Blocker et al. 2006): “When the metric of interest is continuous and you have a simple Null hypothesis, such as equal means in our A/A test example, then the distribution of p-values under the Null should be uniform.”

핵심 통찰: P-value 의 uniform distribution 자체가 trust 의 numerical 검증. Distribution 이 non-uniform 이면 분석 의 가정 위반.

2 개념 및 원리

2.1 P-value Distribution 의 본질

2.1.1 Why Uniform?

2.1.1.1 직관

P-value 정의:
  Test statistic 의 cumulative probability
  P(T ≥ t_observed | H_0)

Null hypothesis:
  T ~ specific distribution (예: t-distribution)

P-value 의 distribution:
  P-value 자체를 random variable 로 보면
  Uniform on [0, 1]

2.1.1.2 수학적 유도

Test statistic T ~ F (CDF)
P-value = 1 - F(T)

만약 Null 이 true:
  T ~ F (정확히)
  → 1 - F(T) ~ U(0, 1)
  (Probability integral transform)

이 정리가 A/A test 의 통계적 foundation.

2.1.2 Implications

실무 함의:

Pass:
  P-values uniform on [0, 1]
  → Histogram 평평
  → KS test passes

Fail (variance underestimate):
  P-values 가 0 근처 cluster
  → False positive 폭증
  → Histogram 의 left side 큰 mass

Fail (variance overestimate):
  P-values 가 0.5 근처 cluster
  → Conservative
  → Histogram 의 middle bias

Fail (outlier-driven):
  P-values 가 specific value 근처 cluster
  → Test 가 outlier 의 영향
  → 일반적 p ≈ 0.32 mass

Fail (discrete data):
  P-values 가 few values 만
  → 분포 not continuous
  → Test inappropriate

각 pattern 이 다른 root cause.

2.2 Fail 패턴 1 — Skewed Distribution

저자 명시 (Ch.19.7).

2.2.1 Pattern

Histogram 의 shape:
  [0, 0.05): 큰 mass (예: 30~50%)
  나머지: 작음
  Right side 거의 평평

Visual:
  ###############
  #####
  ##
  #
  #
  #
  #
  #
  #
  #

이 shape 가 false positive 폭증의 visual proof.

2.2.2 Root cause 1 — i.i.d. 위반

저자 강조: “Is the independence assumption violated (as in the CTR example) because the randomization unit differs from the analysis unit?”

CTR 의 분석 (page-level) + user-level randomization:
  - 같은 user 의 page 가 correlated
  - Naive variance underestimate
  - SE 작음 → t-stat 큼 → p-value 작음
  - 0 근처 mass

해결:
  - Delta method (Ch.18)
  - Bootstrap
  - User-level reformulation

2.2.3 Root cause 2 — Highly Skewed Distribution

저자 강조: “Does the metric have a highly skewed distribution? Normal approximation may fail for a small number of users.”

Heavy long-tail metric:
  Revenue, time-on-site, search count
  → 분포 가 normal 에서 크게 deviate

Sample size 충분 시 (CLT):
  Mean → Normal asymptotically
  T-test 가정 hold

Sample size 부족 시:
  CLT 가 hold 못 함
  T-test 의 distribution 가정 위반
  → P-value 부정확

2.2.3.1 정량적 limit

저자 인용 (Kohavi et al. 2014): “the minimum sample size may need to be over 100,000 users.”

Heavy-tail metric 의 normality 도달:
  Normal data: N=30 충분
  Skewed (skew=2): N=1000
  Very heavy-tail (skew=5+): N=100,000+

이 limit 까지는 CLT 의 약한 convergence
T-test 결과 의심

2.2.4 해결 — Capping 또는 Min Sample Size

저자 명시: “Capped metrics or setting minimum sample sizes may be necessary.”

Option 1 — Capping:
  Outlier 제거 (Ch.18)
  Distribution 의 tail 약화
  Normality convergence 가속

Option 2 — Minimum sample size:
  실험 platform 이 enforce
  N < 100,000 시 metric 분석 거부
  큰 sample 대기

Option 3 — Bootstrap:
  Distribution 가정 회피
  Empirical p-value
  Heavy-tail 도 robust

대부분 platform: capping + minimum sample size 의 hybrid.

2.3 Fail 패턴 2 — Mass around p=0.32

저자 명시 (Ch.19.7).

2.3.1 Pattern

Histogram 의 shape:
  대부분 영역 평평
  0.32 근처에 큰 mass (~10~30%)

Visual:
  ###
  ###
  ###
  #####
  ###########
  ###############  ← 0.32 근처
  ###
  ###
  ###
  ###

2.3.2 수학적 root cause

저자 명시 (Ch.19.7).

2.3.2.1 t-statistic 분해

t = Δ / √Var(Δ)

Δ (mean difference) 의 cluster 분석:

Single very large outlier o in data:
  - Outlier 가 group T 에 속함
  - Δ = mean_T - mean_C
  - mean_T 가 outlier 의 dominate
  - Δ ≈ o / n (outlier value / sample size)
  - 또는 -o / n (outlier 가 C 에 속함)

Var(Δ):
  - Outlier 가 variance dominate
  - Var ≈ o² / n
  - SE = √Var(Δ) = o / √n

2.3.2.2 t-stat 계산

t = Δ / SE
  = (o/n) / (o/√n)
  = (o/n) × (√n/o)
  = 1/√n
  ≈ 0 (large n)

또는:
  Δ ≈ o/n, but Δ 의 variance term 도 ≈ o²/n
  단순 ratio:
    t ≈ ± (o/n) / (o/√n) = ± 1/√n × n/n = ± 1

  Empirical observation (Kohavi):
    t ≈ ±1
    p-value ≈ 2 × (1 - Φ(1)) ≈ 0.32

2.3.2.3 직관

Single outlier 의 dominance:
  - Mean: shifted by outlier
  - Variance: dominated by outlier
  - SE: dominated by outlier
  - t-stat: ratio of two outlier-dominated
  - 거의 ±1
  - P-value 거의 0.32

Result:
  Real effect 가 detect 안 됨 (t-stat 항상 ±1)
  Test 가 outlier 만 측정

2.3.3 해결

저자 명시: “the reason for the outlier needs to be investigated or the data should be capped.”

Step 1: Outlier 의 root cause:
  - Bot? (자동 click)
  - Anomaly user (real but extreme)
  - Bug? (logging duplication)
  - Power user (legitimate)

Step 2: Action:
  - Bot/anomaly: filter
  - Bug: fix logging
  - Power user: cap

대부분: capping 으로 해결

2.3.3.1 Cap 의 critical 성

Without cap:
  Single outlier 의 dominance
  t-stat ≈ ±1
  False negative for real effect

With cap:
  Outlier 의 영향 제한
  Variance 정상화
  Real effect detect 가능

이것이 Ch.18 의 capping 권고의 reinforcement.

2.4 Fail 패턴 3 — Few Discrete Values

저자 명시 (Ch.19.7).

2.4.1 Pattern

Histogram 의 shape:
  대부분 영역 0
  Few specific p-values 에 큰 mass (~30~50%)

Visual:
                     ###############
                                                 ###############
  (else 0)

2.4.2 Root cause

저자 강조: “This happens when the data is single-valued (e.g., 0) with a few rare instances of non-zero values.”

2.4.2.1 Sparse Metric

시나리오:
  Metric: revenue per user
  Most users: $0 (no purchase)
  Some users: $100 (purchase)
  Very rare event

분포:
  Discrete (0 or specific value)
  Heavy 0 mass

Mean of difference:
  Δ = mean_T - mean_C
  Possible values 가 limited (sample 의 specific composition 의존)

Possible Δ:
  Same number of $100 in T, C: Δ = 0
  T 의 +1 $100: Δ = +$100/n
  T 의 +2 $100: Δ = +$200/n
  ...
  Discrete values

2.4.2.2 t-stat 의 discreteness

t = Δ / SE
SE ≈ specific value (sparse metric 의 variance 보통 low)

t 의 가능 값:
  Δ 의 discreteness ÷ specific SE
  → Few discrete t values
  → Few discrete p-values

2.4.3 해결

저자 명시: “the t-test is not accurate, but this is not as serious as the prior scenario, because if a new Treatment causes the rare event to happen often, the Treatment effect will be large and statistically significant.”

Less critical:
  Sparse metric 은 일반적
  Real effect 가 큰 경우만 detect (작은 effect 는 어려움)
  Caution-required but acceptable

해결 옵션:

1. Bootstrap:
   - Discrete distribution 도 robust
   - Empirical p-value

2. Binarization:
   - "Did rare event occur?" boolean
   - Sparse metric → boolean metric
   - Variance bounded

3. Sample size 증가:
   - Rare event 의 발생 횟수 ↑
   - Δ 의 가능 값 더 많음
   - 분포 continuous 에 가까워짐

4. Aggregation:
   - Per-user level 대신 per-region level
   - Region 의 sum 가 less sparse

2.4.3.1 Treatment 의 large effect 시 OK

저자 강조: “if a new Treatment causes the rare event to happen often, the Treatment effect will be large and statistically significant.”

Sparse metric 의 acceptable case:
  Treatment 가 rare event 의 빈도 dramatic 증가:
    Control: 1% conversion
    Treatment: 5% conversion
    → Δ가 크고 detect 가능

Sparse metric 의 problematic case:
  Treatment 가 미세 increment:
    Control: 1% conversion
    Treatment: 1.05% conversion
    → 작은 Δ
    → Discrete distribution 의 noise 에 묻힘

따라서 sparse metric 의 small effect 는 어려움.

2.5 Continuous A/A 의 운영

저자 명시 (Ch.19.7).

2.5.1 Drift Detection

저자 강조: “we recommend regularly running A/A tests concurrently with your A/B tests to identify regressions in the system or a new metric that is failing because its distribution has changed or because outliers started showing up.”

2.5.1.1 Drift 의 시나리오

Day 0: A/A pass (모든 metric)
Day 30: 새 logging system 도입
Day 60: 새 metric 추가
Day 90: A/A fail (일부 metric)

Possible drift:
  - 새 logging 이 일부 user pool 의 leakage
  - 새 metric 의 i.i.d. 위반
  - Outlier behavior 의 변화 (bot evolution)
  - Cache 의 unequal pressure (auto-scaling)

2.5.1.2 Continuous A/A 의 detection

Daily A/A:
  매일 각 metric 의 KS p-value 추적
  P-value 의 distribution shift detect

  Fail 시:
    Alert
    Recent change 와 correlation
    Root cause 분석

2.5.2 Distribution Drift Detection 의 메커니즘

Time series of KS p-values:
  Day 1~30: 평균 0.5 (uniform pass)
  Day 31~60: 평균 0.4 (slight drift)
  Day 61~90: 평균 0.1 (clear fail)

Threshold based alert:
  KS p < 0.01: critical alert
  Average over 7 days < 0.1: warning

Trend analysis:
  Mann-Kendall trend test
  변화 시점 detection
  Root cause: 변화 시점 근처의 system change

2.5.2.1 Bing 사례

Bing 의 운영 (가상 reconstruction):
  Day 1: 모든 metric A/A pass
  Day 30: 일부 metric 의 KS p drift
  Day 35: alert
  Day 36: investigation
    - 발견: 새 logging 의 일부 region 의 user 누락
    - Root cause: filter 의 새 rule
  Day 37: fix deploy
  Day 38~: A/A 다시 pass

이 cycle 이 production maintenance 의 표준.

직관 — A/A Failure Pattern 의 진단 mental model

3 패턴 의 통합 mental model.

2.5.2.2 Pattern 별 즉시 판단

Histogram 보기:

If 0 근처 큰 mass (skew):
  → Variance underestimate
  → Likely cause: i.i.d. violation OR heavy-tail
  → Fix: delta method OR capping/min sample

If 0.32 근처 mass:
  → Single outlier dominance
  → Likely cause: bot OR extreme user
  → Fix: bot detection OR capping

If discrete points only:
  → Sparse data
  → Likely cause: rare event metric
  → Fix: binarization OR bootstrap (if small effect)
  → 또는 acceptable (if large effect)

2.5.2.3 Recursive 검증

Fix 후 재 검증:
  새 1000 simulation
  새 KS test
  Pass: trust 회복
  Fail: 다른 cause 도 있음
  → Iterative fix

2.5.2.4 산업 표준

Modern platform 의 운영:
  1. Pre-launch: 1000 simulation, all metric 검증
  2. Continuous: daily A/A, drift detection
  3. Post-issue: investigation + fix + re-validation
  4. Documentation: 각 metric 의 trust history

이 4 stage 가 mature platform 의 표준 운영.

3 왜 필요한가

3 fail pattern 분석 부재 시.

Skewed pattern: variance underestimate 누적, false positive 폭증
0.32 mass pattern: outlier 의 dominance, real effect 묻힘
Discrete pattern: sparse metric 의 noise

각 pattern 의 detection·진단 활성 시.

Pattern 별 fix: root cause 의 명확한 action
Drift detection: continuous monitoring
Trust 의 ongoing: production quality 유지

이 진단 framework 이 platform 의 statistical maturity.

4 응용 사례 — 새 Metric 추가의 검증 process

새 metric "average session duration" 추가:

Step 1: Pre-launch simulation
  - 지난 1 주 raw data 사용
  - 1000 A/A simulation
  - KS p-value: 0.001 (FAIL)
  - Histogram: 0 근처 mass

Step 2: Diagnosis
  - Pattern: skewed
  - Possible cause: i.i.d. violation 또는 heavy-tail

Step 3: Investigation
  - User-level vs session-level
  - Distribution skew 측정 (skewness statistic)
  - 답: heavy-tail (skew = 8)

Step 4: Fix attempt
  - Capping at 99% percentile
  - 재 simulation
  - KS p: 0.45 (PASS)

Step 5: Production deploy
  - Capping 적용
  - Continuous A/A 모니터링
  - 30 일간 trust history 추적

이 process 가 새 metric 도입의 quality gate.

5 Ch.19 시리즈 마무리

4 편 완료:

F19-0 — A/A test 정의, 5 목적, 5 examples 의 지도
F19-1 — 5 목적 깊이, Examples 1 (CTR) ·2 (Optimizely)
F19-2 — Examples 3 (Redirect) ·4 (Unequal) ·5 (Hardware), 1000 simulation 운영
F19-3 — P-value distribution, 3 fail pattern, continuous A/A

다음: Ch.20 (Triggering, 6 편).

6 코드 예시 — 3 Fail Pattern 의 visualization

각 pattern 의 시뮬레이션.

import numpy as np
from scipy import stats

rng = np.random.default_rng(42)

# === Pattern 1: Skewed (variance underestimate) ===
print("=== Pattern 1: Skewed (variance underestimate) ===")
# Simulated p-values from i.i.d. violation
n = 1000
p_skewed = []
for sim in range(n):
    sim_rng = np.random.default_rng(sim)
    # User-level data with within-user correlation
    n_users = 1000
    user_baseline = sim_rng.normal(0, 1, n_users)
    pages_per_user = sim_rng.integers(1, 30, n_users)

    # Page-level data
    page_data = []
    for u in range(n_users):
        for _ in range(pages_per_user[u]):
            page_data.append((u, user_baseline[u] + sim_rng.normal(0, 0.3)))
    page_data = np.array(page_data)

    # Random group assignment
    group_a = sim_rng.choice(n_users, n_users // 2, replace=False)
    is_a = np.isin(page_data[:, 0].astype(int), group_a)
    a = page_data[is_a, 1]
    b = page_data[~is_a, 1]
    _, p = stats.ttest_ind(a, b)
    p_skewed.append(p)

p_skewed = np.array(p_skewed)
fp_rate = (p_skewed < 0.05).sum() / n
ks_stat, ks_p = stats.kstest(p_skewed, 'uniform')
print(f"False positive rate: {fp_rate*100:.1f}% (expected 5%)")
print(f"KS test: stat={ks_stat:.4f}, p={ks_p:.4f}")
print(f"Histogram (10 bins):")
hist, _ = np.histogram(p_skewed, bins=np.linspace(0, 1, 11))
for i, count in enumerate(hist):
    print(f"  [{i*0.1:.1f}, {(i+1)*0.1:.1f}): {count}")

# === Pattern 2: Mass at 0.32 (outlier) ===
print("\n=== Pattern 2: Mass at 0.32 (outlier) ===")
p_outlier = []
for sim in range(n):
    sim_rng = np.random.default_rng(sim)
    n_users = 1000
    data = sim_rng.normal(50, 10, n_users)
    # Add single very large outlier
    data[0] = 5000  # 100x normal value

    group = sim_rng.choice([0, 1], n_users, p=[0.5, 0.5])
    a = data[group == 0]
    b = data[group == 1]
    _, p = stats.ttest_ind(a, b)
    p_outlier.append(p)

p_outlier = np.array(p_outlier)
fp_rate_o = (p_outlier < 0.05).sum() / n
print(f"False positive rate: {fp_rate_o*100:.1f}%")
mass_at_32 = ((p_outlier > 0.25) & (p_outlier < 0.40)).sum() / n
print(f"Mass at p=[0.25, 0.40]: {mass_at_32*100:.1f}% (expected 15% if uniform)")
print(f"Histogram (10 bins):")
hist, _ = np.histogram(p_outlier, bins=np.linspace(0, 1, 11))
for i, count in enumerate(hist):
    print(f"  [{i*0.1:.1f}, {(i+1)*0.1:.1f}): {count}")

# === Pattern 3: Discrete (sparse data) ===
print("\n=== Pattern 3: Discrete (sparse data) ===")
p_discrete = []
for sim in range(n):
    sim_rng = np.random.default_rng(sim)
    n_users = 1000
    # Sparse: 99% are 0, 1% are 100
    data = np.zeros(n_users)
    rare = sim_rng.choice(n_users, 10, replace=False)
    data[rare] = 100

    group = sim_rng.choice([0, 1], n_users, p=[0.5, 0.5])
    a = data[group == 0]
    b = data[group == 1]
    _, p = stats.ttest_ind(a, b)
    p_discrete.append(p)

p_discrete = np.array(p_discrete)
unique_p = np.unique(np.round(p_discrete, 3))
print(f"Number of unique p-values (rounded): {len(unique_p)}")
print(f"Most common p-values: {sorted(unique_p)[:10]}")
print(f"Histogram (10 bins):")
hist, _ = np.histogram(p_discrete, bins=np.linspace(0, 1, 11))
for i, count in enumerate(hist):
    print(f"  [{i*0.1:.1f}, {(i+1)*0.1:.1f}): {count}")

직관 — 3 Pattern 의 visual fingerprint

각 pattern 의 numerical characteristic.

6.0.0.1 Pattern 1 — Skewed

Visual: Left-heavy
False positive rate: 15~25%
KS test: fail (p < 0.001)
Histogram: 0 근처 spike

6.0.0.2 Pattern 2 — Mass at 0.32

Visual: Mass around 0.32
False positive rate: ~10% (somewhat inflated)
0.25-0.40 mass: 30~50% (vs expected 15%)
Histogram: 중앙 spike

6.0.0.3 Pattern 3 — Discrete

Visual: Few discrete spikes
Unique p-values: < 30 (vs expected 1000+)
Histogram: 일부 bin 에 만 mass

6.0.0.4 진단 의 자동화

Modern platform 의 자동 detection:
  - KS test (uniform 검증)
  - Mass at 0.32 detection
  - Unique value count
  - Skewness statistic

각 pattern 의 자동 alert + suggestion:
  - Pattern 1 → "Try delta method or capping"
  - Pattern 2 → "Investigate outlier"
  - Pattern 3 → "Consider binarization"

이 자동화가 Run·Fly 단계의 표준.

7 관련 주제

선행

Ch.19 시리즈 마무리 — 4 편 완료. 다음 Ch.20 (Triggering).

관련 챕터

다른 카테고리 연결