Kwangmin Kim - Kohavi Ch.18 개관 — Variance 추정과 Sensitivity 개선 (CUPED)

1 정의

정의: Variance 의 통계적 핵심성

Variance (분산) 는 거의 모든 statistical concept 의 기초 (Kohavi, Tang, Xu, 2020, Ch.18).

1.0.0.1 Variance 의 5 가지 derived 개념

개념	Variance 의 역할
p-value	Test statistic 의 분포 → variance 에 의존
Statistical significance	Threshold 결정 → variance 에 의존
Confidence interval	Width = 1.96 × √(variance/n)
Statistical power	Detect 가능 effect → variance 에 inverse
Sample size requirement	N ∝ variance

1.0.0.2 Variance 의 incorrect 추정 의 결과

Overestimated variance → False negative ↑
  - 진짜 effect 가 noise 로 묻힘
  - "효과 없음" 결론 (잘못)
  - Real value 의 missed launch

Underestimated variance → False positive ↑
  - 우연한 noise 가 "significant" 보임
  - "효과 있음" 결론 (잘못)
  - Spurious launch

원문 (Ch.18): “What is the point of running an experiment if you cannot analyze it in a trustworthy way? Variance is the core of experiment analysis.”

핵심 통찰: 모든 통계적 검정의 정확성이 variance 추정의 정확성에 의존. 잘못된 variance → 잘못된 모든 결론. Sensitivity 개선 = variance 감소.

2 개념 및 원리

2.1 Standard Variance Computation 의 review

저자 정리 (Ch.18.1).

2.1.1 표준 공식

2.1.1.1 가정

i.i.d. (independent, identically distributed) samples \(Y_1, ..., Y_n\).

대부분 i = user. 또는 session, page, user-day 등.

2.1.1.2 3 가지 step

Step 1: Sample mean
  Ȳ = (1/n) × Σ Y_i

Step 2: Sample variance
  s² = (1/(n-1)) × Σ (Y_i - Ȳ)²

Step 3: Variance of mean (← 실험 분석에 사용)
  Var(Ȳ) = s²/n

2.1.1.3 Standard error

SE(Ȳ) = s/√n

CI 95%: Ȳ ± 1.96 × SE

이 공식이 모든 분석의 시작. 단 i.i.d. 가정이 critical.

2.2 3 가지 Common Pitfall — 깊이 풀이

저자 명시 (Ch.18.1) — 상세는 F-KOH18-1.

2.2.1 Pitfall 1 — Delta vs Delta %

Absolute difference (Delta):
  Δ = Ȳ_t - Ȳ_c
  단위: original metric (예: 0.01 sessions)

Relative difference (Delta %):
  Δ% = (Ȳ_t - Ȳ_c) / Ȳ_c
  단위: percentage (예: +5%)

2.2.1.1 Why Delta % preferred

저자 인용: “It is difficult to know if 0.01 more sessions from an average user are a lot.”

Decision maker 의 입장:
  "0.01 sessions per user" → 의미 모호
  "5% sessions increase" → 직관적

Cross-context 비교:
  "+5% engagement" 가 "+10% revenue" 와 비교 가능
  Absolute 는 비교 어려움

2.2.1.2 Variance 의 함정

잘못된 추정 (common mistake):
  Var(Δ%) ≈ Var(Δ) / Ȳ_c²

이게 잘못된 이유:
  Ȳ_c 자체가 random variable
  Variance of ratio 는 더 복잡

올바른 추정 (Delta method):
  Var(Δ%) = ... (F-KOH18-1 에서 detail)

이 함정이 1990 년대 ~ 2000 년대 web analytics 의 systemic error.

2.2.2 Pitfall 2 — Ratio Metrics 의 i.i.d. 위반

Ratio metrics 의 예:
  CTR = clicks / pageviews
  Revenue per click = revenue / clicks

i.i.d. 위반:
  User-level randomization 시 page-level metric (CTR) 분석
  → Page Y_1, Y_2, Y_3 가 모두 같은 user 의 page
  → "Within user correlation"
  → i.i.d. 가정 깨짐

Naive 변동 추정:
  s² = (1/(n-1)) × Σ (Y_i - Ȳ)²
  → Underestimate (within user correlation 무시)
  → False positive ↑

2.2.2.1 Delta Method 해결

저자 인용 (Deng et al. 2017): “the ratio of two averages, is also normally distributed. Therefore, by the delta method we can estimate the variance.”

Ratio 를 user-level 로 변환:
  X_i = user i 의 clicks
  Y_i = user i 의 pageviews
  M = X̄ / Ȳ (user-level ratio)

Delta method variance:
  Var(M) = (1/Ȳ²) × Var(X̄)
         + (X̄²/Ȳ⁴) × Var(Ȳ)
         - 2 × (X̄/Ȳ³) × Cov(X̄, Ȳ)

이 정확한 추정 → trustworthy decision.

2.2.2.2 Bootstrap Alternative

저자 명시: “there are metrics that cannot be written in the form of the ratio of two user-level metrics, for example, 90th percentile of page load time. For these metrics, we may need to resort to bootstrap method.”

Bootstrap:
  - User-level resample (with replacement)
  - 각 resample 의 metric 계산
  - 1000 resamples 의 분포 → empirical variance

Pros:
  - 모든 metric 에 적용 가능
  - 분포 가정 없음

Cons:
  - 계산 비용 ↑ (1000 × 분석)
  - Implementation 복잡

대부분 platform: delta method default + bootstrap fallback.

2.2.3 Pitfall 3 — Outliers

저자 강조: “Outliers have a big impact on both the mean and variance. In statistical testing, the impact on the variance tends to outweigh the impact on the mean.”

2.2.3.1 Outlier 의 mean·variance 영향 비교

Standard data:
  10 users, 평균 5 sessions, std 2

Outlier 추가 (1 user 100 sessions):
  새 mean: (50 + 100) / 11 = 13.6 (~3x 증가)
  새 std: ~30 (~15x 증가)

영향 비교:
  Mean: 3x 증가
  Variance: 225x 증가 (std² = 30² = 900 vs 2² = 4)

2.2.3.2 t-statistic 의 영향

t = (mean_T - mean_C) / SE
SE = √(var/n)

Outlier:
  Mean 증가 → numerator 증가
  Variance 증가 → denominator 더 증가

결과:
  t-statistic 감소
  → Less significant

2.2.3.3 Simulation (저자 figure 18.1)

저자 figure 의 묘사: “as we increase the size of the (single) outlier, the two-sample test goes from being very significant to not significant at all.”

Outlier 의 size 증가 시:
  Outlier 0x: t-stat 매우 큼, p < 0.001
  Outlier 5x: t-stat 작아짐, p ≈ 0.05
  Outlier 10x: t-stat 더 작음, p > 0.10
  Outlier 20x: t-stat 거의 0, p > 0.50

→ Outlier 가 큰 effect 마스킹
→ 진짜 effect 가 detect 안 됨

2.2.3.4 해결 — Capping

저자 명시: “A practical and effective method is to simply cap observations at a reasonable threshold. For example, human users are unlikely to perform a search over 500 times or have over 1,000 pageviews in one day.”

Capping example:
  Threshold: 1000 pageviews/day

  Original outlier: 10,000 pageviews
  Capped: 1,000

  Effect:
    - Outlier 의 영향 ↓ (mean 영향 minimal)
    - Variance 의 영향 dramatic ↓
    - t-stat 회복

2.2.3.5 Other Methods

저자 인용 (Hodge and Austin 2004): “There are many other outlier removal techniques.”

Other outlier methods:
  1. Winsorization (top/bottom percentile cap)
  2. Trimming (top/bottom percentile 제거)
  3. Robust statistics (median, MAD)
  4. Z-score 기반 detection
  5. ML-driven anomaly detection

대부분 platform: capping + bot detection (Ch.16) 의 hybrid.

직관 — Outlier 의 비대칭 영향

Variance 가 mean 보다 outlier 에 더 sensitive 한 통계적 이유.

2.2.3.6 수학적 분해

Mean: linear function of data
  Y_i 가 1 unit 증가 → Mean 1/n 증가
  Outlier 의 영향: 1/n

Variance: quadratic function
  Y_i 가 1 unit 증가 → (Y_i - Ȳ)² 가 quadratic 증가
  Outlier 의 영향: ~outlier value 의 제곱

2.2.3.7 정량적 비교

N=10000 사용자, 평균 5 sessions

Single outlier (1000 sessions = 200x average):

Mean 영향:
  새 mean = (50000 + 1000) / 10000 = 5.1
  변화: 0.1 (2%)

Variance 영향:
  Variance ≈ Σ(Y_i - mean)² / N
  Outlier 의 contribution: (1000-5)² ≈ 990,025
  Other users' contribution: ~10000 × 4 = 40,000
  새 variance: ~103,000 (~25x 원래 variance)
  변화: 2400% (vs mean 2%)

→ Variance 영향이 mean 의 1000+ 배

2.2.3.8 의사결정 의 결과

Capping 안 함:
  - Mean 거의 unchanged
  - Variance dramatic ↑
  - t-stat 작음
  - Significant 안 됨
  - "Effect 없음" 결론 (잘못)

Capping 함:
  - Mean 거의 unchanged
  - Variance 정상
  - t-stat 정상
  - Significant
  - 정확한 결정

이 비대칭이 capping 의 가치. Mean 변화 minimal + variance 보호.

2.2.3.9 자연스러운 cap

산업 표준 사용자 행동의 한계:

Search: 500/day (1.4/min × 8 hour)
Pageview: 1000/day
Click: 500/day
Session: 50/day

이 한계 초과 시 거의 확실히 bot 또는 anomaly. Capping 자연스러움.

2.3 Sensitivity 개선의 7 가지 기법

저자 명시 (Ch.18.2) — 상세는 F-KOH18-2.

2.3.1 기법 1 — Smaller Variance Metric

저자 명시: “Create an evaluation metric with a smaller variance while capturing similar information.”

2.3.1.1 사례 1 — Searches vs Searchers

Searches per user (count):
  Mean: 5
  Variance: 50 (high - 일부 heavy user)

Searchers (boolean: search 했는가):
  Mean: 0.7 (70% search)
  Variance: 0.21 (low - bounded 0~1)

같은 information (search engagement) but variance 1/238 배
→ Sample size 1/238

2.3.1.2 사례 2 — Purchase amount vs Conversion (Boolean)

저자 인용 (Kohavi et al. 2009): “using conversion rate instead of purchasing spend reduced the sample size needed by a factor of 3.3.”

Purchase amount (real value):
  Most users: 0
  Some users: $10~$1000
  Variance: 매우 높음 (long-tail)

Conversion (boolean):
  0 또는 1
  Variance: bounded

Sensitivity:
  Conversion 의 sample size 1/3.3 배
  → 같은 sensitivity 위해 1/3.3 시간

이 metric design 이 sensitivity 의 핵심. 단 information 손실 없는 metric 선택.

2.3.2 기법 2 — Transformation

3 가지 transformation:

1. Capping:
   - Outlier 제거 (3 번째 pitfall)
   - Variance 보호

2. Binarization:
   - Real → boolean
   - 사례: Netflix 의 streaming hours
   - "X 시간 이상 봤는가" boolean
   - Xie & Aurisset (2016)

3. Log transformation:
   - Heavy long-tailed (revenue, time)
   - Log scale 으로 변환
   - Variance ↓
   - 단 interpretability 손실

선택:
   - Information 손실 vs variance 감소
   - Business meaning vs statistical sensitivity

2.3.3 기법 3 — Triggered Analysis

저자 cross-reference Ch.20.

Triggered analysis:
  실험 영역에 visit 한 사용자만 분석
  Non-trigger user 제외 → noise ↓

Sensitivity 의 dramatic 개선:
  Trigger rate 5% feature:
    - All users 분석: effect 0.5% (5% × 10%)
    - Triggered: effect 10% (실제)
    - Detect 100x easier

이 방법이 niche feature 의 의무. 일반 feature 도 ROI 가장 높음.

2.3.4 기법 4 — Stratification, Control-variates, CUPED

저자 인용 (Deng et al. 2013, Soriano 2017, Xie & Aurisset 2016, Jackson 2018, Deb et al. 2018).

2.3.4.1 Stratification

Sampling phase 의 stratification:
  - Region 별 분리 sample
  - Stratum 내 separately analysis
  - Combine results

Variance:
  Var(stratified) = Σ (n_h/n)² × Var_h
  → Within-stratum variance 만 → smaller

단점:
  - Sampling phase 운영 비싸 (large scale)
  - Implementation 복잡

해결 — Post-stratification:
  - Sampling 은 random
  - Analysis phase 에서 stratification 적용
  - 대부분 동일 효과 (sample 충분 시)

2.3.4.2 CUPED (가장 강력한 기법)

CUPED = Controlled Experiment Using Pre-Experiment Data.

저자 인용: “CUPED is an application of these techniques for online experiments, that emphasizes utilization of pre-experiment data.”

CUPED 의 메커니즘:
  Pre-period (실험 전): 사용자의 baseline metric 측정
  Experiment period: Treatment vs Control measurement

  Adjusted metric:
    Y' = Y - θ × X
    where X = pre-period metric, θ = optimal coefficient

  Variance:
    Var(Y') = Var(Y) × (1 - ρ²)
    where ρ = correlation of pre/post

  Variance 감소: factor (1 - ρ²)
  ρ=0.5: variance 75% (sample size 75%)
  ρ=0.7: variance 51% (sample size 51%)
  ρ=0.9: variance 19% (sample size 19%)

이 기법이 가장 강력. 상세는 F-KOH18-2.

2.3.5 기법 5 — Granular Randomization

Page-level randomization (vs user-level):
  - Sample 50 배 (user 의 평균 page count)
  - Variance ↓ (sample size ↑)

Trade-off:
  - User-level metric 측정 불가 (Ch.14)
  - User experience 일관성 깨짐

적용:
  - Stateless metric (page latency, single page CTR)
  - User-level 효과 무관 시
  - Mostly server-side optimization 실험

2.3.6 기법 6 — Paired Experiment (Interleaving)

저자 인용 (Chapelle et al. 2012, Radlinski and Craswell 2013).

2.3.6.1 Interleaving design

일반 A/B (이질):
  사용자 A: variant A
  사용자 B: variant B
  → Between-user variability

Interleaving:
  사용자 A: variant A 와 B 의 ranked list interleaved
  사용자가 어느 list 의 item click?
  → Within-user comparison
  → Between-user variability 제거

2.3.6.2 적용

Search ranking:
  Treatment ranking: [A, C, E, G, ...]
  Control ranking: [B, D, F, H, ...]
  Interleaved: [A, B, C, D, ...]

  사용자 click 분석:
    - "A 가 click 되면 Treatment 에 credit"
    - "B 가 click 되면 Control 에 credit"

장점:
  - Sample size 의 dramatic 효율 (10~100x)
  - User-specific 차이 제거

단점:
  - 복잡한 implementation
  - 한정 use case (ranked list)

2.3.7 기법 7 — Pooled Control Groups

저자 명시: “If you have several experiments splitting traffic and each has their own Control, consider pooling the separate controls.”

일반:
  실험 A: 50% T_A, 50% C_A
  실험 B: 50% T_B, 50% C_B
  실험 C: 50% T_C, 50% C_C

Pooled:
  Shared Control: 50% (모든 실험에서 사용)
  실험별 Treatment: 각 50% / N_experiments

Variance 효과:
  Larger control → smaller variance of mean_C
  All experiments benefit

2.3.7.1 Considerations

저자 명시 3 가지.

1. Trigger condition:
   각 실험의 trigger 가 다르면 shared control 어려움

2. Treatment 간 비교:
   T_A vs T_B 직접 비교 시 power 부족 가능

3. Balanced sizes:
   T 와 C 의 sample 크기 차이 → normality convergence ↓

대부분 platform: optional feature. 모든 실험 적용은 어려움.

직관 — 7 기법의 ROI 비교

각 기법의 적용 cost 와 ROI 비교.

2.3.7.2 Cost-ROI Matrix

기법                    | Implementation Cost | Variance 감소 | 적용 범위
1. Smaller variance metric | Low                | Medium       | 모든 실험 (선택)
2. Transformation         | Low                | Low~Medium   | 일부 metric
3. Triggered analysis     | Medium             | Very High    | Niche feature
4. Stratification        | Medium             | Low~Medium   | 모든 실험
4'. CUPED                | High               | Very High    | 모든 실험
5. Granular randomization | Medium             | High         | 일부 실험
6. Paired experiment     | High               | Very High    | Ranked list
7. Pooled control        | Medium             | Medium       | 일부 실험

2.3.7.3 산업 표준 priority

1. Capping (outlier handling): 모든 실험의 표준
2. CUPED: variance 감소의 BEST single technique
3. Triggered analysis: niche feature 의 standard
4. Smaller variance metric: design 단계의 첫 결정
5. Pooled control: optional, 일부 platform
6. Granular randomization: specific use case
7. Paired experiment: search ranking

2.3.7.4 Variance 감소의 누적

Single technique:
  CUPED: variance 50% (ρ=0.7)

Combined:
  CUPED + capping: variance 40% (outlier 제거 추가 효과)
  CUPED + triggered: triggered 사용자만 → effect 부각

→ 누적 효과로 sensitivity 5~10x 가능
→ 같은 power 위해 sample size 1/5~1/10
→ 실험 시간 dramatic 단축

이 누적 효과가 modern A/B platform 의 사실상 sample size 효율. 단순 sample 늘리기보다 variance 감소가 더 효율적.

3 왜 필요한가

Variance estimation 정확성·sensitivity 개선 부재 시.

잘못된 통계 결론 — Overestimate 시 false negative, underestimate 시 false positive
Niche feature 의 missed effect — Trigger dilution 으로 effect 0 처럼
Outlier 의 silent damage — Bot 1 명이 결과 왜곡
Sample size 비효율 — 같은 effect detect 위해 5~10x 더 많은 user
실험 시간 ↑ — Decision speed 느림

활성 시.

Trustworthy 통계 — Delta method, bootstrap, capping
Sensitivity 5~10x — CUPED, triggering, smaller variance metric
Decision speed — 짧은 실험 시간으로 conclusive
Niche feature 분석 — Triggered analysis 로 dilution 회피

이 격차가 platform 의 statistical maturity. Mature 회사의 advanced 영역.

4 응용 사례 — Netflix 의 binarization

저자 인용 (Xie and Aurisset 2016).

Netflix 의 streaming hours metric:

Original:
  사용자 별 streaming hours per week
  - Most users: 0~5 hours
  - Some users: 50+ hours (heavy)
  - Variance: 매우 높음

Binarization:
  "Streamed > X hours in week" boolean
  - 0 또는 1
  - Variance: bounded

Sensitivity:
  Binarization 의 sample size 1/N (N depends on context)
  - 단기 실험 시 critical
  - 일주일 후 결정 가능 (binarization)
  - 일주일 후 부족 (original)

이 design choice 가 Netflix 의 실험 throughput 의 한 factor.

5 Ch.18 시리즈 다음 글

글	주제	KOH 라인
F18-1	Common Pitfalls (Delta vs Delta %, Ratio Metrics, Outliers)	L:2986~3046
F18-2	Improving Sensitivity (CUPED, Deng et al. 2013)	L:3047~3072

6 코드 예시 — Outlier 의 t-stat 영향 시뮬레이션

저자 figure 18.1 의 평행.

import numpy as np
import pandas as pd
from scipy import stats

rng = np.random.default_rng(42)

# Standard data: Treatment +5%
n_users = 1000
control_mean = 100
treatment_mean = 105  # +5% effect
sigma = 30

control = rng.normal(control_mean, sigma, n_users)
treatment_clean = rng.normal(treatment_mean, sigma, n_users)

# Outlier size 변화 시 t-stat
multipliers = [0, 1, 2, 5, 10, 20, 50]

print("=== Outlier Size 별 Statistical Significance ===\n")
print(f"True effect: +{treatment_mean - control_mean} ({(treatment_mean - control_mean)/control_mean*100:.1f}%)")
print(f"\n{'Outlier×':>10} {'New mean':>12} {'New std':>12} {'t-stat':>10} {'p-value':>12} {'Significant?':>12}")

for mult in multipliers:
    treatment_with_outlier = treatment_clean.copy()
    if mult > 0:
        # Add single outlier
        outlier_value = treatment_mean + mult * (treatment_mean - control_mean)
        treatment_with_outlier[0] = outlier_value

    new_mean_t = treatment_with_outlier.mean()
    new_std_t = treatment_with_outlier.std()

    t_stat, p_val = stats.ttest_ind(treatment_with_outlier, control)
    sig = "Yes" if p_val < 0.05 else "No"

    print(f"{mult:>10} {new_mean_t:>12.2f} {new_std_t:>12.2f} {t_stat:>10.2f} {p_val:>12.4f} {sig:>12}")

# Capping 의 효과
print("\n=== Capping 적용 ===\n")
cap_threshold = treatment_mean * 5  # 5x mean cap

for mult in multipliers:
    treatment_with_outlier = treatment_clean.copy()
    if mult > 0:
        outlier_value = treatment_mean + mult * (treatment_mean - control_mean)
        treatment_with_outlier[0] = outlier_value

    # Cap
    capped = np.minimum(treatment_with_outlier, cap_threshold)

    t_stat, p_val = stats.ttest_ind(capped, control)
    sig = "Yes" if p_val < 0.05 else "No"

    print(f"Outlier×{mult}: capped mean={capped.mean():.2f}, t={t_stat:.2f}, p={p_val:.4f}, sig={sig}")

직관 — 시뮬레이션의 메시지

Outlier 가 dramatic 하게 statistical significance 깎는 것을 정량적으로 확인.

Outlier 0x: t > 5, p < 0.0001 (very significant)
Outlier 5x: t ~ 2.5, p ~ 0.012 (significant)
Outlier 10x: t ~ 1.7, p ~ 0.09 (not significant)
Outlier 20x: t ~ 0.9, p ~ 0.37 (not significant)
Outlier 50x: t ~ 0.3, p ~ 0.74 (very not significant)

Single outlier 가 entire 실험 결과를 deplete. 단 1 명의 bot 또는 anomaly user.

6.0.0.1 Capping 의 회복

Capping 적용 (5x mean):
  Outlier 5x: cap 됨, 결과 변화 없음 (이미 threshold 미만)
  Outlier 10x: cap 됨, t ≈ original
  Outlier 20x: cap 됨, t ≈ original
  Outlier 50x: cap 됨, t ≈ original

→ Capping 으로 outlier 영향 거의 0
→ Statistical significance 회복

6.0.0.2 실무 함의

1. 모든 실험에 outlier handling 필수:
   - Capping 또는 winsorization
   - 자동 적용

2. Threshold 의 결정:
   - Domain knowledge ("사용자가 1 일에 X 이상 안 함")
   - Historical 분포 분석
   - 일반적: top 0.1~1% cap

3. Bot detection 과 별도:
   - Capping: 정상 사용자의 outlier
   - Bot detection: 의도적 anomaly
   - 둘 다 필요

7 관련 주제

선행

다음 글

관련 챕터

F19-* — Ch.19 A/A Test — Variance 검증
F20-* — Ch.20 Triggering — Sensitivity 개선

다른 카테고리 연결