Kwangmin Kim - A/A Test 의 5 가지 목적 · CTR i.i.d. Violation

1 정의

정의: A/A Test 의 5 가지 목적

Kohavi (2020) Ch.19.1 에서 명시한 A/A test 의 critical 사용처.

목적	검증 대상	발견 가능한 issue
Type I Error	False positive rate	Variance underestimate
Variance	분산 추정의 정확성	i.i.d. 위반, time correlation
Bias	Treatment·Control 균형	Carry-over, system anomaly
System of Record	외부 logging 일치	User leakage, 데이터 손실
Power Calculation	A/B sample size 계산	Variance 가정 오류

원문 인용 (Ch.19): “The A/A test is highly useful for establishing trust in your experimentation platform.”

핵심 통찰: A/A 가 fail 되는 빈도가 산업에서 의외로 높다. 저자들의 경험: 처음 도입 시 거의 항상 fail. 5 목적 의 각 layer 가 다른 종류의 trust 검증.

2 개념 및 원리

2.1 목적 1 — Type I Error Control

저자 명시.

2.1.1 이론적 배경

Standard Hypothesis Testing:
  Null hypothesis (H_0): no effect
  Alternative (H_a): effect exists

  P-value < 0.05 시 reject H_0
  Type I error = false positive (wrongly reject H_0)
  Theoretical rate: 5%

2.1.2 A/A 시 검증

A/A 의 truth:
  No real effect (B = A)
  → H_0 is always true
  → 모든 reject 가 false positive

기대 결과:
  1000 trial 중 ~50 (5%) 이 reject
  Distribution 의 5% 가 p < 0.05

실제 시:
  > 5% rate → Type I error rate inflated
  → False positive 가 spurious launch 야기

2.1.3 Type I Error Inflation 의 root cause

1. Variance underestimate:
   → SE 작음 → t-stat 큼 → p-value 작음
   → False positive ↑

2. Multiple testing:
   → 100 metric × 5% = 5 expected false positive
   → 자동 highlighting 시 noise

3. Peeking:
   → Multiple looks at data
   → Compounding rejection chance

4. Distribution assumption violation:
   → t-test 의 normality 가정 깨짐
   → P-value 분포 distort

2.1.4 A/A 의 detection mechanism

1000 A/A simulation:
  - 다양한 random seed
  - 같은 underlying data
  - Independent trials

Histogram of p-values:
  Pass: uniform distribution
  Fail: skewed, mass at specific values

KS test:
  Goodness-of-fit to uniform
  p > 0.05: pass
  p < 0.05: fail

2.2 목적 2 — Variance Estimation 검증

저자 인용 (Kohavi et al. 2012).

2.2.1 Variance 의 sample size dependency

이론:
  Var(Ȳ) = σ² / N

  N 증가 시:
    Variance ↓ (1/N rate)
    SE ↓ (1/√N rate)
    Confidence interval 좁아짐

2.2.2 A/A 의 variance check

Procedure:
  Run A/A at different N values:
    N=1000: variance v_1
    N=10000: variance v_2
    N=100000: variance v_3

  Expected:
    v_2 ≈ v_1 / 10
    v_3 ≈ v_1 / 100

  실제 안 될 때:
    v_2 > v_1 / 10
    → Sample size 증가 의 효과 부족
    → User 간 correlation 존재
    → Time correlation

2.2.3 가능한 원인

저자 명시 (Kohavi 2012): “the expected reduction in variance of the mean may not materialize.”

원인 후보:
  1. Within-user correlation:
     - Same user 의 multiple events
     - Page-level analysis 시

  2. Time correlation:
     - Day-of-week effect
     - Seasonality

  3. Cluster effect:
     - Same cluster (city, network) 의 user 가 correlated
     - Network effects

  4. Carry-over:
     - 이전 실험의 영향
     - User behavior 의 persistent change

각 원인이 다른 fix 필요.

2.3 목적 3 — Bias Detection

저자 인용 (Kohavi et al. 2012, Bing): “Bing uses continuous A/A testing to identify a carry-over effect.”

2.3.1 Carry-over Effect 의 메커니즘

시나리오:
  Day 1~7: 실험 X 실행
    50% Treatment X (T_X)
    50% Control X (C_X)

  Day 8~14: 실험 Y 시작
    Same user pool 사용
    50% Treatment Y (T_Y)
    50% Control Y (C_Y)

Carry-over:
  T_X 와 T_Y 의 user overlap (random hash)
  T_X 의 영향이 user 의 행동을 변경
  → T_Y 의 결과가 distorted (X 의 residual)

2.3.1.1 Detection via A/A

Day 8~14 의 A/A:
  Same user pool, no real difference

  만약 carry-over 없음:
    A/A pass (p uniform)

  만약 carry-over 있음:
    A/A fail (skewed p)
    이전 X experiment 의 residual 가 A/A 의 group 에 unequal 분포

2.3.1.2 Reset 의 필요성

Bias 회피 방법:
  1. User pool reset:
     - 새 user pool 사용
     - 이전 실험의 영향 제거
     - 단 sample size 비효율

  2. Cool-down period:
     - 충분 시간 wait
     - Residual effect attenuation
     - Time cost

  3. Bias 보정:
     - Pre-period 의 Y_c (이전 X 영향) 제거
     - 분석 단계의 보정

저자 권고: “Bing uses continuous A/A testing.” → 문제 발견 즉시 해결.

2.4 목적 4 — System of Record 비교

저자 명시.

2.4.1 Systems 의 종류

실험 platform 의 logging:
  - 실험 IDs
  - Variant assignment
  - User-level events

System of record (business reporting):
  - Total revenue
  - Total user count
  - Business KPI
  - 다른 logging system 일 수 있음

2.4.2 일치성 검증

A/A test 의 활용:
  실험 platform 의 user count vs system of record

  예시:
    실험 platform: 980,000 users
    System of record: 1,000,000 users
    Difference: 2%

  Possible cause:
    - Filter 의 일부 user 제외
    - Logging system 의 일부 event 손실
    - Bot detection 차이

2.4.3 Leakage Detection

저자 명시: “If the system of records shows X users visited the website during the experiment and you ran Control and Treatment at 20% each, do you see around 20% X users in each? Are you leaking users?”

Leakage 시나리오:
  실험 setup: 20% Treatment, 20% Control
  → 40% 사용자가 실험 platform 에서 trackable

  System of record: 1M users 방문
  실험 platform: 380,000 users (예상 400,000 - 5%)
  → 5% leakage

Possible cause:
  - 실험 ID 가 일부 request 에 누락
  - 일부 region 에서 platform 부재
  - 일부 device 의 instrumentation 부재

해결:
  Logging system audit
  Coverage 분석

2.5 목적 5 — Variance for Power Calculation

A/B sample size 계산:
  N = (z_alpha + z_beta)² × σ² / Δ²

  Inputs:
    α (Type I error): 0.05 → z_alpha = 1.96
    β (Type II error): 0.20 → z_beta = 0.84
    σ: standard deviation (variance)
    Δ: minimum detectable effect

  σ 의 정확한 추정 critical
  부정확하면:
    Underestimate σ → N 부족 → Power 부족
    Overestimate σ → N 과잉 → 시간 낭비

2.5.0.1 A/A 의 σ 추정

A/A 의 자연스러운 부산물:
  Sample variance 직접 측정
  No assumption (empirical)

Power calculator 의 input:
  σ_estimated = A/A 의 sample variance
  Sample size 정확

이 use case 가 sample size calculator 의 backbone.

직관 — 5 목적의 통합 mental model

5 목적이 다른 layer 의 trust 검증.

2.5.0.2 Layer 분해

Layer 1 — 통계 정확성 (Type I, Variance):
  Hypothesis test 자체의 정확성
  Rejection rate 가 alpha 와 일치?

Layer 2 — 시스템 무결성 (Bias):
  Platform 자체의 bias
  Carry-over, anomaly

Layer 3 — 데이터 quality (System of record):
  Logging 의 완성도
  Leakage 없음

Layer 4 — 미래 design (Power):
  Sample size calculation
  실험 길이 결정

2.5.0.3 Layered defense

4 layer 가 모두 통과:
  → 강력한 trust
  → A/B 결과 신뢰

한 layer 만 fail:
  → 그 layer 의 issue
  → A/B 결과 의심
  → Fix 후 재검증

2.5.0.4 산업 표준 운영

Pre-launch:
  - 1000 A/A simulation
  - 4 layer 모두 검증
  - Pass 시만 launch

Post-launch (continuous):
  - Daily A/A
  - Drift detection
  - Carry-over monitoring
  - Auto alert on fail

이 layered + continuous 가 Run·Fly 단계의 표준.

2.6 Example 1 — CTR 의 i.i.d. 위반

저자 명시 (Ch.19.2).

2.6.1 두 가지 CTR 정의

저자 명시 (Equation 19.1, 19.3).

2.6.1.1 CTR_1 — Total/Total

\[CTR_1 = \frac{\sum_{i=1}^{n} \sum_{j=1}^{K_i} X_{i,j}}{N}\]

Description:
  Total clicks / Total pageviews

Variables:
  n = users 수
  K_i = user i 의 pageview 수
  X_{i,j} = user i 의 j 번째 page 의 clicks
  N = total pageviews

2.6.1.2 예시

User 1: 1 pageview, 0 clicks
User 2: 2 pageviews, 2 clicks (모든 page click)

CTR_1 = (0 + 2) / (1 + 2) = 2/3

2.6.1.3 CTR_2 — Average of User CTRs

\[CTR_2 = \frac{1}{n} \sum_{i=1}^{n} \frac{\sum_{j=1}^{K_i} X_{i,j}}{K_i}\]

Description:
  Average of each user's CTR
  Double average

Procedure:
  1. 각 user 의 CTR 계산: clicks_i / pageviews_i
  2. Users 의 CTR 평균

2.6.1.4 예시

User 1 CTR: 0/1 = 0
User 2 CTR: 2/2 = 1
CTR_2 = (0 + 1) / 2 = 0.5

2.6.2 두 정의 의 차이

CTR_1: 2/3 (≈0.67)
CTR_2: 0.5

같은 데이터, 다른 결과.

Reason:
  CTR_1 은 page-level weight
  CTR_2 는 user-level weight
  → User 의 pageview 수 distribution 다름
  → 다른 weight → 다른 mean

2.6.3 어떤 게 right?

저자 명시: “There is no right or wrong in these definitions, both are useful definitions for CTR, but using different user averages yields different results.”

실무 권고 (Kohavi):
  - 둘 다 보고
  - 일반적 CTR_2 권고 (robust to outliers)
  - Bot 의 영향 ↓

2.6.3.1 CTR_2 의 robust 성

시나리오: Bot user 가 1000 pageview, 0 click
Real users: 평균 5 pageview, 0.5 click rate

CTR_1:
  Bot 의 1000 pageview 가 numerator·denominator 모두 영향
  Numerator: real_clicks
  Denominator: 1000 (bot) + real_pageviews
  → Bot 의 dominate

CTR_2:
  Bot 의 user-level CTR: 0/1000 = 0
  Real user CTRs: 평균 0.5
  Average: bot 의 single weight
  → Bot 의 영향 1/N

이 robust 성이 CTR_2 권고의 root.

2.6.4 Variance 의 잘못된 추정 (CTR_1 의 i.i.d. 위반)

저자 명시 (Equation 19.5):

\[\text{VAR}(CTR_1) = \frac{\sum_{i=1}^{n} \sum_{j=1}^{K_i} (X_{ij} - CTR_1)^2}{N^2}\]

2.6.4.1 가정

이 공식의 가정:
  X_{i,j} 가 i.i.d.
  → All page-level samples independent

2.6.4.2 위반

Reality:
  같은 user 의 pages (X_{1,1}, X_{1,2}, ..., X_{1,K_1}) 가 correlated
  → User 의 행동 패턴 공유
  → i.i.d. 위반

Variance:
  Naive formula 가 underestimate
  Within-user correlation 무시

2.6.5 A/A test 의 detection

저자 명시: “We initially made this observation not because it was an obvious violation of the independence assumption, but because in our A/A tests, \(CTR_1\) was statistically significant far more often than the expected 5%.”

A/A 의 결과:
  Expected: 5% false positive
  Actual: 15~20% false positive
  → KS test fail
  → Variance underestimate

Discovery process:
  Empirical fail → root cause analysis → i.i.d. 위반 인지

→ A/A 가 silent issue 의 vocal detection

2.6.6 해결

저자 인용 (Tang 2010, Deng 2011, Deng et al. 2017): “use the delta method or bootstrapping.”

Delta method:
  CTR_1 을 user-level X̄ / Ȳ 으로 reformulate
  X_i = user i 의 clicks
  Y_i = user i 의 pageviews
  CTR = X̄ / Ȳ

  Var(CTR) = delta method formula

Bootstrap:
  User-level resample (with replacement)
  매번 CTR 재계산
  Empirical variance

이 fix 후 A/A pass.

2.7 Example 2 — Optimizely Peeking 함정

저자 명시 (Ch.19.3, Kohavi 2014, Borden 2014, Pekelis 2015).

2.7.1 Optimizely 의 초기 권고

Siroker and Koomen 2013 (Optimizely 책):
  "Once the test reaches statistical significance, you'll have your answer"
  "When the test has reached a statistically significant conclusion..."

권고된 행동:
  1. 실험 시작
  2. 매일 결과 check (peeking)
  3. P < 0.05 도달 시 stop
  4. Conclude effect

2.7.2 Peeking 의 문제

저자 강조: “The statistics commonly used assume that a single test will be made at the end of the experiment and ‘peeking’ violates that assumption, leading to many more false positives than expected using classical hypothesis testing.”

2.7.2.1 메커니즘

Single test:
  Type I error rate: alpha = 5%

Multiple peeking + early stopping:
  매 peek 마다 Type I error chance
  Compound:
    Day 1 peek: 5%
    Day 2 peek: P(false at day 1) + P(false at day 2 | not at day 1)
    ...

  Expected combined Type I error:
    가정 조건에 따라 50%+
  → False positive 가 진짜 false positive 의 10배

2.7.2.2 시각화

Single fixed-time test:
  실험 1 주 → 1 회 분석 → 5% false positive

Daily peeking + stop early (1 주 max):
  실험 시작 → 매일 분석
  P < 0.05 시 stop
  Total false positive: ~25%+
  → 5x worse than designed

2.7.3 “How Optimizely (Almost) Got Me Fired”

저자 인용 (Borden 2014).

사례:
  Engineer 가 Optimizely 사용
  매일 check
  P < 0.05 도달 시 launch
  Launch 후 실제 효과 없음 또는 negative
  → 직장 위기

Root cause:
  Peeking 의 false positive
  Spurious significance

2.7.4 해결 — Optimizely’s New Stats Engine

저자 인용 (Pekelis 2015, Pekelis, Walsh, Johari 2015).

2.7.4.1 Always-Valid P-value

Sequential testing:
  매 peek 시 valid p-value
  Multiple looks 보정
  P < 0.05 시 stop OK
  False positive 5% 보장

방법:
  Mixture of likelihoods
  Optional stopping
  Type I error 의 strict control

2.7.4.2 비교

Classical test:
  - Fixed sample size
  - 단 한 번 분석
  - Peeking 시 invalid

Always-valid p-value:
  - Variable sample size OK
  - Continuous monitoring
  - Peeking valid

2.7.5 A/A 의 검증

Peeking system 의 A/A:
  - 매일 check
  - Stop early 시
  - 1000 simulation 의 false positive rate

Classical (Pre-Pekelis):
  Expected 5%, actual 25%+ → fail

Always-valid (Post-Pekelis):
  Actual 5% → pass

이 fix 가 Optimizely 의 trust 회복의 root.

가정 — Peeking 함정의 보편성

가정: 분석가가 매일 결과 check + p < 0.05 시 stop early.

2.7.5.1 결과

False positive rate inflation:
  Day 1 alone: 5%
  Day 7 cumulative: 25%
  Day 14 cumulative: 35%
  Day 30 cumulative: 50%+

Implication:
  매일 check 시 절반의 실험이 spurious significance 도달
  → False launch 의 절반
  → Real effect 의 absent

2.7.5.2 회피

Strategy 1: 사전 fixed time
  - 실험 시작 전 시간 결정
  - 그 시점만 분석
  - Standard test 사용

Strategy 2: Always-valid p-value
  - Sequential testing
  - 매 peek valid
  - Stop early OK

Strategy 3: Bonferroni correction
  - K peek 시 alpha = 0.05/K
  - Conservative

Strategy 4: A/A test 검증
  - Platform 의 peeking 정책 검증
  - Always-valid 사용 시 5% 유지
  - Classical 사용 시 inflation 발견

이 검증이 platform 의 peeking 정책의 정당성. Always-valid p-value 가 modern 표준.

3 왜 필요한가

5 목적·2 examples 무시 시.

Type I error inflation — Spurious launch 폭증
Variance underestimate — Peeking·i.i.d. 위반의 누적
Carry-over bias — 이전 실험 영향 계속
System leakage — 외부 일치성 부재
Power 잘못 — Sample size 부족 또는 과잉

활성 시.

Trustworthy 통계 — 5% rate 보장
정확한 variance — Delta method 또는 bootstrap
Bias 감지 — Continuous monitoring
System 일치 — Logging 무결성
Sample size 효율 — Power calculation 정확

4 응용 사례 — Optimizely Recovery

Pre-Pekelis (2010~2014):
  Classical hypothesis testing
  Peeking 권고
  False positive rate: 25%+
  사용자 trust 위기

Post-Pekelis (2015~):
  Always-valid p-value
  Sequential testing
  False positive rate: 5%
  Trust 회복

이 transition 이 A/B SaaS industry 의 lesson:
  - 통계적 정확성 critical
  - User-friendly tool 도 통계적 정확성 보장 필요
  - A/A test 가 검증 도구

5 코드 예시 — CTR i.i.d. 시뮬레이션

CTR_1 vs CTR_2, naive vs delta method 의 비교.

import numpy as np
import pandas as pd
from scipy import stats

rng = np.random.default_rng(42)

# 시뮬레이션 setup
n_users = 1000
n_simulations = 1000

# User 별 pageview 수 (heterogeneous)
user_pageviews = rng.lognormal(2, 1, n_users).astype(int) + 1
user_ctr = np.full(n_users, 0.05)  # all 5% CTR (true null)

# 1000 A/A simulation
naive_p_values = []
delta_p_values = []

for sim_i in range(n_simulations):
    sim_rng = np.random.default_rng(sim_i)
    # Random group assignment (user-level)
    group = sim_rng.choice([0, 1], n_users, p=[0.5, 0.5])

    # Simulate clicks
    user_clicks = np.array([sim_rng.binomial(p, c) for p, c in zip(user_pageviews, user_ctr)])

    # Group A
    a_users = group == 0
    a_clicks_total = user_clicks[a_users].sum()
    a_pageviews_total = user_pageviews[a_users].sum()

    # Group B
    b_users = group == 1
    b_clicks_total = user_clicks[b_users].sum()
    b_pageviews_total = user_pageviews[b_users].sum()

    # === Naive (page-level) ===
    a_ctr = a_clicks_total / a_pageviews_total
    b_ctr = b_clicks_total / b_pageviews_total

    a_var_naive = a_ctr * (1 - a_ctr) / a_pageviews_total
    b_var_naive = b_ctr * (1 - b_ctr) / b_pageviews_total
    diff = a_ctr - b_ctr
    se_naive = np.sqrt(a_var_naive + b_var_naive)
    z_naive = diff / max(se_naive, 1e-10)
    p_naive = 2 * (1 - stats.norm.cdf(abs(z_naive)))
    naive_p_values.append(p_naive)

    # === Delta method (user-level) ===
    a_x = user_clicks[a_users]
    a_y = user_pageviews[a_users]
    b_x = user_clicks[b_users]
    b_y = user_pageviews[b_users]

    n_a = len(a_x)
    n_b = len(b_x)

    # User-level mean
    a_x_mean = a_x.mean()
    a_y_mean = a_y.mean()
    b_x_mean = b_x.mean()
    b_y_mean = b_y.mean()

    a_ratio = a_x_mean / a_y_mean
    b_ratio = b_x_mean / b_y_mean

    # Delta method variance
    a_var_x = a_x.var(ddof=1) / n_a
    a_var_y = a_y.var(ddof=1) / n_a
    a_cov_xy = np.cov(a_x, a_y, ddof=1)[0, 1] / n_a
    a_var_ratio = (1 / a_y_mean**2) * a_var_x + (a_x_mean**2 / a_y_mean**4) * a_var_y - 2 * (a_x_mean / a_y_mean**3) * a_cov_xy
    a_var_ratio = max(a_var_ratio, 1e-10)

    b_var_x = b_x.var(ddof=1) / n_b
    b_var_y = b_y.var(ddof=1) / n_b
    b_cov_xy = np.cov(b_x, b_y, ddof=1)[0, 1] / n_b
    b_var_ratio = (1 / b_y_mean**2) * b_var_x + (b_x_mean**2 / b_y_mean**4) * b_var_y - 2 * (b_x_mean / b_y_mean**3) * b_cov_xy
    b_var_ratio = max(b_var_ratio, 1e-10)

    diff_delta = a_ratio - b_ratio
    se_delta = np.sqrt(a_var_ratio + b_var_ratio)
    z_delta = diff_delta / se_delta
    p_delta = 2 * (1 - stats.norm.cdf(abs(z_delta)))
    delta_p_values.append(p_delta)

naive_p_values = np.array(naive_p_values)
delta_p_values = np.array(delta_p_values)

# Analysis
print("=== Naive (잘못된) Analysis ===")
print(f"False positive rate (alpha=0.05): {(naive_p_values < 0.05).sum() / n_simulations * 100:.1f}%")
print(f"Expected: 5.0%")
ks_stat_naive, ks_p_naive = stats.kstest(naive_p_values, 'uniform')
print(f"KS test for uniformity: stat={ks_stat_naive:.4f}, p={ks_p_naive:.4f}")
print(f"=> {'PASS' if ks_p_naive > 0.05 else 'FAIL'} (p-values uniform)")

print("\n=== Delta Method (올바른) Analysis ===")
print(f"False positive rate (alpha=0.05): {(delta_p_values < 0.05).sum() / n_simulations * 100:.1f}%")
print(f"Expected: 5.0%")
ks_stat_delta, ks_p_delta = stats.kstest(delta_p_values, 'uniform')
print(f"KS test for uniformity: stat={ks_stat_delta:.4f}, p={ks_p_delta:.4f}")
print(f"=> {'PASS' if ks_p_delta > 0.05 else 'FAIL'} (p-values uniform)")

# Comparison
print("\n=== 비교 ===")
print(f"Naive false positive: {(naive_p_values < 0.05).sum() / n_simulations * 100:.1f}%")
print(f"Delta method false positive: {(delta_p_values < 0.05).sum() / n_simulations * 100:.1f}%")

직관 — 시뮬레이션 의 결정적 메시지

이 코드의 메시지.

5.0.0.1 Naive 의 fail

Expected false positive: 5%
Actual (Naive): 15~25% (3~5x inflation)

이유:
  - Within-user correlation 무시
  - Variance underestimate
  - SE 작음 → t-stat 큼 → p-value 작음

→ A/A test FAIL

5.0.0.2 Delta Method 의 pass

Expected: 5%
Actual (Delta): ~5% (정상)

이유:
  - User-level i.i.d. 가정 hold
  - Correct variance (delta method)
  - Trustworthy

→ A/A test PASS

5.0.0.3 함의

이 simulation 이 보여주는 것:

A/A 가 i.i.d. 위반의 정량 detection
Same data, different analysis → different result
Platform 의 default 가 critical: naive 가 default 면 모든 분석 invalid
Continuous A/A monitoring 가 필수: 새 metric 추가 시 자동 검증

5.0.0.4 산업 표준

대부분 modern platform:
  - CTR 등 ratio metric 의 user-level analysis 자동
  - Naive 분석 자체 차단
  - A/A 검증 의무
  - Failure 시 alert

6 관련 주제

선행

다음 글

관련 챕터

다른 카테고리 연결