Kwangmin Kim - A/A Test Examples 3~5

1 정의

정의: Examples 3~5 와 운영 표준

Kohavi (2020) Ch.19 의 추가 사례와 실무 운영.

Example	사례	함정
3	Browser Redirect	Performance · Bots · Bookmarks
4	Unequal Split (10/90)	LRU cache · Normality convergence
5	Hardware Differences	미세한 server 차이

1.0.0.1 운영 표준

1000 A/A simulation:
  Pre-launch 의 platform 검증
  P-value distribution 의 KS test

Continuous A/A:
  Production 의 daily monitoring
  Drift detection

원문 인용 (Ch.19.5): “Always run a series of A/A tests before utilizing an A/B testing system. Ideally, simulate a thousand A/A tests and plot the distribution of p-values.”

핵심 통찰: A/A 는 단일 시점 검증이 아닌 ongoing 운영. Examples 3~5 가 보여주는 issue 들이 production 의 silent killer. Continuous A/A 가 catch.

2 개념 및 원리

2.1 Example 3 — Browser Redirects

저자 명시 (Ch.19.4, Kohavi and Longbotham 2010 section 2).

2.1.1 시나리오

Setup:
  v1 (existing site) vs v2 (new site)
  Treatment: 사용자 가 v2 로 redirect
  Control: v1 그대로

Standard A/B 결과:
  v2 가 거의 항상 lose

Spoiler: B (Treatment) will lose with high probability.

2.1.2 3 가지 root cause

2.1.2.1 Cause 1 — Performance Differences

저자 명시: “Users who are redirected suffer that extra redirect. This may seem fast in the lab, but users in other regions may see wait times of 1–2 seconds.”

Redirect 의 latency:
  - 추가 HTTP request (302 redirect)
  - 새 connection 생성 (TCP handshake)
  - DNS lookup (different domain 시)
  - SSL 협상 (HTTPS)

Lab 의 측정:
  - Same data center
  - Fast network
  - 50~100ms 추가

Real users (다양한 region):
  - 모바일 cellular: 200~500ms 추가
  - 개도국 region: 500~2000ms 추가
  - VPN: 더 느림

Effect on user:
  - 사용자 abandonment ↑ (Ch.5)
  - Engagement metric ↓
  - Treatment 가 inferior 처럼 보임 (실은 redirect 의 부산물)

2.1.2.2 Cause 2 — Bots 의 다른 행동

저자 명시: “Robots handle redirects differently: some may not redirect; some may see this as a new unseen area and crawl deeply, creating a lot of non-human traffic that could impact your key metrics.”

Bot behavior 의 다양성:

Type 1: Redirect 무시
  - 단순 bot (script)
  - HTTP 302 무시
  - v1 만 crawl

Type 2: Redirect 수용
  - 더 sophisticated bot (search engine)
  - v2 로 redirect
  - Crawl

Type 3: 새 영역 처음 보면 deep crawl
  - "v2 가 새 site 인가?"
  - 수많은 page request
  - 일반 user 보다 100x traffic
  - Metric 왜곡

함의:
  Treatment (v2) 의 bot traffic 부산물
  - Pageview spike
  - 사용자 metric 의 distortion
  - 정상 detection (Ch.16) 의 bypass 가능

2.1.2.3 Cause 3 — Bookmarks·Shared Links Contamination

저자 명시: “Users that go deep into a website (e.g., to a product detail page) using a bookmark or from a shared link must still be redirected.”

시나리오:
  사용자 A 의 bookmark: v1.example.com/product/123
  사용자 A 가 click → v1 의 product page 도달

  실험 setup:
    Bookmark URL 은 v1
    Random assignment: A 가 Treatment (should be v2)

  Choices:
    Option 1: A 의 bookmark request 를 v1 으로 (Treatment 무시)
              → Random assignment 깨짐
              → Self-selection bias

    Option 2: A 를 v2 로 redirect
              → Bookmark URL 유지 (bookmark UI 의 기대)
              → 또 redirect 발생 (latency)

    Option 3: Symmetric redirect
              → A 가 Control 도 v1 → v1 redirect (degraded)
              → "Treatment 만 redirect" 함정 회피
              → 단 Control 도 latency 증가

2.1.2.4 Symmetric Redirect 의 trade-off

저자 권고: “execute a redirect for both Control and Treatment (which degrades the Control group).”

Symmetric redirect:
  - Control: v1 → v1 (redirect 자체)
  - Treatment: v1 → v2 (redirect)

  Both 의 latency 증가
  Same redirect cost
  Difference 만 측정 가능

단 self-degradation:
  - Control 도 latency 증가 → 사용자 영향
  - 단기 metric 의 손실
  - 그러나 정확한 비교

2.1.2.5 권고 — Server-side Routing

저자 명시: “Either build things so that there are no redirects (e.g., server-side returns one of two home pages).”

Server-side routing:
  Request 도착
   ↓
  Server 에서 variant 결정
   ↓
  Variant 의 HTML 직접 응답
   ↓
  No redirect

Pros:
  - No redirect cost
  - Symmetric (Treatment·Control 같은 path)
  - Trustworthy 분석

Cons:
  - Engineering complexity (모든 site 가 같은 server)
  - Cross-domain 시 어려움

이 server-side routing 이 modern A/B platform 의 default. Redirect 는 last resort.

2.1.3 A/A 의 detection

A/A test (with redirect):
  Treatment·Control 모두 redirect (다른 server 라도 같은 redirect)

  Issue:
    - Performance: identical (둘 다 redirect)
    - Bots: 같은 새 site 처럼 보일 수 있음 (bot 영향 다름)
    - Bookmarks: 양쪽 contamination

  Result:
    A/A 가 fail 할 수 있음 (bot impact 의 unequal)

이 fail 자체가 redirect 시도의 reject.

2.2 Example 4 — Unequal Splits

저자 명시 (Ch.19.4, Kohavi and Longbotham 2010 section 4).

2.2.1 Unequal split 의 함정

시나리오:
  Treatment: 10% (small)
  Control: 90% (large)
  Same shared resource (cache)

2.2.2 LRU Cache 의 mechanism

저자 강조: “Uneven splits (e.g., 10%/90%) may suffer from shared resources providing a clear benefit to the larger variant.”

LRU (Least Recently Used) Cache:
  - Cache 의 size 제한
  - 가장 오래된 entry 제거 (LRU eviction)

Shared cache (Treatment + Control):
  - Cache 가 모든 user 의 request 처리
  - Treatment 의 cache entry: 10% 의 사용자 패턴
  - Control 의 cache entry: 90% 의 사용자 패턴

Eviction:
  - 더 자주 access 되는 entry 가 cache 에 살아남음
  - Control 의 access frequency 가 9x 더 큼 (90% vs 10%)
  - Control 의 cache entry 가 더 자주 hit
  - Treatment 의 entry 가 evict 자주

결과:
  Cache hit rate:
    Control: 90%+ (자주 cache 됨)
    Treatment: 50% (자주 evict 됨)

  Latency:
    Control: 빠름 (cache hit)
    Treatment: 느림 (cache miss)

  Engagement:
    Control: 좋음 (latency 빠름)
    Treatment: 나쁨 (latency 느림)

  Conclusion (잘못):
    "Control 이 better, Treatment reject"
    → 실제 Treatment 효과 = 0 (redirect 의 부산물)

2.2.3 Cache Key 의 critical 성

저자 강조: “experiment IDs must always be part of any caching system that could be impacted by the experiment, as the experiments may cache different values for the same hash key.”

Naive cache key:
  cache_key = hash(URL + user_id)

Issue:
  Treatment 와 Control 이 같은 URL 의 다른 response
  Cache key 가 같음 → 한 variant 의 response 가 다른 variant 에 served
  → Variant pollution

Correct cache key:
  cache_key = hash(URL + user_id + experiment_id + variant)

→ Variant 별 separate cache
→ Cross-contamination 회피

2.2.4 10/10 Split 의 alternative

저자 명시: “In some cases, it is easier to run a 10%/10% experiment (not utilizing 80% of the data so useful in theory) to avoid LRU caching issues.”

Alternative:
  Treatment: 10%
  Control: 10%
  Other: 80% (별도 group, 분석 외)

장점:
  - Cache 의 symmetric pressure
  - LRU bias 회피
  - Clean comparison

단점:
  - 80% 사용자 데이터 손실 (분석 활용 안 됨)
  - Statistical power 부족 (20% 만 사용)
  - 더 긴 실험 필요

Trade-off:
  Power vs trustworthy

2.2.5 Runtime vs Post-hoc

저자 강조: “this must be done at runtime; you cannot run 10%/90% and throw away data.”

잘못된 attempt:
  Run 10/90
  분석 시 90% 의 random 10% 만 사용 (subsample)
  10/10 처럼 분석

문제:
  Cache 는 10/90 split 으로 운영됨
  Cache bias 가 이미 발생
  Subsample 도 같은 bias

올바른 방법:
  Runtime 에 10/10 setup
  나머지 80% 는 다른 cache 영역 (또는 단일 control 로)

2.2.6 Normality Convergence

저자 명시: “the rate of convergence to a Normal Distribution is different. If you have a highly skewed distribution for a metric, the Central Limit Theorem states that the average will converge to Normal, but when the percentages are unequal, the rate will be different.”

CLT:
  Mean of N samples → Normal as N → ∞
  Convergence rate: 1/√N

Treatment 와 Control 의 차이:
  N_T = 100K, N_C = 900K
  Both converge to Normal, but at different rates
  Smaller N_T 의 convergence 느림
  → Treatment mean 의 normality 가 약함

분석 함의:
  T-test 의 normality 가정 위반 (작은 N 의 변)
  P-value 부정확
  False positive 또는 negative

해결:
  Same-size variants
  → Same convergence rate
  → Symmetric distribution

2.2.6.1 A/A 의 detection

10/90 의 A/A:
  Cache pressure 의 unequal
  → Hit rate 차이
  → Latency 차이
  → Metric 차이
  → A/A fail (false positive 또는 negative)

해결:
  Cache key 에 experiment ID 추가
  또는 10/10 split

2.3 Example 5 — Hardware Differences

저자 명시 (Ch.19.4, Bakshy and Frachtenberg 2015, Facebook).

2.3.1 시나리오

Facebook 의 service:
  V1 service: 기존 fleet (server group)
  V2 service: new fleet (다른 server group)
  Both 가 "identical hardware" 가정

A/A test:
  Same code, different fleet
  Expected: pass (no real difference)

Result: FAIL

2.3.2 Root cause

Hardware 의 미세한 차이:
  - CPU 모델 (같은 model 지만 다른 batch)
  - Memory speed (같은 spec 이지만 미세한 차이)
  - Network card (같은 vendor 지만 다른 firmware)
  - Disk performance (SSD 의 wear level)
  - Server age (8 개월 vs 1 년)

Cumulative effect:
  Single hardware difference: 미세
  Combined: detectable
  Service performance: 측정 가능 차이

2.3.3 함의

저자 강조: “Small hardware differences can lead to unexpected differences.”

모든 service migration 의 lesson:
  - "같은 hardware" 가정 위험
  - Always run A/A first
  - Hardware 차이 정량화

Modern infrastructure:
  - Cloud (AWS, GCP, Azure)
  - 자동 hardware variation
  - Same instance type 도 다를 수 있음
  - Continuous A/A 가 detect

2.3.3.1 A/A 의 detection

A/A 의 결과:
  V1 fleet 의 service: latency p99 = 100ms
  V2 fleet 의 service: latency p99 = 105ms (5% slower)

  Statistical test: significant difference
  → Hardware-induced difference
  → 새 hardware 가 더 느림 또는 빠름 (architecture 차이)
  → 영향 분리 후 software 변경 분석 가능

이 detection 이 cloud era 의 표준 절차.

2.4 A/A 의 운영 절차 — How to Run

저자 명시 (Ch.19.5).

2.4.1 1000 A/A Simulation

저자 권고: “simulate a thousand A/A tests and plot the distribution of p-values. If the distribution is far from uniform, you have a problem.”

Procedure:
  1. 지난 1 주의 raw data 저장 (모든 user event)
  2. 각 simulation:
     a. Random seed 선택
     b. User 별 variant 할당 (group A or B)
     c. Group A 와 B 의 metric 비교
     d. P-value 계산
  3. 1000 p-value 의 distribution 분석

2.4.2 Replay 기법의 가치

저자 명시: “you can just simulate the A/A test.”

2.4.2.1 Why replay?

Real A/A in production:
  - 1000 회 trial 운영 비싸 (production traffic 분할)
  - 각 trial 이 적어도 며칠 (sample size)
  - Total: 수개월

Replay (지난 데이터):
  - 1 주 raw data 가 있으면
  - 1000 simulation 이 분 단위 가능
  - Various seed, same data
  - Cost: minimal

2.4.2.2 한계

저자 강조: “you will not catch performance issues or shared resources such as the LRU cache mentioned above, but it is a highly valuable exercise.”

Replay 가 catch:
  - Variance estimation issue
  - i.i.d. 위반
  - Distribution skew
  - Outlier 영향

Replay 가 못 catch:
  - Performance (실제 latency 영향)
  - Cache (실제 cache pressure)
  - Hardware difference (replay 는 same hardware 가정)
  - Real network behavior

Hybrid 권고:
  Replay (대부분 검증) + Continuous production A/A (real environment)

2.4.3 Goodness-of-Fit Test

저자 인용 (Anderson-Darling, Kolmogorov-Smirnoff).

2.4.3.1 Kolmogorov-Smirnoff (KS)

KS test:
  Compare empirical CDF vs theoretical CDF (uniform)
  Statistic: max difference

  H_0: data 가 uniform
  H_a: data 가 not uniform

  Reject if KS_stat > critical value

2.4.3.2 Anderson-Darling

AD test:
  KS 와 비슷, but tail 의 차이 더 sensitive
  더 strict
  Skewed distribution detect 더 잘

  대부분 platform: KS 또는 AD 둘 다 사용

2.4.3.3 시각화 — Histogram

저자 명시 figures: “Figure 19.1 is a real histogram showing far from uniform distribution. Figure 19.2 shows that after applying the delta method, distribution was much more uniform.”

Pass histogram:
  Each bin (10 bins of 0.1 width) 가 ~10%
  거의 평평

Fail histogram (variance underestimate):
  0~0.05 bin: 큰 mass (~30%)
  나머지: 작음
  0 근처에 spike

Fail histogram (outlier):
  0.32 근처에 spike
  나머지: 평평

Fail histogram (sparse data):
  Few discrete points
  나머지: 0

2.4.4 Continuous A/A — 산업 표준

저자 권고: “Even after an A/A test passes, we recommend regularly running A/A tests concurrently with your A/B tests to identify regressions in the system or a new metric that is failing.”

Continuous A/A:
  매일 새 random seed 의 A/A
  - 1% 또는 less 의 user pool
  - 모든 metric 자동 분석
  - Pass: 무알림
  - Fail: 즉시 alert

검증 dimensions:
  - Daily drift (system change 의 영향)
  - 새 metric 의 i.i.d. 위반
  - Carry-over effect
  - System bug

2.4.4.1 Bing 의 표준

Bing 의 daily A/A:
  Random seed 매일
  Sample 수백만 사용자
  20+ metric 자동 분석

  발견된 issue 사례:
    - 일부 metric 의 outlier (cap 적용 미흡)
    - 새 logging system 의 bias
    - Cache invalidation bug
    - Carry-over (이전 실험 영향)

이 운영이 Run·Fly 단계의 backbone.

직관 — A/A 운영 의 layered defense

A/A 가 platform trust 의 multiple layer 방어.

2.4.4.2 Layer 1 — Pre-Launch Validation

Platform 도입 시:
  1000 A/A simulation
  Goodness-of-fit test
  → Trust 의 first establishment

2.4.4.3 Layer 2 — Continuous Production A/A

Daily production A/A:
  Real environment (cache, hardware 포함)
  Drift detection
  → Ongoing trust monitoring

2.4.4.4 Layer 3 — Post-Bug Verification

새 metric 추가 시:
  - Replay simulation
  - 1000 trial
  - 새 metric 의 검증

System change 시:
  - Production A/A 재 운영
  - Change 의 영향 검증

2.4.4.5 Layer 4 — Specific Investigation

A/B 결과 의심 시:
  - 같은 user pool 의 A/A
  - Carry-over check
  - Bias detection

2.4.4.6 4 layer 의 통합

Pre-launch 확신:
  L1 + L2 + L3 + L4 모두 통과
  → Strong trust

L1 통과만:
  → Weak trust (initial only)
  → Continuous monitoring 부재

L1+L2 통과:
  → Strong trust 의 ongoing
  → 산업 표준

이 4 layer 가 mature platform 의 표준.

3 왜 필요한가

3 examples + 운영 부재 시.

Example 3: Browser redirect 의 false comparison
Example 4: Unequal split 의 cache bias
Example 5: Hardware migration 의 hidden difference
운영 부재: Continuous monitoring 없으면 drift detect 못 함

활성 시.

Symmetric design: Server-side routing, equal split
Cache key 정책: Experiment ID 포함
Hardware A/A: 모든 migration 검증
Continuous monitoring: Drift 즉시 catch

이 운영이 production trust 의 backbone.

4 응용 사례 — Microsoft ExP 의 A/A 운영

Microsoft ExP 의 A/A 운영:

Pre-launch:
  - 1000 A/A simulation (raw data replay)
  - 모든 metric 검증
  - Goodness-of-fit (KS, AD)
  - 모든 metric pass 시 platform launch 가능

Continuous (daily):
  - Random seed
  - 1% production traffic
  - 모든 active metric 자동 분석
  - Dashboard 의 trust signal
  - Fail 시 platform engineer 에 alert

발견된 issue 사례:
  - 새 logging system 의 일부 region 의 leakage
  - 새 metric 의 sparse distribution
  - Cache key 의 일부 user pool 차이
  - Hardware migration 의 latency 영향

이 운영이 Microsoft 의 platform trust 의 foundation.

5 코드 예시 — Replay A/A Simulation

지난 데이터 의 1000 simulation 의 implementation.

import numpy as np
import pandas as pd
from scipy import stats
import hashlib

rng = np.random.default_rng(42)

# 가상의 지난 1 주 user data
n_users = 50_000
user_ids = [f"user_{i:06d}" for i in range(n_users)]

# Each user 의 weekly metric (simulated)
user_metric = rng.lognormal(2, 1, n_users)

# Replay simulation
def run_aa_simulation(user_ids, user_metric, n_simulations=1000):
    p_values = []
    for sim in range(n_simulations):
        sim_seed = sim
        # Random group assignment
        rng_sim = np.random.default_rng(sim_seed)
        group_assignment = rng_sim.choice([0, 1], len(user_ids), p=[0.5, 0.5])

        # Calculate metric for each group
        a_metric = user_metric[group_assignment == 0]
        b_metric = user_metric[group_assignment == 1]

        # T-test
        _, p = stats.ttest_ind(a_metric, b_metric)
        p_values.append(p)

    return np.array(p_values)

print("=== Run 1000 A/A Simulations ===")
p_values = run_aa_simulation(user_ids, user_metric, n_simulations=1000)

# Analysis
print(f"\nP-value statistics:")
print(f"  Mean: {p_values.mean():.4f} (expected 0.5)")
print(f"  Median: {np.median(p_values):.4f} (expected 0.5)")
print(f"  Std: {p_values.std():.4f}")

print(f"\nFalse positive rates:")
for alpha in [0.01, 0.05, 0.10]:
    fp_rate = (p_values < alpha).sum() / len(p_values)
    print(f"  alpha={alpha}: {fp_rate*100:.1f}% (expected {alpha*100:.1f}%)")

# Goodness-of-fit (KS)
ks_stat, ks_p = stats.kstest(p_values, 'uniform')
print(f"\nKS test for uniformity:")
print(f"  KS statistic: {ks_stat:.4f}")
print(f"  P-value: {ks_p:.4f}")
print(f"  Result: {'PASS' if ks_p > 0.05 else 'FAIL'}")

# Goodness-of-fit (Anderson-Darling)
ad_result = stats.anderson(p_values, dist='uniform')
print(f"\nAnderson-Darling test:")
print(f"  AD statistic: {ad_result.statistic:.4f}")
print(f"  Critical (5%): {ad_result.critical_values[2]:.4f}")
print(f"  Result: {'PASS' if ad_result.statistic < ad_result.critical_values[2] else 'FAIL'}")

# Histogram
print(f"\nP-value histogram (10 bins):")
hist, _ = np.histogram(p_values, bins=np.linspace(0, 1, 11))
for i, count in enumerate(hist):
    bar = "#" * int(count / 5)
    print(f"  [{i*0.1:.1f}, {(i+1)*0.1:.1f}): {count:>4} | {bar}")

# Compare with violation case (skewed metric)
print("\n=== Skewed Metric Simulation (i.i.d. violation) ===")
# Generate user-page level data
n_users_v = 1000
pages_per_user = lambda: max(1, int(rng.lognormal(2, 1)))
page_data = []
for u in range(n_users_v):
    n_pages = pages_per_user()
    user_baseline = rng.normal(0, 1)
    for _ in range(n_pages):
        page_data.append((u, user_baseline + rng.normal(0, 0.3)))

page_df = pd.DataFrame(page_data, columns=["user_id", "metric"])

# 1000 simulations at page level (잘못된 unit)
p_values_violated = []
for sim in range(1000):
    sim_rng = np.random.default_rng(sim)
    user_group = sim_rng.choice([0, 1], n_users_v)
    page_df["group"] = page_df["user_id"].map(dict(zip(range(n_users_v), user_group)))
    a = page_df[page_df["group"] == 0]["metric"]
    b = page_df[page_df["group"] == 1]["metric"]
    _, p = stats.ttest_ind(a, b)
    p_values_violated.append(p)

p_values_violated = np.array(p_values_violated)
fp_rate_v = (p_values_violated < 0.05).sum() / 1000
ks_v_stat, ks_v_p = stats.kstest(p_values_violated, 'uniform')

print(f"False positive rate (alpha=0.05): {fp_rate_v*100:.1f}% (expected 5%)")
print(f"KS test: stat={ks_v_stat:.4f}, p={ks_v_p:.4f}")
print(f"Result: {'PASS' if ks_v_p > 0.05 else 'FAIL'}")

직관 — Replay Simulation 의 메시지

이 코드의 메시지.

5.0.0.1 정상 case (user-level)

False positive rate: ~5%
KS test: pass
AD test: pass
Histogram: 평평

→ Platform trustworthy

5.0.0.2 Violation case (page-level w/ user randomization)

False positive rate: 20~30%
KS test: fail
AD test: fail
Histogram: skewed (0 근처 spike)

→ Variance underestimate
→ A/A FAIL
→ Fix: delta method or aggregate by user

5.0.0.3 Replay 의 가치

이 simulation:
  - Real production traffic 사용 안 함
  - 지난 데이터 만 으로 1000 trial
  - 분 단위 결과
  - Platform launch 전 검증

ROI:
  - 비용: minimal (compute)
  - 가치: trust foundation
  - 산업 표준

5.0.0.4 산업 표준 사용

모든 modern platform:
  1. New metric 추가 시 replay simulation
  2. Pass 시만 production 활성
  3. Fail 시 fix 후 재 simulation
  4. Continuous A/A 추가

이 process 가 metric 의 quality gate.

6 관련 주제

선행

다음 글

F19-3 — P-value Distribution + When Fails

관련 챕터

다른 카테고리 연결

Engineering — Server-side Routing
Engineering — Cache Architecture — Cache key design
Statistics — Goodness-of-fit (KS, AD)