Kwangmin Kim - SRM 의 통계적 정의와 2 가지 시나리오

1 정의

정의: SRM 의 통계적 본질

Sample ratio 가 design 과 statistically significantly different 한 상태. 정상 random variation 이 아닌 bug 또는 bias 의 indicator (Kohavi, Tang, Xu, 2020, Ch.21.1).

1.0.0.1 핵심 idea

Variant 의 assignment:
  사용자 가 Treatment 또는 Control 에 random 배정
  Treatment 의 영향 받지 않는 단계
  → Sample ratio 가 design 과 일치 expect

Treatment 의 영향:
  Variant assignment 가 Treatment 의 결과의 영향 받으면 → 실험 invalid
  Random assignment 의 가정 위반

1.0.0.2 Law of Large Numbers

Coin flip 사례:
  10 flip: 4 heads, 6 tails (40/60)
    → 큰 변동 가능 (random)
  10000 flip: 4900 heads, 5100 tails (49/51)
    → 작은 변동 (LLN)
  1M flip: 499900 heads, 500100 tails
    → Almost exact 50/50

A/B 사례:
  N 사용자 의 ratio 의 standard error: 1/(2√N)
  → N 증가 시 정확
  → 큰 deviation 은 strong evidence of bug

원문 인용 (Ch.21.1): “the decision to expose a user to a variant must be independent of the Treatment, so the ratio of the users in the variants should match the experimental design.”

핵심 통찰: SRM 은 trust check 의 first defense. 변동 가 random chance 보다 unlikely 면 bug. Multiple p-value threshold (0.05, 0.01, 0.001) 의 trade-off.

2 개념 및 원리

2.1 Statistical Test 의 메커니즘

2.1.1 Standard Test

저자 명시: “You can use a standard t-test or chi-squared test to compute the p-value.”

2.1.1.1 Chi-square Test

H_0: 사용자 의 variant assignment 가 design ratio 일치
H_a: 사용자 의 variant assignment 가 design ratio 와 다름

Test statistic:
  χ² = Σ_variants (O_i - E_i)² / E_i

  Where:
    O_i = observed count in variant i
    E_i = expected count = N × p_i
    N = total sample size
    p_i = design probability for variant i

Degrees of freedom:
  df = K - 1
  K = number of variants (보통 2)

2.1.1.2 Computation 예시

Setup:
  Design: 50/50 (p_T = 0.5, p_C = 0.5)
  Observed: T=500, C=502

  N = 1002
  E_T = 1002 × 0.5 = 501
  E_C = 1002 × 0.5 = 501

  χ² = (500 - 501)² / 501 + (502 - 501)² / 501
     = 1/501 + 1/501
     = 0.004

  df = 1
  p-value = P(χ² > 0.004 | df=1) ≈ 0.95
  → Pass (very normal)

2.1.1.3 Sensitivity

Sample size 의 sensitivity:
  Same 0.5% deviation:
  N=1000: p-value ~0.6 (no detection)
  N=10000: p-value ~0.6
  N=100000: p-value ~0.32
  N=1000000: p-value ~0.001 (detected)
  N=10000000: p-value ~10^-7 (strongly detected)

→ Sample size 클수록 sensitivity ↑
→ Production 실험 의 매우 sensitive (수백만 sample)

2.1.2 T-test (Equivalent)

이항 분포 의 z-score:
  z = (p_observed - p_expected) / SE
  where SE = √(p_expected × (1 - p_expected) / N)

Equivalent to chi-square (df=1):
  χ² = z² (large samples)
  P-value: 같음

대부분 platform: chi-square 사용 (multiple variants 도 가능).

2.2 Scenario 1 — Simple SRM

저자 명시 (Ch.21.2 Scenario 1).

2.2.1 Setup

실험 design:
  Control: 50%
  Treatment: 50%

Observed:
  Control: 821,588
  Treatment: 815,482
  Total: 1,637,070

Ratio:
  T/C = 815,482 / 821,588 = 0.9926
  T fraction = 815,482 / 1,637,070 = 0.4981

Deviation:
  Expected fraction: 0.5
  Observed fraction: 0.4981
  Difference: 0.0019 (0.19 percentage points)

2.2.2 P-value Calculation

Chi-square:
  E_each = 1,637,070 / 2 = 818,535

  χ² = (821,588 - 818,535)² / 818,535
     + (815,482 - 818,535)² / 818,535

  Numerator: 3053² = 9,320,809

  χ² = 9,320,809 / 818,535 + 9,320,809 / 818,535
     = 11.39 + 11.39
     = 22.78

  df = 1
  p-value: P(χ² > 22.78 | df=1) ≈ 1.8 × 10^-6

2.2.3 Interpretation

저자 명시: “the probability of seeing this ratio or more extreme, in a design with an equal number of users in Control and Treatment, is 1.8E-6, or less than 1 in 500,000!”

P-value 1.8E-6 의 의미:
  - Random chance 으로 발생할 확률 < 1 in 500,000
  - 실험 의 1 회 만 의 운영 시 발생 매우 드뭄
  - 따라서 random 이 아닌 systematic cause

Conclusion:
  - Bug or bias 가 거의 확실
  - 다른 metric 의 trust 의심
  - Investigation 의 의무

2.2.3.1 Why threshold

Standard alpha:
  0.05: 일반적 hypothesis test
    - 100 실험 중 5 false alarm
    - 너무 자주 alert

  0.001: SRM 의 표준
    - 100 실험 중 0.1 false alarm
    - False alarm 비용 vs bug 의 cost
    - Mature platform 의 default

  0.0001: 매우 strict
    - 너무 conservative
    - Real SRM 일부 missed

2.2.3.2 Recommended

Microsoft ExP, LinkedIn, Google:
  Threshold: 0.001
  False alarm rate: 0.1%
  Real SRM detection rate: ~99%+

이 0.001 가 산업 표준.

2.3 Scenario 2 — Subtle SRM with Bing Real Data

저자 명시 (Ch.21.2 Scenario 2, Figure 21.1).

2.3.1 Setup

실험 design: 50/50

Overall:
  Treatment: 959,716
  Control: 965,679
  Total: 1,925,395

Ratio:
  T/C = 0.9938
  T fraction = 0.4985

Deviation:
  0.0015 (0.15 percentage points)
  Smaller than Scenario 1

2.3.2 P-value

Chi-square (similar computation):
  χ² ≈ 17.2
  p-value: 2 × 10^-5

→ Stronger than Scenario 1 due to larger sample
→ Detection 강함

2.3.3 함정 — Metric 모두 Significant

저자 강조 (Bing real scorecard).

2.3.3.1 Overall Scorecard (전체 사용자)

| Metric          | Delta % | P-Value | Significance |
|-----------------|---------|---------|--------------|
| Sessions/UU     | +0.54%  | 0.0094  | Yes |
| Metric 2        | +0.20%  | 7E-11   | Strongly |
| Metric 3        | +0.49%  | 2E-10   | Strongly |
| Metric 4        | -0.46%  | 4E-5    | Yes |
| Metric 5        | +0.24%  | 0.0001  | Yes |

→ 5 metric 모두 statistically significant
→ "Treatment 가 좋아 보임"

2.3.3.2 Decision 의 함정

SRM 무시 시:
  "5 metric 모두 lift, p < 0.01"
  Decision: launch

But:
  Sample mismatch
  → False positive 가능
  → 잘못된 decision

2.3.4 Investigation

저자 강조: “the excluded users were those that used an old version of the Chrome browser, which was the cause of the SRM. Also, a bot was not properly classified due to some changes in the Treatment, causing an SRM.”

2.3.4.1 Browser Segment 분석

Investigation step 1:
  Browser 별 sample ratio:
    Chrome (current): 50/50 (passes SRM)
    Chrome (old version): T=10K, C=15K (fail)
    Firefox: 50/50 (passes)
    Safari: 50/50 (passes)
    Edge: 50/50 (passes)

  → Old Chrome 의 sample mismatch
  → 이 segment 의 user 가 Treatment 노출 적음

2.3.4.2 Bot 분류 의 issue

Investigation step 2:
  Bot detection 의 결과:
    Treatment 의 일부 user 가 bot 분류
    Treatment 의 변경이 user 행동 변경
    → 일부 normal user 가 bot heuristic 매칭
    → Treatment 의 bot count > Control 의 bot count
    → Filtering 후 Treatment user 적음

2.3.5 After Fix (Excluded affected users)

After excluding old Chrome + reclassified bots:
  Treatment: 924,240
  Control: 924,842
  Total: 1,849,082

  Ratio: T/C = 0.9993
  T fraction: 0.4998

  Chi-square: ~0.2
  p-value: 0.658
  → SRM passes (p > 0.001)

2.3.5.1 Re-analysis 의 결과

After fix:
| Metric          | Delta % | P-Value | Significance |
|-----------------|---------|---------|--------------|
| Sessions/UU     | +0.19%  | 0.3754  | NO |
| Metric 2        | +0.04%  | 0.1671  | NO |
| Metric 3        | +0.13%  | 0.0727  | Marginal |
| Metric 4        | -0.12%  | 0.2877  | NO |
| Metric 5        | +0.01%  | 0.8275  | NO |

→ 모든 5 metric 의 statistical significance 사라짐
→ Treatment 의 진정 effect: minimal

2.3.5.2 Lesson

Original (with SRM):
  "Treatment 의 강한 lift" (false)
  Launch decision (잘못)

After fix (SRM resolved):
  "Treatment 의 effect minimal" (true)
  Decision: reject 또는 redesign

Difference:
  6% 사용자 의 unequal exposure (SRM cause)
  → 5 metric 모두 의 false significance
  → 잘못된 launch 의 risk

이 dramatic effect 가 SRM 의 critical 성의 evidence.

이 Bing 사례 가 SRM 의 가치 의 가장 강력한 사례.

직관 — Subtle SRM 의 hidden danger

2.3.5.3 Magnitude 의 paradox

SRM 의 magnitude:
  Scenario 1: 0.19 percentage points (0.4981 vs 0.5)
  Scenario 2: 0.15 percentage points (0.4985 vs 0.5)

  매우 작은 deviation
  Apparently negligible

But:
  Statistical significance:
    Scenario 1: p = 1.8E-6
    Scenario 2: p = 2E-5
  → Strong evidence of bug

2.3.5.4 Why small deviation can cause large effect

Sample composition 의 변화:
  6% 사용자 가 differently treated
  - Heavy user vs light user 의 비율 차이 가능
  - High-spender vs low-spender 의 비율 차이
  - Region 의 다른 분포

이 composition difference:
  Average metric 의 변화
  Treatment·Control 의 different baseline
  → Apparent lift (실제는 composition difference)

2.3.5.5 Bing 사례 의 정량

6% 사용자 의 mismatch:
  - Old Chrome users (non-typical user)
  - Reclassified bots (different behavior)

이 6% 의 영향:
  Sessions/UU: +0.54% lift (apparent)
  → 6% 의 user 의 average behavior 차이
  → 0.54% lift 의 부산물

After fix (homogeneous sample):
  Sessions/UU: +0.19% (not significant)
  → True effect

2.3.5.6 Implication

Tiny SRM 도 dramatic 영향:
  6% mismatch → 5 metric 의 false significance

따라서:
  SRM threshold 의 strict (p < 0.001)
  Even small deviation 의 detect
  Investigation 의 의무

이 pattern 이 SRM check 의 의무의 evidence.

2.3.6 Real Scorecard 의 Format

저자 Figure 21.1 의 전체 layout:

| Metadata          | Overall      | Segment       |
|-------------------|--------------|---------------|
| ScorecardId       | 96699772     | 96762547      |
| Sample Ratio      | 0.9938 (FAIL)| 0.9993 (PASS) |
| Trigger Rate      | -            | 96.04%        |
| Sessions/UU       | +0.54%, p=0.009 | +0.19%, p=0.38 |
| ...               | ...          | ...           |

2.3.6.1 Modern Platform 의 표시

실제 platform UI:
  [Trust Section]
  Sample Ratio: 0.9938 (FAIL, p = 2E-5)  ← 첫 row
  Sample Ratio (overall): 0.9993 (PASS, after segment exclude)

  [Metric Section]
  - SRM fail 시: hide
  - Pass 시: 정상 표시

  [Segment Section]
  - Segment exclude 후 의 result
  - 재 분석 가능

이 dual presentation 이 ExP 의 표준.

2.4 Sample Ratio 의 정의 차원

2.4.1 By User vs By Page

저자 명시 (Bing scorecard 의 두 row):

Sample Ratio [by user]:
  사용자 의 unique count
  T_users / C_users

Sample Ratio [by page]:
  Page view count
  T_pages / C_pages

Both 의 SRM check:
  By user fail → randomization issue
  By page fail (but by user pass) → user 별 page count 의 차이
    → Treatment 가 user behavior 영향
    → 일부 metric 의 의심

2.4.1.1 Implications

By page fail:
  Treatment 가 page view 빈도 영향:
    - Treatment 의 better experience → more visits
    - Treatment 의 worse experience → less visits

  This is OK (Treatment effect 자체):
    By user 가 정상이면 randomization 정상
    By page 의 차이는 effect 의 일부

  But:
    일부 metric 의 분석 시 page-level 사용 (예: CTR)
    → 분석 의 변경 필요 (delta method, F-KOH18)

2.4.1.2 Industry standard

모든 platform: by user SRM 의 first check
  + by page SRM 의 second check

Both fail:
  Major issue (randomization 또는 pipeline)

Only by user fail:
  Randomization 또는 segment bias

Only by page fail:
  Treatment 가 visit 빈도 영향 (acceptable)
  단 page-level metric 분석 의 주의

이 dual check 가 mature platform 의 표준.

3 왜 필요한가

Scenario 1·2 검증 부재 시.

False launch: Bing 의 5 metric significance → 잘못된 decision
Hidden bug: 6% mismatch 가 invisible
Investigation 부재: Browser bug 가 발견 안 됨

활성 시.

Trust check enforcement
Bug detection
Decision quality

이 enforcement 가 platform 의 trust foundation.

4 응용 사례 — Bing 의 SRM Resolution Workflow

Bing 의 실제 운영 (저자 사례 기반):

Day 1: 실험 시작
Day 7: Scorecard 생성
Day 7: SRM check
  - Overall: fail (p = 2E-5)
  - 5 metric 모두 significant (false)

Day 8: Investigation 시작
  - Browser segment SRM check
    Old Chrome: fail (sample ratio 0.95)
  - Bot detection check
    Treatment 의 bot rate ↑

Day 9: Root cause identified
  - Old Chrome browser bug + bot reclassification
  - Total 6% 사용자 의 unequal exposure

Day 10: Fix
  - Old Chrome 사용자 exclude
  - Bot detection logic fix
  - Re-analyze

Day 10: Re-analysis
  - SRM: pass (p = 0.658)
  - 5 metric 의 significance 사라짐
  - Decision: reject (no real lift)

Day 11+: Code fix deploy
  - Browser compatibility
  - Bot detection improvement
  - Re-experiment

이 workflow 가 SRM 의 typical investigation cycle.

5 코드 예시 — Bing-style SRM Detection with Segment Analysis

Browser segment 의 SRM detection 시뮬레이션.

import numpy as np
import pandas as pd
from scipy import stats

rng = np.random.default_rng(42)

# 가상 실험: 2M users
n_users = 2_000_000
treatment = rng.choice([0, 1], n_users, p=[0.5, 0.5])

# Browsers
browser = rng.choice(["Chrome_current", "Chrome_old", "Firefox", "Safari", "Edge"],
                     n_users, p=[0.40, 0.05, 0.20, 0.20, 0.15])

# Treatment 가 Old Chrome 사용자 의 일부 를 bot 으로 잘못 분류
# (simulate bug)
is_bot = np.zeros(n_users, dtype=bool)
for i in range(n_users):
    if treatment[i] == 1 and browser[i] == "Chrome_old":
        if rng.uniform(0, 1) < 0.5:  # 50% reclassified
            is_bot[i] = True

# Filter out bots (standard pipeline)
valid_users = ~is_bot

# Overall SRM check
n_t = ((treatment == 1) & valid_users).sum()
n_c = ((treatment == 0) & valid_users).sum()
n_total = n_t + n_c
expected = n_total / 2

chi2 = (n_t - expected)**2 / expected + (n_c - expected)**2 / expected
p_overall = 1 - stats.chi2.cdf(chi2, df=1)
print(f"=== Overall SRM ===")
print(f"T: {n_t:,}, C: {n_c:,}")
print(f"Ratio: {n_t/n_c:.4f}")
print(f"P-value: {p_overall:.2e}")
print(f"Result: {'PASS' if p_overall > 0.001 else 'FAIL'}")

# Segment 별 SRM
print(f"\n=== Segment-level SRM ===")
for b in ["Chrome_current", "Chrome_old", "Firefox", "Safari", "Edge"]:
    seg_mask = (browser == b) & valid_users
    n_t_seg = (seg_mask & (treatment == 1)).sum()
    n_c_seg = (seg_mask & (treatment == 0)).sum()
    if n_t_seg + n_c_seg > 0:
        seg_total = n_t_seg + n_c_seg
        seg_expected = seg_total / 2
        chi2_seg = (n_t_seg - seg_expected)**2 / seg_expected + (n_c_seg - seg_expected)**2 / seg_expected
        p_seg = 1 - stats.chi2.cdf(chi2_seg, df=1)
        result = "PASS" if p_seg > 0.001 else "FAIL"
        print(f"{b}: T={n_t_seg:,}, C={n_c_seg:,}, ratio={n_t_seg/max(n_c_seg,1):.4f}, p={p_seg:.2e}, {result}")

# After exclude old Chrome
print(f"\n=== After Excluding Old Chrome ===")
new_valid = valid_users & (browser != "Chrome_old")
n_t_new = ((treatment == 1) & new_valid).sum()
n_c_new = ((treatment == 0) & new_valid).sum()
expected_new = (n_t_new + n_c_new) / 2
chi2_new = (n_t_new - expected_new)**2 / expected_new + (n_c_new - expected_new)**2 / expected_new
p_new = 1 - stats.chi2.cdf(chi2_new, df=1)
print(f"T: {n_t_new:,}, C: {n_c_new:,}")
print(f"Ratio: {n_t_new/n_c_new:.4f}")
print(f"P-value: {p_new:.4f}")
print(f"Result: {'PASS' if p_new > 0.001 else 'FAIL'}")

# Decision recommendation
print(f"\n=== Decision Recommendation ===")
if p_overall < 0.001:
    print(f"Initial SRM FAIL (p={p_overall:.2e})")
    print(f"  → Investigate root cause")
    print(f"  → Check segments")

    # Find offending segments
    print(f"\nOffending segments:")
    for b in ["Chrome_current", "Chrome_old", "Firefox", "Safari", "Edge"]:
        seg_mask = (browser == b) & valid_users
        n_t_seg = (seg_mask & (treatment == 1)).sum()
        n_c_seg = (seg_mask & (treatment == 0)).sum()
        if n_t_seg + n_c_seg > 0:
            seg_total = n_t_seg + n_c_seg
            seg_expected = seg_total / 2
            chi2_seg = (n_t_seg - seg_expected)**2 / seg_expected + (n_c_seg - seg_expected)**2 / seg_expected
            p_seg = 1 - stats.chi2.cdf(chi2_seg, df=1)
            if p_seg < 0.001:
                print(f"  - {b}: SRM in segment")

    print(f"\nAfter excluding offending segments:")
    if p_new > 0.001:
        print(f"  SRM PASS (p={p_new:.4f})")
        print(f"  → Re-analyze metrics")
        print(f"  → If significant lift exists, decide based on remaining sample")
        print(f"  → If lift disappeared, reject Treatment (false positive)")

직관 — Segment-level Investigation

이 코드의 메시지.

5.0.0.1 Overall fail 의 root cause

Overall SRM fail:
  Treatment 의 일부 user 가 missing (bot 분류)
  Sample mismatch

Segment 별 분석 시:
  Chrome_current: pass
  Chrome_old: fail (Treatment 의 더 많은 reclassification)
  Other browsers: pass

→ Specific browser 의 bug
→ Implementation 의 hidden issue

5.0.0.2 Fix 의 결정

Option 1: Browser bug fix
  - Code 의 bot detection logic 수정
  - 모든 browser 에서 같은 분류
  - Re-experiment

Option 2: Browser exclude (analyze 만)
  - Old Chrome user 의 분석 제외
  - 나머지 95% 사용자 의 분석
  - SRM pass 후 재 분석

대부분 case:
  Option 2 가 immediate fallback
  Option 1 가 long-term fix

Bing 사례:
  Option 2 의 결과: 5 metric 의 lift 가 사라짐
  → False positive 였음
  → Original launch 결정의 잘못

5.0.0.3 Implication

SRM 의 dramatic 영향:
  6% mismatch → 5 metric 의 false significance

Real lift 가 0.5% 인데 SRM 의 부산물이 +0.5% 증가
→ 결과: +1% 처럼 보임 (false)

Decision 의 reverse:
  Original: launch
  After fix: reject

이 reverse 의 가치:
  False launch 회피
  Long-term 의 ROI 보호

5.0.0.4 산업 표준 절차

Modern platform 의 SRM workflow:

1. Continuous monitoring:
   - 매일 SRM check
   - Alert on fail

2. Automatic segment analysis:
   - Browser, country, device, day
   - Offending segment 식별

3. Recommended action:
   - Segment exclude (immediate)
   - Bug fix (long-term)
   - Re-experiment

4. Documentation:
   - SRM history per experiment
   - Common cause patterns
   - Resolution playbook

이 절차가 mature platform 의 표준.

6 관련 주제

선행

다음 글

관련 챕터

F16-1 — Bot Filtering

다른 카테고리 연결