Kwangmin Kim - Long-Term Holdout · Replication

1 정의

정의: Long-term Holdout 과 Post-ramp Cleanup

Kohavi (2020) Ch.15.3~15.4 의 두 마지막 단계.

1.0.0.1 Long-term Holdout

Treatment 가 거의 모든 사용자에 launch 된 후, 일부 사용자 (5~10%) 가 영구 또는 장기간 Control 유지. 장기 truth 측정 + replication 의 도구.

일반 ramp (holdout 없음):
  100% Treatment, Control 0%
  → Long-term comparison 불가

Holdout ramp:
  90% Treatment, 10% Control (수개월)
  → 장기 effect 측정 가능

1.0.0.2 Post Final Ramp Cleanup

100% launch 후 dead code path 정리. Architecture 별 다른 접근.

Code fork architecture: dead 분기 제거
Parameter system: default 값 변경

원문 (Ch.15.4): “in the first case, it can be disastrous when a dead code path that is not being maintained for a while is accidentally executed.”

핵심 통찰: Holdout 은 비용 (10% 가 inferior 또는 superior 경험). 따라서 모든 실험의 default 가 아닌 선택적 사용. Cleanup 도 가시적 cost 없지만 누적 시 큰 risk. 둘 다 ramp 의 hidden quality factor.

2 개념 및 원리

2.1 Long-term Holdout — 3 가지 사용 시나리오

저자 명시 (Ch.15.3).

2.1.1 시나리오 1 — Long-term ≠ Short-term

2.1.1.1 Sub-case 1a — Novelty/Primacy Effect

Novelty effect:
  사용자가 새 feature 에 호기심 → 초기 engagement ↑
  시간 따라 attenuation
  진정 long-term effect 는 attenuated 후

Primacy effect:
  사용자가 기존 행동에 익숙
  새 feature 저항 → 초기 effect ↓
  학습 후 effect ↑

2.1.1.2 Holdout 의 메커니즘

일반 측정 (1 주 MPR):
  Treatment (1 주 노출): +10% engagement
  Control (1 주 노출): baseline

해석 1: "+10% effect"
  → 실제는 novelty effect 인지 모름

Long-term holdout (3 개월):
  Day 7: T +10%, C baseline → +10%
  Day 30: T +6%, C baseline → +6% (novelty 일부 attenuated)
  Day 90: T +5%, C baseline → +5% (true long-term effect)

해석 2: "True effect +5%, Novelty +5%"

이 분리가 unbiased decision 의 본질.

2.1.1.3 Sub-case 1b — Large Short-term Impact

저자 명시: “the short-term impact on key metrics is so large that we must ensure that the impact is sustainable for reasons, such as financial forecasting.”

시나리오: Treatment 가 +30% revenue lift 단기 측정

Decision 도전:
  - 단기 +30% 이 sustainable?
  - 계절성·이벤트 효과인지?
  - Financial forecast 에 +30% 가정 안전?

Holdout 의 가치:
  - 3 개월 holdout 으로 sustainability 검증
  - +30% 의 attenuation 측정
  - Financial forecasting 의 정확도 ↑

2.1.1.4 Sub-case 1c — Small Short-term, 의심 long-term effect

저자 명시: “the short-term impact is small-to-none, but teams believe in a delayed effect (e.g., due to adoption or discoverability).”

시나리오: 새 feature 의 1-week MPR

Treatment vs Control:
  +0.2% engagement (statistically not significant)
  → Detect 안 됨

그러나 team 의 가설:
  "이 feature 는 사용자가 발견·학습 후 효과 발휘"
  "1 개월 후 +5% 가 될 수 있음"

Holdout 의 가치:
  3 개월 holdout 후 측정:
  - Adoption 후 effect 가 ↑?
  - 정량 검증

이 sub-case 가 가장 미묘. 단기 effect 부재 → launch 결정 어려움. Holdout 으로 long-term truth.

2.1.2 시나리오 2 — Early Indicator → True-North Metric

저자 명시: “When an early indicator metric shows impact, but the true-north metric is a long- term metric, such as a one-month retention.”

2.1.2.1 Early indicator vs True-north

Early indicator (단기 측정 가능):
  - Click rate
  - Session duration
  - 1-day engagement

True-north metric (long-term, 의사결정에 critical):
  - 1-month retention
  - Lifetime value (LTV)
  - 3-month engagement

2.1.2.2 Holdout 의 기능

시나리오:
  Early indicator: Treatment +5% click rate (1 주)
  True-north: 1-month retention 측정 필요

옵션 1 — Holdout 없이 launch:
  - 1 개월 후 모든 사용자가 Treatment
  - Retention 측정 시 Control 없음 → comparison 불가
  - "Treatment 가 retention 영향?" 답 없음

옵션 2 — 1-month holdout:
  - 90% Treatment, 10% Control
  - 1 개월 후 retention 측정
  - Treatment retention vs Control retention 비교
  - Causal effect 확인

2.1.2.3 Early indicator 와 true-north 의 일치 검증

가능성 1 — 일치:
  Click rate ↑ → retention ↑
  Early indicator 가 true-north 의 leading indicator
  Future 에 click rate 만으로 결정 가능

가능성 2 — 불일치:
  Click rate ↑ → retention ↓ (clickbait 효과, Goodhart)
  Early indicator 가 misleading
  True-north 만 신뢰 가능 → holdout 항상 필요

이 검증이 Ch.7 (OEC) 의 metric validation 과 연결. Holdout 이 metric 자체의 검증 도구.

2.1.3 시나리오 3 — Variance Reduction

저자 명시 (Ch.15.3, Ch.22 cross-reference): “When there is a benefit of variance reduction for holding longer.”

2.1.3.1 Variance reduction 의 메커니즘

1-week MPR:
  N = 1,000,000 users (1 week)
  Variance = σ² / N

3-month holdout (10% C):
  N = 3,000,000 user-months (3 month × 1M users)
  Variance ↓

→ Sensitivity ↑ (smaller effect detect 가능)

2.1.3.2 Effect detection 의 한계

1-week MPR 의 detection 한계:
  - Effect ≥ 1% reliable detect
  - Effect 0.5% 검출 어려움

3-month holdout:
  - Effect ≥ 0.2% reliable detect
  - Smaller effect 도 검출 가능

이것이 small-effect detection 의 가치. 단순 시간을 더 사용함으로써.

2.2 Holdout 의 운영 — 90/10 vs MPR 유지

저자 강조 (Ch.15.3): “There is a misconception that holdout should always be conducted with a majority of the traffic in Treatment, such as 90% or 95%. While this may work well in general, for the 1c scenario discussed here where the short-term impact is already too small to be detected at MPR, we should continue the holdout at MPR if possible.”

2.2.1 일반 holdout — 90/10

일반 시나리오:
  - Effect 가 MPR 에서 detect 됨
  - 90/10 holdout 으로 충분 (sustainability 검증)

이유:
  - 90% 사용자가 새 feature benefit
  - Compromise: small holdout 으로 검증

2.2.2 예외 — 1c 시나리오 (Small Short-term)

시나리오 1c:
  Short-term effect 가 MPR 에서 detect 안 됨

90/10 holdout 시:
  - Treatment N = 9,000,000 (3 month)
  - Control N = 1,000,000
  - Variance factor: 1/0.9 + 1/0.1 = 11.1 (vs MPR 의 4)
  - Sensitivity 가 MPR 의 1/2.7

이 경우 → MPR 50/50 유지가 더 sensitive
  - Treatment N = 5,000,000 (3 month)
  - Control N = 5,000,000
  - Variance factor: 4
  - Sensitivity 최대

2.2.2.1 Trade-off

50/50 holdout (3 month):
  + Sensitivity 최대
  - 50% 사용자가 3 개월 동안 Treatment 못 받음
  - Opportunity cost (만약 Treatment 가 진짜 +effect)

90/10 holdout:
  + 90% benefit
  - Sensitivity 1/2.7
  - Small effect detect 어려움

선택은 effect size 가설에 의존. Small effect 가설 시 MPR 유지, large effect 가설 시 90/10.

2.3 Uber Holdout

저자 명시 (Ch.15.3): “In addition to holdouts at the experiment level, there are companies that have uber holdouts, where some portion of traffic is withheld from any feature launch over a long term (often a quarter) to measure the cumulative impact across experiments.”

2.3.1 메커니즘

일반 (실험별 holdout):
  Experiment A: 90% T_A / 10% C_A
  Experiment B: 90% T_B / 10% C_B
  실험 별 holdout 사용자 다름

Uber holdout:
  특정 사용자 그룹 (예: 10% pool) 이 모든 실험에서 Control
  → 영구 baseline experience
  → 다른 90% 가 cumulative experiment 결과

비교:
  Uber holdout 사용자: 1 분기 baseline
  나머지 90%: 1 분기 동안 누적된 X 개 launched feature
  차이 = 1 분기 누적 effect

2.3.2 Bing 의 Global 10% Holdout

저자 인용 (Kohavi et al. 2013): “Bing conducts a global holdout to measure the overhead of experimentation platform, where 10% of Bing users are withheld from any experiments.”

Bing 의 운영:
  10% 사용자 = global holdout
  - 모든 실험에서 Control
  - 새 feature launch 시 미노출
  - 분기 단위 reset

측정 가치:
  - 1 분기 동안 launched 모든 feature 의 cumulative effect
  - "이 분기에 회사가 사용자에 얼마나 가치 추가?"
  - 메타분석 (Ch.8) 의 직접 input

2.3.3 Reverse Experiment

저자 명시: “There can also be reverse experiments, where users are put back into Control several weeks (or months) after the Treatment launches to 100% (see Chapter 23).”

Reverse experiment 의 메커니즘:

Day 1: Treatment X launched 100%
Day 30: Treatment X 의 사용자 익숙
Day 60: Reverse experiment 시작
  - 일부 사용자 (예: 5%) 를 Control 로 reverse
  - "이 feature 없이 어떻게 행동?"

가치:
  - Sustained effect 측정
  - User adoption 의 stability 검증
  - Removal 의 cost (만약 feature 제거 결정 시)

2.3.3.1 Reverse 의 ethical 측면

사용자 입장:
  - Day 1~60: feature 사용 익숙
  - Day 60+: 일부 사용자만 feature 제거 (Control)
  - "왜 내 것만 사라졌지?" frustration

Mitigation:
  - Reverse 의 사용자 비율 작게 (5% 이하)
  - 명시 communication (가능한 경우)
  - Reverse 후 빠른 termination (분석 후 다시 활성)

이 ethical 측면이 reverse experiment 의 자제 사유. 자주 사용 안 함.

2.4 Replication — Surprising Result 의 검증

저자 명시 (Ch.15.3): “When experiment results are surprising, a good rule of thumb is to replicate them. Rerun the experiment with a different set of users or with an orthogonal re- randomization. If the results remain the same, you can have a lot more confidence that the results are trustworthy.”

2.4.1 Replication 의 메커니즘

Original experiment:
  Day 1~7: Treatment vs Control
  Result: +20% engagement (surprising large)

Replication option 1 — Different users:
  Day 8~14: 다른 사용자 sample 으로 재실행
  Result: +18% (consistent → trustworthy)
       또는 +5% (inconsistent → spurious)

Replication option 2 — Orthogonal re-randomization:
  Day 8~14: 같은 시간이지만 different randomization seed
  Result: 같은 사용자가 다른 variant 받을 수 있음
  → Spurious effect 검증

2.4.2 Multi-iteration 의 Selection Bias

저자 강조: “when there have been many iterations of an experiment, the results from the final iteration may be biased upwards. A replication run reduces the multiple-testing concern and provides an unbiased estimate.”

2.4.2.1 Selection bias 의 메커니즘

iteration 1: Treatment v1, +2%
iteration 2: Treatment v2, +4%
iteration 3: Treatment v3, +6%
iteration 4: Treatment v4, +8% ← 채택

Final estimate: +8%

문제:
  Iteration 4 의 +8% 가 진정 effect?
  또는 noise + selection (가장 좋은 것 선택)?

2.4.2.2 Replication 으로 보정

Iteration 4 채택 후 replication:
  - 다른 사용자 sample 으로 재측정
  - True effect 가 +8% → replication 도 +8%
  - True effect 가 +5% (noise +3%) → replication 은 +5%

이 차이가 selection bias 의 정량.

저자 cross-reference: Ch.17 (Computing Experiment Effects) 에서 detail.

2.5 Post Final Ramp — Cleanup

저자 명시 (Ch.15.4): “We have not discussed what happens after an experiment is ramped to 100%.”

2.5.1 Architecture 별 cleanup

2.5.1.1 Architecture 1 — Code Fork

# Code 의 분기 형태
if variant == "treatment":
    new_logic()
else:
    old_logic()

2.5.1.2 100% launch 후 cleanup

# Cleanup 후
new_logic()  # if 분기 자체 제거

2.5.1.3 Cleanup 안 했을 때의 risk

저자 강조: “it can be disastrous when a dead code path that is not being maintained for a while is accidentally executed, which could happen when the experiment system has an outage.”

Dead code path 의 risk:

1. Maintenance 부재:
   - 새 feature 추가 시 new_logic() 만 update
   - old_logic() 은 1 년 이전 상태
   - 다른 dependency 변경 시 old_logic() 깨짐
   - 그러나 normal flow 에서 reach 안 됨

2. Outage 시 fallback:
   - Experiment system down
   - if 분기 의 fallback (default branch) → old_logic()
   - Old_logic() 이 broken → 사용자 broken experience

3. Refactor 시 confusion:
   - "이 코드 누가 썼지? 왜 있지?"
   - Knowledge loss
   - 잘못된 변경 가능

2.5.1.4 Architecture 2 — Parameter System

# 코드는 parameter 사용
config_value = config.get("feature_param", default="old_value")
result = process(config_value)

2.5.1.5 100% launch 후 cleanup

# Cleanup: default 만 변경
config_value = config.get("feature_param", default="new_value")
# 또는 config 자체 제거 + hardcode
result = process("new_value")

2.5.1.6 Parameter system 의 advantage

Cleanup 단순:
  - default 값만 변경
  - if 분기 없음
  - Dead code path 거의 없음

Maintainability:
  - 모든 path 가 active
  - Refactor 안전
  - Knowledge 보존

2.5.1.7 산업 trend

2010 년대: 대부분 code fork
2020 년대: Parameter system 으로 점차 전환
이유:
  - Dead code 누적 회피
  - Maintainability ↑
  - Cleanup 자동화

가정 — Cleanup SLA 부재 시의 누적 효과

가정: 회사 가 1 년에 100 개 launch, cleanup SLA 없음.

2.5.1.8 1 년 후

100 개 launched experiment 의 dead code path:
  - 모두 production codebase 에 잔존
  - 평균 1 path 당 200 줄
  - Total dead code: 20,000 줄

영향:
  - Codebase 의 30~50% 가 dead code (시간 따라)
  - Code review 시 cognitive load
  - 새 feature 개발 시 어디 수정해야 confusion
  - Build, test 시간 ↑
  - Bug 의 해 영역 ↑ (dead path 의 bug)

2.5.1.9 2 년 후의 사고 가능성

Experiment system outage 시:
  - if 분기 fallback 실행
  - 1 년 전 launched 실험의 old logic
  - 다른 code 변경으로 incompatibility
  - 사용자 broken experience
  - Recovery 어려움 (코드 ownership 사라짐)

2.5.1.10 해결

1. 명시 SLA:
   - Launch 후 30 일 내 cleanup
   - Cleanup ticket 자동 생성
   - Manager 가 enforcement

2. Architecture 전환:
   - Code fork → parameter system
   - Cleanup 의 단순화

3. 자동화:
   - Linter 가 dead code path detect
   - "Launched X days ago, please cleanup" warning

4. Code review 강제:
   - 새 PR 가 dead code touch 시 cleanup 요청
   - Refactor 의 일부로 cleanup

이 cleanup 운영이 mature platform 의 silent quality. 일반 metric 으로 visible 하지 않지만 long- term codebase health 의 핵심.

3 왜 필요한가

Phase 4 부재 시.

Long-term truth 모름 — Novelty effect 로 over-launch
Sustained effect 검증 안 됨 — Big lift 의 financial forecasting 부정확
Spurious result launch — Multi-iteration 의 selection bias

활성 시.

Phase 4 holdout — Long-term effect 분리
Replication — Surprising result 검증, selection 보정
Cleanup SLA — Codebase health 유지

이 phase 가 장기 quality 의 본질. 단기 launch decision 보다 long-term 의사결정 quality.

4 응용 사례 — Bing Global Holdout 의 운영

Bing 의 quarterly 운영:

매 분기 시작:
  - 10% 사용자 = global holdout (random 선택)
  - 분기 동안 모든 launched feature 미노출

분기 동안:
  - 90% 사용자가 launched feature 누적
  - 10% 가 baseline 유지

분기 종료:
  - Holdout 사용자 vs others 비교
  - 분기 누적 effect 측정
  - 회사 성과 보고서 의 input

Reset:
  - 다음 분기에 새 holdout 사용자 random 선택
  - 동일 사용자가 영구 holdout 안 됨 (ethical)

이 운영이 modern A/B platform 의 mature 단계. 분기 단위 회사 성과 측정의 input.

5 코드 예시 — 90/10 Holdout vs MPR Holdout 비교

Sensitivity 비교 시뮬레이션.

import numpy as np
import pandas as pd
from scipy import stats

rng = np.random.default_rng(42)

# 시나리오 1: Effect 1% (medium)
# 시나리오 2: Effect 0.2% (small, 1c scenario)

scenarios = [
    {"name": "Medium effect (1%)", "effect": 0.01},
    {"name": "Small effect (0.2%)", "effect": 0.002},
]

# Holdout 비율 비교
holdout_options = [
    {"name": "MPR (50/50)", "T_pct": 50, "C_pct": 50},
    {"name": "90/10", "T_pct": 90, "C_pct": 10},
    {"name": "95/5", "T_pct": 95, "C_pct": 5},
]

n_total = 1_000_000  # 1M users for 3 months
duration_factor = 13  # 13 weeks (3 months)
sigma = 1.0

print("=== Sensitivity Comparison ===\n")

for scenario in scenarios:
    print(f"\n=== {scenario['name']} ===")
    effect = scenario["effect"]

    for option in holdout_options:
        n_T = int(n_total * option["T_pct"] / 100)
        n_C = int(n_total * option["C_pct"] / 100)

        # Standard error
        se_T = sigma / np.sqrt(n_T * duration_factor)
        se_C = sigma / np.sqrt(n_C * duration_factor)
        se_diff = np.sqrt(se_T**2 + se_C**2)

        # Z-score for the effect
        z_score = effect / se_diff
        # 2-tailed p-value
        p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))

        # Detection (alpha = 0.05)
        detected = z_score > 1.96

        # Variance factor
        q_T = option["T_pct"] / 100
        var_factor = 1/q_T + 1/(1-q_T)

        print(f"  {option['name']:20s}: "
              f"N_T={n_T:>9,}, N_C={n_C:>9,}, "
              f"VarFactor={var_factor:.1f}, "
              f"Z={z_score:.2f}, p={p_value:.4f}, "
              f"detect={'Yes' if detected else 'No'}")

# 핵심 메시지
print("\n=== 핵심 메시지 ===")
print("Medium effect (1%): 모든 holdout option 에서 detect")
print("Small effect (0.2%):")
print("  - MPR (50/50): variance factor 4, sensitivity 최대")
print("  - 90/10: variance factor 11.1, sensitivity 1/2.78")
print("  - 95/5: variance factor 21.0, sensitivity 1/5.27")
print("\n→ Small effect 시 MPR 유지가 정답 (sensitivity 우선)")

예상 출력.

=== Sensitivity Comparison ===

=== Medium effect (1%) ===
  MPR (50/50)         : N_T=  500,000, N_C=  500,000, VarFactor=4.0, Z=18.03, p=0.0000, detect=Yes
  90/10               : N_T=  900,000, N_C=  100,000, VarFactor=11.1, Z=10.82, p=0.0000, detect=Yes
  95/5                : N_T=  950,000, N_C=   50,000, VarFactor=21.0, Z=7.85, p=0.0000, detect=Yes

=== Small effect (0.2%) ===
  MPR (50/50)         : N_T=  500,000, N_C=  500,000, VarFactor=4.0, Z=3.61, p=0.0003, detect=Yes
  90/10               : N_T=  900,000, N_C=  100,000, VarFactor=11.1, Z=2.16, p=0.0306, detect=Yes
  95/5                : N_T=  950,000, N_C=   50,000, VarFactor=21.0, Z=1.57, p=0.1162, detect=No

=== 핵심 메시지 ===
Medium effect (1%): 모든 holdout option 에서 detect
Small effect (0.2%):
  - MPR (50/50): variance factor 4, sensitivity 최대
  - 90/10: variance factor 11.1, sensitivity 1/2.78
  - 95/5: variance factor 21.0, sensitivity 1/5.27

→ Small effect 시 MPR 유지가 정답 (sensitivity 우선)

직관 — Small Effect Holdout 의 결정

이 시뮬레이션의 핵심 메시지.

5.0.0.1 Medium effect (1%) 의 경우

모든 holdout option 에서 detect. 따라서 user benefit 우선 → 95/5 가 가장 나음 (95% 가 새 feature benefit).

5.0.0.2 Small effect (0.2%) 의 경우

MPR: detect (z=3.61)
90/10: marginal (z=2.16, p=0.03)
95/5: NOT detect (z=1.57, p=0.12)

따라서 small effect 시 MPR 유지. User benefit 95% vs measurement quality 의 trade-off 중 quality 우선.

5.0.0.3 결정 framework

Effect size 추정 (사전 가설):

Small effect (< 0.5%):
  → MPR (50/50) holdout 유지
  → Sensitivity 우선

Medium effect (0.5%~3%):
  → 90/10 holdout
  → User benefit + 적절한 sensitivity

Large effect (> 3%):
  → 95/5 holdout
  → User benefit 우선

Effect size 미정 (가설 불확실):
  → 보수적 90/10
  → 미들 ground

이 결정 framework 가 holdout 운영의 실무 규칙.

6 Ch.15 시리즈 마무리

4 편 완료:

F15-0 — Ramping 정의, SQR framework, 4 phase 지도
F15-1 — SQR 의 trade-off, ramp up/down 비대칭, MPR 유래
F15-2 — Phase 1 (Pre-MPR) ring 구조, Phase 2 (MPR) 1 주 권고, Phase 3 (Post-MPR)
F15-3 — Phase 4 (Holdout) 3 시나리오, Replication, Post-final cleanup

다음: Ch.16 (Scaling Experiment Analyses, 3 편).

7 관련 주제

선행

다음 챕터

F16-* — Ch.16 Scaling Experiment Analyses

관련 챕터

F8-* — Ch.8 제도적 기억 — 메타분석 의 input
F19-* — Ch.19 A/A Test — Replication 의 도구
F23-* — Ch.23 장기 효과 — Novelty/primacy detail
F22-* — Ch.22 Leakage — Variance reduction

다른 카테고리 연결

Engineering — Feature Flag Cleanup
Engineering — Code Refactoring — Dead code 관리
Statistics — Multiple Testing — Replication 의 selection 보정