Kwangmin Kim - Ramping 의 정의와 SQR Framework

1 정의

정의: Ramping (Controlled Exposure)

새 Treatment 의 traffic 비중을 점진적으로 증가시키며 신뢰성·품질·운영 부담을 단계적으로 검증하는 절차 (Kohavi, Tang, Xu, 2020, Ch.15.1).

1.0.0.1 단계 예시

Day 1, hour 1: 0.5% (single data center)
Day 1, hour 8: 1% (전 region)
Day 2: 5% (early adopter ring)
Day 3: 25% (broader audience)
Day 5: 50% (MPR — measurement)
Day 12: 75%
Day 13: 100%

1.0.0.2 Why “ramping” 이 단순 “scaling” 아닌가

Ramping:
  - 점진적 증가 + 각 단계의 명시 검증
  - 단계별 alert + decision point
  - 사고 시 현 단계 hold

Scaling (단순 늘림):
  - 시간에 따라 자동 증가
  - 검증 없음
  - 사고 무방어

원문 (Thomas A. Edison): “The real measure of success is the number of experiments that can be crowded into 24 hours.”

핵심 통찰: Edison 의 인용은 속도 의 가치. 그러나 ramping 은 속도와 안전의 동시 추구. 단순 “많은 실험” 이 아닌 trustworthy 한 많은 실험.

2 개념 및 원리

2.1 Ramping 의 본질 — 양극단의 trade-off

저자 명시 (Ch.15.1): “Ramping too slowly wastes time and resources. Ramping too quickly may hurt users and risks making suboptimal decisions.”

2.1.1 양극단의 cost

2.1.1.1 너무 느린 Ramp

Ramp 가 1 개월 이상:
  비용:
    - Innovation cycle 지연
    - 경쟁사가 같은 시간에 5 개 실험 진행
    - Engineer 의 시간 낭비
    - User 가 좋은 feature 못 받음 (opportunity cost)

  비유:
    "신중함의 함정" — 모든 risk 0 추구로 가치 창출 0

2.1.1.2 너무 빠른 Ramp

Ramp 가 1 일 이내:
  비용:
    - Bug 발견 시 100% 사용자 영향
    - User 신뢰 손상
    - Suboptimal launch decision (insufficient measurement)
    - Healthcare.gov 평행 사고

  비유:
    "성급함의 함정" — 속도 추구로 큰 사고

2.1.1.3 균형의 추구

저자 강조: “How do we decide which incremental ramps we need and how long we should stay at each increment?”

적절한 ramp:
  - 단계 수: 4~6 개 (너무 많으면 inefficient, 너무 적으면 risk)
  - 단계당 시간: 시간 ~ 일 (operational 검증 가능)
  - Total: 1~2 주 (innovation cycle 과 균형)

이 balance 가 SQR framework 의 본질.

2.1.1.4 Principles + Tooling 의 결합

저자 강조: “we need principles on how to ramp to guide experimenters and ideally, tooling to automate the process and enforce the principles at scale.”

Principles:
  - SQR framework (개념적 가이드)
  - Phase 별 목표 명시
  - Alert threshold

Tooling:
  - Auto dial-up
  - Real-time guardrail
  - Phase advance automation

이 둘이 결합되어 scale 한 ramping. 회사 1000 명 engineer 가 동시 실험 시 자동화 필수.

2.2 Ramp Up vs Ramp Down 의 비대칭

저자 강조 (Ch.15.1): “We primarily focus on the process of ramping up. Ramping down is typically used when we have a bad Treatment, in which case we typically shut it down to zero very quickly to limit the user impact.”

2.2.1 비대칭의 메커니즘

Ramp UP (점진):
  목적: 검증 + 안전한 확대
  속도: 느림 (단계별 검증)
  Phase 수: 4~6
  Total time: 1~2 주

Ramp DOWN (즉시):
  목적: 사고 시 영향 차단
  속도: 빠름 (즉시 0%)
  Phase 수: 1 (즉시)
  Total time: 분~시간

2.2.1.1 Down 의 즉시성이 critical

시나리오: MPR 에서 bug 발견

옵션 1 (느린 ramp down):
  50% → 25% (1 일)
  25% → 10% (1 일)
  10% → 5% (1 일)
  5% → 0% (1 일)
  → 4 일간 사용자 영향

옵션 2 (즉시 ramp down):
  50% → 0% (5 분)
  → 5 분 후 영향 차단

명백히 옵션 2 가 정답. Bug 영향이 크면 즉시 stop.

2.2.1.2 Engineering 의 즉시성 보장

Ramp down 의 infrastructure 요구:
  - Server-side feature flag (즉시 toggle)
  - Cache invalidation (next request 부터 적용)
  - Client-side cache 의 max TTL (보통 분 단위)

Mobile app 의 ramp down:
  - Server flag toggle
  - 다음 client fetch (분 ~ 시간)
  - Offline client 는 다음 online 까지
  - Total: 분 ~ 일 (online 비율 따라)

2.2.1.3 Enterprise 의 예외

저자 명시: “large enterprises usually control their own client-side updates, so they are effectively excluded from some experiments and ramping exposure.”

Enterprise customer (예: Microsoft 365 enterprise):
  - IT 부서가 update 통제
  - 자체 schedule 따라 update
  - Ramping 의 외부 (회사 의도와 무관)

함의:
  - 일부 사용자 (enterprise) 는 ramp 의 외부
  - Sample 분석 시 exclude
  - 일반 사용자 의 ramp 결과 만으로 enterprise 의사결정 어려움

2.3 Why Run Controlled Experiments — 3 가지 목적

저자 명시 (Ch.15.1).

2.3.1 목적 1 — Measure (측정)

"To measure the impact and Return-On-Investment (ROI) of the Treatment if it launched to 100%."

핵심 질문:
  - 이 Treatment 가 100% launch 시 어떤 metric impact?
  - ROI 가 positive?
  - 사업 가치 정량화

방법:
  - MPR 50/50 split
  - 1 주 이상 측정
  - Effect size estimate + confidence interval

2.3.1.1 Measure 의 statistical power

50/50 split 의 power 우위:
  - Variance 최소
  - 같은 effect 를 작은 sample 로 detect
  - Decision speed 빠름

Power 의 정량 비교:
  90/10 split: power 낮음 (variance 높음)
  50/50 split: power 최대
  10/90 split: power 낮음 (대칭)

2.3.2 목적 2 — Reduce Risk

"To reduce risk by minimizing damage and cost to users and business during an experiment (i.e.,
when there is a negative impact)."

핵심 질문:
  - 이 Treatment 가 사용자에 harm 시 어떻게 차단?
  - 사고의 sample size 제한?
  - 사고 발견의 속도?

방법:
  - Pre-MPR 의 작은 ring
  - Real-time guardrail
  - 빠른 ramp down 능력

2.3.2.1 Risk 의 severity 분류

Risk severity:
  Low: small UX issue, 사용자 별 미미
  Medium: feature 일부 깨짐, 사용자 일부 frustrated
  High: 시스템 down, 모든 사용자 차단
  Critical: data loss, security breach

Ramping 의 요구:
  Low risk: 빠른 ramp 가능
  High risk: 천천히 ramp + multiple ring
  Critical risk: extensive testing 후 ramp (또는 ramp 자체 유보)

2.3.3 목적 3 — Learn

저자 명시: “To learn about users’ reactions, ideally by segments, to identify potential bugs, and to inform future plans.”

Learning 의 차원:
  - User reaction (engagement, satisfaction)
  - Segment-specific 효과
  - Edge case bug
  - Future feature ideas

방법:
  - Verbatim feedback (whitelisted ring)
  - Long-term holdout (sustainability)
  - A/A 테스트 (system 검증)

2.3.3.1 Learning 의 design 의도

저자 명시 (Ch.5 cross-reference): “either as part of running any standard experiments, or when running experiments designed for learning.”

표준 실험 + learning:
  - 모든 실험에 learning component
  - Decision metric 외 segment 분석
  - Edge case 점검

Learning-designed 실험:
  - 의도적 inferior 처치 (slowdown 실험)
  - Trade-off 정량화
  - Causal mechanism 이해

2.4 MPR (Maximum Power Ramp) 의 통계적 유래 — 깊이 풀이

저자 footnote (Ch.15.1, footnote 1): “If the experiment has the entire 100% traffic with only one Treatment, the variance in the two-sample t-test is proportional to 1/q(1-q), where q is the treatment traffic percentage. The MPR in this case has a 50/50 traffic allocation.”

2.4.1 통계적 유도

2.4.1.1 Two-sample t-test 의 variance

Treatment 비율: q
Control 비율: 1-q
Total sample: N

각 그룹의 sample size:
  N_T = qN
  N_C = (1-q)N

Sample mean 의 variance:
  Var(mean_T) = σ²/N_T = σ²/(qN)
  Var(mean_C) = σ²/N_C = σ²/((1-q)N)

Difference 의 variance:
  Var(mean_T - mean_C) = σ²/(qN) + σ²/((1-q)N)
                      = (σ²/N) × (1/q + 1/(1-q))
                      = (σ²/N) × ((1-q+q)/(q(1-q)))
                      = (σ²/N) / (q(1-q))

Variance ∝ 1 / (q × (1-q))

2.4.1.2 Variance minimize 위한 q 의 결정

Variance 가 작을수록 power ↑
→ q(1-q) 가 클수록 variance 작음
→ q(1-q) 의 maximum?

미분:
  d/dq [q(1-q)] = 1 - 2q = 0
  → q = 0.5

따라서 q = 0.5 (50/50 split) 이 variance 최소 = power 최대.

2.4.1.3 Variance 의 q 별 비교

q = 0.5: Variance ∝ 1 / 0.25 = 4
q = 0.25: Variance ∝ 1 / 0.1875 = 5.33 (33% ↑)
q = 0.1: Variance ∝ 1 / 0.09 = 11.11 (178% ↑)
q = 0.05: Variance ∝ 1 / 0.0475 = 21.05 (426% ↑)
q = 0.01: Variance ∝ 1 / 0.0099 = 101 (2425% ↑)

→ q 가 50% 에서 멀어질수록 variance 가 dramatic 증가
→ Same effect detect 위해 sample size 가 같은 비율 증가

2.4.1.4 일반화 — 100% 미만 traffic

저자 명시: “If there is only 20% traffic available to experiment between one Treatment and one Control, the MPR has a 10/10 split.”

Available traffic: a
Treatment 비율 (전체 대비): q_T
Control 비율 (전체 대비): q_C
Constraint: q_T + q_C = a

Same logic:
  Variance ∝ 1/(q_T × (1 - q_T)) where 1-q_T means q_C in this allocation
  실제: 1/q_T + 1/q_C
  Minimum at q_T = q_C = a/2

따라서 20% available → 10% T + 10% C 가 MPR.

2.4.1.5 Multi-variant 의 일반화

저자 명시: “If there are four variants splitting 100% traffic, then each variant should get 25%.”

K variants 의 100% split:
  Each variant: 100/K%
  Each pair-wise test 의 power 최대화 위해 균등 분배

2.4.1.6 MPR 의 본질적 의미

MPR 의 statistical 의미:
  "Variance 의 minimum + Power 의 maximum"

MPR 의 운영 의미:
  "Trustworthy 측정의 시점"
  "Decision 의 most reliable point"
  "Learning 의 best window"

MPR 의 운영 시간:
  - 1 주 이상 (time-dependent factors)
  - Novelty/primacy 시 더 길게

직관 — MPR 50/50 의 함의

2.4.1.7 왜 90/10 가 아니라 50/50?

직관적으로 “사용자에 적게 노출 (10%) 이 안전” 같지만, 통계적으로는 50/50 이 정답.

시나리오 — 같은 effect detect

90/10 split:
  Treatment: 100,000 users
  Control: 1,000,000 users
  Variance ∝ 1/0.1 + 1/0.9 = 11.1
  Power 낮음 → effect 1% detect 위해 N=100,000+

50/50 split:
  Treatment: 500,000 users
  Control: 500,000 users
  Variance ∝ 1/0.5 + 1/0.5 = 4
  Power 높음 → effect 1% detect 위해 N=10,000+

Sample size 차이: 10 배

해석:
  90/10: "사용자 10% 만 영향" 의 안전감
  But: 통계적 power 부족 → bigger sample 필요 → 더 긴 실험
  결과: 적은 사용자에 노출하지만 더 긴 시간
  Trade-off: 사용자별 노출 시간 vs 사용자 수

2.4.1.8 50/50 의 추가 가치

50/50 의 hidden 장점:
  - Same time 기간에 더 정확한 결과
  - 빠른 decision (launch 또는 reject)
  - 빠른 iteration (다음 실험 진행)

90/10 의 hidden 비용:
  - 결과 도달 시간 ↑
  - Iteration 속도 ↓
  - 같은 시간에 1/N 실험만 가능

따라서 measurement 단계에서는 50/50 이 표준. 90/10 등 unequal split 은 risk 우선 시기 (Pre- MPR) 만 사용.

2.4.1.9 예외 — Pre-MPR 의 unequal

Pre-MPR 단계에서 Treatment 1~5% 는 risk 우선:

Quality 측정 어려움 (sample size 작아 noise 큼)
그러나 risk 차단이 우선
Quality 는 다음 phase (MPR) 에서

이 phase-by-phase 의 split 변화가 SQR framework 의 운영. Risk → Quality 로 전환되면 50/50.

2.4.2 Intermediate Ramp Stages

저자 명시 (Ch.15.1): “We may also need intermediate ramp stages between MPR and 100%. For example, for operational reasons we may need to wait at 75% to ensure that the new services or endpoints can scale to the increasing traffic load.”

Operational ramp 의 사례:

새 ML model 의 75% ramp:
  - Server 의 GPU memory 부담
  - Cache hit rate 변화
  - Database connection pool 사용
  - 75% 단계에서 monitor → 100% 결정

새 service endpoint 의 75% ramp:
  - 처음 75% load 의 latency p99
  - Auto-scaling 작동 검증
  - Database master/replica balance

2.4.2.1 75% 의 의미

50% MPR → 75% step:
  Treatment 사용자: 50% → 75% (1.5 배)
  Server load: 1.5 배
  새 service endpoint 의 traffic: 1.5 배

이 1.5 배 부담의 검증:
  - 1 일 동안 peak hour 모니터링
  - p99 latency 안정?
  - Memory usage 정상?
  - Database 연결 부족 없음?

검증 통과 시:
  75% → 100% (0.5 배 추가 부담)
  최소 추가 risk

이 단계가 operational 안정성의 보장.

2.4.3 Long-term Holdout — 깊이 풀이

저자 명시: “Another common example is to learn. While learning should be part of every ramp, we sometimes conduct a long-term holdout ramp, where a small fraction (e.g., 5–10%) of users do not receive the new Treatment for a period of time (e.g., two months) primarily for learning purposes.”

2.4.3.1 Holdout 의 메커니즘

일반 ramp:
  Phase 1: 1% → 5% → 25%
  Phase 2: 50% (MPR, 1 주)
  Phase 3: 75% → 100%
  → 모든 사용자가 Treatment

Long-term holdout:
  Phase 1: 1% → 5% → 25%
  Phase 2: 50% (MPR, 1 주)
  Phase 3: 75% → 90%
  Phase 4: 90% Treatment / 10% Control (수개월)
  → 10% 가 영구 control, long-term comparison

2.4.3.2 Holdout 의 가치

Short-term effect 와 long-term effect 의 차이:

Day 1 (MPR 측정):
  Treatment +5% engagement
  Control baseline

Day 30 (Holdout):
  Treatment +5% engagement (sustained)
  Control baseline (변화 없음)

Day 90 (Holdout):
  Treatment +3% engagement (some attenuation)
  Control baseline

Insight:
  Short-term +5% → Long-term +3%
  Novelty effect 가 +2%
  진정 lift 는 +3%

이 attenuation 이 short-term 만으로는 invisible. Holdout 이 truth 의 도구.

2.4.3.3 Chapter 23 cross-reference

저자 강조: “See more discussion in Chapter 23.” (Long-Term Effects, F-KOH23 시리즈에서 상세).

3 왜 필요한가

SQR framework + ramping 부재 시.

Healthcare.gov 평행 — Big launch 사고
Quality 부족 — 잘못된 launch 결정
Speed 손실 — 신중함의 함정 (실험 지연)
Learning 부족 — Long-term truth 모름

활성 시.

3 차원 균형 — Speed·Quality·Risk 모두 acceptable
Phase 별 명시 priority — 각 단계의 결정 기준 명확
Statistical 기반 — MPR 의 power maximization
Long-term truth — Holdout 으로 sustainability

이 framework 가 modern A/B platform 의 운영 골격.

4 응용 사례 — Bing 의 Ramping 운영

저자 명시 사례 (Ch.15.2 의 Bing 예시).

Bing 의 ramping schedule:

Phase 1 (Pre-MPR):
  - 0.5~2% (single data center)
  - Time: 4~12 시간
  - Memory leak, slow leak detect

  → Single data center 통과 시:
  - 모든 data center 의 small traffic (1~5%)
  - Time: 1 일

Phase 2 (MPR):
  - 50/50 split
  - Time: 1 주 이상 (heavy/light user 모두)

Phase 3 (Post-MPR):
  - 75% (1 일)
  - 90% (1 일)
  - 100% (final)

Phase 4 (Holdout):
  - Selective (큰 launch 만)
  - 10% control 영구 (Bing 의 global 10% holdout)

이 schedule 이 Bing 의 industrial 표준. 모든 launch 가 따름.

5 코드 예시 — MPR Power 계산

q (treatment 비율) 별 statistical power 비교.

import numpy as np
import pandas as pd
from scipy import stats

# 같은 effect (delta = 0.05) detect 위한 sample size by q
def required_sample_size(q, delta=0.05, sigma=1.0, alpha=0.05, power=0.8):
    """ Two-sample t-test 의 required sample size. """
    z_alpha = stats.norm.ppf(1 - alpha/2)
    z_beta = stats.norm.ppf(power)

    # Variance proportional to 1/q + 1/(1-q)
    n_unit = (z_alpha + z_beta)**2 * sigma**2 * (1/q + 1/(1-q)) / delta**2
    return int(np.ceil(n_unit))

q_values = [0.5, 0.25, 0.10, 0.05, 0.01]
results = []
for q in q_values:
    n_required = required_sample_size(q)
    relative = n_required / required_sample_size(0.5)
    variance_factor = 1/q + 1/(1-q)
    results.append({
        "q (Treatment 비율)": q,
        "Variance factor (1/q + 1/(1-q))": variance_factor,
        "Required N": n_required,
        "Relative to 50/50": f"{relative:.2f}x",
    })

df = pd.DataFrame(results)
print("=== Sample Size by Treatment 비율 ===")
print(df.to_string(index=False))

# Power calculation by q for fixed N
print("\n=== Power by q for N=10000 ===")
N_total = 10_000
delta = 0.05
sigma = 1.0
alpha = 0.05

for q in q_values:
    n_T = int(N_total * q)
    n_C = N_total - n_T
    se = np.sqrt(sigma**2 * (1/n_T + 1/n_C))
    z_alpha = stats.norm.ppf(1 - alpha/2)
    z_critical = z_alpha
    z_score_at_delta = delta / se
    power = 1 - stats.norm.cdf(z_critical - z_score_at_delta)
    print(f"q={q:.2f}: N_T={n_T}, N_C={n_C}, SE={se:.4f}, Power={power:.3f}")

# MPR 의 시각화
print("\n=== MPR 의 본질 ===")
print("q 가 0.5 에서 멀어질수록:")
print("  - Variance 증가")
print("  - Required sample size 증가")
print("  - Decision 시간 증가")
print("\n50/50 split 이 statistical optimum.")

예상 출력.

=== Sample Size by Treatment 비율 ===
 q (Treatment 비율)  Variance factor (1/q + 1/(1-q))  Required N Relative to 50/50
                0.5                          4.0000        6273             1.00x
               0.25                          5.3333        8364             1.33x
                0.1                         11.1111       17424             2.78x
               0.05                         21.0526       33002             5.26x
               0.01                        101.0101      158409            25.25x

=== Power by q for N=10000 ===
q=0.50: N_T=5000, N_C=5000, SE=0.0200, Power=1.000
q=0.25: N_T=2500, N_C=7500, SE=0.0231, Power=0.999
q=0.10: N_T=1000, N_C=9000, SE=0.0333, Power=0.876
q=0.05: N_T=500, N_C=9500, SE=0.0459, Power=0.620
q=0.01: N_T=100, N_C=9900, SE=0.1005, Power=0.176

=== MPR 의 본질 ===
q 가 0.5 에서 멀어질수록:
  - Variance 증가
  - Required sample size 증가
  - Decision 시간 증가

50/50 split 이 statistical optimum.

직관 — Sample size 의 dramatic 차이

5.0.0.1 핵심 메시지

q=0.01 (1%) Treatment 시:

Required N: 158,409 (50/50 의 25 배)
같은 effect detect 위해 25 배 큰 sample
같은 traffic 기준 25 배 긴 실험

5.0.0.2 Pre-MPR ring 의 limit

Pre-MPR ring 1 (1% traffic):
  - Risk 차단이 우선
  - Quality 는 sample size 부족으로 어려움
  - Power 0.18 (effect 5% 가 detect 안 됨)
  - 결정의 도구로 부족 → MPR 가 도달해야 quality

따라서 Pre-MPR 은 "검증" 이 아니라 "안전 검사".
"이 단계에서 명백한 catastrophe 없으면 다음 단계" 라는 thresholding.

5.0.0.3 Power 의 ramp phase 별 변화

Phase 1 (1%): power 0.18 → 명확한 disaster 만 catch
Phase 2 (5%): power 0.62 → 큰 effect 만 reliable
Phase 3 (25%): power 0.99 → 대부분 effect detect
Phase 4 (50% MPR): power 1.00 → 가장 정확

→ MPR 도달 후 quality decision 가능
→ 그 전 phase 는 risk 차단 only

5.0.0.4 실무 함의

Decision 의 priority by phase:
  Phase 1: 큰 disaster 없는가? (power 부족 → 안전 검사)
  Phase 2: small disaster 도 catch? (power 보통 → 일부 검증)
  Phase 3: 대부분 issue catch (power 높음 → 본격 검증)
  Phase 4 MPR: precise effect estimate (power 최대 → decision)
  Phase 5+: scaling (decision 이미 됨)

이 phase-by-phase 의 power 변화가 SQR framework 의 통계적 본질. 각 phase 가 다른 정확도 → 다른 decision 기준.

6 관련 주제

선행

F15-0 — Ch.15 개관: Ramping 과 SQR

다음 글

관련 챕터

F4-0 — Ch.4 개관 (Maturity) — Crawl/Walk/Run/Fly
F12-2 — Implication 1 (Anticipate) — Pre-shipped variant
F18-* — Ch.18 Variance/CUPED — Variance 감소
F23-* — Ch.23 장기 효과 — Long-term holdout

다른 카테고리 연결

Engineering — Canary Deployment — DevOps 의 ramp
Engineering — Blue-Green vs Canary
Statistics — Statistical Power — Power 의 정의·계산
Statistics — Sample Size Calculation