Kwangmin Kim - Kohavi Ch.16 개관 — Scaling Experiment Analyses

1 정의

정의: Scaling Experiment Analysis

Maturity model (Ch.4) 의 Run·Fly 단계에서 실험 분석을 platform 의 일부로 통합하는 절차 (Kohavi, Tang, Xu, 2020, Ch.16).

1.0.0.1 3 단계 파이프라인

Stage	목적	주요 작업
Data Processing	Raw log → 분석 가능 형태	Sort, join, clean, enrich
Data Computation	Metric·segment 계산 + 통계 검정	per-user stats, scorecard
Visualization	의사결정자에 결과 전달	OEC highlight, segment drill

1.0.0.2 두 path 의 dual

NRT (Near Real-Time) path:
  - 단순 metric (sum, count)
  - 분 단위 update
  - Egregious problem 즉시 catch
  - Auto shut-off

Batch path:
  - Full processing (clean, enrich)
  - 일~며칠 update
  - Trustworthy decision 의 input

원문 (Thomas J. Watson): “If you want to increase your success rate, double your failure rate.”

핵심 통찰: 실험 throughput 이 Run·Fly 단계에서 N 배 증가. Ad-hoc 분석으로는 scale 불가. 분석 자체를 platform 에 통합 해야 hundreds of experiments/quarter 가능. 이 통합이 maturity 의 본질.

2 개념 및 원리

2.1 Why Care — Run·Fly Maturity 의 분석 위기

저자 도입 강조: “For a company to move to the later phases of experimentation maturity (‘Run’ or ‘Fly’), incorporating data analysis pipelines as part of the experimentation platform can ensure that the methodology is solid, consistent, and scientifically founded.”

2.1.1 Maturity 단계별 분석 패턴

Crawl (~10 실험/분기):
  - Ad-hoc SQL query 별 실험
  - 분석가 1 명이 모두 처리
  - 1 실험 당 며칠

Walk (~50/분기):
  - 표준 SQL template
  - 분석가 N 명
  - 1 실험 당 1 일

Run (~250/분기):
  - 자동 scorecard 생성
  - Engineer 가 self-service
  - 1 실험 당 시간

Fly (1000+/분기):
  - 자동 분석 + alert
  - 모든 결정자 self-service
  - 1 실험 당 분

2.1.2 Ad-hoc 의 scale 한계

Ad-hoc 분석 의 한계:

1. 일관성 부족:
   - 분석가별 다른 metric 정의
   - "이 결과가 다른 결과와 왜 다른가" 의문 반복

2. Methodology 검증 부재:
   - SQL 의 수동 작성
   - Bug 가능성
   - 통계적 정확성 검증 어려움

3. Scale 불가:
   - 1000 실험/분기 시 분석가 100 명 필요
   - 비현실적 cost

4. 의사결정 지연:
   - Custom 분석마다 시간
   - Innovation cycle 의 bottleneck

2.1.2.1 Platform 통합 의 효과

Platform 분석의 이점:

1. 일관성:
   - 표준 metric 정의
   - 같은 분석 → 같은 결과

2. Trustworthy:
   - 검증된 methodology
   - SRM, A/A 자동 적용
   - 통계적 정확성 보장

3. Scale:
   - 무한 실험 자동 분석
   - Engineer self-service

4. 빠른 결정:
   - 실시간 또는 일 단위 scorecard
   - Innovation 가속

이 차이가 maturity 의 inflection point. Ad-hoc → platform 의 transition 이 Run 단계의 본질.

2.2 Stage 1 — Data Processing

저자 명시 (Ch.16.1): 3 가지 sub-step.

2.2.1 Sub-step 1 — Sort and Group

입력: Raw log (multiple sources)
  - Client log (browser, mobile)
  - Server log
  - Multiple data center 의 다른 log

작업:
  - User ID 별 sort
  - Timestamp 별 sort
  - Multi-source join (Ch.13)

출력:
  - User 별 chronological event 시퀀스
  - Session 또는 visit 단위로 group
  - 분석의 raw input

2.2.1.1 Materialization vs Virtual

저자 강조: “You may not need to materialize this join, as a virtual join as a step during processing and computation may suffice.”

Materialization:
  - Joined data 를 저장
  - 다른 use case (debugging, hypothesis) 에도 쓰임
  - Storage cost ↑

Virtual:
  - Computation 시점에만 join
  - Storage 절약
  - 매 computation 마다 join 비용

선택:
  - 다목적 사용 → materialize
  - 한 번만 사용 → virtual

2.2.2 Sub-step 2 — Data Cleaning

저자 명시 5 가지 cleaning task.

1. Bot/fraud session 제거:
   - Heuristics:
     - Session 이 너무 많은 활동 (10K events/min)
     - 너무 적은 활동 (impossible scenario)
     - 너무 빠른 event 사이 간격 (<10ms)
     - 너무 많은 click on a page
     - "Laws of physics" 위반 (예: 동시 multi-location)

2. Instrumentation issue 수정:
   - Duplicate event 제거
   - Timestamp anomaly 보정

3. Missing event:
   - Cleaning 으로 회복 불가
   - Lossiness inherent (Kohavi 2010)

4. Filter bias 점검:
   - 일부 cleaning 이 한 variant 만 affect 시
   - SRM (Sample Ratio Mismatch) 발생 (Ch.21)

5. Edge case handling:
   - 시간대 변경
   - Daylight saving
   - Timezone shift

2.2.2.1 “Laws of Physics” 의 violation

실제 사용자가 절대 못 하는 행동:
  - 1 초에 1000 click
  - 동시에 2 곳에 위치 (London + NY)
  - Session 길이 -1 (negative timestamp)

이런 패턴이 detect 되면 → bot 또는 instrumentation bug
→ Cleaning 시 제거

2.2.3 Sub-step 3 — Enrich

저자 명시.

Enrich 의 dimensions:

Per-event:
  - Browser family/version (user agent parsing)
  - OS, device type
  - Day of week
  - Time of day (hour, minute)

Per-session:
  - Total event count
  - Total duration
  - Unique pages visited
  - Conversion 발생 여부

Per-user:
  - Session count
  - Total time on site
  - Last visit timestamp

Experiment-specific:
  - 이 session 이 실험에 포함되는가?
  - Treatment 변경 시점 (transition)
  - Variant version

2.2.3.1 Annotation 의 가치

저자 강조: “These annotations are pieces of business logic that are often added during enrichment for performance reasons.”

Why annotation at enrichment:
  - Computation step 마다 재계산 비효율
  - 한 번 enrichment 시 모두 계산
  - 다음 stage 에서 lookup 만

2.3 Stage 2 — Data Computation

저자 명시 (Ch.16.2).

2.3.1 2 가지 architecture

2.3.1.1 Architecture 1 — Materialize Per-User Stats

Pipeline:
  Step 1: Per-user statistics 계산
    - 각 user 별:
      - Pageview count
      - Click count
      - Impression count
      - 기타 metric
    - User → table

  Step 2: User-experiment mapping table
    - 각 user 가 어떤 실험의 어떤 variant?

  Step 3: Join + aggregation
    - User stat × experiment mapping → variant 별 metric
    - Treatment vs Control 통계 검정

장점:
  - Per-user stats 가 다른 use case 에도 사용 (overall business reporting)
  - 한 번 계산 → 여러 use case
  - Storage 효율

단점:
  - Materialization 의 storage cost
  - 일부 segment·metric 은 user-level 이 아닌 finer

2.3.1.2 Architecture 2 — Integrated Per-Experiment

Pipeline:
  Step 1: 각 실험에 대해
    - Per-user metric 계산 (그 실험에 한정)
    - Materialize 안 함 (memory 또는 streaming)

  Step 2: Aggregation + 검정

장점:
  - Per-experiment 유연성
  - Storage 절약 (materialize 안 함)
  - 실험별 다른 metric 정의 가능

단점:
  - Pipeline 간 일관성 유지 어려움
  - Definition 공유 mechanism 필요
  - 같은 metric 이 여러 pipeline 에서 다르게 계산 가능

2.3.1.3 산업 trend

초기: Architecture 1 (materialize)
  - 단순
  - 빠른 시작

성장 후: Hybrid
  - Common metric: materialized
  - Experiment-specific: integrated

Mature: Architecture 2 + shared definition
  - 유연성 우선
  - Strong governance (definition consistency)

이 evolution 이 platform maturity 의 한 축.

2.3.2 Speed 와 Efficiency 의 dual

저자 강조 (Bing, LinkedIn, Google 의 terabyte/day): “Bing, LinkedIn, and Google all process terabytes of experiment data daily.”

2.3.2.1 Initial vs Modern

초기 (2010s):
  - Daily scorecard
  - 24 시간 lag (Monday data → Wednesday EOD)
  - 단순

Modern (2020s):
  - NRT path (분 단위)
  - Batch path (시간~일)
  - Dual layer

2.3.2.2 NRT (Near Real-Time) Path

저자 명시: “The NRT path has simpler metrics and computations (i.e., sums and counts, no spam filtering, minimal statistical tests) and is used to monitor for egregious problems.”

NRT 의 특성:
  - 단순 metric (sum, count)
  - 분 단위 update
  - 통계 검정 없음 (또는 minimal)
  - Spam filtering 없음 (raw)

목적:
  - Egregious bug 즉시 catch
  - Auto shut-off (실험의 자동 stop)

예시 trigger:
  - Crash rate spike (10x baseline)
  - Latency p99 dramatic 증가
  - Error rate 폭증

2.3.2.3 Batch Path

Batch 의 특성:
  - Full data processing (clean, enrich)
  - 시간~일 단위
  - 통계 검정 적용
  - Trustworthy

목적:
  - Decision 의 reliable input
  - Daily/weekly scorecard

dual 의 가치:
  - NRT: 안전 net
  - Batch: 정확한 측정
  - 두 layer 가 서로 보완

직관 — Dual Path 의 design rationale

NRT 와 batch 가 다른 layer 에서 다른 task.

2.3.2.4 NRT 의 limit

Why NRT 가 단순 metric 만?
  - Spam filter 가 시간 걸림 (분 단위)
  - 통계 검정 의 정확성 위해 sample size 필요 (시간 단위)
  - NRT 가 raw 데이터 사용 → noise 큼

NRT 의 sufficient task:
  - "이 실험이 dangerous?" (egregious problem)
  - 분명한 disaster 만 catch
  - 미세한 effect 분석은 batch

2.3.2.5 Batch 의 limit

Why batch 가 빠르지 않은가?
  - Full processing pipeline (clean, enrich, join)
  - 통계 검정 (multiple correction)
  - Multiple metric × segment 조합

Batch 의 sufficient task:
  - Trustworthy decision
  - 미세 effect 측정
  - Long-term analysis

2.3.2.6 Hybrid 의 본질

NRT path:
  - 사용자 안전 net (즉시 stop)
  - "이 실험 위험한가?" 의 빠른 답

Batch path:
  - Decision quality
  - "이 실험 launch 결정?" 의 정확한 답

두 path 가 서로 다른 question 에 답.
한 path 만으로는 부족. Dual 이 modern platform 표준.

2.3.2.7 사례 — Microsoft ExP

Microsoft 의 운영:
  NRT path:
    - 5 분 단위
    - 단순 count/sum
    - SRM detection (즉시)
    - Crash rate alert

  Batch path:
    - 시간 단위 (intra-day)
    - Full statistical analysis
    - Multiple metric
    - Daily scorecard

이 dual 이 platform 의 backbone.

2.3.3 Platform Recommendation 3 가지

저자 명시 (Ch.16.2): “we recommend that every experimentation platform:”

2.3.3.1 Recommendation 1 — Common Metric Definitions

표준 vocabulary:
  - 같은 metric 이 모든 pipeline 에서 같은 정의
  - 같은 분석 → 같은 결과
  - "왜 이 결과 가 다른 결과와 다른가" 의문 회피

2.3.3.2 Recommendation 2 — Implementation Consistency

Common implementation:
  - 같은 code base
  - 또는 testing 으로 다른 implementation 의 일치 보장

Why:
  - 같은 metric 의 다른 pipeline 결과가 다르면
  - 분석가 의 confusion
  - 신뢰 약화

2.3.3.3 Recommendation 3 — Change Management

저자 강조: “Changing the definition of an existing metric is often more challenging than additions or deletions.”

Metric 변경의 challenge:

1. 과거 데이터의 backfill?
   - 새 정의 로 historical 데이터 재계산?
   - Storage cost?
   - Compute cost?
   - 얼마나 historical 까지?

2. 결과 비교의 confusion:
   - "Old metric +5% vs new metric +3%"
   - 어느 게 맞나?

3. Stakeholder 알림:
   - Metric owner 변경 인지
   - Decision 영향

2.3.3.4 변경 process 의 enforcement

산업 표준:
  - Metric change 는 review process
  - Change 영향 분석 의무
  - 일정 기간 dual reporting (old + new)
  - 명시 deprecation date

이것이 metric governance 의 본질. 단순 code change 가 아닌 organizational process.

2.4 Stage 3 — Visualization (Preview)

저자 명시 (Ch.16.3) — 상세는 F-KOH16-2 에서.

2.4.1 핵심 원칙

1. Trust signal:
   - SRM 등 trust check 가 fail 시 scorecard 숨김
   - Microsoft ExP 의 표준

2. Hierarchy:
   - OEC + critical metric 강조
   - Guardrail, quality 도 표시
   - 덜 중요 metric 은 숨김 가능

3. Statistical 의 visibility:
   - Relative change 표시
   - Statistical significance 표시 (color, indicator)
   - Filter 가능

4. Segment drill-down:
   - 자동 흥미로운 segment highlight
   - Heterogeneous treatment effect (HTE) detection

2.4.2 Accessibility

저자 강조: “scorecard visualizations should be accessible to people with various technical backgrounds.”

Audience 분류:
  - Marketers: 비즈니스 metric 만, simple visualization
  - Product Managers: OEC + segment
  - Data Scientists: 모든 metric, raw data access
  - Engineers: 시스템 metric 추가
  - Executives: high-level summary

Tool 의 layered presentation:
  - Default: 가장 중요 metric
  - Drill-down: 더 detail
  - Expert mode: full statistical info

이 accessibility 가 culture 의 한 축. 의사결정자가 분석 자체에 access 할 수 있어야 healthy decision process.

2.4.3 Multiple Testing

저자 명시 (Ch.16.3): “Multiple testing (Romano et al. 2016) becomes more important as the number of metrics grow.”

Multiple testing 문제:
  100 metric × 0.05 alpha = 5 metric 평균 false positive
  실험가가 "왜 이 metric 변동?" 질문 → 대부분 noise

해결 옵션:
  1. Strict p-value (0.001)
  2. Benjamini-Hochberg (FDR control)
  3. Bonferroni correction

Cross-reference:
  Ch.17 에서 detail (F-KOH17 미 - Phase F 제외)

2.4.4 Metrics of Interest 의 자동 식별

저자 명시: “The platform can automatically identify these metrics by combining multiple factors, such as the importance of these metrics for the company, statistical significance, and false positive adjustment.”

자동 식별 logic:
  factors = [
    metric_importance,  # Pre-defined company priority
    statistical_significance,  # p-value
    false_positive_adjustment,  # Multiple testing
    direction_alignment,  # Treatment 기대 효과와 일치?
    magnitude  # Effect size
  ]

  score = combine(factors)
  highlight = score > threshold

이 automated highlighting 이 1000+ metric 환경에서 의사결정 의 핵심 도구.

3 왜 필요한가

Scaling 부재 시.

Ad-hoc bottleneck — 분석가 부족, 의사결정 지연
Inconsistency — 다른 pipeline 의 다른 결과
Trust 위기 — Methodology 검증 부재
Scale 불가 — 1000 실험/분기 불가능

활성 시.

Self-service — Engineer 가 직접 분석
Consistency — 표준 metric, methodology
Trustworthy — 검증된 pipeline
Scale — Hundreds 실험 동시 처리

이 격차가 maturity 의 transition. Run·Fly 도달의 prerequisite.

4 응용 사례 — 회사별 분석 platform

회사	NRT path	Batch path	Visualization
Microsoft (ExP)	5 min	hourly	Scorecard with SRM gate
Google	minute	hourly	Internal dashboard
LinkedIn (XLNT)	minute	hourly	Concourse dashboard
Bing	5 min	hourly	ExP 통합
Facebook	minute	hourly	Internal tools
Netflix	minute	hourly	Custom dashboards

각 회사가 dual path + 통합 visualization. Run·Fly 단계의 표준 architecture.

5 Ch.16 시리즈 다음 글

글	주제	KOH 라인
F16-1	Data Processing + Computation	L:2757~2785
F16-2	Results Summary and Visualization	L:2786~2813

6 코드 예시 — NRT vs Batch Path Pseudocode

두 path 의 logic 비교.

import time
from datetime import datetime, timedelta

# === NRT Path (단순, 분 단위) ===
def nrt_path(raw_events_stream, experiments):
    """ Real-time monitoring for egregious problems. """
    while True:
        # 1 분간의 raw events
        events = raw_events_stream.get_window(seconds=60)

        for exp_id, exp_config in experiments.items():
            # 단순 count
            t_count = sum(1 for e in events if e.variant == "T" and e.exp_id == exp_id)
            c_count = sum(1 for e in events if e.variant == "C" and e.exp_id == exp_id)

            # SRM 검정 (단순)
            expected_t = (t_count + c_count) * 0.5
            srm_z = (t_count - expected_t) / max(1, (expected_t * 0.5)**0.5)
            if abs(srm_z) > 4:
                trigger_alert(exp_id, f"SRM: T={t_count}, C={c_count}")

            # Crash rate 검사
            t_crashes = sum(1 for e in events if e.variant == "T" and e.exp_id == exp_id and e.is_crash)
            c_crashes = sum(1 for e in events if e.variant == "C" and e.exp_id == exp_id and e.is_crash)
            t_crash_rate = t_crashes / max(1, t_count)
            c_crash_rate = c_crashes / max(1, c_count)
            if t_crash_rate > c_crash_rate * 5:  # 5x baseline
                trigger_alert(exp_id, f"Crash spike: {t_crash_rate:.4f} vs {c_crash_rate:.4f}")
                auto_shutoff(exp_id)  # 자동 stop

        time.sleep(60)

# === Batch Path (full pipeline, 시간 단위) ===
def batch_path(raw_logs, experiments, run_time):
    """ Trustworthy daily/hourly analysis. """
    # Stage 1: Data Processing
    sorted_data = sort_by_user_timestamp(raw_logs)
    cleaned_data = clean_bots_and_anomalies(sorted_data)
    enriched_data = enrich_with_dimensions(cleaned_data)

    # Stage 2: Data Computation
    per_user_stats = compute_per_user_metrics(enriched_data)

    scorecards = {}
    for exp_id, exp_config in experiments.items():
        # User-experiment mapping
        users_in_exp = get_users_in_experiment(exp_id)
        exp_data = per_user_stats[per_user_stats["user_id"].isin(users_in_exp)]

        # Compute scorecard
        scorecard = {
            "exp_id": exp_id,
            "metrics": {},
            "segments": {},
            "trust_checks": {},
        }

        # 모든 metric 계산
        for metric_name in exp_config["metrics"]:
            t_data = exp_data[exp_data["variant"] == "T"][metric_name]
            c_data = exp_data[exp_data["variant"] == "C"][metric_name]
            scorecard["metrics"][metric_name] = {
                "delta": (t_data.mean() - c_data.mean()) / c_data.mean(),
                "p_value": compute_p_value(t_data, c_data),
                "ci_95": compute_ci(t_data, c_data, 0.95),
            }

        # Trust checks (SRM 등)
        scorecard["trust_checks"]["srm"] = check_srm(exp_data)
        scorecard["trust_checks"]["aa"] = check_aa(exp_data)

        # Segment drill-down
        for segment in exp_config["segments"]:
            scorecard["segments"][segment] = compute_segment_analysis(exp_data, segment)

        scorecards[exp_id] = scorecard

    # Stage 3: Visualization
    if all(scorecards[e]["trust_checks"]["srm"]["passed"] for e in scorecards):
        publish_scorecards(scorecards)
    else:
        publish_scorecards_with_warning(scorecards)

    return scorecards

이 pseudocode 가 dual path 의 architecture. NRT 가 단순 alert, batch 가 full analysis.

직관 — Pseudocode 의 message

6.0.0.1 NRT path 의 단순성

NRT 의 logic:
  - Raw event stream
  - Simple count, SRM, crash rate
  - Threshold-based alert
  - Auto shut-off

코드 line: ~30 줄
실행 시간: 분 단위
복잡도: 낮음

6.0.0.2 Batch path 의 복잡성

Batch 의 logic:
  - Sort, join, clean, enrich
  - Per-user metric, segment
  - Statistical test, multiple correction
  - Trust check, visualization

코드 line: 수백 줄 (실제는 수만)
실행 시간: 시간 ~ 일
복잡도: 매우 높음

6.0.0.3 두 path 의 dual 가치

NRT: 안전 (즉시 catch)
Batch: 정확 (trustworthy decision)

따로 운영하는 이유:
  - NRT 가 batch 처럼 정확하면 latency ↑ → 안전 net 무용
  - Batch 가 NRT 처럼 빠르면 정확성 ↓ → trustworthy 무용

각자 specialty 에서 best.

이 dual 이 modern A/B platform 의 architecture 핵심. 한 path 로는 양 목적 동시 달성 불가.

7 관련 주제

선행

다음 글

관련 챕터

F8-* — Ch.8 제도적 기억 — 메타분석 의 입력
F19-* — Ch.19 A/A Test — Trust check
F21-* — Ch.21 SRM — 자동 detection

다른 카테고리 연결