Kwangmin Kim - Trustworthy Triggering · 3 가지 Common Pitfalls

1 정의

정의: Trustworthy Triggering 검증과 3 함정

Kohavi (2020) Ch.20.8~20.9 의 trust framework.

1.0.0.1 2 가지 검증

Check	무엇	발견 가능
Triggered SRM	Triggered user 의 sample ratio	Counterfactual bias
Complement A/A	Non-triggered user 의 A/A	Trigger condition 의 incomplete

1.0.0.2 3 가지 Pitfall

#	Pitfall	핵심
1	Tiny Segment	Generalization 어려움, but 예외 (MSN Hotmail)
2	Triggered Lifetime	First trigger 후 모든 future activity 포함
3	Counterfactual Latency	Logging cost 의 production 영향 hidden

원문 인용 (Ch.20.8): “There are two checks you should do to ensure a trustworthy use of triggering. We have found these to be highly valuable and they regularly point to issues.”

핵심 통찰: Triggering 의 가치 = trustworthy 의 가치. Implementation bug (counterfactual logging 의 잘못, lifetime tracking 의 누락) 가 분석 결과를 silently 왜곡. SRM·complement check 가 detection 의 도구.

2 개념 및 원리

2.1 Trustworthy Check 1 — Triggered SRM

저자 명시 (Ch.20.8).

2.1.1 Standard SRM 의 검증

Standard A/B 의 SRM:
  실험 setup: 50/50
  Actual: 50/50 일치?

  N_T / N_total ≈ 0.50
  Z-score: |observed - expected| / SE
  Z > threshold (예: 4) → SRM detected

2.1.2 Triggered SRM 의 본질

Triggered subset 의 SRM:
  Triggered T: N_θT
  Triggered C: N_θC
  Expected: N_θT ≈ N_θC (same trigger condition)

  N_θT / (N_θT + N_θC) ≈ 0.50
  Triggered SRM check

2.1.2.1 Why critical

Overall SRM 통과 (50/50 일치) but Triggered SRM 위반:
  → Trigger condition 자체의 bias

가능한 원인:
  1. Counterfactual logging 의 implementation bug
     - Treatment 의 counterfactual 가 일부 user 누락
     - Control 의 trigger 가 다른 빈도

  2. Latency 의 difference
     - Treatment 의 evaluation 이 더 느림
     - 일부 user 가 trigger 전 timeout
     - Trigger event 의 diff

  3. Filter 의 difference
     - Treatment 와 Control 의 filter 결과 다름
     - Triggered set 이 unequal

2.1.3 Detection의 메커니즘

Triggered SRM check:
  Step 1: 모든 user 의 triggered/not 분류
  Step 2: Triggered T 와 Triggered C 의 count
  Step 3: Chi-square test
    Expected: 50/50 split
    Observed: actual N_T, N_C
    Z-score 계산
  Step 4: Threshold (Z > 4 또는 p < 0.001) 위반 시 alert

2.1.3.1 Investigation procedure

Triggered SRM 발견 시:
  Step 1: Trigger event log 검사
    - Treatment trigger event 빈도
    - Control trigger event 빈도

  Step 2: Counterfactual logging 검증
    - Both variants 모두 evaluation?
    - Output 의 logging 정확?

  Step 3: 일부 user pool 의 deep dive
    - Triggered 의 user 가 어떤 segment?
    - Non-triggered 와 차이?

  Step 4: Bug 확정 후 fix

2.2 Trustworthy Check 2 — Complement A/A

저자 강조: “Generate a scorecard for never triggered users, and you should get an A/A scorecard.”

2.2.1 Mechanism

Complement subset:
  Non-triggered user (treatment effect = 0 by definition)
  Treatment vs Control 비교

Expected:
  No effect (A/A test 의 결과)
  All metric 의 5% false positive rate
  P-value distribution uniform

2.2.2 Why critical

Complement A/A pass:
  → Non-triggered 의 Treatment effect = 0
  → Trigger condition 정확

Complement A/A fail (significant effect):
  → Non-triggered 에도 Treatment 의 영향
  → Trigger condition incomplete
  → Implementation bug 가능

2.2.2.1 Real-world 사례

시나리오:
  Trigger condition: weather query 에 weather widget

Complement A/A:
  Non-weather query 의 Treatment vs Control
  Expected: no effect

Found:
  Engagement -2% (Treatment 가 non-weather 도 영향!)
  Possible cause:
    - Page latency: weather widget 의 load time 이 모든 page 영향
    - Code path bug: weather 무관 query 도 weather code path 일부 실행

Investigation:
  Page latency 분석:
    - Treatment 의 모든 query 의 latency +50ms
    - Weather widget 의 background load
  → Trigger condition 이 incomplete
  → Latency 영향이 모든 query 에

Fix:
  Trigger condition 확대:
    "Treatment 의 page load 가 다른 user 모두" trigger
  또는 architecture 변경:
    Weather widget 의 lazy load (안 보면 load 안 함)

이 detection 이 hidden bug 의 vocal 화.

2.2.3 산업 표준 운영

Modern platform:
  - 모든 triggered analysis 가 자동 complement A/A
  - Pass: scorecard 표시
  - Fail: warning + 분석가 review

Microsoft ExP:
  - Complement A/A 는 default
  - Fail metric 이 너무 많으면 scorecard hide

이 enforcement 가 trigger 의 trustworthy 의 표준.

직관 — 2 Check 의 layered defense

2.2.3.1 Check 분담

Check 1 — Triggered SRM:
  Sample composition 의 검증
  "Trigger 자체 의 bias"

Check 2 — Complement A/A:
  Trigger 의 completeness
  "Trigger 외 의 사용자 영향"

2.2.3.2 Failure mode 의 다른 cause

Triggered SRM fail:
  - Counterfactual logging implementation bug
  - Trigger event 의 latency
  - Filter difference

Complement A/A fail:
  - Trigger condition 의 incomplete (Treatment 가 외부 영향)
  - Architecture 의 spillover
  - Bug

2.2.3.3 통합 분석

Both pass:
  → Trustworthy
  → Decision 가능

Triggered SRM fail, Complement pass:
  → Trigger condition 자체 OK
  → Counterfactual logging 의 bug
  → Logging fix

Triggered SRM pass, Complement fail:
  → Trigger 의 sample 정상
  → Trigger 외 영향 있음
  → Architecture 의 분리 또는 trigger 확대

Both fail:
  → Major issue
  → Comprehensive investigation

이 4 가지 case 의 mental model 이 trigger 의 trust 의 표준.

2.3 Pitfall 1 — Tiny Segment Generalization

저자 명시 (Ch.20.9).

2.3.1 함정

시나리오:
  Triggered = 0.1% of users
  Triggered effect: +5%
  Diluted impact: +5% × 0.1% = +0.005%
  → Negligible

Decision:
  Tiny effect → reject? 또는 explore generalization?

2.3.1.1 Amdahl’s Law 의 평행

저자 인용: “In computer architecture, Amdahl’s law is often mentioned as a reason to avoid focusing on speeding up parts of the system that are a small portion of the overall execution time.”

Amdahl's law:
  Total speedup = 1 / ((1 - f) + f/s)

  Where:
    f = fraction of time spent on optimized part
    s = speedup of optimized part

  만약 f = 0.001 (0.1%):
    Maximum speedup = 1 / (0.999 + 0.001/s) ≈ 1.001
    → 0.1% improvement 만 가능

A/B 평행:
  만약 trigger rate = 0.001:
    Maximum overall lift ≈ 0.001 × triggered effect
    Triggered effect 100% 라도 overall 만 0.1%

이 limit 이 tiny segment 의 일반적 함정.

2.3.2 예외 — Generalization

저자 강조 (Ch.20.9, MSN 사례): “There is one important exception to this rule, which is generalizations of a small idea.”

2.3.2.1 MSN UK Hotmail 사례 (2008)

실험:
  Site: MSN UK
  변경: Hotmail link → 새 tab open (Control: 같은 tab)

Trigger:
  사용자 가 Hotmail link 클릭

Triggered population:
  ~5% of MSN visitors
  Hotmail user only

Triggered effect:
  +8.9% engagement (clicks/user on homepage)
  매우 큰 효과

Diluted impact:
  ~8.9% × 5% = ~0.45%
  Modest but visible

2.3.2.2 Generalization process

2008: MSN UK Hotmail link in new tab
  → +8.9% on triggered

Hypothesis 일반화:
  "External links 가 새 tab 에 → engagement ↑"
  Reason: 사용자 의 site 에서 이탈 안 함

Iterative experiments:
  - 다른 external link 의 새 tab
  - Search results 의 새 tab
  - Various country MSN

Multi-year evolution:
  Each experiment 의 confirmation
  Generalization 의 incremental support

2.3.2.3 2011: MSN US Search Result

실험:
  MSN US, 12M users
  Search results 의 새 tab

Triggered:
  Search results 의 click users
  Larger segment than original Hotmail

Triggered effect:
  +5% engagement (clicks per user)
  Massive at scale

Diluted impact:
  훨씬 더 큰 user base 의 +5%
  → 회사 의 대형 wins 중 하나

저자 강조 (Kohavi 2014, Kohavi and Thomke 2017): “This was one of the best features that MSN ever implemented in terms of increasing user engagement.”

2.3.2.4 Lesson — Tiny Segment 의 가치

일반 case:
  Tiny segment = ROI 부족
  Reject

특별 case (generalization):
  Tiny segment = 큰 idea 의 small test
  Generalize 후 큰 ROI

판단 기준:
  - Mechanism 이 일반적인가?
  - Other use case 에 적용 가능?
  - Iterative experiment 의 path?

2.3.2.5 결정 framework

Tiny segment 발견 시:
  Step 1: Effect 의 mechanism 분석
    "왜 이 효과 발생?"
  Step 2: Other use case 식별
    "유사한 mechanism 의 다른 영역?"
  Step 3: Generalization experiment 설계
    각 use case 의 별도 실험
  Step 4: Iterative scaling
    Confirmed 후 더 큰 segment
  Step 5: Final 큰 launch

MSN 사례 가 이 path 의 모범 case.

2.4 Pitfall 2 — Triggered User 의 Lifetime Tracking

저자 명시 (Ch.20.9).

2.4.1 함정

저자 강조: “As soon as a user triggers, the analysis must include them going forward.”

2.4.1.1 시나리오 — Daily Analysis

시나리오:
  Treatment: 매우 나쁜 experience (예: bug)
  사용자 가 trigger 후 visit ↓

Day 1: 사용자 trigger
  - Treatment effect 측정 (bad experience)
Day 2~7: 사용자 visit ↓
  - "Treatment user 가 less visit" pattern
  - Treatment 의 long-term harm

Daily analysis 시:
  Day 1 만 분석:
    - Treatment 의 단기 effect 만
    - Long-term retention 영향 무시
  → Underestimate harm

2.4.1.2 시나리오 — Session-based Analysis

시나리오:
  Treatment 가 page load 의 latency 증가
  사용자 의 abandonment ↑

Session 1: trigger
  - 사용자 frustrated, abandon
Session 2~: visit 안 함
  - 사용자 sessions 줄어듦

Per-session analysis 시:
  Session 의 metric 만:
    "Treatment session 의 conversion rate"
    Same session 만 분석
  Visit-per-user 변화는 separate metric

Issue:
  Visits-per-user 자체가 Treatment effect
  Per-session analysis 가 이 effect 무시
  → Underestimate harm

2.4.2 해결 — Triggered User 의 모든 future activity

저자 권고: “If you analyze users by day or by session, you will underestimate the Treatment effect.”

Implementation:
  사용자 first trigger 시 → triggered flag 설정
  이후 모든 user activity → triggered analysis 포함

  Future activity:
    - Same session 의 다른 actions
    - Future sessions
    - Future days
    - 사용자 의 lifetime (실험 기간 동안)

2.4.2.1 Visits-per-Triggered-User Metric

저자 강조:
  "If visits-per-user has not significantly changed statistically, you can get statistical
  power by looking at triggered visits."

Metric definition:
  Visits-per-triggered-user:
    각 triggered user 의 visit count (실험 기간 동안)

  Treatment 효과 측정:
    Triggered T 의 평균 visit count
    Triggered C 의 평균 visit count
    Difference

이 metric 이 lifetime tracking 의 표준.

2.4.3 사례 — E-commerce Free Shipping

시나리오:
  Treatment: cart $25 trigger 시 free shipping
  Triggered user (cart $25-$35 사용자)

Analysis option 1 (잘못):
  Triggered session 의 conversion 만
  - Triggered session 의 conversion +5% (단기)
  - 다른 sessions 무시

Analysis option 2 (올바름):
  Triggered user 의 모든 future activity
  - 첫 trigger session: +5% conversion
  - 이후 sessions: +3% (residual effect, free shipping의 perception)
  - Total lifetime: +4% per user

Option 2 가 진정 effect.

이 lifetime tracking 이 trustworthy의 본질.

2.5 Pitfall 3 — Counterfactual Logging 의 Performance Impact

저자 명시 (Ch.20.9, F-KOH20-2 의 ML triggering 보강).

2.5.1 함정 재명시

Counterfactual logging 시:
  Both variants user 가 V1 + V2 모두 inference
  → Same latency cost 모두

Production 후:
  V1 user 가 V1 만 (faster)
  V2 user 가 V2 만 (slower if V2 더 무거움)

Lab 의 실험 결과:
  V2 의 advantage (recommendation 의 quality)
  → +5% engagement

Production 실제:
  V2 의 advantage (recommendation)
  V2 의 disadvantage (slower latency)
  → Net effect 가 lab 보다 낮음

2.5.2 해결 1 — Awareness

저자 명시: “Awareness of this issue. The code can log the timing for each model so that they can be directly compared.”

Implementation:
  Each model inference 의 timing log:
    - V1 inference time
    - V2 inference time
  Counterfactual logging 시 both 의 timing

  Comparison:
    V1 의 평균 latency
    V2 의 평균 latency
    Difference

  Decision:
    V2 가 V1 보다 50% 느리다면 Production launch 시 user 의 latency 영향 예상
    Decision 시 이 latency cost 고려

2.5.3 해결 2 — A/A’/B Experiment

저자 명시: “Run an A/A’/B experiment, where A is the original system (Control), A’ is the original system with counterfactual logging, and B is the new Treatment with counterfactual logging. If A and A’ are significantly different, you can raise an alert that counterfactual logging is making an impact.”

2.5.3.1 3-arm experiment

Setup:
  A (~33%): Original V1, no counterfactual
  A' (~33%): V1 + counterfactual logging (V1 + V2 inference)
  B (~33%): V2 + counterfactual logging (V1 + V2 inference)

Comparisons:
  A vs A':
    Counterfactual logging 의 cost 측정
    같은 V1 inference + 추가 V2 inference
    Difference = counterfactual logging 의 latency 영향

  A' vs B:
    True V2 effect (둘 다 같은 logging cost)
    User 에 expose 만 다름
    Counterfactual cost 가 cancel
    → Pure V2 effect

  A vs B:
    Production 의 expected effect
    Counterfactual cost 포함
    "If V2 launches with counterfactual" 의 effect

2.5.3.2 결과 활용

A vs A' significant:
  → Counterfactual logging 의 cost 큼
  → Production launch 시 logging 제거 필요
  → 또는 logging 의 sample rate 줄임 (1%)

A vs A' insignificant:
  → Counterfactual logging cost minimal
  → Production launch 가능 (with logging)
  → A' vs B 의 effect 가 real production 의 expected

2.5.3.3 사례 — V2 가 V1 보다 50% 느림

A (V1 only): latency 100ms
A' (V1 + V2 logging): latency 150ms (+50ms)
B (V2 + V1 logging): latency 175ms (+75ms?)

A vs A':
  Latency +50ms
  Engagement -2% (latency penalty)
  → Counterfactual cost: -2%

A' vs B:
  Latency +25ms (V1 vs V2 의 difference)
  Engagement +5% (V2 quality) - 1% (latency)
  → Pure V2 effect: +4%

A vs B (production prediction):
  V1 only vs V2 only (no logging)
  Production V1: 100ms
  Production V2: 150ms
  Engagement: V2 +5% (quality), -3% (latency)
  Net: +2%

Decision:
  Pure V2 effect: +4% (with logging)
  Production V2 effect: +2% (no logging, latency)

  Launch decision:
    +2% lift 가 acceptable?
    Yes → launch
    No → V2 의 latency optimization 필요

이 A/A’/B 가 production reality 의 검증.

2.5.4 Shared Control 의 한계

저자 명시 (F-KOH20-2 의 부분 다룸).

Standard shared control:
  여러 실험 의 공통 Control
  Control user 가 모든 실험의 baseline

Counterfactual logging 시:
  Control 도 Treatment 의 inference (counterfactual)
  → Control 의 cost 증가
  → Other 실험 의 baseline 과 다름

해결 옵션:
  Option 1: shared control 포기
    개별 control per experiment
    Variance ↑ (sample size 줄어듦)

  Option 2: triggered condition 의 다른 detection
    Counterfactual 없이 trigger
    Suboptimal 또는 erroneous

  Option 3: Sample-based counterfactual
    1% user 만 counterfactual logging
    분석 시 sample 추정
    Reduced cost

각 option 의 trade-off.

가정 — 3 Pitfall 무시 시

가정: 분석가 가 3 가지 함정 모두 무시.

2.5.4.1 Pitfall 1 무시

Tiny segment 의 +5% effect 발견
Diluted: +0.005% (negligible)
Reject without exploring generalization

Lost opportunity:
  MSN Hotmail 평행
  Generalize 시 회사 의 큰 wins 중 하나
  → Innovation 손실

2.5.4.2 Pitfall 2 무시

Daily analysis only:
  Triggered session 만 분석
  Treatment 의 단기 effect

Long-term harm 무시:
  사용자 의 visit reduction
  Retention 손실
  → Underestimate harm

Decision:
  Single day 의 +5% lift → launch
  Production 6 개월 후:
    Long-term retention -10%
    Net negative

2.5.4.3 Pitfall 3 무시

Counterfactual logging 의 cost 무시
  Lab: V2 +5% lift expected
  Production: V2 +2% (latency cost 포함)

Decision quality:
  Expected vs actual 의 gap
  Stakeholder trust 손상
  Future ML A/B 의 reputation 위기

2.5.4.4 통합 결과

3 pitfall 모두 무시 시:
  - Innovation 기회 손실
  - 잘못된 launch
  - Long-term harm 누적
  - Platform trust 약화

2.5.4.5 해결

1. Pitfall 1: tiny segment 의 generalization 탐색
   - Mechanism 분석
   - Other use case 식별
   - Iterative experimentation

2. Pitfall 2: lifetime tracking
   - First trigger 후 모든 future activity 포함
   - Per-user metric 우선

3. Pitfall 3: A/A'/B experiment
   - Counterfactual logging 의 cost 검증
   - Production reality 의 prediction

이 3 lessons 가 advanced triggering 의 표준.

3 왜 필요한가

Trustworthy + 3 Pitfalls 부재 시.

Trigger condition 의 hidden bias: 잘못된 분석
Tiny segment 의 ROI 손실: Generalization 기회 무시
Lifetime tracking 의 부재: Long-term harm underestimate
Counterfactual cost 의 hidden: Production gap

활성 시.

Trigger 의 trustworthy: Trustworthy 분석
Generalization 의 framework: Innovation 의 path
Lifetime 의 정확한 추적: Long-term truth
Production reality: Prediction 의 정확

이 framework 이 advanced triggering 의 mature 단계.

4 응용 사례 — Microsoft Bing 의 Trigger 운영

Bing 의 triggered analysis 운영:

Scorecard 의 자동 layer:
  Layer 1: Triggered effect (Δ_θ)
  Layer 2: Diluted effect (Δ_θ × τ × M_ω/M_θ)
  Layer 3: Triggered SRM check
  Layer 4: Complement A/A check
  Layer 5: Lifetime tracking (visits-per-triggered-user)

Trigger condition design 의 review:
  - Symmetric (T·C 양쪽 evaluate)
  - Counterfactual logging (ML 시)
  - Lifetime sticky flag

A/A'/B 의 표준:
  ML A/B 시 항상 A/A'/B
  Production prediction 의 정확
  Decision quality 의 enforcement

발견된 issue 사례:
  - Counterfactual logging 의 일부 user 누락 (Triggered SRM fail)
  - Latency 의 영향 (A vs A' significant)
  - Trigger condition 의 incomplete (Complement A/A fail)

이 운영이 Bing 의 trustworthy analysis 의 표준.

5 코드 예시 — Triggered SRM + Complement A/A

자동 검증의 implementation.

import numpy as np
import pandas as pd
from scipy import stats

rng = np.random.default_rng(42)

# 가상 실험 데이터
n_users = 10_000
treatment = rng.choice([0, 1], n_users, p=[0.5, 0.5])

# Trigger condition (10% of users)
triggered = rng.uniform(0, 1, n_users) < 0.10

# 가정: Treatment 가 trigger event 의 빈도 약간 영향
# (잘못된 implementation 시뮬레이션)
# Treatment 가 trigger 더 자주 (5% 추가)
trigger_bias = (treatment == 1) & (rng.uniform(0, 1, n_users) < 0.005)
triggered_with_bias = triggered | trigger_bias

# Engagement metric
baseline = rng.normal(50, 15, n_users)
treatment_effect = np.where(
    triggered_with_bias & (treatment == 1),
    baseline * 0.10,  # +10% on triggered
    0
)
engagement = baseline + treatment_effect

# === Check 1: Overall SRM ===
print("=== Overall SRM Check ===")
n_t_overall = (treatment == 1).sum()
n_c_overall = (treatment == 0).sum()
expected = (n_t_overall + n_c_overall) / 2
chi2_overall = ((n_t_overall - expected)**2 + (n_c_overall - expected)**2) / expected
p_overall = 1 - stats.chi2.cdf(chi2_overall, df=1)
print(f"N_T: {n_t_overall}, N_C: {n_c_overall}")
print(f"Chi-square: {chi2_overall:.2f}, p: {p_overall:.4f}")
print(f"Result: {'PASS' if p_overall > 0.001 else 'FAIL'}")

# === Check 2: Triggered SRM ===
print("\n=== Triggered SRM Check ===")
n_t_triggered = (triggered_with_bias & (treatment == 1)).sum()
n_c_triggered = (triggered_with_bias & (treatment == 0)).sum()
expected_triggered = (n_t_triggered + n_c_triggered) / 2
chi2_triggered = ((n_t_triggered - expected_triggered)**2 + (n_c_triggered - expected_triggered)**2) / expected_triggered
p_triggered = 1 - stats.chi2.cdf(chi2_triggered, df=1)
print(f"N_θT: {n_t_triggered}, N_θC: {n_c_triggered}")
print(f"Chi-square: {chi2_triggered:.2f}, p: {p_triggered:.4f}")
print(f"Result: {'PASS' if p_triggered > 0.001 else 'FAIL'}")

# === Check 3: Complement A/A ===
print("\n=== Complement A/A Check ===")
non_triggered = ~triggered_with_bias
non_t_t = engagement[non_triggered & (treatment == 1)]
non_t_c = engagement[non_triggered & (treatment == 0)]
_, p_complement = stats.ttest_ind(non_t_t, non_t_c)
lift_complement = (non_t_t.mean() - non_t_c.mean()) / non_t_c.mean() * 100
print(f"Non-triggered T mean: {non_t_t.mean():.2f}, C mean: {non_t_c.mean():.2f}")
print(f"Lift: {lift_complement:.2f}%, p: {p_complement:.4f}")
print(f"Result: {'PASS (no effect on non-triggered)' if p_complement > 0.05 else 'FAIL (effect on non-triggered)'}")

# === Triggered Analysis ===
print("\n=== Triggered Analysis ===")
trig_t = engagement[triggered_with_bias & (treatment == 1)]
trig_c = engagement[triggered_with_bias & (treatment == 0)]
_, p_triggered_analysis = stats.ttest_ind(trig_t, trig_c)
lift_triggered = (trig_t.mean() - trig_c.mean()) / trig_c.mean() * 100
print(f"Triggered T mean: {trig_t.mean():.2f}, C mean: {trig_c.mean():.2f}")
print(f"Lift: {lift_triggered:.2f}%, p: {p_triggered_analysis:.4f}")

# === 종합 ===
print("\n=== 종합 ===")
all_checks_pass = (p_overall > 0.001) and (p_triggered > 0.001) and (p_complement > 0.05)
if all_checks_pass:
    print("All trust checks PASSED. Trustworthy triggered analysis.")
else:
    print("One or more trust checks FAILED. Investigate before launch.")
    if p_overall <= 0.001:
        print("  - Overall SRM fail: experiment setup issue")
    if p_triggered <= 0.001:
        print("  - Triggered SRM fail: counterfactual logging issue")
    if p_complement <= 0.05:
        print("  - Complement A/A fail: trigger condition incomplete")

직관 — 시뮬레이션 의 메시지

이 코드 의 메시지:

5.0.0.1 Triggered SRM 의 detection

Implementation bug:
  Treatment 가 trigger event 의 빈도 약간 더 큼 (+0.5% bias)

Detection:
  Triggered SRM check
  Chi-square test
  P-value < 0.001 시 fail

  → Bug detect
  → Fix 후 재 분석

5.0.0.2 Complement A/A 의 detection

Trigger condition 의 incomplete:
  Treatment 가 non-triggered 에도 영향
  (예: latency, architecture spillover)

Detection:
  Non-triggered 의 T vs C
  Significant difference 시 fail

  → Trigger condition 확대 또는 architecture 분리

5.0.0.3 통합 운영

Modern platform 의 자동:
  - 모든 triggered analysis 가 3 check
  - Pass 시 scorecard 표시
  - Fail 시 warning + investigation
  - Both check pass 시만 trustworthy

이 enforcement 가 trigger 의 trust foundation.

6 관련 주제

선행

다음 글

F20-5 — Open Questions

관련 챕터

다른 카테고리 연결