Triggering Examples 4~5 — Coverage Change · ML Counterfactual

Symmetric Difference · Recommender Model A/B · 2x Compute · Latency Implications

Kohavi (2020) Ch.20.4 의 Examples 4~5 를 깊게 다룬다. Coverage change (Treatment 영역 의 simple expansion 이 아닌 condition 추가) 의 symmetric difference, ML model A/B 의 counterfactual triggering (V1 vs V2 recommendations), 2x model inference 의 cost, latency 영향, computational trade-off, shared control 의 한계를 코드와 사례로 풀이한다.

Experimentation
A/B Test
저자

Kwangmin Kim

공개

2026년 05월 08일

1 정의

정의: Coverage Change · Counterfactual ML

Kohavi (2020) Ch.20.4 의 advanced triggering 패턴.

1.0.0.1 Pattern 4 — Coverage Change

Coverage 가 단순 확장 (T ⊃ C) 이 아닌, 조건의 변경 (T = (cart > $25) ∧ (no return in 60d) ).

Set notation:
  Control = A
  Treatment = B
  Coverage 가 다름 (T ≠ C, 그리고 T ⊅ C)

Triggered = (A \ B) ∪ (B \ A) (symmetric difference)
1.0.0.2 Pattern 5 — Counterfactual ML

ML model 의 A/B test. 같은 user 에 대해 V1 model 과 V2 model 의 결과 비교.

Trigger:
  user 의 V1(input) ≠ V2(input)

Implementation:
  모든 user 에 대해 V1 + V2 둘 다 inference
  → Counterfactual 식별
  → 2x compute cost

원문 인용 (Ch.20.4): “The key observation is that if the new model overlaps the old model for most users, as when making the same classifications or recommendations for the same inputs, then the Treatment effect is zero for those users.”

핵심 통찰: ML A/B 의 sensitivity 가 critical. V2 가 V1 의 90% user 에 same prediction 시 effect 의 90% dilution. Counterfactual logging 이 이 dilution 회피의 도구. 단 2x compute 의 cost.

2 개념 및 원리

2.1 Pattern 4 — Coverage Change (조건 변경)

저자 명시 (Ch.20.4 Example 4).

2.1.1 시나리오 — Free Shipping 의 조건 변경

Original (Control):
  Free shipping if cart > $35

Treatment:
  Free shipping if (cart > $25) AND (no return in 60 days)

Coverage 분석:
  Control = {users | cart > $35}
  Treatment = {users | cart > $25 ∧ no_return_60d}

2.1.2 Set 의 4 가지 영역

영역 1: cart > $35 + return (60d 내)
  Control: free (cart > $35)
  Treatment: paid (return 있음 → 조건 미충족)
  → Different experience → Trigger

영역 2: cart > $35 + no return
  Control: free
  Treatment: free
  → Same experience → No trigger

영역 3: cart in [$25, $35] + no return
  Control: paid (cart < $35)
  Treatment: free (cart > $25 ∧ no return)
  → Different experience → Trigger

영역 4: cart in [$25, $35] + return
  Control: paid
  Treatment: paid (return 있음)
  → Same experience → No trigger

영역 5: cart < $25
  Control: paid
  Treatment: paid
  → Same experience → No trigger
2.1.2.1 Triggered subset
Triggered = 영역 1 ∪ 영역 3
  = {(cart > $35) ∧ (return in 60d)} ∪ {(cart in [$25, $35]) ∧ (no return)}

Symmetric difference:
  (Control \ Treatment) ∪ (Treatment \ Control)
  = {Control 만 free} ∪ {Treatment 만 free}
2.1.2.2 Venn diagram (저자 Figure 20.2)
       Control     Treatment
       coverage    coverage
        ┌────┐    ┌────┐
        │    │    │    │
        │  ┌─┼────┼─┐  │
        │  │ │    │ │  │
        │  │ │ ∩  │ │  │
        │  │ │    │ │  │
        │  └─┼────┼─┘  │
        │    │    │    │
        └────┘    └────┘
        C\T  ∩    T\C
        ↓    ↓    ↓
     Trigger Same Trigger

2.1.3 Counterfactual Evaluation 의 필요

저자 강조: “Both Control and Treatment must evaluate the ‘other’ condition, that is, the counterfactual, and mark users as triggered only if there is a difference between the two variants.”

2.1.3.1 메커니즘
Control user 의 처리:
  Step 1: Control rule 평가 (cart > $35)
    → True/False
  Step 2: Treatment rule 평가 (counterfactual)
    → cart > $25 ∧ no return → True/False
  Step 3: 결과 비교
    Same → No trigger
    Different → Trigger

Treatment user 의 처리:
  Step 1: Treatment rule 평가
  Step 2: Control rule 평가 (counterfactual)
  Step 3: 결과 비교
2.1.3.2 Implementation cost
Standard implementation:
  Control: only evaluate Control rule
  Treatment: only evaluate Treatment rule
  → 1 rule per user

Counterfactual implementation:
  Both variants: evaluate Both rules
  → 2 rules per user
  → ~2x compute (rule evaluation 의 cost 따라)
2.1.3.3 Free shipping 의 cheap evaluation
Free shipping rule:
  Cart amount check (단순)
  Return history check (DB lookup)

  Cost: minimal (microseconds)
  Counterfactual cost: 2x but still minimal

→ Counterfactual logging 의 ROI 좋음

2.1.4 Pattern 4 의 일반화

일반화:
  Treatment 가 Control 의 simple expansion 아닐 때
  Symmetric difference 가 trigger condition

Use cases:
  - Pricing rule 변경 (조건 추가)
  - Promotion eligibility 변경
  - Algorithm 의 분기 조건 변경
  - User segmentation 의 boundary 변경

각 use case 마다 counterfactual evaluation 필요.

직관 — Coverage Change 의 mental model
2.1.4.1 핵심 — Symmetric Difference
Symmetric difference (Δ):
  A Δ B = (A ∪ B) \ (A ∩ B)
        = (A \ B) ∪ (B \ A)

Triggered:
  Coverage 의 difference 영역
  Same experience 사용자는 trigger 안 함
2.1.4.2 시각적 비유
Pattern 3 (Coverage Increase):
  T 가 C 의 superset (T ⊃ C)
  T \ C = additional 영역

Pattern 4 (Coverage Change):
  T 와 C 의 overlap + 각자의 unique 영역
  T Δ C = symmetric difference
2.1.4.3 Implementation 의 차이
Pattern 3:
  Trigger condition: in T but not in C
  → "is_treatment_eligible AND NOT is_control_eligible"
  → Single boolean

Pattern 4:
  Trigger condition: difference in eligibility
  → "is_treatment_eligible XOR is_control_eligible"
  → 양쪽 evaluation

Cost 차이:
  Pattern 3: 1 rule (Treatment 의 새 rule 만 평가)
  Pattern 4: 2 rule (Treatment 와 Control 모두)
2.1.4.4 Pattern 5 (ML) 와의 연결
Pattern 4 의 generalization:
  Rule-based coverage → Function-based output

Pattern 5 (ML):
  Function = ML model
  Coverage = model 의 output 의 specific value

  Pattern 4 와 같은 logic:
    V1 output ≠ V2 output → Trigger

이 통합 view 가 advanced triggering 의 mathematical foundation.

2.2 Pattern 5 — Counterfactual ML Triggering

저자 명시 (Ch.20.4 Example 5).

2.2.1 시나리오 — ML Model A/B

2.2.1.1 Recommender System
Setup:
  V1 recommender: 기존 (Control)
  V2 recommender: 새 model (Treatment)

  사용자 가 product page visit 시:
    V1: recommends [A, B, C]
    V2: recommends [A, D, E]

User 의 분류:
  V1 == V2 (같은 추천): no trigger
  V1 ≠ V2 (다른 추천): trigger
2.2.1.2 Classifier
Setup:
  V1 classifier: user → [promo_A, promo_B, promo_C]
  V2 classifier: 새 model

  User 가 site visit 시:
    V1 classification: promo_A
    V2 classification: promo_B

User 의 분류:
  같은 promo 추천: no trigger
  다른 promo 추천: trigger

2.2.2 Why Counterfactual?

저자 강조: “if the new model overlaps the old model for most users, as when making the same classifications or recommendations for the same inputs, then the Treatment effect is zero for those users.”

2.2.2.1 Overlap 의 정도
시나리오:
  V1 과 V2 의 90% user 에 same recommendation

Effect dilution:
  Naive analysis:
    - 90% user: 0 effect
    - 10% user: full effect
    - Average: 10% × full_effect

  Triggered analysis:
    - 10% user 만 분석
    - Effect: full magnitude
    - 10x 더 sensitive
2.2.2.2 Why high overlap
ML model 의 일반적 진화:
  V1 → V2 의 변화:
    - Feature 추가/제거 (incremental)
    - Hyperparameter tuning (minor)
    - Architecture 변경 (major)

  대부분 변경:
    - Edge case 의 일부 user 만 영향
    - Mainstream user 는 same prediction
    - Overlap 80~95%

따라서 triggered analysis 가 ML A/B 의 표준.

2.2.3 Counterfactual Logging 의 메커니즘

저자 명시.

2.2.3.1 Procedure
Control user 의 처리:
  Step 1: V1 inference → output_1
  Step 2: V2 inference (counterfactual) → output_2
  Step 3: User 에 expose output_1 (Control 이므로)
  Step 4: Log: user_id, output_1, output_2
    → Trigger if output_1 ≠ output_2

Treatment user 의 처리:
  Step 1: V1 inference (counterfactual) → output_1
  Step 2: V2 inference → output_2
  Step 3: User 에 expose output_2 (Treatment 이므로)
  Step 4: Log: user_id, output_1, output_2
    → Trigger if output_1 ≠ output_2
2.2.3.2 Symmetry
Both variants:
  - V1 + V2 모두 inference
  - User 에 expose 만 다름
  - Log 의 trigger 정보 같음

Triggered set:
  {user | output_V1(user) ≠ output_V2(user)}

  Treatment 와 Control 의 same set
  → Trustworthy comparison

2.2.4 Compute Cost

저자 강조: “the computational cost in this scenario rises (e.g., the model inference cost doubles with one Treatment) as both machine learning models must be executed.”

2.2.4.1 Cost analysis
Single inference:
  Cost: C (CPU + memory + time)

Counterfactual inference:
  V1 cost: C
  V2 cost: C
  Total: 2C

Total compute:
  Standard A/B: C per user
  Counterfactual: 2C per user
  → 100% 추가 compute
2.2.4.2 Latency

저자 강조: “Latency could also be impacted if the two models are not run concurrently and the controlled experiment cannot expose differences in the model’s execution (e.g., if one is faster or takes less memory) as both executed.”

2.2.4.3 Sequential execution
Sequential:
  Time: t_V1 + t_V2

  Issues:
    - Total latency 2x
    - User experience 영향 (page slow)
    - Real-time application 에 critical
2.2.4.4 Parallel execution
Parallel (separate threads or service):
  Time: max(t_V1, t_V2) ≈ t_V_slower

  Issues:
    - 2x compute resource (CPU)
    - Memory 2x
    - Implementation 복잡 (thread pool 또는 separate service)
2.2.4.5 Trade-off 결정
Sequential vs Parallel:
  Sequential: simple but slow
  Parallel: fast but resource-intensive

Decision factors:
  - Compute budget
  - Real-time requirement
  - User-facing latency tolerance
  - Multi-tenant cost (cloud)

2.2.5 Performance Hidden Issue

저자 강조 (Ch.20.4): “this will not be visible in the controlled experiment.”

2.2.5.1 시나리오 — V2 가 더 느림
Real production (no counterfactual):
  V1 user: V1 latency (100ms)
  V2 user: V2 latency (150ms, 50% 느림)
  → User experience 의 latency difference visible

Counterfactual logging:
  Both variants: V1 + V2 모두 inference
  Total latency: 150ms (max) 또는 250ms (sequential)
  → User experience 의 latency same (둘 다 동일 cost)
  → V1 vs V2 의 latency difference 가 user 에게 invisible
2.2.5.2 Why critical
실제 production 에서 V2 가 50% 느리면:
  User abandonment ↑
  Engagement ↓
  → Negative effect

Counterfactual logging 시 detect 못 함:
  Both variants 의 user 가 같은 latency
  → Latency difference invisible
  → 잘못된 decision (V2 launch 후 실제 prod 에서 slow)
2.2.5.3 해결
1. Awareness:
   - Counterfactual logging 의 limit 인지
   - V2 의 production latency 별도 측정

2. Code-level timing:
   - V1 inference 만 의 timing log
   - V2 inference 만 의 timing log
   - 비교 가능

3. A/A'/B Experiment:
   - A: original (no counterfactual)
   - A': original + counterfactual logging
   - B: V2 + counterfactual logging
   - A vs A' 의 차이 = counterfactual cost
   - A' vs B 의 차이 = V2 의 actual effect

저자 명시: “Run an A/A’/B experiment, where A is the original system (Control), A’ is the original system with counterfactual logging, and B is the new Treatment with counterfactual logging. If A and A’ are significantly different, you can raise an alert that counterfactual logging is making an impact.”

2.2.6 Shared Control 의 한계

저자 강조 (Ch.12 cross-reference): “counterfactual logging makes it very hard to use shared controls (see Chapter 12 and Chapter 18).”

2.2.6.1 메커니즘
Standard shared control (F-KOH18-2):
  여러 실험 의 공통 Control
  Control 사용자 가 모든 실험의 baseline

Counterfactual logging 시:
  Control 도 V2 inference (counterfactual)
  → Control 자체의 cost 증가
  → 다른 실험의 control 과 다름
  → Sharing 의미 약화
2.2.6.2 대안
Alternative 1 — Sample subset:
  Counterfactual logging 을 1% user 만
  Triggered analysis 는 sample-based estimate
  Cost ↓, sensitivity ↓

Alternative 2 — Other trigger detection:
  ML model 의 output 의 difference 가 아닌 다른 signal
  예: input 의 feature 가 특정 range
  Counterfactual 없이 trigger
  Suboptimal but acceptable

Alternative 3 — Pre-compute:
  V1 과 V2 의 output 을 batch 로 pre-compute
  실험 시 lookup 만
  Storage 큼 but compute 절약

각 대안의 trade-off 가 있음.

가정 — Counterfactual Logging 의 hidden cost 무시 시

가정: V2 model 이 V1 보다 느리지만, counterfactual logging 으로 latency difference invisible.

2.2.6.3 시나리오
Lab test (counterfactual logging):
  V1 user: 200ms (V1 + V2 inference)
  V2 user: 200ms (V1 + V2 inference)
  → Same latency

  V1 vs V2 의 engagement:
    V2: +5% engagement (recommendation 더 좋음)
    Decision: launch V2

Production (no counterfactual):
  V1 user: 100ms (V1 만)
  V2 user: 150ms (V2 만, 50% 느림)
  → V2 user 의 latency 50% 증가

  Real impact:
    V2 의 latency 영향 → engagement -3% (Ch.5 의 speed matters)
    V2 의 recommendation 영향 → engagement +5%
    Net: +2% 만 (lab 의 +5% 보다 작음)
    또는 negative (latency 가 dominate)
2.2.6.4 결과
Decision quality 위기:
  Lab: +5% lift expected
  Production: +2% 만 (또는 negative)

  Why gap:
    Counterfactual logging 의 latency hidden
    Production 의 user-perceived latency 다름
2.2.6.5 해결
A/A'/B experiment:
  A: production V1 (no counterfactual)
  A': production V1 + counterfactual
  B: V2 + counterfactual

  A vs A':
    A 의 latency: 100ms
    A' 의 latency: 200ms
    Diff: latency cost of counterfactual = 100ms
    → User experience -3% 영향

  A' vs B:
    Counterfactual logging 의 effect (analysis 도구)
    True V2 effect

  Final decision:
    True V2 effect (A' vs B)
    Production latency adjustment (V2 production setup)
    Decision: V2 launch with latency optimization

이 A/A’/B 가 ML A/B 의 trustworthy 의 의무.

3 왜 필요한가

Coverage change · ML counterfactual 부재 시.

  • Effect dilution: 80~95% same user 의 noise
  • Sensitivity 부족: ML A/B 의 detect 어려움
  • Decision quality: False negative

활성 시.

  • ML A/B 의 sensitivity 10~20x
  • Coverage change 의 정확한 effect
  • Trustworthy ML decision

이 advanced triggering 이 modern ML platform 의 sensitivity 의 핵심.

4 응용 사례 — Netflix 의 Recommender A/B

Netflix 의 ML A/B (가상 reconstruction):

Setup:
  V1: Existing recommender
  V2: New deep learning recommender

Counterfactual logging:
  Both variants user 에 V1 + V2 모두 inference
  Recommendation 의 difference log

Triggered:
  ~30% user 가 different recommendations
  나머지 70% same recommendations

Triggered analysis:
  V2 의 effect: +8% engagement on triggered

Naive analysis (만약):
  Effect dilute: 0.30 × 8% = 2.4%
  → Detect possible but weak

Decision:
  Triggered analysis 의 strong evidence
  Production launch (latency 검증 후)

이 ML A/B 운영이 Netflix 의 recommender evolution 의 표준.

5 코드 예시 — Counterfactual ML Triggering

V1 과 V2 model 의 simulation.

import numpy as np
import pandas as pd
from scipy import stats

rng = np.random.default_rng(42)

# 가상 ML model: V1 (older) vs V2 (newer)
n_users = 5000

# User features
user_features = rng.normal(0, 1, (n_users, 5))

# V1 model (linear, simple)
def v1_model(features):
    weights = np.array([0.5, 0.3, 0.2, 0.1, 0.0])
    score = features @ weights
    return (score > 0).astype(int)  # binary recommendation

# V2 model (slightly different - 비선형)
def v2_model(features):
    weights = np.array([0.5, 0.3, 0.2, 0.0, 0.2])  # weight 변경
    score = features @ weights + 0.1 * features[:, 0] * features[:, 1]  # interaction
    return (score > 0).astype(int)

# Counterfactual logging
v1_recommendations = v1_model(user_features)
v2_recommendations = v2_model(user_features)

# Triggered: V1 ≠ V2
triggered_mask = v1_recommendations != v2_recommendations
n_triggered = triggered_mask.sum()
print(f"=== Counterfactual ML Triggering ===")
print(f"Total users: {n_users}")
print(f"Triggered (V1 ≠ V2): {n_triggered} ({n_triggered/n_users*100:.1f}%)")
print(f"Non-triggered (V1 == V2): {n_users - n_triggered} ({(n_users-n_triggered)/n_users*100:.1f}%)")

# Treatment assignment
treatment = rng.choice([0, 1], n_users, p=[0.5, 0.5])

# Engagement (simulated)
# V2 가 V1 보다 better recommendation → +10% engagement on triggered
baseline_engagement = rng.normal(50, 15, n_users)

# Treatment effect: only on triggered users with treatment
treatment_effect = np.zeros(n_users)
for i in range(n_users):
    if triggered_mask[i] and treatment[i] == 1:
        treatment_effect[i] = baseline_engagement[i] * 0.10  # +10%

final_engagement = baseline_engagement + treatment_effect

# === Naive analysis ===
naive_t = final_engagement[treatment == 1]
naive_c = final_engagement[treatment == 0]
naive_lift = (naive_t.mean() - naive_c.mean()) / naive_c.mean() * 100
_, p_naive = stats.ttest_ind(naive_t, naive_c)
print(f"\n=== Naive Analysis ===")
print(f"T mean: {naive_t.mean():.2f}, C mean: {naive_c.mean():.2f}")
print(f"Lift: {naive_lift:.2f}%, p-value: {p_naive:.4f}")

# === Triggered analysis ===
triggered_t = final_engagement[triggered_mask & (treatment == 1)]
triggered_c = final_engagement[triggered_mask & (treatment == 0)]
triggered_lift = (triggered_t.mean() - triggered_c.mean()) / triggered_c.mean() * 100
_, p_triggered = stats.ttest_ind(triggered_t, triggered_c)
print(f"\n=== Triggered Analysis ===")
print(f"Triggered N (T): {len(triggered_t)}")
print(f"Triggered N (C): {len(triggered_c)}")
print(f"T mean: {triggered_t.mean():.2f}, C mean: {triggered_c.mean():.2f}")
print(f"Lift: {triggered_lift:.2f}%, p-value: {p_triggered:.4f}")

# === Diluted Impact ===
print(f"\n=== Diluted Impact ===")
print(f"Triggered effect: {triggered_lift:.2f}%")
print(f"Trigger rate: {n_triggered/n_users*100:.1f}%")
diluted = triggered_lift * (n_triggered/n_users)
print(f"Naive estimate of diluted: {diluted:.2f}% (rough)")
print(f"Actual naive lift: {naive_lift:.2f}% (real measurement)")
직관 — ML Triggering 의 message
5.0.0.1 Sensitivity 의 dramatic 차이
Naive (모든 user):
  Effect: ~3% (10% × 30% trigger rate)
  P-value: marginal (effect 일부 dilute)

Triggered (V1 ≠ V2 만):
  Effect: ~10% (full magnitude)
  P-value: strong significant
5.0.0.2 Counterfactual logging 의 가치
Without counterfactual:
  Trigger 식별 불가
  → Naive analysis 만 가능
  → Sensitivity 부족

With counterfactual:
  Trigger 식별 가능
  → Triggered analysis
  → Sensitivity 5~10x
5.0.0.3 Cost-benefit
Cost:
  - 2x model inference compute
  - 일부 latency 영향 가능

Benefit:
  - Sensitivity 5~10x
  - Smaller sample size
  - Faster decision

ROI:
  대부분 ML A/B 에서 ROI 양수
  Compute cost 가 sensitivity 가치 보다 작음
5.0.0.4 산업 표준
Modern ML platform:
  - 모든 ML A/B 에 counterfactual logging
  - Computational budget 의 일부
  - Triggered analysis 가 default
  - A/A'/B 로 latency 검증

이 운영이 mature ML 의 표준.

6 관련 주제

선행

다음 글

관련 챕터

다른 카테고리 연결

Subscribe

Enjoy this blog? Get notified of new posts by email: