1 정의
Kohavi (2020) Ch.20.4 의 advanced triggering 패턴.
1.0.0.1 Pattern 4 — Coverage Change
Coverage 가 단순 확장 (T ⊃ C) 이 아닌, 조건의 변경 (T = (cart > $25) ∧ (no return in 60d) ).
Set notation:
Control = A
Treatment = B
Coverage 가 다름 (T ≠ C, 그리고 T ⊅ C)
Triggered = (A \ B) ∪ (B \ A) (symmetric difference)
1.0.0.2 Pattern 5 — Counterfactual ML
ML model 의 A/B test. 같은 user 에 대해 V1 model 과 V2 model 의 결과 비교.
Trigger:
user 의 V1(input) ≠ V2(input)
Implementation:
모든 user 에 대해 V1 + V2 둘 다 inference
→ Counterfactual 식별
→ 2x compute cost
원문 인용 (Ch.20.4): “The key observation is that if the new model overlaps the old model for most users, as when making the same classifications or recommendations for the same inputs, then the Treatment effect is zero for those users.”
핵심 통찰: ML A/B 의 sensitivity 가 critical. V2 가 V1 의 90% user 에 same prediction 시 effect 의 90% dilution. Counterfactual logging 이 이 dilution 회피의 도구. 단 2x compute 의 cost.
2 개념 및 원리
2.1 Pattern 4 — Coverage Change (조건 변경)
저자 명시 (Ch.20.4 Example 4).
2.1.1 시나리오 — Free Shipping 의 조건 변경
Original (Control):
Free shipping if cart > $35
Treatment:
Free shipping if (cart > $25) AND (no return in 60 days)
Coverage 분석:
Control = {users | cart > $35}
Treatment = {users | cart > $25 ∧ no_return_60d}
2.1.2 Set 의 4 가지 영역
영역 1: cart > $35 + return (60d 내)
Control: free (cart > $35)
Treatment: paid (return 있음 → 조건 미충족)
→ Different experience → Trigger
영역 2: cart > $35 + no return
Control: free
Treatment: free
→ Same experience → No trigger
영역 3: cart in [$25, $35] + no return
Control: paid (cart < $35)
Treatment: free (cart > $25 ∧ no return)
→ Different experience → Trigger
영역 4: cart in [$25, $35] + return
Control: paid
Treatment: paid (return 있음)
→ Same experience → No trigger
영역 5: cart < $25
Control: paid
Treatment: paid
→ Same experience → No trigger
2.1.2.1 Triggered subset
Triggered = 영역 1 ∪ 영역 3
= {(cart > $35) ∧ (return in 60d)} ∪ {(cart in [$25, $35]) ∧ (no return)}
Symmetric difference:
(Control \ Treatment) ∪ (Treatment \ Control)
= {Control 만 free} ∪ {Treatment 만 free}
2.1.2.2 Venn diagram (저자 Figure 20.2)
Control Treatment
coverage coverage
┌────┐ ┌────┐
│ │ │ │
│ ┌─┼────┼─┐ │
│ │ │ │ │ │
│ │ │ ∩ │ │ │
│ │ │ │ │ │
│ └─┼────┼─┘ │
│ │ │ │
└────┘ └────┘
C\T ∩ T\C
↓ ↓ ↓
Trigger Same Trigger
2.1.3 Counterfactual Evaluation 의 필요
저자 강조: “Both Control and Treatment must evaluate the ‘other’ condition, that is, the counterfactual, and mark users as triggered only if there is a difference between the two variants.”
2.1.3.1 메커니즘
Control user 의 처리:
Step 1: Control rule 평가 (cart > $35)
→ True/False
Step 2: Treatment rule 평가 (counterfactual)
→ cart > $25 ∧ no return → True/False
Step 3: 결과 비교
Same → No trigger
Different → Trigger
Treatment user 의 처리:
Step 1: Treatment rule 평가
Step 2: Control rule 평가 (counterfactual)
Step 3: 결과 비교
2.1.3.2 Implementation cost
Standard implementation:
Control: only evaluate Control rule
Treatment: only evaluate Treatment rule
→ 1 rule per user
Counterfactual implementation:
Both variants: evaluate Both rules
→ 2 rules per user
→ ~2x compute (rule evaluation 의 cost 따라)
2.1.3.3 Free shipping 의 cheap evaluation
Free shipping rule:
Cart amount check (단순)
Return history check (DB lookup)
Cost: minimal (microseconds)
Counterfactual cost: 2x but still minimal
→ Counterfactual logging 의 ROI 좋음
2.1.4 Pattern 4 의 일반화
일반화:
Treatment 가 Control 의 simple expansion 아닐 때
Symmetric difference 가 trigger condition
Use cases:
- Pricing rule 변경 (조건 추가)
- Promotion eligibility 변경
- Algorithm 의 분기 조건 변경
- User segmentation 의 boundary 변경
각 use case 마다 counterfactual evaluation 필요.
2.1.4.1 핵심 — Symmetric Difference
Symmetric difference (Δ):
A Δ B = (A ∪ B) \ (A ∩ B)
= (A \ B) ∪ (B \ A)
Triggered:
Coverage 의 difference 영역
Same experience 사용자는 trigger 안 함
2.1.4.2 시각적 비유
Pattern 3 (Coverage Increase):
T 가 C 의 superset (T ⊃ C)
T \ C = additional 영역
Pattern 4 (Coverage Change):
T 와 C 의 overlap + 각자의 unique 영역
T Δ C = symmetric difference
2.1.4.3 Implementation 의 차이
Pattern 3:
Trigger condition: in T but not in C
→ "is_treatment_eligible AND NOT is_control_eligible"
→ Single boolean
Pattern 4:
Trigger condition: difference in eligibility
→ "is_treatment_eligible XOR is_control_eligible"
→ 양쪽 evaluation
Cost 차이:
Pattern 3: 1 rule (Treatment 의 새 rule 만 평가)
Pattern 4: 2 rule (Treatment 와 Control 모두)
2.1.4.4 Pattern 5 (ML) 와의 연결
Pattern 4 의 generalization:
Rule-based coverage → Function-based output
Pattern 5 (ML):
Function = ML model
Coverage = model 의 output 의 specific value
Pattern 4 와 같은 logic:
V1 output ≠ V2 output → Trigger
이 통합 view 가 advanced triggering 의 mathematical foundation.
2.2 Pattern 5 — Counterfactual ML Triggering
저자 명시 (Ch.20.4 Example 5).
2.2.1 시나리오 — ML Model A/B
2.2.1.1 Recommender System
Setup:
V1 recommender: 기존 (Control)
V2 recommender: 새 model (Treatment)
사용자 가 product page visit 시:
V1: recommends [A, B, C]
V2: recommends [A, D, E]
User 의 분류:
V1 == V2 (같은 추천): no trigger
V1 ≠ V2 (다른 추천): trigger
2.2.1.2 Classifier
Setup:
V1 classifier: user → [promo_A, promo_B, promo_C]
V2 classifier: 새 model
User 가 site visit 시:
V1 classification: promo_A
V2 classification: promo_B
User 의 분류:
같은 promo 추천: no trigger
다른 promo 추천: trigger
2.2.2 Why Counterfactual?
저자 강조: “if the new model overlaps the old model for most users, as when making the same classifications or recommendations for the same inputs, then the Treatment effect is zero for those users.”
2.2.2.1 Overlap 의 정도
시나리오:
V1 과 V2 의 90% user 에 same recommendation
Effect dilution:
Naive analysis:
- 90% user: 0 effect
- 10% user: full effect
- Average: 10% × full_effect
Triggered analysis:
- 10% user 만 분석
- Effect: full magnitude
- 10x 더 sensitive
2.2.2.2 Why high overlap
ML model 의 일반적 진화:
V1 → V2 의 변화:
- Feature 추가/제거 (incremental)
- Hyperparameter tuning (minor)
- Architecture 변경 (major)
대부분 변경:
- Edge case 의 일부 user 만 영향
- Mainstream user 는 same prediction
- Overlap 80~95%
따라서 triggered analysis 가 ML A/B 의 표준.
2.2.3 Counterfactual Logging 의 메커니즘
저자 명시.
2.2.3.1 Procedure
Control user 의 처리:
Step 1: V1 inference → output_1
Step 2: V2 inference (counterfactual) → output_2
Step 3: User 에 expose output_1 (Control 이므로)
Step 4: Log: user_id, output_1, output_2
→ Trigger if output_1 ≠ output_2
Treatment user 의 처리:
Step 1: V1 inference (counterfactual) → output_1
Step 2: V2 inference → output_2
Step 3: User 에 expose output_2 (Treatment 이므로)
Step 4: Log: user_id, output_1, output_2
→ Trigger if output_1 ≠ output_2
2.2.3.2 Symmetry
Both variants:
- V1 + V2 모두 inference
- User 에 expose 만 다름
- Log 의 trigger 정보 같음
Triggered set:
{user | output_V1(user) ≠ output_V2(user)}
Treatment 와 Control 의 same set
→ Trustworthy comparison
2.2.4 Compute Cost
저자 강조: “the computational cost in this scenario rises (e.g., the model inference cost doubles with one Treatment) as both machine learning models must be executed.”
2.2.4.1 Cost analysis
Single inference:
Cost: C (CPU + memory + time)
Counterfactual inference:
V1 cost: C
V2 cost: C
Total: 2C
Total compute:
Standard A/B: C per user
Counterfactual: 2C per user
→ 100% 추가 compute
2.2.4.2 Latency
저자 강조: “Latency could also be impacted if the two models are not run concurrently and the controlled experiment cannot expose differences in the model’s execution (e.g., if one is faster or takes less memory) as both executed.”
2.2.4.3 Sequential execution
Sequential:
Time: t_V1 + t_V2
Issues:
- Total latency 2x
- User experience 영향 (page slow)
- Real-time application 에 critical
2.2.4.4 Parallel execution
Parallel (separate threads or service):
Time: max(t_V1, t_V2) ≈ t_V_slower
Issues:
- 2x compute resource (CPU)
- Memory 2x
- Implementation 복잡 (thread pool 또는 separate service)
2.2.4.5 Trade-off 결정
Sequential vs Parallel:
Sequential: simple but slow
Parallel: fast but resource-intensive
Decision factors:
- Compute budget
- Real-time requirement
- User-facing latency tolerance
- Multi-tenant cost (cloud)
3 왜 필요한가
Coverage change · ML counterfactual 부재 시.
- Effect dilution: 80~95% same user 의 noise
- Sensitivity 부족: ML A/B 의 detect 어려움
- Decision quality: False negative
활성 시.
- ML A/B 의 sensitivity 10~20x
- Coverage change 의 정확한 effect
- Trustworthy ML decision
이 advanced triggering 이 modern ML platform 의 sensitivity 의 핵심.
4 응용 사례 — Netflix 의 Recommender A/B
Netflix 의 ML A/B (가상 reconstruction):
Setup:
V1: Existing recommender
V2: New deep learning recommender
Counterfactual logging:
Both variants user 에 V1 + V2 모두 inference
Recommendation 의 difference log
Triggered:
~30% user 가 different recommendations
나머지 70% same recommendations
Triggered analysis:
V2 의 effect: +8% engagement on triggered
Naive analysis (만약):
Effect dilute: 0.30 × 8% = 2.4%
→ Detect possible but weak
Decision:
Triggered analysis 의 strong evidence
Production launch (latency 검증 후)
이 ML A/B 운영이 Netflix 의 recommender evolution 의 표준.
5 코드 예시 — Counterfactual ML Triggering
V1 과 V2 model 의 simulation.
import numpy as np
import pandas as pd
from scipy import stats
rng = np.random.default_rng(42)
# 가상 ML model: V1 (older) vs V2 (newer)
n_users = 5000
# User features
user_features = rng.normal(0, 1, (n_users, 5))
# V1 model (linear, simple)
def v1_model(features):
weights = np.array([0.5, 0.3, 0.2, 0.1, 0.0])
score = features @ weights
return (score > 0).astype(int) # binary recommendation
# V2 model (slightly different - 비선형)
def v2_model(features):
weights = np.array([0.5, 0.3, 0.2, 0.0, 0.2]) # weight 변경
score = features @ weights + 0.1 * features[:, 0] * features[:, 1] # interaction
return (score > 0).astype(int)
# Counterfactual logging
v1_recommendations = v1_model(user_features)
v2_recommendations = v2_model(user_features)
# Triggered: V1 ≠ V2
triggered_mask = v1_recommendations != v2_recommendations
n_triggered = triggered_mask.sum()
print(f"=== Counterfactual ML Triggering ===")
print(f"Total users: {n_users}")
print(f"Triggered (V1 ≠ V2): {n_triggered} ({n_triggered/n_users*100:.1f}%)")
print(f"Non-triggered (V1 == V2): {n_users - n_triggered} ({(n_users-n_triggered)/n_users*100:.1f}%)")
# Treatment assignment
treatment = rng.choice([0, 1], n_users, p=[0.5, 0.5])
# Engagement (simulated)
# V2 가 V1 보다 better recommendation → +10% engagement on triggered
baseline_engagement = rng.normal(50, 15, n_users)
# Treatment effect: only on triggered users with treatment
treatment_effect = np.zeros(n_users)
for i in range(n_users):
if triggered_mask[i] and treatment[i] == 1:
treatment_effect[i] = baseline_engagement[i] * 0.10 # +10%
final_engagement = baseline_engagement + treatment_effect
# === Naive analysis ===
naive_t = final_engagement[treatment == 1]
naive_c = final_engagement[treatment == 0]
naive_lift = (naive_t.mean() - naive_c.mean()) / naive_c.mean() * 100
_, p_naive = stats.ttest_ind(naive_t, naive_c)
print(f"\n=== Naive Analysis ===")
print(f"T mean: {naive_t.mean():.2f}, C mean: {naive_c.mean():.2f}")
print(f"Lift: {naive_lift:.2f}%, p-value: {p_naive:.4f}")
# === Triggered analysis ===
triggered_t = final_engagement[triggered_mask & (treatment == 1)]
triggered_c = final_engagement[triggered_mask & (treatment == 0)]
triggered_lift = (triggered_t.mean() - triggered_c.mean()) / triggered_c.mean() * 100
_, p_triggered = stats.ttest_ind(triggered_t, triggered_c)
print(f"\n=== Triggered Analysis ===")
print(f"Triggered N (T): {len(triggered_t)}")
print(f"Triggered N (C): {len(triggered_c)}")
print(f"T mean: {triggered_t.mean():.2f}, C mean: {triggered_c.mean():.2f}")
print(f"Lift: {triggered_lift:.2f}%, p-value: {p_triggered:.4f}")
# === Diluted Impact ===
print(f"\n=== Diluted Impact ===")
print(f"Triggered effect: {triggered_lift:.2f}%")
print(f"Trigger rate: {n_triggered/n_users*100:.1f}%")
diluted = triggered_lift * (n_triggered/n_users)
print(f"Naive estimate of diluted: {diluted:.2f}% (rough)")
print(f"Actual naive lift: {naive_lift:.2f}% (real measurement)")5.0.0.1 Sensitivity 의 dramatic 차이
Naive (모든 user):
Effect: ~3% (10% × 30% trigger rate)
P-value: marginal (effect 일부 dilute)
Triggered (V1 ≠ V2 만):
Effect: ~10% (full magnitude)
P-value: strong significant
5.0.0.2 Counterfactual logging 의 가치
Without counterfactual:
Trigger 식별 불가
→ Naive analysis 만 가능
→ Sensitivity 부족
With counterfactual:
Trigger 식별 가능
→ Triggered analysis
→ Sensitivity 5~10x
5.0.0.3 Cost-benefit
Cost:
- 2x model inference compute
- 일부 latency 영향 가능
Benefit:
- Sensitivity 5~10x
- Smaller sample size
- Faster decision
ROI:
대부분 ML A/B 에서 ROI 양수
Compute cost 가 sensitivity 가치 보다 작음
5.0.0.4 산업 표준
Modern ML platform:
- 모든 ML A/B 에 counterfactual logging
- Computational budget 의 일부
- Triggered analysis 가 default
- A/A'/B 로 latency 검증
이 운영이 mature ML 의 표준.
6 관련 주제
선행
다음 글
관련 챕터
다른 카테고리 연결