Kwangmin Kim - Other Trust-Related Guardrails — Telemetry · Cache · Cookie

1 정의

정의: 4 가지 추가 Trust Guardrails

Kohavi (2020) Ch.21.5 의 SRM 외의 trust check. 각각 다른 mechanism 의 hidden bug detection.

Guardrail	검증	발생 가능 issue
Telemetry Fidelity	Click tracking 의 loss rate	Web beacon lossiness
Cache Hit Rate	Shared cache 의 균형	SUTVA violation
Cookie Clobbering	Cookie write rate	Browser bug distortion
Quick Queries	Sub-second query rate	미해결 anomaly

원문 인용 (Ch.21.5, Dmitriev et al. 2017): “There are other metrics besides the SRM to indicate that something is wrong.”

핵심 통찰: SRM 이 가장 critical 한 trust check 이지만 유일하지 않음. 4 가지 추가 guardrail 이 SRM 이 catch 못 하는 hidden bug 의 detection. Mature platform 의 layered trust check.

2 개념 및 원리

2.1 Guardrail 1 — Telemetry Fidelity

저자 명시 (Ch.21.5, Kohavi, Messner et al. 2010).

2.1.1 Click Beacon 의 Lossiness

Web beacon mechanism (F-KOH13-1):
  사용자 click → JavaScript event handler
  → Tracking pixel 요청 → server log

Lossiness:
  - Async beacon: race condition 의 일부 loss
  - Network failure
  - Browser bug
  - Ad blocker

Standard loss rate: 5~15%

2.1.2 Treatment 의 Loss Rate 영향

저자 강조: “If the Treatment impacts the loss rate, the results may appear better or worse than the actual user experience.”

2.1.2.1 시나리오 — Better Apparent Result

Treatment 의 변경:
  Page navigation 더 빠른 (좋음)

Beacon loss rate:
  Faster navigation → beacon completion 전 의 page change
  → Beacon loss rate ↑

Result:
  Treatment 의 click count ↓ (loss 더 많음)
  Apparent metric:
    "Treatment 가 click 적음 → engagement ↓ → bad"
  Real:
    Treatment 가 navigation 더 빠르 → user 만족
    But beacon 손실로 click 측정 안 됨

→ False negative (Treatment 의 advantage 가 invisible)

2.1.2.2 시나리오 — Worse Apparent Result

Treatment 의 변경:
  Page slower (예: heavy JavaScript)

Beacon loss rate:
  Slower navigation → beacon completion 의 충분 시간
  → Beacon loss rate ↓

Result:
  Treatment 의 click count ↑ (loss 적음)
  Apparent metric:
    "Treatment 가 click 더 많음 → engagement ↑ → good"
  Real:
    Treatment 가 slower (negative)
    But beacon 의 정확한 측정으로 click ↑

→ False positive (Treatment 의 disadvantage 가 invisible)

2.1.3 Detection — Fidelity Metric

Internal referrer 사용:
  Click 의 redirect 의 server log
  Synchronous (loss 없음)
  실제 click rate 추정

또는 dual logging:
  Click 의 두 path:
    Path 1: Async beacon (lossy)
    Path 2: Server-side log (high fidelity)

  Comparison:
    Path 1 의 count
    Path 2 의 count
    Ratio: fidelity rate

  Treatment·Control 의 fidelity 차이?
  차이 시 alert

2.1.3.1 Ad Click 의 Sync 의 Reason

저자 강조: “ad clicks, which require high fidelity.”

광고 click 의 special:
  Compliance (광고주 결제)
  Loss 가 financial damage

  Sync beacon 사용:
    Latency 더 길지만 loss 거의 0
    Compliance 우선

  Result:
    Ad click metric 의 fidelity 100%
    Treatment 의 ad click effect 정확 측정

이 dual approach 가 산업 표준.

2.2 Guardrail 2 — Cache Hit Rates

저자 명시 (Ch.21.5, Kohavi and Longbotham 2010).

2.2.1 SUTVA 위반의 메커니즘

SUTVA (Stable Unit Treatment Value Assumption):
  Unit 의 Treatment effect 가 다른 unit 영향 안 받음

Shared cache 시 위반:
  Treatment·Control 의 같은 cache 사용
  Cache hit/miss 가 둘 다 영향
  → Treatment 의 cache pressure 가 Control 의 hit rate 영향

2.2.2 F-KOH19 의 LRU Cache 사례

시나리오 (F-KOH19-2 의 Example 4):
  Treatment: 10%, Control: 90% (unequal split)
  Shared cache (LRU)

Cache:
  Cache size 제한
  LRU eviction (least recently used)

Effect:
  Control 의 90% access rate
  Treatment 의 10% access rate
  Cache 의 entry: Control 의 access pattern
  Treatment 의 entry 가 자주 evict (small minority)

  Cache hit rate:
    Control: 90%+ (own pattern dominate)
    Treatment: 50% (often evicted)

  Latency:
    Control: 빠름
    Treatment: 느림

2.2.3 Detection — Cache Hit Rate Metric

Per-variant cache hit rate:
  Treatment cache hits / Treatment requests
  Control cache hits / Control requests

  Significant difference 시 alert:
    SUTVA violation
    Investigation 의무

2.2.3.1 Solution

Option 1: Cache key 에 experiment ID 추가:
  Each variant 의 separate cache entries
  Cross-variant pollution 회피

Option 2: Equal split (50/50):
  Cache pressure 의 균형
  Both variants 의 similar hit rate

Option 3: Separate cache:
  Treatment·Control 의 dedicated cache
  Compute cost ↑ but isolation 확보

이 cache analysis 가 shared resource 의 표준 검증.

2.3 Guardrail 3 — Cookie Clobbering

저자 명시 (Ch.21.5, Dmitriev et al. 2016): “cookie clobbering can cause severe distortion to other metrics due to browser bugs.”

2.3.1 Cookie Write Rate 의 Issue

Standard cookie write:
  사용자 의 first visit 시 cookie 생성
  Subsequent visit 시 cookie 의 read·update

Treatment 의 frequent write:
  매 search response 에 cookie write
  Random number 또는 timestamp

Browser bug:
  Cookie 의 frequent write 시:
    일부 cookie 가 corrupt
    User identification 깨짐
    → Random user reassignment
    → 사용자 의 다른 variant 받을 수 있음 (within session)

2.3.2 Bing 의 Real Case

저자 인용: “One experiment at Bing wrote a cookie that was not used anywhere and set it to a random number with every search response page. The results showed massive user degradations in all key metrics.”

2.3.2.1 Setup

Bing 의 실험:
  Treatment 의 변경:
    매 search response page 에 cookie write
    Random number (사용 안 함)
    원래 의도: bug 의 testing

Result:
  Sessions-per-user: dramatic ↓
  Queries-per-user: dramatic ↓
  Revenue-per-user: dramatic ↓
  → 모든 key metric 의 거대한 negative

2.3.2.2 Root Cause

Browser bug:
  Cookie 의 frequent write 시
  일부 cookie 가 corrupt
  User identification 의 일부 loss

  Effect:
    User 의 cookie 가 changed (random reassignment)
    Same user 가 sessions 사이 different ID
    "Sessions-per-user" 이 actually "sessions-per-cookie"
    Cookie 의 fragmentation → metric ↓

  Variant assignment:
    Cookie 의 corrupt → variant 의 reassignment
    Treatment user 가 일부 Control 로 misclassified
    → 분석 의 confusion

2.3.2.3 Lesson

저자 강조: “the results showed massive user degradations in all key metrics.”

이 사례 의 dramatic:
  Innocent 의 random number write (사용 안 함)
  → Browser bug trigger
  → 모든 metric 의 negative
  → False conclusion: "Treatment is bad"
  실제: Treatment 의 의미 없는 cookie write 만

해결:
  Cookie write rate metric:
    Variant 별 monitoring
    Threshold 위반 시 alert

  Best practice:
    Cookie write 최소화
    Frequent write 의 review
    Browser bug 의 awareness

이 사례 가 cookie clobbering 의 hidden danger 의 evidence.

2.4 Guardrail 4 — Quick Queries

저자 명시 (Ch.21.5).

2.4.1 정의

Quick queries:
  같은 user 의 1초 내 multiple search query
  → Sub-second query
  → Possible bot, browser bug, accidental

2.4.2 Open Mystery

저자 강조: “Google and Bing have both observed this phenomenon, but to date have been unable to explain their cause.”

관찰:
  Quick queries 의 비율:
    Treatment 와 Control 의 차이 발생
    Cause 미상

  Treatment 가 quick queries 변경 시:
    "Result 의 untrustworthy"
    원인 모름 but 결과의 의심

2.4.3 Use as Guardrail

Treatment 의 quick query 비율 변경:
  Significant 변화 시 alert
  → 결과 의심
  → Investigation

Bing·Google 의 표준:
  Quick queries 의 metric
  Variant 별 monitoring
  Threshold 위반 시 quarantine

이 미해결 anomaly 가 industry secret 의 일부.

직관 — 4 Guardrail 의 layered defense

2.4.3.1 4 Guardrail 의 분담

SRM (가장 중요):
  Sample composition 의 검증

Telemetry Fidelity:
  Click 의 정확한 측정

Cache Hit Rate:
  Shared resource 의 isolation

Cookie Clobbering:
  Browser bug 의 detection

Quick Queries:
  Unknown anomaly 의 awareness

2.4.3.2 Detection 의 layered

Layer 1: SRM
  → User-level mismatch

Layer 2: Telemetry, Cache, Cookie, Query
  → Implementation·infrastructure 의 hidden issue

Layer 3: Domain-specific (예: revenue anomaly)
  → Business 의 specific check

2.4.3.3 Pass·Fail 의 의미

모든 4 pass:
  → Strong trust
  → Decision 가능

SRM fail:
  → Major issue (대부분 metric 의 의심)

Other guardrail fail:
  → Specific issue (해당 metric 만 의 의심)
  → 일부 metric 의 trust 약화

Multi-fail:
  → Comprehensive issue
  → Investigation 의 priority

2.4.3.4 산업 표준 정착

Mature platform 의 dashboard:
  Trust section:
    SRM (overall)
    SRM (segments)
    A/A check
    Telemetry fidelity
    Cache hit rate (만약 shared)
    Cookie write rate
    Quick query rate

  Pass:
    Scorecard 정상 표시

  Fail:
    Specific metric 의 warning
    Investigation 의 가이드

이 layered defense 가 production trust 의 backbone.

2.5 추가 — Domain-Specific Guardrails

저자 인용 (Ch.21.5): “Sometimes these follow deep investigations and relate to software bugs.”

2.5.0.1 일반적 추가 guardrail

Domain-specific:
  - Revenue anomaly (e-commerce)
  - Search query language distribution
  - User language preference 변화
  - Geographic distribution 변화
  - User device distribution 변화
  - Subscription tier 변화

각 domain 의 specific:
  Treatment 의 영향 받지 않을 metric
  변화 시 systematic issue 신호

2.5.0.2 사례 — Geographic Distribution

Treatment 의 변경: text size 변경

Expected:
  Geographic distribution: unchanged
  All countries 의 같은 effect

Anomaly:
  Treatment 의 user 가 일부 country 의 비율 ↑
  Country 의 distribution 변화

Possible cause:
  - Browser bug 의 country-specific
  - CDN 의 regional difference
  - Localization 의 implementation issue

→ Investigation 의 trigger

이 domain-specific check 가 mature platform 의 advanced layer.

3 Why Multi-Guardrail

3.0.1 Why SRM 만 으로는 부족

SRM 의 limit:
  Sample composition 만 검증
  Implementation·infrastructure 의 다른 issue 못 catch

Examples:
  - Cookie clobbering: SRM pass, but metric distortion
  - Cache hit rate: SRM pass, but SUTVA violation
  - Telemetry fidelity: SRM pass, but click count distortion

3.0.1.1 Bing 의 Cookie 사례

저자 인용된 Bing 사례.

SRM check:
  Sample 정상 (cookie 의 일부 corrupt 라도 mostly OK)
  → SRM pass

Metric check:
  All metric 의 dramatic 감소
  → False negative ?
  Or real bug?

Investigation:
  Cookie write rate metric:
    Treatment 가 1000x more writes
    → Browser bug trigger
    → User 의 fragment

→ SRM pass, but metric 의 trust 의심
→ Multi-guardrail 의 가치

이 사례 가 multi-guardrail 의 evidence.

4 왜 필요한가

4 추가 guardrail 부재 시.

SRM-pass false trust: SRM 만 보면 hidden bug 의 missed
Metric distortion: Telemetry, cache, cookie 의 silent damage
Open anomaly: Quick query 같은 미해결 issue 의 missed

활성 시.

Layered trust check: SRM + 4 추가
Specific bug detection: Domain 별 issue
Decision quality: Multi-layer 의 confidence

이 layered defense 가 mature platform 의 표준.

5 응용 사례 — Microsoft Bing 의 Trust Dashboard

Bing 의 trust monitoring (가상 reconstruction):

Per-experiment dashboard:
  SRM (overall): pass / warning / fail
  SRM (browser segment): per browser status
  SRM (country segment): per country
  A/A test status
  Telemetry fidelity: T 의 loss rate vs C 의 loss rate
  Cache hit rate: T vs C 의 cache hit
  Cookie write rate: T vs C 의 write count per session
  Quick query rate: T vs C 의 sub-second query

Trust score:
  All check pass → green
  1~2 fail → yellow (warning)
  3+ fail → red (do not use)

Decision:
  Green: 자동 scorecard 표시
  Yellow: scorecard 표시 + warning banner
  Red: scorecard hide, investigation 의무

Engineer 의 view:
  Detailed diagnostics
  Investigation tools
  Resolution playbook

이 dashboard 가 Bing 의 일별 운영 의 backbone.

6 Ch.21 시리즈 마무리

4 편 완료:

F21-0 — SRM 정의, 5 cause, 6 debug, 4 추가 guardrail 의 지도
F21-1 — SRM scenarios (1, 2 + Bing real scorecard)
F21-2 — 5 cause + 6 debug 의 상세
F21-3 — Telemetry, cache, cookie, quick queries 의 추가 guardrail

다음: Ch.22 (Leakage and Interference, 4 편).

7 코드 예시 — Multi-Guardrail Trust Check

자동 trust check 의 implementation.

import numpy as np
import pandas as pd
from scipy import stats

rng = np.random.default_rng(42)

def srm_check(t, c, alpha=0.001):
    n = t + c
    expected = n / 2
    chi2 = (t - expected)**2 / expected + (c - expected)**2 / expected
    p = 1 - stats.chi2.cdf(chi2, df=1)
    return {"name": "SRM", "p_value": p, "passed": p > alpha}

def telemetry_fidelity_check(t_clicks_logged, t_clicks_internal_referrer,
                              c_clicks_logged, c_clicks_internal_referrer,
                              alpha=0.05):
    # Loss rate per variant
    t_loss = 1 - t_clicks_logged / max(t_clicks_internal_referrer, 1)
    c_loss = 1 - c_clicks_logged / max(c_clicks_internal_referrer, 1)
    diff = abs(t_loss - c_loss)
    return {
        "name": "Telemetry Fidelity",
        "t_loss": t_loss,
        "c_loss": c_loss,
        "diff": diff,
        "passed": diff < 0.02  # threshold 2%
    }

def cache_hit_rate_check(t_hit_rate, c_hit_rate, alpha=0.05):
    diff = abs(t_hit_rate - c_hit_rate)
    return {
        "name": "Cache Hit Rate",
        "t_rate": t_hit_rate,
        "c_rate": c_hit_rate,
        "diff": diff,
        "passed": diff < 0.05  # threshold 5%
    }

def cookie_write_check(t_writes_per_session, c_writes_per_session, alpha=0.05):
    if c_writes_per_session > 0:
        ratio = t_writes_per_session / c_writes_per_session
    else:
        ratio = float('inf')
    return {
        "name": "Cookie Write Rate",
        "t_rate": t_writes_per_session,
        "c_rate": c_writes_per_session,
        "ratio": ratio,
        "passed": 0.5 < ratio < 2.0  # within 2x
    }

def quick_query_check(t_quick_rate, c_quick_rate):
    diff = abs(t_quick_rate - c_quick_rate)
    return {
        "name": "Quick Query Rate",
        "t_rate": t_quick_rate,
        "c_rate": c_quick_rate,
        "diff": diff,
        "passed": diff < 0.02  # threshold 2%
    }

# === 시나리오 1: Healthy experiment ===
print("=== Scenario 1: Healthy Experiment ===")
checks = [
    srm_check(t=500_000, c=502_000),
    telemetry_fidelity_check(t_clicks_logged=92_000, t_clicks_internal_referrer=100_000,
                              c_clicks_logged=92_500, c_clicks_internal_referrer=100_500),
    cache_hit_rate_check(t_hit_rate=0.85, c_hit_rate=0.86),
    cookie_write_check(t_writes_per_session=2.0, c_writes_per_session=2.1),
    quick_query_check(t_quick_rate=0.05, c_quick_rate=0.051),
]

for check in checks:
    status = "PASS" if check["passed"] else "FAIL"
    print(f"  {check['name']}: [{status}]")

# Overall trust
all_passed = all(check["passed"] for check in checks)
print(f"\n  Overall Trust: {'PASS' if all_passed else 'FAIL'}\n")

# === 시나리오 2: Bing-style cookie clobbering ===
print("=== Scenario 2: Cookie Clobbering Bug ===")
checks = [
    srm_check(t=500_000, c=502_000),  # SRM pass
    telemetry_fidelity_check(t_clicks_logged=92_000, t_clicks_internal_referrer=100_000,
                              c_clicks_logged=92_500, c_clicks_internal_referrer=100_500),
    cache_hit_rate_check(t_hit_rate=0.85, c_hit_rate=0.86),
    cookie_write_check(t_writes_per_session=50.0, c_writes_per_session=2.1),  # 25x
    quick_query_check(t_quick_rate=0.05, c_quick_rate=0.051),
]

for check in checks:
    status = "PASS" if check["passed"] else "FAIL"
    print(f"  {check['name']}: [{status}]")

all_passed = all(check["passed"] for check in checks)
print(f"\n  Overall Trust: {'PASS' if all_passed else 'FAIL'}")
print(f"  *** Cookie write rate 25x normal ***")
print(f"  → Investigate cookie clobbering")

직관 — Multi-Guardrail 의 catch

7.0.0.1 Scenario 1 (healthy)

모든 check pass:
  → Strong trust
  → Decision 가능

이 case 가 normal experiment 의 standard.

7.0.0.2 Scenario 2 (cookie clobbering)

SRM pass (sample composition 의 normal)
Telemetry pass
Cache pass
Cookie FAIL (25x write rate)
Quick query pass

→ SRM 만 보면 invisible
→ Cookie guardrail 가 catch
→ Investigation 의 trigger

7.0.0.3 산업 표준

Modern platform:
  - 모든 guardrail 의 자동 check
  - Pass·fail 의 visualization
  - Fail 시 specific investigation 가이드
  - Scorecard 의 trust signal

이 layered defense 가 mature platform 의 표준.

8 관련 주제

선행

Ch.21 시리즈 마무리 — 4 편 완료. 다음 Ch.22 (Leakage).

관련 챕터

다른 카테고리 연결

Engineering — Cache Architecture
Engineering — Cookie Management — Browser quirks
Statistics — SUTVA