Kwangmin Kim - Kohavi Ch.22 개관 — Leakage·Interference (Variant 간 누수와 간섭)

1 정의

정의: Interference (간섭) · Leakage (누수)

A/B 실험에서 한 unit 의 행동이 다른 unit 의 variant 배정에 영향을 받는 현상. SUTVA (Stable Unit Treatment Value Assumption) 의 위반이며, spillover·leakage 로도 부른다 (Kohavi, Tang, Xu, 2020, Ch.22).

1.0.0.1 SUTVA 의 형식적 표현

\[Y_i(\mathbf{z}) = Y_i(z_i) \tag{22.1}\]

\(\mathbf{z} = (z_1, z_2, \dots, z_n)\): 전체 \(n\) unit 의 variant 배정 vector
\(Y_i(\mathbf{z})\): unit \(i\) 의 잠재 결과 (전체 배정 의존)
\(Y_i(z_i)\): unit \(i\) 의 잠재 결과 (자신의 배정만 의존)

1.0.0.2 위반 시 결과

ATE 추정량 \(\hat{\tau}\) 가 biased — direction (under/over) 은 leakage 채널에 따라 다름
해석 변형: 실험 시 \(\hat{\tau}\) 와 launch 후 universe 효과 가 다르다
의사결정 위협: ad campaign 처럼 budget 제약 하에서 launch 후 효과는 neutral 인데 실험 중 positive 로 보일 수 있음

직관

A/B 실험 의 비유 는 두 평행 우주. 한쪽 우주 는 모두 Treatment, 다른 우주 는 모두 Control. 두 우주 의 metric 차 가 진짜 효과.

SUTVA 가 깨지면: 두 우주 가 분리되지 않는다. Treatment 사용자 가 Control 사용자 의 행동을 바꾼다. 결과: 우리가 보는 delta 는 섞인 우주 의 차 — 진짜 차 가 아니다.

레슨: 실험 unit 이 진짜로 독립인지 점검. 사회 네트워크·marketplace·shared resource· 일부 instrumentation 은 기본적으로 독립이 아니다.

2 왜 필요한가

SUTVA 위반의 실무 위험

2.0.0.1 1. Direction of bias 의 비대칭

시나리오	leakage 방향	bias 방향
사회 네트워크 (Facebook 메시지)	T → C 전파	underestimate
Marketplace (Airbnb 재고)	T 가 C 자원 흡수	overestimate
Ad campaign budget	T 가 C budget 소진	overestimate
Relevance model 공유 학습	T → C 학습 누수	underestimate
CPU contention	T bug 가 C 응답 시간 악화	underestimate
Sub-user unit (page)	같은 user 의 fast/slow page 혼합	underestimate

2.0.0.2 2. Launch 후 효과 와 의 괴리

ad campaign budget 사례:

실험 중:
  Treatment: 더 많은 click → revenue +5%
  Control: budget 정상 → revenue baseline
  delta: +5% (statistically significant)

Launch 후 (모든 사용자 Treatment):
  Total budget: 동일 (월간 cap)
  click 모두 Treatment 비율로 → 추가 click 의 한계 revenue 0
  실제 효과: ~0%

의사결정 의 함정: 실험 결과 만 보고 launch 하면 ROI 가 0.

2.0.0.3 3. 의식하지 못하는 채널

가장 위험한 leakage 는 의식하지 못하는 채널:

shared CPU·memory (의 contention)
shared cache (의 hot key)
shared rate limiter
shared experience (같은 사용자 의 page-level 무작위 배정)

이런 leakage 는 코드 의 architecture 를 알아야 발견.

3 두 가지 연결 채널

정의: Direct vs Indirect Connection

Kohavi 는 leakage 의 두 갈래로 분류 (Kohavi, Tang, Xu, 2020, Ch.22).

3.0.0.1 Direct connection

두 unit 이 직접 연결: friendship, 같은 시간 같은 공간, 메시지 송수신
예: Facebook 친구, Skype call partner, LinkedIn connection
채널: 사회적 상호작용 (social engagement)

3.0.0.2 Indirect connection

두 unit 이 latent variable 또는 shared resource 로 연결
예: Airbnb 재고 공유, ad budget 공유, relevance model 공유 학습 데이터
채널: 자원 경쟁, 알고리즘 학습, infrastructure (CPU)

3.0.0.3 차이의 함의

	Direct	Indirect
가시성	높음 (graph 가 있다)	낮음 (latent)
측정 가능성	edge-level analysis 가능	시스템 dependency 분석 필요
해결책	network-cluster, ego-centric	resource splitting, geo, time
기본 사용자 unit 작동?	부분적 (clusters)	거의 안 됨

직관: 매개체 (medium) 의 식별

leakage 의 공식 은 항상 같다:

A → [매개체 medium] → B

매개체 는 무엇인가? - friendship graph (사회 네트워크) - 공유 budget (ad campaign) - 공유 inventory (marketplace) - 공유 학습 데이터 (relevance model) - 공유 hardware (CPU) - 공유 사용자 (sub-user randomization)

매개체 를 식별하면 해결책 도 보인다: 매개체 를 분리 (split) 하거나, 매개체 를 포함 (cluster) 하여 randomization 단위 를 변경.

4 6 가지 leakage 시나리오

Direct connection — 사례 2

4.0.0.1 Facebook · LinkedIn (사회 engagement)

“video chat”, “message”, “post” 의 효용 은 친구 가 사용할수록 증가
Treatment 의 algorithm 개선 → Treatment 사용자 의 invitation 증가
Control 친구 가 invitation 받아 → Control 도 connection·메시지 증가
결과: total invitation 증가 가 과소 측정 (Control 도 같이 올라감)

4.0.0.2 Skype calls (양방향 통신)

모든 call 은 2 명 이상 참여
Treatment 가 call 품질 개선 → Treatment 사용자 call 증가
Treatment → Control 친구 call → Control 의 call 도 증가
결과: Treatment 효과 과소 측정

4.0.0.3 직관

direct connection 시 Treatment effect 의 일부 가 Control 로 전파 된다. graph 가 밀집할수록 전파 정도 가 크다 → underestimate 의 크기 도 크다.

Indirect connection — 사례 6

4.0.0.4 Airbnb (marketplace 재고)

Treatment 의 conversion 개선 → Treatment 사용자 booking 증가
동일 inventory 에서 booking 증가 → Control 사용자 가 볼 inventory 감소
Control revenue 감소 → delta 과대 측정

4.0.0.5 Uber·Lyft (양면 시장)

“surge price” 알고리즘 개선 → Treatment rider 옵트인 증가
도로 의 driver 수 동일 → Control rider 의 가격 상승, ride 감소
delta 과대 측정

4.0.0.6 eBay (auction)

Treatment 가 bidding 촉진 → Treatment 의 winning bid 증가
동일 item → Control 사용자 의 winning 확률 감소
delta 과대 측정

4.0.0.7 Ad campaign budget

Treatment 의 CTR 개선 → Treatment click 증가
공유 budget 의 소진 가속 → Control click 의 budget 부족
delta 과대 측정 + 월말 효과

4.0.0.8 Relevance model 공유 학습

Treatment 가 better click prediction → Treatment 의 click 데이터 가 양질
공유 학습 데이터 → Control model 도 양질 데이터 학습
시간이 흐를수록 Control 도 개선 → delta 과소 측정

4.0.0.9 CPU contention

Treatment bug 가 CPU 점유 → 같은 machine 의 Control 응답 시간 도 악화
Treatment 의 latency 부정 효과 과소 측정

직관: bias direction 의 결정

bias 방향은 채널 의 성질 에 따라:

경쟁 자원 (inventory, budget): T 가 C 의 자원을 흡수 → overestimate
사회 전파 (friendship, communication): T effect 가 C 로 전파 → underestimate
공유 학습 (model training): T 데이터 가 C 모델 도 개선 → underestimate
공유 infra (CPU): T bug 가 C 도 영향 → underestimate (의 부정 효과)

규칙: 희소 자원 공유 → 과대, positive externality 공유 → 과소.

5 실무 해결책 4 갈래

1. Rule-of-thumb (ecosystem value)

가장 가벼운 접근.

Bernoulli randomization (전형적 user-level) 유지
first-order action (예: 메시지 보냄) 의 delta 를 측정
downstream metric (예: 메시지 받은 사람 의 reply, session) 을 함께 측정
과거 실험 의 historical data 로 ecosystem multiplier (예: 메시지 1 건 → ecosystem value 0.7) 를 추정

장점: 일회성 calibration 후 적용 쉬움. 단점: 평균값 — 특정 실험 이 평균과 다르면 부정확.

2. Isolation (4 갈래)

매개체 를 분리.

5.0.0.1 Splitting shared resources

ad budget: 50/50 traffic 에 50/50 budget 할당
training data: variant 별로 분리 학습
한계: heterogeneous machine 분리 시 confounding 도입

5.0.0.2 Geo-based randomization

지역 단위 randomization (Vaver and Koehler 2011, 2012)
적용: hotel·taxi·rider marketplace
한계: sample size 가 지역 수 로 제한 → variance 증가

5.0.0.3 Time-based randomization

시간 단위 (분, 시간, 일) 로 모두 T 또는 모두 C
적용: 같은 사용자 시간차 leakage 가 없는 경우
한계: 시간 효과 (요일·시간대) 가 강해 paired t-test 필요

5.0.0.4 Network-cluster randomization

사회 네트워크 의 cluster 단위 randomization
적용: Facebook, LinkedIn
한계: dense graph 는 perfect isolation 불가 (LinkedIn 80%+ inter-cluster edges)

5.0.0.5 Network ego-centric randomization

ego (focal node) + alters (인접 node) 묶음 단위
ego 만 variant 배정, alters 는 통제
장점: 작은 cluster 로 sample size 확보, first-order·downstream 분리 가능

3. Edge-Level Analysis

Bernoulli randomization on users + edge labeling.

각 interaction edge 를 4 type 으로 분류:

T → T (Treatment 가 Treatment 에게)
T → C (Treatment 가 Control 에게)
C → C
C → T

5.0.0.6 분석 패턴

unbiased delta: T → T edge 와 C → C edge 의 차
Treatment affinity: T 가 다른 T 에게 더 자주 message 하는가?
response rate: T 의 새 action 이 더 높은 reply 받는가?

5.0.0.7 한계

edge 가 명확히 정의 가능한 경우만 (메시지·like·visit)
ego-centric 보다 power 낮음 (모든 user 가 single variant)

4. Detection & Monitoring

leakage 를 측정 하지 않더라도 감시 는 필수.

ramp 단계 (employees → small datacenter → 1% → 10% → 50%) 에서 outlier 탐지
budget-constrained vs not constrained 의 split monitoring
platform-wide alert: CPU spike, latency 의 상위 quantile

ramp 의 4 단계 (Pre-MPR / MPR / Post-MPR / Long-term) 는 leakage 발견 의 첫 방어선 (Ch.15 참조).

직관: 해결책 의 trade-off

접근	적용 영역	sample size	bias 제거	구현 비용
Rule-of-thumb	사회 engagement	큼 (Bernoulli)	약	낮음
Resource split	budget·training	큼	강	중간
Geo	marketplace	작음 (지역 수)	강	중간
Time	단기 transactional	작음 (시간 수)	강	낮음
Network-cluster	사회 네트워크	작음 (cluster 수)	부분적	높음
Ego-centric	사회 네트워크	중간	강	높음
Edge-level	사회 네트워크	큼	강	중간

선택 의 원칙: 매개체 의 성질 에 맞는 isolation + Bernoulli 와 결합 가능 시 sample size 보강.

6 Python 시뮬레이션 (간단한 marketplace leakage)

import numpy as np

np.random.seed(42)

def simulate_marketplace(n_users=10000, n_items=100, treatment_lift=0.20):
    # Treatment: 20% conversion lift, 동일 inventory
    assignment = np.random.randint(0, 2, size=n_users)  # 0 = C, 1 = T
    base_p = 0.05
    p = np.where(assignment == 1, base_p * (1 + treatment_lift), base_p)
    intent = np.random.binomial(1, p, size=n_users)

    # 동일 inventory: T booking 이 inventory 차지 → C booking 감소
    t_intents = (assignment == 1) & (intent == 1)
    c_intents = (assignment == 0) & (intent == 1)
    available = n_items
    t_bookings = min(t_intents.sum(), available)
    available -= t_bookings
    c_bookings = min(c_intents.sum(), available)

    n_t, n_c = (assignment == 1).sum(), (assignment == 0).sum()
    rate_t = t_bookings / n_t
    rate_c = c_bookings / n_c
    observed_delta = (rate_t - rate_c) / rate_c
    return rate_t, rate_c, observed_delta

rate_t, rate_c, delta = simulate_marketplace()
print(f"Treatment rate: {rate_t:.4f}")
print(f"Control rate:   {rate_c:.4f}")
print(f"Observed delta: {delta:.2%}")
print(f"True delta:     {0.20:.2%}")
print(f"Bias direction: overestimate (Control rate suppressed by inventory contention)")

시뮬레이션 해석

inventory 가 부족 (n_items=100 << 예상 booking 500+) 할 때 Control 의 booking 이 먼저 소진된 inventory 로 인해 감소. Treatment 의 진짜 20% lift 가 실험 에서는 20% 보다 훨씬 큰 delta 로 보인다 (overestimate).

해결: geo-based randomization (city 별 inventory 격리) 또는 time-based randomization (같은 시간 모두 T 또는 모두 C).

7 비교

차원	일반 실험 (SUTVA 성립)	leakage 실험
가정	unit 독립	unit 의존
추정량	\(\hat{\tau} = E[Y \mid T] - E[Y \mid C]\)	biased estimator
식별	causal effect 그대로	“혼합 우주” effect
의사결정	launch 시 효과 ≈ 실험 효과	괴리 가능
해결	불필요	isolation·rule-of-thumb·edge analysis

8 응용

사회 네트워크 (Facebook, LinkedIn, Twitter): network-cluster + ego-centric
양면 marketplace (Airbnb, Uber, eBay): geo-based + time-based
ad platform (Google, Bing): budget split + ramp monitoring
search engine: relevance model 분리 학습 + holdback experiment
콘텐츠 플랫폼 (YouTube, TikTok): network-cluster + ecosystem multiplier

9 Phase F 의 후속 글

F22-1: Direct + Indirect Connections — 6 사례 의 mechanism 디테일
F22-2: Practical Solutions + Ecosystem Value — rule-of-thumb 구현 패턴
F22-3: Isolation + Edge-Level + Detection — 4 isolation 의 trade-off

10 관련 주제

Ch.14 Randomization Unit (F-KOH14): user vs page vs cluster
Ch.15 Ramping (F-KOH15): ramp 가 leakage 의 1 차 방어
Ch.18 Variance (F-KOH18): isolation 의 sample size 손실 보완
Ch.21 SRM (F-KOH21): isolation 후 SRM 점검 의무
D-18 (시간 변동): time-based randomization 와 친척
J-SWITCH (Phase J): switchback design 은 time-based 의 일반화

출처

Kohavi, Tang, Xu (2020). Trustworthy Online Controlled Experiments. Cambridge University Press. Ch.22 (Leakage and Interference between Variants).
Imbens & Rubin (2015). Causal Inference for Statistics, Social, and Biomedical Sciences.
Eckles, Karrer, Ugander (2017). “Design and Analysis of Experiments in Networks.” Journal of Causal Inference.
Holtz (2018). “Limiting Bias from Test-Control Interference In Online Marketplace Experiments.” MIT Thesis.
Saint-Jacques et al. (2018). LinkedIn ego-centric randomization.
Vaver and Koehler (2011, 2012). Geo-based experiments.
Bojinov and Shephard (2017). Time series experiments.