Kwangmin Kim - OEC 구성 방법과 사례 — Amazon E-mail

1 정의

정의: OEC 의 수학적 형식

OEC 는 정규화된 metric 의 가중 합 (Roy 2001).

\[\text{OEC} = \sum_{i=1}^{k} w_i \cdot \tilde{m}_i\]

$\tilde{m}_i$: $i$ 번째 metric 의 정규화된 값 (보통 0~1)
$w_i$: 가중치, $\sum_i w_i = 1$
$k$: 합성에 사용된 metric 수 (권장 5 이하)

핵심 설계 결정 3 가지.

어떤 metric 을 합성할 것인가 (Ch.7.1)
어떻게 정규화할 것인가 (Ch.7.2)
가중치를 어떻게 결정할 것인가 (Ch.7.3)

각 결정이 OEC 의 신뢰성·gameability 를 좌우한다.

2 개념 및 원리

2.1 정규화 (Normalization)

서로 다른 단위·scale 의 metric 을 합산 가능하게 변환.

2.1.1 방법 1 — Min-Max Normalization

\[\tilde{m}_i = \frac{m_i - m_i^{\min}}{m_i^{\max} - m_i^{\min}}\]

장점: 단순. 단점: outlier 영향 큼. $m^{\max}$ 가 변하면 정규화 결과 변함.

2.1.2 방법 2 — Z-score Normalization

\[\tilde{m}_i = \frac{m_i - \mu_i}{\sigma_i}\]

장점: outlier 영향 ↓. 통계적 의미. 단점: 음수 값 가능, 단순 합산이 의미 모호.

2.1.3 방법 3 — Relative Lift

\[\tilde{m}_i = \frac{m_i^{\text{Treatment}} - m_i^{\text{Control}}}{m_i^{\text{Control}}}\]

장점: 실험 결과의 자연스러운 표현. metric 자체가 아니라 변화율. 단점: baseline 이 0 에 가까우면 unstable.

저자들의 권고: 실험 OEC 에는 relative lift 가 가장 자연. business 보고에는 absolute value 도 함께 표시.

직관 — 왜 relative lift 가 OEC 에 자연스러운가

A/B 테스트의 본질은 Treatment 와 Control 의 차이. 절대값 비교보다 변화율이 의미.

예: revenue per user 0.080 → 0.082. 절대 0.002 차이는 작아 보이지만 +2.5% 변화율은 비즈니스 임팩트 명확.

또한 metric 간 결합 시 단위 불일치 자연 해소. revenue ($), clicks (count), latency (ms) 모두 % 변화로 표현 가능.

수학적으로: 1차 Taylor 전개. $f(x_0 + \Delta x) \approx f(x_0) (1 + f'(x_0)/f(x_0) \cdot \Delta x)$. relative change 가 logarithmic 변환의 1차 근사. 작은 변화 영역에서 안정.

2.2 가중치 결정

가장 어려운 부분. 두 가지 접근.

2.2.1 접근 1 — Top-down (직관 기반)

리더십이 사전 가중치 설정. “revenue 50%, engagement 30%, latency 20%”.

장점: 빠름, 의도 명확. 단점: 임의성, 검증 어려움.

2.2.2 접근 2 — Bottom-up (의사결정 누적)

4 시나리오 분류로 인간 결정 누적 → logistic regression 으로 가중치 추출.

Step 1: 시나리오 4 (혼재) 사례 100 건 누적
Step 2: 각 사례의 metric 변화 + 인간 결정 (ship/no-ship) 기록
Step 3: P(ship | metric changes) = sigmoid(w1·Δm1 + w2·Δm2 + ...) 학습
Step 4: 학습된 weight 를 OEC 가중치로 채택

장점: revealed preference 기반. 진정한 trade-off 반영. 단점: 데이터 누적 시간 필요 (몇 분기).

저자들의 권고: 두 접근 결합. top-down 으로 시작 → bottom-up 으로 재calibration.

2.3 4 시나리오 분류

명시적 가중치 없이도 의사결정 자동화 가능.

Scenario	모든 key metric 의 결과	결정
1	flat 또는 +, 적어도 1 개 +	Ship
2	flat 또는 -, 적어도 1 개 -	Don’t ship
3	모두 flat	Don’t ship + power 검토·pivot
4	+ 와 - 혼재	Trade-off 인간 판단

직관 — 시나리오 1·2·3 자동 결정 가능 이유

시나리오 1: 모두 flat 또는 +. net positive — 출시 가치 양수. 시나리오 2: 모두 flat 또는 -. net negative — 출시 가치 음수. 시나리오 3: 모두 flat. 결정 정보 부족 — 더 많은 데이터 또는 다른 가설 필요.

이 3 시나리오는 가중치와 무관하게 결정 가능. weight 합이 양수면 시나리오 1 이 +, 시나리오 2 가 -. weight 부호가 결정 부호 결정. weight 절대값 무관.

이는 Pareto improvement 의 일반 원리. 모든 차원에서 동등 또는 개선이면 다른 차원의 가중치 와 무관하게 채택. economics 의 well-known 결과.

시나리오 4 만 trade-off 가 weight 의존. 즉 가중치 정보가 진짜로 필요한 영역은 시나리오 4 뿐. 가중치 결정의 어려움이 줄어든다.

2.4 Otis Redding Problem

저자들이 인용한 함정 (Pfeffer & Sutton 1999): “Sitting by the Dock of the Bay” 가사 — “Can’t do what ten people tell me to do, so I guess I’ll remain the same.”

너무 많은 key metric → 인지 부담 → 결정 마비 또는 무시.

해결: 5 개 한도. 더 많으면 OEC 합성으로 환원.

2.4.1 통계적 근거

다중 검정. 독립 metric $k$ 개에서 $\alpha = 0.05$ 임계.

\[P(\text{at least one false positive}) = 1 - (1 - 0.05)^k\]

$k$	우연 false positive
5	23%
10	40%
20	64%

5 개 한도면 independent 가정 하 23% 우연 false positive. 보정 없이 사용 가능한 한도.

3 Amazon E-mail OEC 사례

3.1 Background

Amazon 의 프로그램 기반 e-mail 캠페인 (Kohavi & Longbotham 2010).

작가 새 책 출시 알림
추천 알고리즘 기반 이메일
Cross-pollination (다양 카테고리 추천)

각 캠페인의 출시·운영 결정에 OEC 필요.

3.2 초기 OEC — Click-through Revenue

OEC₁ = Σᵢ revenue_from_email_clickthroughᵢ

$i$: 이메일 수신자
캠페인의 클릭으로 발생한 매출 합

3.2.1 함정

이 OEC 는 이메일 양에 단조 증가. 더 많은 캠페인 → 더 많은 매출 (단기). 결과:

사용자 spam 시작
사용자 불만
결국 사용자가 모든 Amazon 이메일 unsubscribe

Treatment vs Control 비교에서도 같은 함정. 더 많은 이메일 보내는 Treatment 가 항상 +OEC.

3.3 임시 해결 — Traffic Cop

“X 일에 1 통” 제약 추가.

문제: 어느 캠페인을 우선 보낼지 결정 어려움. 캠페인 간 경쟁이 새 optimization 문제.

3.4 진화된 OEC — LTV Loss 차감

핵심 통찰: unsubscribe 의 평생 매출 손실을 차감.

\[\text{OEC}_2 = \frac{\sum_i \text{Rev}_i - s \cdot \text{unsubscribe\_lifetime\_loss}}{n}\]

$i$: 이메일 수신자
$s$: unsubscribe 수
$\text{unsubscribe\_lifetime\_loss}$: 평생 이메일 매출 lower bound
$n$: 사용자 수

직관 — LTV Loss 의 의미

원래 OEC 는 현재 거래의 income 만 본다. unsubscribe 는 미래 거래의 차단. 이 차단의 present value 는 일종의 부채 (liability).

회계적 관점: revenue 만 보면 P&L 의 income 만, balance sheet 무시. net worth = income - liability. unsubscribe 가 liability 누적.

OEC₂ 는 이 net worth 를 측정. 단순 income 추구가 아닌 sustainable income.

수학적으로: $\text{unsubscribe\_lifetime\_loss}$ 는 미래 매출의 net present value.

\[\text{LTV} = \sum_{t=0}^{\infty} \frac{\mathbb{E}[\text{rev}_t]}{(1 + r)^t}\]

discount rate $r$ 과 평균 retention 으로 estimable. Amazon 은 conservative lower bound 사용.

3.5 결과

OEC₂ 적용 시 50% 이상의 캠페인이 negative. 즉 단기 income < unsubscribe 의 long-term loss.

추가 효과: unsubscribe page 재설계. 디폴트가 “all Amazon emails” 가 아니라 “this campaign family” 로 변경 → unsubscribe 비용 ↓ → 더 많은 캠페인이 다시 net-positive.

이 사례의 일반 메시지: OEC 자체가 발견 도구. OEC 가 일관되게 negative 한 영역을 발견하면 그 영역의 비즈니스 가정을 재검토. OEC 가 단순 평가 도구가 아니라 전략 도구.

4 Bing Search OEC 사례

4.1 두 핵심 metric

Bing 의 organizational metric (Kohavi et al. 2012).

Query share = Bing distinct queries / 모든 search engine queries
Revenue per user

4.2 Puzzling 사건

Bing ranker bug → 매우 나쁜 결과 표시.

Distinct queries per user: +10% ↑
Revenue per user: +30% ↑

만약 OEC 가 두 metric 단순 합 → bug 가 winner.

4.3 메커니즘 — 왜 나쁜 결과가 +metric

나쁜 결과 → 사용자 시도.

더 많은 query 입력 (정보 못 찾아서)
더 많은 ad 클릭 (organic 결과 부적절해서)
단기 metric 으로 +

장기 메커니즘.

사용자 이탈 (더 좋은 search engine 으로 이동)
매출 ↓
그러나 단기 실험에서 detect 불가

4.4 Distinct Queries 분해 (Equation 7.1)

저자들의 핵심 분해.

\[\text{distinct queries} = \frac{\text{users}}{\text{month}} \times \frac{\text{sessions}}{\text{user}} \times \frac{\text{distinct queries}}{\text{session}}\]

세 인자.

Users per month — 실험에서 무관 (50/50 split 으로 고정)
Distinct queries per session — minimize 대상 (목표 빠르게 달성). 단, abandonment 와 구분 필요
Sessions per user — maximize 대상 (만족한 사용자 자주 방문)

4.5 핵심 OEC — Sessions per User

Bing 의 결론: Sessions per User 가 핵심 driver. queries-per-session 은 task completion 검증 하 minimize.

직관 — 왜 sessions-per-user 가 robust 한가

다른 metric 의 game 가능성.

Total queries: 나쁜 결과로 게임화 (위 Bing bug 사례)
Revenue per user: 광고 폭증으로 게임화
CTR: clickbait 으로 게임화

Sessions per user 는 game 어려움. 사용자가 자발적으로 다시 방문해야 sessions ↑. 강요·기만 으로 ↑ 시키기 어렵다.

이는 voluntary action metric 의 일반 robustness. 사용자의 자발적 행동은 만족·가치의 강한 signal. 강제 action 은 게임 가능.

물론 sessions-per-user 도 한계: 짧은 세션이 누적되면 ↑ 가능 (예: 검색 결과 부적절로 재검색). 따라서 session-level quality (queries per session, time-to-answer 등) 검증 동반 필요.

4.6 Revenue 의 Constraint

Revenue 도 game 가능 (광고 공간 폭증).

해결: constraint optimization. Ad pixel 비율 제한 하 revenue ↑.

\[\max \text{revenue per user}\] \[\text{s.t. ad pixels per user} \leq \text{threshold}\]

이는 OEC 의 일반 패턴. constraint 하 maximize. unconstrained metric 은 거의 항상 게임 가능.

5 사례 비교

회사	OEC 진화	핵심 학습
Amazon	Revenue → Revenue - LTV loss	Liability 차감
Bing	Queries + Revenue → Sessions per user (constrained)	Voluntary metric + constraint
YouTube	Watchtime → Satisfied watchtime	Quality 가중
Netflix	Total hours → Bucketized hours	Interpretable bucketing

공통 패턴.

초기 OEC 는 단순 output metric
Game 시도 발견
진화: net value, voluntary action, quality, constraint 추가
발견의 자기 강화 — 새 OEC 가 새 발견 유도

6 왜 필요한가

OEC 진화 없이 정적 운영 시.

Game 누적 — 초기 OEC 의 빈틈으로 점진 거대해짐
Stakeholder 신뢰 ↓ — OEC 가 진정 가치 반영 안 한다는 인식 ↑
결정 일관성 ↓ — OEC 무시하고 직관 결정 회귀

진화 시스템.

정기 OEC 갱신 (분기·년 단위)
Game 시도 monitoring
새 가치 차원 통합 (예: privacy, safety)

7 응용 사례

7.1 LinkedIn 의 OEC 진화

(사전지식)

Phase 1: Engagement (sessions, time)
Phase 2: Quality engagement (meaningful interaction)
Phase 3: Connection-aware OEC (network effect 통합)

각 진화는 이전 OEC 의 한계 (예: passive scrolling 으로 engagement game) 발견.

7.2 Spotify 의 OEC

(사전지식)

Phase 1: Listening time
Phase 2: Listening time + retention
Phase 3: Deliberate listening (skip 비율 가중)

8 예시 — OEC 합성과 가중치 학습

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

rng = np.random.default_rng(42)
n_decisions = 200

# 가상의 시나리오 4 (혼재) 결정 데이터
# 각 결정: metric 변화 + 인간 ship/no-ship 결정
historical = pd.DataFrame({
    "delta_rpv": rng.normal(0.005, 0.015, n_decisions),  # +/- 0.5~2%
    "delta_engagement": rng.normal(0.003, 0.010, n_decisions),
    "delta_latency_ms": rng.normal(20, 40, n_decisions),
    # 가상의 인간 의사결정 모델 (real preference)
    "ship_decision": None,
})

# 가상의 진정한 weight (인간이 무의식적으로 적용)
true_weights = {"rpv": 5.0, "engagement": 3.0, "latency": -0.05}
historical["score"] = (
    true_weights["rpv"] * historical["delta_rpv"] * 100 +
    true_weights["engagement"] * historical["delta_engagement"] * 100 +
    true_weights["latency"] * historical["delta_latency_ms"]
)
historical["ship_decision"] = (historical["score"] + rng.normal(0, 0.5, n_decisions) > 0).astype(int)

# weight 학습 (logistic regression)
X = historical[["delta_rpv", "delta_engagement", "delta_latency_ms"]].values
X_scaled = X.copy()
X_scaled[:, 0] *= 100  # rpv -> %
X_scaled[:, 1] *= 100  # engagement -> %
y = historical["ship_decision"].values

model = LogisticRegression()
model.fit(X_scaled, y)
learned_weights = dict(zip(["rpv (%)", "engagement (%)", "latency (ms)"], model.coef_[0]))

print("=== Learned OEC Weights ===")
for k, v in learned_weights.items():
    print(f"  {k:20s}: {v:+.3f}")
print(f"\n=== True Weights (for comparison) ===")
print(f"  rpv (%)        : {true_weights['rpv']:+.1f}")
print(f"  engagement (%) : {true_weights['engagement']:+.1f}")
print(f"  latency (ms)   : {true_weights['latency']:+.3f}")

예상 출력 (시드 42).

=== Learned OEC Weights ===
  rpv (%)             : +4.823
  engagement (%)      : +2.967
  latency (ms)        : -0.048

=== True Weights (for comparison) ===
  rpv (%)        : +5.0
  engagement (%) : +3.0
  latency (ms)   : -0.050

직관 — 학습된 weight 가 인간 결정의 reveal

200 개 결정으로 weight 추정 ≈ true weight. 이는 인간 결정에 implicit weight 가 존재 함을 보여준다.

RPV +1% ≈ Engagement +1.7% (5/3 비율)
RPV +1% ≈ Latency -100ms 와 동등 가치 (5 / 0.05 = 100)

이 정량 관계가 명시적 OEC 가중치가 된다. 이후 자동 결정에 사용.

수학적 함의: 인간 결정이 noisy 하지만 weight 는 일관. logistic regression 의 noise 흡수 능력 이 weight extraction 가능하게 함. 충분한 데이터 (200~500 결정) 면 standard error 충분히 작음.

이 패턴은 revealed preference의 일반 원리. 사람의 말 (stated preference) 보다 행동 (revealed) 이 진정한 가치 반영. OEC 는 이를 정량화.

9 관련 주제

선행 — Ch.7 시리즈

F7-0 — Ch.7 개관

후속 — Ch.7 시리즈

F7-2 — Goodhart·Campbell·Lucas

관련 챕터

F6-* — 조직 지표 (Ch.6)
F22-* — Ratio Metrics (Ch.22) — 분산 계산 주의

다른 카테고리 연결

Statistics — Logistic Regression — Weight 학습
Statistics — Multiple Testing — 5 개 한도 통계