Kwangmin Kim - Kohavi Ch.13 개관 — Instrumentation (계측)

1 정의

정의: Instrumentation (계측)

시스템 (웹사이트, 앱) 과 사용자의 모든 의미 있는 행동·상태를 영구적·구조화된 형태로 기록 하는 인프라 (Kohavi, Tang, Xu, 2020, Ch.13).

저자들은 instrument·track·log 를 동의어로 사용. 핵심은 “발생한 일을 사후 분석 가능한 형태로 보존” 이다.

1.0.0.1 3 가지 계측 layer

Layer	대상	예시
User Action	사용자가 무엇을 하는가	Click, hover, scroll, time-to-click
System Performance	시스템이 어떻게 작동하는가	Server latency, p99, cache hit rate
System State	시스템이 어떤 상태인가	Error count, exception, retry, cluster info

원문 인용 (Marcus Aurelius): “Everything that happens happens as it should, and if you observe carefully, you will find this to be so.”

핵심 통찰: 계측 없으면 실험도 없다. A/B 결과는 계측된 metric 의 비교. 계측 빠지면 비교 자체 불가능. 따라서 instrumentation 은 실험 platform 의 가장 기본 layer — 모든 advanced 기능 (CUPED, triggered analysis, SRM detection) 의 prerequisite.

2 개념 및 원리

2.1 Why Care — 실험·OEC·메타분석 모두의 prerequisite

저자들이 도입에서 강조: “Before you can run any experiments, you must have instrumentation in place to log what is happening to the users and the system.”

2.1.0.1 계측이 받쳐주는 4 가지

1. 일상 운영 (Live Site Health)
   - 시스템이 정상 동작 하는가
   - 사용자가 어떤 path 로 service 사용
   - 베이스라인 metric

2. 실험 (Experimentation)
   - Treatment vs Control metric 비교
   - Triggered analysis (Ch.20)
   - SRM detection (Ch.21)

3. OEC 진화 (Ch.6, Ch.7)
   - Metric sensitivity 평가
   - 새 metric 후보 검증
   - Goodhart 효과 detect

4. 메타분석 (Ch.8 의 institutional memory)
   - 1000+ 실험의 누적 분석
   - Best practice 식별
   - Empirical research

이 4 가지가 모두 계측에 의존. 계측이 부재 또는 불완전하면 4 영역 모두 무력화.

2.1.0.2 비유 — 의료 영상 진단의 발전

1900 년대: 청진기 + 촉진
  → 진단 정확도 낮음
  → 외과 수술의 risk 큼

1950 년대: X-ray
  → 뼈·폐 진단 가능
  → 일부 질환 catch

1980 년대: CT scan
  → 단면 영상으로 정밀 진단
  → 이전 보지 못한 종양 발견

2020 년대: MRI + AI 분석
  → 미세 구조 + 자동 anomaly detection
  → 진단 정확도 ↑↑

데이터 사이언스 도구도 동일 발전:

1990 년대: 단순 page view counter
  → 거의 모든 미세 사용자 행동 보이지 않음

2000 년대: web beacon, click tracking
  → 일부 사용자 행동 catch

2010 년대: 풍부한 client/server 통합 계측
  → 사용자 의도·시스템 내부 모두 visible

2020 년대: Real-time + ML-driven
  → Anomaly 자동 detect, causal inference 자동

이 발전 곡선의 본질: 더 많이 보면 더 잘 결정한다. 계측이 의사결정 quality 의 천장.

2.2 Client·Server 의 양면 — 두 시각의 보완

저자가 명시한 첫 핵심 분류 (Ch.13.1).

2.2.1 Client-Side Instrumentation 의 시각

사용자가 무엇을 보고, 어떻게 행동하는가:
  - 클라이언트 화면 의 actual rendering
  - 사용자 click·hover·scroll 의 정확한 시점
  - 클라이언트 only action (server roundtrip 없음)
  - JavaScript error, app crash
  - Page rendering performance

2.2.1.1 핵심 가치 — Client 만이 볼 수 있는 것

저자 인용 (Kohavi et al. 2014): “client-side malware can overwrite what the server sends and this is only discoverable using client-side instrumentation.”

Client malware 시나리오:
  Server: HTML 정상 응답
  Client malware: HTML 변경 (광고 삽입, content 조작)
  사용자: 변경된 HTML 봄

Server-only 계측:
  - Server 가 보낸 HTML 만 기록
  - 사용자가 본 결과 모름
  → Malware detection 불가

Client 계측:
  - 사용자가 본 actual rendering 기록
  - Server 응답과 비교 → 차이 detect
  → Malware visible

이 example 이 client 계측의 unique value 을 압축. Server 만으로는 사용자 actual experience 를 모른다.

2.2.1.2 다른 client-only signal 들

1. 사용자 의 hover (server roundtrip 없음):
   - "사용자가 어떤 element 에 관심" signal
   - Click 전 의 intent 표시

2. Slideshow click pattern:
   - 사용자가 어떤 slide 에 머물렀는가
   - 이미 client 에 ship 된 콘텐츠 사이의 navigation

3. Form field validation:
   - 사용자가 어떤 field 에서 error 봤나
   - Server 도달 안 하는 client-side error

4. Page time-to-interactive:
   - 사용자가 실제 interact 가능한 시점
   - Server "response sent" 시점과 다름

2.2.2 Server-Side Instrumentation 의 시각

시스템 내부에서 무엇이 발생하는가:
  - Server 응답 latency (p50, p95, p99)
  - Component 별 처리 시간
  - 어떤 server 가 응답했는가 (load balancing)
  - Cache hit rate
  - Internal scoring (예: search ranking score)

2.2.2.1 핵심 가치 — System 내부

Search engine 사례:
  - Internal ranking score 의 계측 가능
  - Why this result was ranked first
  - Algorithm 의 debugging + tuning 가능

A/B 실험 디버깅:
  - 어떤 server 가 어떤 variant 응답?
  - Latency 분포가 variant 별 다름?
  - Cache hit rate 가 variant 별 차이?

2.2.2.2 Variance 의 차이

저자 강조: “the data tend to have lower variance, allowing for more sensitive metrics.”

Client metric:
  - 사용자 network, device 등 외부 noise 포함
  - 분산 큼

Server metric:
  - 시스템 내부만 측정 (network 외부 영향 없음)
  - 분산 작음
  - Sensitivity ↑

직관 — 두 시각의 본질적 보완

Client·Server 는 경쟁이 아니라 보완. 한 사용자 한 행동에 대해 두 시각이 다른 정보 제공.

2.2.2.3 사용자 click 의 두 시각

사용자가 "구매" 버튼 click:

Client 계측:
  - Click 시점: 14:23:45.123
  - Element ID: btn_purchase
  - 사용자 device: iPhone 15
  - 직전 hover: 250ms
  - Browser: Safari

Server 계측:
  - Request 도달: 14:23:45.456 (333ms 지연)
  - 응답 latency: 120ms
  - Server: us-west-2-server-7
  - Cache: miss
  - DB query: 80ms

이 둘이 합쳐져야 완전한 그림.

2.2.2.4 Sample 분석 시나리오

Q: "왜 구매 conversion 이 낮아졌나?"

Client 만:
  - Click rate 낮아짐 → 사용자 행동 변화?
  - 그러나 root cause 모름

Server 만:
  - Latency 증가 → 시스템 문제?
  - 그러나 사용자가 어떻게 reaction 했는지 모름

Client + Server:
  - Latency 증가 (server) → click hesitation (client)
  - 사용자가 slow response 에 abandoning
  - Root cause: server bottleneck
  - Action: server scale up

Joint analysis 가 root cause 파악의 본질.

2.2.2.5 추가 — Cross-validation

Client·Server 의 mismatch 자체가 signal.

Server 가 "응답 보냄" but Client 가 "받지 못함":
  → Network 문제 또는 client crash
  → 사용자 ghost session

Client 가 "click 발생" but Server 가 "request 받음 X":
  → Network 손실 또는 client malware
  → 데이터 무결성 issue

이 mismatch detection 이 양면 계측의 silent value. 둘 중 하나만 있으면 mismatch 자체 visible 하지 않음.

2.2.3 Client-Side 계측의 4 가지 challenge

저자 명시 (Ch.13.1, JavaScript 기준).

2.2.3.1 Challenge 1 — Performance Cost

계측 코드의 비용:
  - JavaScript snippet load time ↑
  - CPU cycles ↑
  - Network bandwidth ↑
  - Battery drain (mobile)

함의: 계측 자체가 사용자 experience 손상. 너무 많은 계측은 self-defeating.

2.2.3.2 Challenge 2 — Web Beacon Lossiness

저자 인용 (Kohavi, Longbotham, Walker 2010).

Web beacon 메커니즘:
  - 사용자 click → tracking pixel 요청 → server 도달
  - Beacon 이 전송 완료 전에 next page load 시작 가능

3 가지 시나리오:
  a. Async beacon (default):
     - Page navigation 우선 → beacon 손실 (loss rate browser 별 다름)
  b. Sync beacon:
     - Beacon 전송 완료까지 대기 → latency ↑
     - 사용자 click abandonment ↑
  c. Application 별 선택:
     - 광고 click 등 critical (compliance) → sync 선호
     - 일반 user action → async + 일부 손실 허용

2.2.3.3 Challenge 3 — Client Clock 신뢰성 부족

저자 강조: “Client clock can be changed, manually or automatically. This means that the actual timing from the client may not be fully synchronized with server time.”

Client clock 의 issue:
  - 사용자가 수동 설정 (timezone 변경)
  - OS 의 NTP sync 안 됨
  - 사용자가 의도적 변경 (날짜 hack)

분석 함의:
  - "Client time - Server time" 계산 위험
  - Timestamp 비교 시 server time 만 사용
  - 사용자 행동의 시간 분석은 server log 기반

저자 명시 격언: “never subtract client and server times.”

2.2.3.4 Challenge 4 — Data Accuracy

Client 계측의 정확도 ↓ 원인:
  - Browser plugin (ad blocker) 가 tracking 차단
  - Privacy mode (incognito) 에서 cookie 다름
  - User agent spoofing

2.2.3.5 Server-Side 의 보완 가치

저자 강조: “Server-side instrumentation suffers less from these concerns. It offers a less clear view of what the user is actually doing but can provide more granularity of what is happening inside your system and why.”

trade-off 정리:

차원	Client	Server
User experience visibility	강	약
System internal visibility	약	강
Data accuracy	중 (lossy)	강
Variance	큼 (network noise)	작음
Cost (사용자 입장)	비용 있음	비용 0
Privacy concern	큼	작음
Malware detection	가능	불가

따라서 양면 동시 계측이 표준. 한 면만으로는 완전한 분석 불가.

2.3 멀티소스 로그의 통합

저자가 명시한 둘째 핵심 (Ch.13.2 — F-KOH13-2 에서 상세).

2.3.0.1 다양한 log source

1. Client 별 log:
   - Web browser (JavaScript console)
   - Mobile app (iOS, Android)
   - Desktop client (Office, Adobe)
   - 각자 다른 format, 다른 timestamp 정밀도

2. Server log:
   - Web server (Apache, Nginx)
   - Application server (Java, Python)
   - Database (queries, exceptions)
   - 각자 다른 schema

3. User state log:
   - Sign-in, sign-out
   - Opt-in, opt-out preference
   - Subscription state
   - 별도 system 에 저장

2.3.0.2 통합의 challenge

1. Join key 부재:
   - Client 의 cookie ID
   - Server 의 user ID (sign-in 후)
   - Mobile 의 device ID
   - 같은 사용자가 여러 ID 보유

2. Schema 불일치:
   - Client log: JSON
   - Server log: structured text
   - DB log: SQL audit format
   - 변환 비용

3. Timestamp 동기화:
   - Server clock 끼리도 ms 단위 차이 가능
   - Client clock 신뢰 불가
   - "어느 event 가 먼저 발생" 결정 어려움

2.3.0.3 해결 — 표준화

저자 권고:

1. Common 식별자:
   - Pseudonymous user ID (sign-in 전: device 기반)
   - User ID (sign-in 후)
   - Event ID (개별 이벤트의 unique 식별)

2. 공통 schema:
   - 표준 필드: timestamp, country, language, platform
   - Custom 필드: 도메인 특화

3. 서버 시간 우선:
   - 모든 client event 도 server 도달 시점 기준 정렬
   - Client timestamp 는 reference 만

2.4 Instrumentation Culture — 비행기 계기판 비유

저자가 명시한 셋째 핵심 (Ch.13.3 — F-KOH13-2 에서 상세).

2.4.0.1 비행기 비유

저자 인용: “Imagine flying a plane with broken instruments in the panel. It is clearly unsafe, yet teams may claim that there is no user impact to having broken instrumentation. How can they know? Those teams do not have the information to know whether this is a correct assumption because, without proper instrumentation, they are flying blind.”

비행기 계기판:
  - Speedometer 깨짐 → 너무 빠른지 모름
  - Altimeter 깨짐 → 추락 위험 detect 못 함
  - Fuel gauge 깨짐 → 비상 시 대응 불가
  - 모두 작동: 안전 비행

서비스의 계측:
  - Engagement metric 깨짐 → 사용자 만족도 모름
  - Latency metric 깨짐 → 성능 저하 detect 못 함
  - Error rate 깨짐 → 사고 발생 모름
  - 모두 작동: 안전 운영

이 비유의 메시지: 계측이 깨져도 직접 사고 안 일어나지만, 사고 발견·예방 능력이 사라진다.

2.4.0.2 Cultural Norm — “계측 없으면 ship 없음”

저자 명시 핵심 norm: “nothing ships without instrumentation.”

운영 원칙:
  - Feature 개발 spec 에 계측 포함
  - 계측 없는 feature 는 review 거부
  - 깨진 계측 = 깨진 feature (같은 priority)
  - 계측 test 가 dev cycle 의 일부

2.4.0.3 어려움 — 시간 lag + 기능 해리

저자 분석:

1. 시간 lag:
   Code 작성 → 첫 user 사용 → 첫 분석 → 결과 → 인지
   = 며칠~몇 주

2. 기능 해리:
   - Feature 만든 engineer ≠ log 분석가
   - Engineer 가 자기 계측의 quality 모름
   - 분석가가 buggy 계측 발견 시 engineer 가 이미 다른 작업

2.4.0.4 해결 (저자 명시 3 가지)

1. Cultural norm:
   "Nothing ships without instrumentation"
   계측 = feature 의 일부

2. Testing 투자:
   - Dev cycle 에 계측 test
   - Engineer 가 로컬에서 계측 결과 확인
   - Code review 에서 reviewer 가 점검

3. Raw log monitoring:
   - Event count by dimension
   - Invariant check (timestamp 범위 등)
   - Anomaly detection
   - 발견 즉시 fix (everyone 의 priority)

가정 — 계측 문화 부재 시

가정: “계측은 데이터 팀 책임, engineer 는 feature 만”

결과:

Buggy 계측의 silent damage — Feature ship 후 분석 시점에 계측 broken 발견. 며칠~몇 주의 데이터 손실. 분석 결과 unreliable.
수정 의지 부족 — Engineer 가 다른 작업 중. “Bug 인 계측은 데이터 팀이 fix” 라고 미룸. 장기간 broken.
분석가 신뢰 상실 — 분석가가 자기 분석 결과 의심. “데이터 또 broken 인가?” 의심 기본값.
실험 결과 신뢰 위기 — A/B 결과를 의심. 실험 culture 자체 위기.

해결: 저자 권고 3 가지의 동시 적용. 한 단계 만으로는 부족 (예: testing 만 하면 cultural norm 없어 결과 무시).

2.4.0.5 비유 — 비행기 의 계기판 vs 자동차의 dashboard

비행기: 계기판 broken 시 즉시 ground. 비행 자체 거부. 자동차: dashboard broken 도 운전 가능 (대부분 사람).

서비스 계측: 비행기 mindset 권장. “계측 broken = service 운영 정지” 강도. 그러나 대부분 회사 는 자동차 mindset → silent damage 누적.

이 mindset shift 가 instrumentation culture 의 본질.

3 왜 필요한가

계측 부재 시.

실험 불가 — Treatment vs Control metric 자체 측정 불가
OEC 무력화 — Metric 정의 했지만 측정 안 됨
Live site blind — 사고 발생해도 detect 어려움
사후 분석 불가 — Past event 의 기록 없으면 root cause 추적 불가
메타분석 불가 — 누적 데이터 없음

계측 활성 시.

모든 실험 가능 — Trustworthy metric 비교
Live site 가시성 — Anomaly 즉시 detect
Causal inference 능력 — Hidden confounder 분석
메타분석 가능 — 1000 개 실험 누적 분석

이 격차는 모든 platform 의 가장 기본적 차원. 계측 quality 가 platform maturity 의 천장.

4 응용 사례 — 계측의 산업 표준 도구

4.0.0.1 회사별 계측 system (사전지식)

회사	Client 계측	Server 계측	통합
Google	gTM (custom)	Borgmon, Stackdriver	Dremel + BigQuery
Facebook	XHP custom logger	Scribe + Hadoop	Scuba
Microsoft	A.B.S (custom)	ETW, Azure Monitor	Cosmos + Azure Data Lake
Netflix	Mantis	Atlas	Keystone Pipeline
Uber	Heatpipe	M3, Jaeger	uReplicator

각 회사가 client·server 양면 + 통합 platform 를 자체 구축. 계측이 회사 platform 의 가장 큰 investment 영역 중 하나.

4.0.0.2 Open-source 도구

Client 계측:
  - Google Analytics (tags)
  - Mixpanel
  - Amplitude
  - Segment (라우팅)

Server 계측:
  - Prometheus + Grafana (metrics)
  - Jaeger, Zipkin (tracing)
  - ELK Stack (logs)

통합:
  - OpenTelemetry (CNCF 표준)
  - Datadog, New Relic (commercial)

이 ecosystem 의 활성화가 modern observability culture 의 결과.

5 Ch.13 시리즈 다음 글

글	주제	KOH 라인
F13-1	Client-Side vs Server-Side Instrumentation	L:2540~2565
F13-2	Processing Logs from Multiple Sources + Culture	L:2566~2585

6 코드 예시 — 양면 계측의 통합 분석

Client·Server log 를 join 하여 사용자 행동의 full picture 재구성.

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

rng = np.random.default_rng(42)

# 가상 client log
n_events = 1000
client_log = pd.DataFrame({
    "event_id": [f"evt_{i:05d}" for i in range(n_events)],
    "client_timestamp": [datetime(2026, 5, 8, 10, 0) + timedelta(seconds=int(s))
                          for s in rng.uniform(0, 3600, n_events)],
    "user_id": rng.choice([f"user_{i:03d}" for i in range(100)], n_events),
    "action": rng.choice(["click", "hover", "scroll"], n_events, p=[0.5, 0.3, 0.2]),
    "element_id": rng.choice(["btn_buy", "btn_cart", "img_product", "link_help"], n_events),
})

# 가상 server log (일부 client event 에 대응)
server_event_count = int(n_events * 0.7)  # 70% 만 server 도달 (lossy beacon)
server_indices = rng.choice(n_events, server_event_count, replace=False)
server_log = pd.DataFrame({
    "event_id": [f"evt_{i:05d}" for i in server_indices],
    "server_timestamp": [client_log.iloc[i]["client_timestamp"] +
                          timedelta(milliseconds=int(rng.normal(50, 20)))
                          for i in server_indices],
    "user_id": [client_log.iloc[i]["user_id"] for i in server_indices],
    "latency_ms": rng.lognormal(4, 0.5, server_event_count),
    "server_id": rng.choice([f"srv_{i:02d}" for i in range(10)], server_event_count),
    "cache_hit": rng.choice([True, False], server_event_count, p=[0.7, 0.3]),
})

# Join
joined = client_log.merge(server_log, on=["event_id", "user_id"], how="left",
                           suffixes=("_c", "_s"))

print("=== 통합 분석 ===")
print(f"전체 client event: {len(client_log)}")
print(f"Server log 매치: {joined['server_timestamp'].notna().sum()}")
print(f"Loss rate: {(1 - joined['server_timestamp'].notna().mean())*100:.1f}%")

# Loss 패턴 분석
print(f"\n=== Action 별 loss rate ===")
loss_by_action = joined.groupby("action").apply(
    lambda x: (1 - x["server_timestamp"].notna().mean()) * 100
)
print(loss_by_action.round(1))

# Server-only 정보 활용
print(f"\n=== Server 정보 통합 분석 ===")
matched = joined[joined["server_timestamp"].notna()]
print(f"Average latency: {matched['latency_ms'].mean():.1f}ms")
print(f"Cache hit rate: {matched['cache_hit'].mean()*100:.1f}%")

# Client-Server timestamp 차이
matched["timestamp_diff_ms"] = (
    matched["server_timestamp"] - matched["client_timestamp"]
).dt.total_seconds() * 1000
print(f"Client-Server timestamp diff (mean): {matched['timestamp_diff_ms'].mean():.1f}ms")
print(f"Client-Server timestamp diff (max): {matched['timestamp_diff_ms'].max():.1f}ms")

예상 출력 (시드 42 — 일부 random 변동).

=== 통합 분석 ===
전체 client event: 1000
Server log 매치: 700
Loss rate: 30.0%

=== Action 별 loss rate ===
action
click     30.7
hover     27.4
scroll    32.7
dtype: float64

=== Server 정보 통합 분석 ===
Average latency: 60.6ms
Cache hit rate: 70.6%
Client-Server timestamp diff (mean): 49.7ms
Client-Server timestamp diff (max): 130.4ms

직관 — 통합 분석의 가치

이 시뮬레이션의 4 가지 메시지.

1. Loss rate 30% 의 의미

Web beacon 의 일반 lossiness. Click event 의 30% 가 server 도달 안 함.

함의: Server-only 분석은 사용자 행동의 70% 만 본다. Client + Server 통합 시 100% 가시성.

2. Action 별 다른 loss rate

Click·hover·scroll 의 loss rate 차이. 일반적으로 click 이 더 critical (navigation 트리거). 이를 위해 sync beacon 사용 → loss rate 다름.

이 차이가 분석가의 mental model 에 입력. 어떤 action 의 신뢰도가 높은가.

3. Server 정보의 unique value

Server 만의 data:

Latency (60.6ms 평균)
Cache hit rate (70.6%)
Server ID (어느 server 처리)

Client 만으로는 이 차원 모름. 시스템 내부의 가시성 이 server 계측의 본질.

4. Timestamp 차이의 의미

Client-server timestamp diff: 평균 50ms, 최대 130ms.

이 차이가:

50ms: 정상 network latency
130ms+: 일부 사용자 의 slow connection
음수 (없으면 OK): client clock 이 server 보다 앞서면 anomaly

만약 음수 timestamp 발견 시: client clock 신뢰 불가 신호. Server 시간 기준 분석.

6.0.0.1 추가 — 결합 분석의 깊은 use case

Q: "Latency 가 사용자 click rate 에 영향?"

Client only: click count 만 보임
Server only: latency 분포만 보임
통합: latency 와 click 의 정확한 시점 매핑

분석:
  Latency p99 > 200ms → click rate 5% ↓ (Ch.5 의 speed matters)
  Latency p99 > 500ms → click rate 15% ↓
  실시간 alerting 가능

이 결합 분석이 server side 변경의 user impact 측정의 표준. Server 만으로는 latency 만, client 만으로는 click 만 본다. 두 시각의 결합이 causal inference.

7 관련 주제

선행

F4-3 — 인프라·도구 — 4 컴포넌트 중 instrumentation
F8-* — Ch.8 제도적 기억 — 메타분석의 입력
F9-* — Ch.9 윤리 — Privacy 와 계측

다음 글

관련 챕터

F5-* — Ch.5 Speed Matters — Latency 측정
F6-* — Ch.6 조직 지표 — Metric 정의
F12-* — Ch.12 Client-side — Mobile 계측

다른 카테고리 연결

Engineering — Observability (Logs, Metrics, Traces) — 3 pillars
Engineering — OpenTelemetry, Prometheus, Jaeger — 표준 도구
Data_Science — 데이터 수집 파이프라인 — ETL, ELT