Kwangmin Kim - 멀티소스 로그 처리와 계측 문화

1 정의

정의: 멀티소스 로그 통합 + 계측 문화의 두 축

Kohavi (2020) Ch.13.2~13.3 가 명시한 두 운영 차원.

1.0.0.1 축 1 — Multi-source Log Processing

여러 system 에서 발생하는 log 를 통합하여 단일 분석 가능 형태로 만드는 절차.

Source 종류	예시	주요 식별자
Client logs	Browser JavaScript, Mobile app	Cookie ID, device ID
Server logs	Web server, App server	Request ID, server timestamp
User state	Sign-in, opt-in, subscription	User ID
External	Email service, push provider	Various

1.0.0.2 축 2 — Culture of Instrumentation

계측을 회사 문화의 일부로 정착시키는 3 가지 운영 원칙.

원칙	핵심
Cultural Norm	“Nothing ships without instrumentation”
Testing Investment	Dev cycle 에 계측 test 통합
Raw Log Monitoring	Anomaly detection + 즉시 fix

원문 인용 (Ch.13.3): “Imagine flying a plane with broken instruments in the panel. It is clearly unsafe, yet teams may claim that there is no user impact to having broken instrumentation. How can they know? Those teams do not have the information to know whether this is a correct assumption because, without proper instrumentation, they are flying blind.”

핵심 통찰: 두 축이 결합되지 않으면 platform 무력화. 통합 없으면 분석 불가, 문화 없으면 quality 보장 불가. 둘 다 갖춘 회사가 mature.

2 개념 및 원리

2.1 Multi-source Log Processing — 통합의 메커니즘

저자 명시 (Ch.13.2): “It is likely you will have multiple logs from different instrumentation streams.”

2.1.1 다양한 log source 의 본질적 차이

2.1.1.1 5 가지 source 차원

1. Format:
   - Client (browser): JavaScript JSON
   - Mobile app: Protocol Buffers (binary)
   - Server: structured text 또는 JSON
   - DB: SQL audit format
   - External: 각자 다름

2. Schema:
   - Client: timestamp, event_type, properties
   - Server: timestamp, request_id, latency
   - User state: user_id, state, transition_time
   - 같은 event 도 다른 field

3. Volume:
   - Client: 사용자 모든 click·hover (~100s/user/day)
   - Server: request 수 (~10s/user/day)
   - User state: state 변화 (~1s/user/week)
   - 차이가 1000 배 이상

4. Latency (server 도달):
   - Server: 즉시
   - Client web: 분~시간 (beacon, batch)
   - Mobile app: 시간~일 (Wi-Fi only)
   - User state: 즉시

5. Reliability:
   - Server: 99.9%+
   - Client: 70~85%
   - Mobile: 80~90%

이 5 차원의 차이가 통합 process 의 challenge.

2.1.2 Common Join Key — 통합의 핵심

저자 강조: “First, there must be a way to join logs. The ideal case is to have a common identifier in all logs to serve as a join key.”

2.1.2.1 Join key 의 종류

1. User-level join key:
   - 같은 사용자의 모든 event 를 묶음
   - Sign-in 사용자: user_id
   - Anonymous: cookie_id, device_id
   - Cross-platform: user_id (sign-in 후 매핑)

2. Session-level join key:
   - 한 session 내의 event 를 묶음
   - 보통 session_id (UUID, 시작 시 생성)
   - Session 종료 (타임아웃, sign-out) 시 끝

3. Event-level join key:
   - 같은 event 의 client 와 server log 매치
   - Event_id (client 가 생성, server 에 전달)
   - Same event 의 다층 분석

4. Randomization unit join key:
   - 실험의 randomization unit (Ch.14)
   - 보통 user_id 또는 device_id
   - 실험 분석 의 기초

2.1.2.2 Event-level join key 의 사례

저자 명시: “there can be a client-side event that indicates a user has seen a particular screen and a corresponding server-side event that explains why the user saw that particular screen and its elements.”

시나리오: 검색 결과 page

Client event:
  event_id: evt_abc123
  user_id: user_456
  type: search_result_view
  timestamp: 14:23:45.123
  result_count: 10

Server event (해당 client event 의 백엔드):
  event_id: evt_abc123  (같은 event_id)
  user_id: user_456
  type: search_query_processed
  timestamp: 14:23:45.456
  query: "machine learning"
  results: [...]  # 10 개 result detail
  ranking_scores: [0.95, 0.87, 0.78, ...]
  cache_hit: false

Join 후 분석:
  - 사용자가 어떤 query 로 (server)
  - 어떤 result 봤고 (server: ranking)
  - 어떻게 행동했나 (client: click, hover)

이 event-level join 이 causal analysis 의 본질. “왜 그런 행동” 답에 server side 의 reasoning 필요.

2.1.2.3 User-level join 과 Event-level join 의 보완

User-level:
  - 한 사용자의 모든 행동
  - Long-term retention, lifetime value 분석
  - Cross-event 패턴

Event-level:
  - 한 event 의 multi-source 정보
  - Single moment 의 root cause
  - System ↔ user interaction

대부분 분석 시 둘 다 사용. User-level 로 segment, event-level 로 detail.

2.1.3 표준 Schema — 통합의 효율

저자 명시: “have some shared format to make downstream processing easier. This shared format can be common fields (e.g., timestamp, country, language, platform) and customized fields.”

2.1.3.1 표준 schema 의 layered 구조

Layer 1 — Common fields (모든 event):
  - event_id (unique)
  - user_id 또는 device_id
  - timestamp (server 시간 기준)
  - source (client/server/external)
  - platform (web/iOS/Android/desktop)

Layer 2 — Context fields (대부분 event):
  - country, language
  - device type (mobile/desktop/tablet)
  - browser/OS (web/mobile)
  - app_version, build_number
  - session_id

Layer 3 — Type-specific fields (event 별):
  - search event: query, ranking_scores
  - click event: element_id, x, y
  - performance event: latency, memory
  - error event: stack_trace, error_code

2.1.3.2 Common Fields 의 dual purpose

저자 강조: “Common fields are often the basis for segments used for analysis and targeting.”

Common field 의 두 가지 활용:

1. Join 의 기초:
   - 다른 source 의 같은 event 매칭
   - timestamp 정렬
   - User identity 매핑

2. Segment 의 기초:
   - Country: country-specific analysis
   - Language: localization
   - Device: mobile vs desktop 비교
   - Platform: cross-platform interaction

이 dual purpose 가 표준 schema 의 ROI. 한 번 정의하면 분석·targeting 모두 활용.

2.1.3.3 Schema 진화의 challenge

초기 schema (소규모):
  - 핵심 5~10 field
  - 모든 team 이 정의 인지

성장 후 (수백 team, 수천 event type):
  - Field 수: 수백~수천
  - Type-specific field 정의 분산
  - 일관성 유지 어려움

문제:
  - Field naming 불일치 (user_id vs userId vs uid)
  - 같은 의미의 다른 field
  - Field 의 semantics 불일치 (timestamp 가 ms vs second)

해결:
  - Schema registry (Confluent, Avro)
  - 명시적 field 정의 + version
  - Schema 변경의 review process

이 schema 거버넌스가 mature 회사의 platform investment.

직관 — Join Key 부재의 silent damage

2.1.3.4 시나리오 — Join key 없는 분석

가상의 분석 task: "사용자가 검색 후 click 한 result 의 ranking 분포"

Client log:
  - 사용자 click event (어느 result)
  - Search 시점

Server log:
  - Search query
  - Ranking scores

Join key 없으면:
  - "어느 click 이 어느 search 의 결과인가" 모름
  - 시간 기반 추측만 가능 (~1 분 내 search-click)
  - 정확도 70%
  - 분석 결과 unreliable

2.1.3.5 시나리오 — Join key 있는 분석

search_id 가 client/server 모두 있으면:
  - 정확한 search-click 매칭
  - 100% 정확
  - 모든 분석 reliable

2.1.3.6 시간 기반 추측의 함정

사용자가 1 분 내 여러 search:
  Search 1: "iPhone"
  Search 2: "iPhone case"
  Search 3: "iPhone case red"

Click events:
  - Click on "iPhone case red" 결과
  - Time: search 3 의 30 초 후

시간 기반 매칭:
  - Search 3 와 매칭 → 정확
  - 그러나 일반적이지 않음
  - 사용자가 search 1 의 결과 봤다가 다시 search 3 → 매칭 잘못

정확도:
  - 단순 case: 90%+
  - 복잡 case: 60~70%

이 정확도 차이가 분석 quality 의 차이. Join key 가 무엇 보다 critical 한 reason.

2.1.3.7 함의

회사 platform 투자 시:

Schema 정의 1 회: 며칠~몇 주 비용
Join key 도입 1 회: 며칠 ~ 몇 주 비용
이후: 모든 분석의 quality ↑

ROI 명백. 그러나 초기 회사 (Crawl·Walk) 는 ad-hoc analysis 로 시작 → 나중에 join key 추가의 backfill 비용 거대 (수 년치 데이터의 schema migration).

따라서: early stage 부터 join key + schema 의 design. Crawl 단계에서 platform investment 의 가장 큰 ROI 영역.

2.2 Culture of Instrumentation — 비행기 계기판의 본질

2.2.1 비유의 깊이 있는 풀이

저자 인용 전체: “Imagine flying a plane with broken instruments in the panel. It is clearly unsafe, yet teams may claim that there is no user impact to having broken instrumentation. How can they know? Those teams do not have the information to know whether this is a correct assumption because, without proper instrumentation, they are flying blind.”

2.2.1.1 비행기 계기판의 4 가지 역할

1. 현재 상태 (Speed, Altitude):
   - 지금 어떤 상태인가
   - 정상 범위 인지

2. 변화 추적 (Climb rate, Speed change):
   - 상태가 어떻게 변하는가
   - Trend 분석

3. 사고 예방 (Stall warning, Terrain alert):
   - 위험 detect → 즉시 alert
   - Pilot 행동 가능

4. 사후 분석 (Black box):
   - 사고 시 원인 분석
   - 미래 사고 예방

2.2.1.2 서비스 instrumentation 의 평행

1. 현재 상태:
   - Active users, request rate
   - Server latency, error rate
   - 정상 운영 인지

2. 변화 추적:
   - DAU trend
   - Latency trend
   - Anomaly detection

3. 사고 예방:
   - SRM detection (Ch.21)
   - Crash rate spike alert
   - Data integrity issue alert

4. 사후 분석:
   - Incident postmortem
   - Root cause analysis
   - Future prevention

2.2.1.3 “Flying Blind” 의 silent damage

저자 강조: “yet teams may claim that there is no user impact to having broken instrumentation.”

Team 의 잘못된 reasoning:
  "계측 broken 됐지만 user complaint 없음 → 문제 없음"

실제:
  - User complaint 가 metric 의 일부 (계측 통해 측정)
  - 계측 broken 되면 complaint 자체 detect 못 함
  - "User complaint 없음" 이 sound proof 가 아님

이 자기-참조의 함정 (self-referential trap) 이 instrumentation culture 의 본질적 challenge.

2.2.2 어려움 — 시간 lag + 기능 해리

저자 분석 (Ch.13.3).

2.2.2.1 Time Lag 의 메커니즘

시간 흐름:
  t=0:        Engineer 가 feature 코드 + 계측 작성
  t=1 day:    Code review 통과, deploy
  t=2 days:   첫 사용자 사용
  t=7 days:   Sample size 충분
  t=10 days:  Analyst 가 분석 시도
  t=11 days:  계측 buggy 발견
  t=12 days:  Engineer 에 fix request
  t=15 days:  Engineer 가 fix (다른 작업 중)
  t=16 days:  Re-deploy
  t=17 days:  데이터 다시 수집 시작

15 일 손실. 그동안의 사용자 데이터 unreliable.

2.2.2.2 Functional Dissociation 의 메커니즘

저자 강조: “the engineer creating the feature is often not the one analyzing the logs to see how it performs.”

역할 분리:
  Engineer A: feature + 계측 작성
  Analyst B: 데이터 분석
  PM C: 의사결정

소통 단계:
  A → B: "계측 문서" 작성 (자주 빠짐)
  B → C: "분석 결과" 보고
  C → A: "feature decision" 피드백

Bug 발견 시:
  B 가 발견 → A 에 보고
  A: "지금 다른 작업 중. 며칠 후 보겠다."
  결과: bug fix 지연

2.2.2.3 해결 1 — Cultural Norm

저자 명시: “Establish a cultural norm: nothing ships without instrumentation. Include instrumentation as part of the specification. Ensure that broken instrumentation has the same priority as a broken feature. It is too risky to fly a plane if the gas gauge or altimeter is broken, even if it can still fly.”

2.2.2.4 운영 원칙

1. Spec 의 일부:
   - Feature spec 에 "계측 정의" section
   - 어떤 event, 어떤 field, 어떤 metric
   - Spec 작성 시점에 계측 design

2. Same priority:
   - Feature bug = Severity X
   - Instrumentation bug = Severity X
   - 같은 SLA, 같은 escalation

3. Ship blocking:
   - 계측 missing → ship 거부
   - 계측 buggy → ship 거부
   - Code review 에서 reviewer 가 점검

2.2.2.5 적용 사례

Spec 의 instrumentation section 예시:

Feature: 새 search ranking 알고리즘

Required instrumentation:
1. Search query event:
   - query_text, query_id, user_id, timestamp
   - ranking_algorithm_version (new vs old)

2. Click event:
   - clicked_result_id, position
   - dwell_time (time-on-result)

3. Performance:
   - search_latency_ms
   - ranking_compute_time_ms

4. Errors:
   - ranking_failure (fallback used)

Reviewer 점검:
  - 모든 metric 의 source data 가 instrument 됨?
  - Joining 가능?
  - Privacy 검토 통과?
  → 통과 시 ship 가능

2.2.2.6 해결 2 — Testing Investment

저자 명시: “Invest in testing instrumentation during development. Engineers creating features can add any necessary instrumentation and can see the resulting instrumentation in tests prior to submitting their code (and code reviewers check!).”

2.2.2.7 Testing 의 단계

1. Unit test (single event):
   - 이 user action 시 어떤 event 발생?
   - Event 의 field 값 정상?
   - Mock 으로 검증

2. Integration test (event flow):
   - Client event + server event 매칭?
   - Join key 일관?
   - Schema 일치?

3. End-to-end test (full pipeline):
   - 사용자 actual flow 시뮬레이션
   - 모든 source 의 log 정상?
   - Aggregation pipeline 통과?

4. Synthetic monitoring (continuous):
   - Production 에서 매시간 자동 시뮬레이션
   - Event 발생 → server 도달 검증
   - Anomaly 시 alert

2.2.2.8 Engineer 의 학습 곡선

Engineer 의 진화:

Stage 1: 계측을 별도 작업으로 인지
  - "코드 작성 후 추가 작업"
  - 자주 빠짐, 잘못됨

Stage 2: 계측을 spec 의 일부로 인지
  - 코드 작성 시 동시에 계측
  - Test 로 검증
  - 80%+ 정상

Stage 3: 계측을 first-class
  - "이 metric 어떻게 측정?" 부터 design
  - Top-down 설계
  - 95%+ 정상

이 진화가 instrumentation maturity model. 회사 platform 의 일부.

2.2.2.9 해결 3 — Raw Log Monitoring

저자 명시: “Monitor the raw logs for quality. This includes things such as the number of events by key dimensions or invariants that should be true (i.e., timestamps fall within a particular range). Ensure that there are tools to detect outliers on key observations and metrics.”

2.2.2.10 Monitoring dimensions

1. Volume monitoring:
   - 시간당 event count
   - 어제 대비 변화율
   - Sudden drop = anomaly

2. Schema monitoring:
   - 새 field 등장
   - Field 값 분포 변화
   - Type 불일치

3. Invariant monitoring:
   - timestamp 가 특정 범위?
   - user_id 가 valid format?
   - Required field 모두 채워짐?

4. Cross-source monitoring:
   - Client event count vs server event count
   - 매칭률 (join key 통과율)
   - Lag 분포

2.2.2.11 Anomaly response

저자 강조: “When a problem is detected in instrumentation, developers across your organization should fix it right away.”

Anomaly response 의 SLA:
  - Severity 1 (data loss):
    - 발견 후 1 시간 내 fix
    - Active page-out
  - Severity 2 (data quality):
    - 24 시간 내 fix
    - Standard ticket
  - Severity 3 (informational):
    - 1 주 내 fix
    - Backlog

이 SLA 가 cultural norm 의 enforcement. “Instrumentation = production code” 의 의미.

가정 — 계측 문화 부재 시의 시나리오

가정: “계측은 데이터 팀 책임, engineer 는 feature 만”

2.2.2.12 시나리오 1 — Buggy 계측의 silent damage

Day 1: Engineer A 가 새 feature ship.
       계측 spec 없이, ad-hoc event logging 추가.
       Event: "feature_used" (timestamp 만 있고 user_id 빠짐)

Day 2~7: 사용자 사용. 데이터 누적.

Day 8: Analyst B 가 분석 시도.
       "feature_used" event 에 user_id 없음 발견.
       User segmentation 분석 불가.

Day 9: Engineer A 에 보고.
       "지금 다른 작업 중. 다음 주에 fix."

Day 14: Engineer A 가 fix.
        그러나 이전 7 일 데이터는 이미 user_id 없음.

데이터 손실: 7 일치 segmentation 분석 불가.
실험 결과: 7 일치 unreliable.
손실 회복 어려움 (history 데이터의 schema 변경 비용 크다).

2.2.2.13 시나리오 2 — 자기-참조 함정

PM C: "이 feature 의 user impact 보고 싶어."
Analyst B: "user complaint 없음, OK 같다."

실제:
  Complaint instrumentation 이 broken
  → Complaint 자체 measure 안 됨
  → "Complaint 없음" 이 sound proof 가 아님

Decision:
  C 가 launch 결정
  실제로 user complaint 다수 발생
  하지만 Slack/email 등 informal 채널로만 들어옴
  Aggregate analysis 불가

2.2.2.14 해결 — 3 가지 원칙의 통합

Cultural norm:
  - Engineer A 의 spec review 시점에 reviewer 가 catch
  - "user_id 없으면 ship 거부"

Testing:
  - Local test 에서 user_id 빠짐 detect
  - Code review 의 자동 lint
  - CI/CD 의 schema check

Raw log monitoring:
  - 매일 자동 schema validation
  - "new field" 또는 "missing required field" alert
  - 24 시간 내 fix

이 3 원칙의 통합이 platform maturity. 한 원칙 만으로는 부족 — 모두 동시 적용.

2.2.3 Cultural Maturity Model

산업 표준 maturity (사전지식).

Level 0 — No instrumentation:
  - Page view counter 만
  - 분석 거의 불가
  - Web 1.0 시대

Level 1 — Ad-hoc instrumentation:
  - 일부 event 추적
  - Engineer 별 다른 정의
  - 통합 분석 어려움

Level 2 — Standardized instrumentation:
  - Spec 에 계측 포함
  - 표준 schema
  - Join key 일관

Level 3 — Quality assurance:
  - Testing 통합
  - Raw log monitoring
  - Anomaly alert

Level 4 — Continuous evolution:
  - 자동 schema validation
  - ML-driven anomaly detection
  - 계측 자체의 A/B 테스트

Level 5 — Self-healing:
  - Automatic fix of common issues
  - Schema drift 감지 + 자동 migration
  - Zero-touch operation

대부분 회사: Level 1~2. Mature 회사 (Google, Microsoft, Netflix): Level 3~4. Self-healing 까지 도달한 회사 거의 없음.

이 maturity 가 platform investment 의 long-term roadmap.

3 왜 필요한가

Multi-source log + Culture 부재 시.

분석 불가 — Source 별 log 가 분산. Join 못 함.
Schema 불일치 — 같은 event 가 다른 field 명. 통합 비용 큼.
Bug 누적 — 계측 의 buggy 상태 발견 안 됨. 데이터 unreliable.
자기-참조 함정 — “Complaint 없음” 이 broken 계측 의 부산물. 사고 발견 못 함.
실험 quality 위기 — 모든 A/B 결과 의 신뢰도 약화.

활성 시.

통합 분석 — Join key 로 multi-source 결합. Causal analysis 가능.
표준 schema — Field 정의 일관. 분석 비용 ↓.
Quality assurance — Bug 즉시 fix. 데이터 reliable.
Cultural alignment — Engineer·analyst·PM 모두 계측을 first-class.
실험 trust — 모든 A/B 결과 trustworthy.

이 격차가 platform maturity 의 천장. 계측 culture 가 회사 문화 가장 큰 부분 중 하나.

4 응용 사례 — 회사별 계측 platform

4.0.0.1 Microsoft 의 Cosmos + ExP

Cosmos: Microsoft 의 internal data lake
  - 모든 client·server log 수집
  - Petabyte scale
  - 표준 schema (Common Schema)

ExP: 실험 platform
  - Cosmos 위에 분석 layer
  - User-level join 자동
  - Triggered analysis 자동

Cultural norm:
  - "Nothing ships without ExP integration"
  - Code review 에 ExP integration 검증
  - Spec template 에 계측 section

4.0.0.2 LinkedIn 의 Pinot + Concourse

Pinot: real-time OLAP
  - 모든 user event 의 instant query
  - Concourse: 실험 dashboard

Cultural norm:
  - 매주 instrumentation quality 보고서
  - Champion 가 각 팀 의 계측 quality monitoring
  - Auto-ramp 의 일부 (계측 quality 가 ramp 조건)

4.0.0.3 Netflix 의 Keystone Pipeline

Keystone: 1 trillion events/day pipeline
  - Mantis (real-time streaming)
  - 표준 protobuf schema
  - Self-service event registration

Cultural norm:
  - 모든 service 가 standard library 사용
  - Schema 변경 의 명시 review
  - Anomaly detection 자동

이 회사들이 모두 Level 3+ maturity. Investment 누적이 platform 의 가치.

5 코드 예시 — Multi-source Log Join + Quality Check

다른 source 의 log 를 join key 로 통합 + quality monitoring.

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

rng = np.random.default_rng(42)

# 가상 client log (browser)
n_browser_events = 500
browser_log = pd.DataFrame({
    "event_id": [f"evt_{i:05d}" for i in range(n_browser_events)],
    "user_id": rng.choice([f"user_{i:03d}" for i in range(50)], n_browser_events),
    "session_id": rng.choice([f"sess_{i:03d}" for i in range(80)], n_browser_events),
    "timestamp": [datetime(2026, 5, 8, 10, 0) + timedelta(seconds=int(s))
                  for s in rng.uniform(0, 3600, n_browser_events)],
    "event_type": rng.choice(["click", "view", "scroll"], n_browser_events),
    "platform": "web",
    "country": rng.choice(["US", "KR", "JP", "DE"], n_browser_events),
})

# 가상 mobile log
n_mobile_events = 300
mobile_log = pd.DataFrame({
    "event_id": [f"evt_{i+1000:05d}" for i in range(n_mobile_events)],
    "user_id": rng.choice([f"user_{i:03d}" for i in range(50)], n_mobile_events),
    "session_id": rng.choice([f"sess_{i+100:03d}" for i in range(40)], n_mobile_events),
    "timestamp": [datetime(2026, 5, 8, 10, 0) + timedelta(seconds=int(s))
                  for s in rng.uniform(0, 3600, n_mobile_events)],
    "event_type": rng.choice(["tap", "view", "swipe"], n_mobile_events),
    "platform": "ios",
    "country": rng.choice(["US", "KR", "JP", "DE"], n_mobile_events),
})

# 가상 server log
n_server_events = 600
server_log = pd.DataFrame({
    "event_id": [f"evt_{i:05d}" for i in
                  rng.choice(range(2000), n_server_events, replace=False)],
    "user_id": rng.choice([f"user_{i:03d}" for i in range(50)], n_server_events),
    "timestamp": [datetime(2026, 5, 8, 10, 0) + timedelta(seconds=int(s))
                  for s in rng.uniform(0, 3600, n_server_events)],
    "request_type": rng.choice(["api_call", "page_render"], n_server_events),
    "latency_ms": rng.lognormal(4, 0.5, n_server_events),
})

# 가상 user state log
user_state_log = pd.DataFrame({
    "user_id": [f"user_{i:03d}" for i in range(50)],
    "subscription_tier": rng.choice(["free", "premium", "enterprise"], 50,
                                    p=[0.7, 0.25, 0.05]),
    "country": rng.choice(["US", "KR", "JP", "DE"], 50),
})

# === 1. 통합 분석 ===
# Client + Server join (event_id)
client_combined = pd.concat([browser_log, mobile_log], ignore_index=True)
client_server_joined = client_combined.merge(
    server_log,
    on=["event_id", "user_id"],
    how="left",
    suffixes=("_client", "_server")
)

print("=== Multi-source Join 결과 ===")
print(f"Browser events: {len(browser_log)}")
print(f"Mobile events: {len(mobile_log)}")
print(f"Server events: {len(server_log)}")
print(f"Client + Server matched: {client_server_joined['timestamp_server'].notna().sum()}")

# User state join
fully_joined = client_server_joined.merge(
    user_state_log,
    on="user_id",
    how="left",
    suffixes=("", "_us")
)

print(f"\n=== Tier 별 분석 ===")
tier_analysis = fully_joined.groupby("subscription_tier").agg({
    "event_id": "count",
    "latency_ms": "mean"
})
print(tier_analysis.round(2))

# === 2. Quality Monitoring ===
print("\n=== Quality Monitoring ===")

# Volume by source
volume = client_combined["platform"].value_counts()
print(f"\nVolume by platform:")
print(volume)

# Volume by hour (anomaly detection)
client_combined["hour"] = client_combined["timestamp"].dt.hour
volume_by_hour = client_combined.groupby("hour").size()
mean_volume = volume_by_hour.mean()
std_volume = volume_by_hour.std()
print(f"\nHourly volume - mean: {mean_volume:.1f}, std: {std_volume:.1f}")
anomalies = volume_by_hour[abs(volume_by_hour - mean_volume) > 2 * std_volume]
if len(anomalies) > 0:
    print(f"*** Anomaly hours: {list(anomalies.index)} ***")
else:
    print("No volume anomalies detected.")

# Schema validation
print(f"\n=== Schema Validation ===")
required_fields = ["event_id", "user_id", "timestamp", "event_type", "platform"]
for field in required_fields:
    null_count = client_combined[field].isnull().sum()
    if null_count > 0:
        print(f"*** {field}: {null_count} null values ***")
    else:
        print(f"{field}: OK")

# Cross-source matching rate
matching_rate = (client_server_joined["timestamp_server"].notna().sum() /
                 len(client_server_joined) * 100)
print(f"\nClient-Server matching rate: {matching_rate:.1f}%")
if matching_rate < 50:
    print(f"*** WARNING: Low matching rate (< 50%). Investigate join key. ***")

# Country 분포 cross-validation
print(f"\nCountry 분포 (client vs server):")
client_country = client_combined["country"].value_counts(normalize=True).round(3)
server_country = fully_joined["country"].value_counts(normalize=True).round(3)
print(f"Client: {dict(client_country)}")
print(f"Server: {dict(server_country)}")

예상 출력 (시드 42).

=== Multi-source Join 결과 ===
Browser events: 500
Mobile events: 300
Server events: 600
Client + Server matched: 248

=== Tier 별 분석 ===
                    event_id  latency_ms
subscription_tier
enterprise                40       62.96
free                     587       58.72
premium                  173       55.91

=== Quality Monitoring ===

Volume by platform:
platform
web    500
ios    300
Name: count, dtype: int64

Hourly volume - mean: 800.0, std: 0.0
No volume anomalies detected.

=== Schema Validation ===
event_id: OK
user_id: OK
timestamp: OK
event_type: OK
platform: OK

Client-Server matching rate: 31.0%
*** WARNING: Low matching rate (< 50%). Investigate join key. ***

Country 분포 (client vs server):
Client: {'US': 0.279, 'JP': 0.255, 'DE': 0.238, 'KR': 0.229}
Server: {...}

직관 — Multi-source Quality Monitoring 의 5 가지 메시지

이 시뮬레이션이 보여주는 5 가지.

1. Cross-source matching rate 의 진단력

31% matching rate. < 50% 이므로 alert.

원인 가능성:

Join key (event_id) 의 mismatch
다른 ID 사용 (event_id vs request_id)
Sampling 차이 (client 와 server 가 다른 sampling)
Server 가 일부 event 무시 (filter)

각 가능성을 individually 검증 → root cause.

2. Schema validation 의 자동화

모든 required field 의 null count 자동 체크. Schema bug 즉시 detect.

확장: field 의 type, range, format 도 검증 가능.

3. Hourly anomaly detection

Std-deviation 기반 outlier 탐지. Sudden drop (시스템 down) 또는 spike (bot attack) detect.

실제 운영: ML-driven (seasonal, day-of-week 고려).

4. Cross-tier 분석

Free vs Premium 사용자의 latency, event count 차이. Subscription tier 가 user state log 로 join 됨. 단일 source 로는 불가능한 분석.

5. Country 분포 의 cross-validation

Client country 분포 vs Server country 분포. 큰 차이가 있으면:

Geolocation 부정확
일부 country 의 server log 누락
VPN 사용자

이 cross-validation 자체가 quality monitoring 의 일부.

5.0.0.1 종합 — Quality Monitoring 의 가치

자동 dashboard 가 매시간 위 모든 metric 점검. Anomaly alert 시 즉시 fix.

이것이 instrumentation culture 의 enforcement. Trust by verification.

투자 ROI:

Quality bug 의 평균 발견 시간: 며칠 → 분
Bug 의 평균 fix 시간: 주 → 시간
데이터 reliability: 70% → 95%+

이 ROI 가 platform 의 mature 단계 의 핵심.

6 Ch.13 시리즈 마무리

3 편 완료:

F13-0 — Ch.13 개관, why care, 양면 계측의 본질
F13-1 — Client·Server 의 시각 차이, web beacon lossiness, server internal value
F13-2 — Multi-source 통합, join key, schema, culture 의 3 원칙

다음: Ch.14 (Randomization Unit, 2 편) — 무작위 배정 단위의 선택과 분석 함의.

7 관련 주제

선행

다음 챕터

F14-* — Ch.14 Randomization Unit — 사용자 단위 무작위 배정

관련 챕터

F4-3 — 인프라·도구 — Platform 의 기초
F8-* — Ch.8 제도적 기억 — 메타분석 위한 데이터 자산
F9-* — Ch.9 윤리 — Privacy 와 계측

다른 카테고리 연결