Kwangmin Kim - 선택권 제공과 데이터 수집 — Switching Cost

1 정의

정의: Choice·Data Collection 의 윤리 점검 영역

Belmont Report 의 Respect for Persons 원칙은 두 가지로 운영된다 (Kohavi, Tang, Xu, 2020, Ch.9.4~9.5).

Choice (선택권) — 사용자가 실험 참여 거부의 alternative 가 있는가
Data Collection (데이터 수집) — 무엇을·왜·얼마나·언제까지 수집하는가

Choice spectrum.

Zero switching cost  ────────────────────  High switching cost
검색 엔진 → SNS → 모바일 OS → 회사 SaaS → 의료 RCT
   (alternative 풍부)        (alternative 거의 없음)

원문 인용 (Ch.9.4): “In medical clinical trials testing new drugs for cancer, the main choice most participants face is death, making it allowable for the risk to be quite high, given informed consent.”

핵심 통찰: Switching cost 가 높을수록 ethical review bar 높음. Cancer RCT 가 death 라는 alternative 밖에 없으므로 risk 수용 한계가 가장 높음. 반대로 search engine 변경은 사용자가 즉시 이탈 가능.

2 개념 및 원리

2.1 Provide Choices — Switching Cost Spectrum

저자가 명시한 spectrum 의 양 끝과 중간.

2.1.1 Zero Switching Cost — 검색 엔진

특성:

Alternative 풍부 (Bing, Google, DuckDuckGo)
즉시 전환 가능 (URL 한 번)
비용: 시간 0~1 초

윤리 함의: 사용자가 실험 마음에 안 들면 즉시 이탈. 따라서 risk 수용 bar 낮아도 됨. 단 사용자 이탈 자체가 회사 metric 손실 → 자발적 quality control.

2.1.3 High Switching Cost — 회사 SaaS, 은행

특성:

Alternative 적음 (Salesforce, AWS, 주거래 은행)
전환 비용: 데이터 마이그레이션, 학습, 비즈니스 중단
비용: 수개월 ~ 수년

윤리 함의: 사용자가 사실상 lock-in. 회사의 power 강함 → 윤리 책임 매우 큼. 일시적 risk 강요도 문제.

2.1.4 Highest — 의료 RCT (cancer trial)

특성:

Alternative: standard care 또는 death
전환: 임상시험 중단 후 standard care 으로 복귀 가능
비용: 가장 높음 (생명)

윤리 함의: Risk 수용 bar 가장 높음 (death 가 alternative). 그러나 individual informed consent + IRB 강제. Power 비대칭 인지.

직관 — Switching Cost 가 윤리 bar 와 inverse

직관 1: “Switching cost 높으면 사용자 power 약함 → 회사 권력 → 더 강한 윤리 review 필요”

직관 2: “Switching cost 높으면 alternative 없음 → death 수준 alternative 라면 high-risk 처치도 정당화 가능 (informed consent 시)”

두 직관이 모순 같지만 둘 다 맞다. 의료 RCT 는:

사용자 power 약함 (alternative = death)
따라서 IRB · individual consent 필수 (윤리 강화)
그러나 전제 충족 시 high-risk 처치 정당화 (alternative 가 더 나쁘므로)

온라인 SaaS 도 유사:

사용자 power 약함 (lock-in)
윤리 review 필요
Risk 수용 bar 는 의료보다 낮음 (alternative 가 death 아님)

이 spectrum 이 ethical review 의 calibration 입력. 회사 platform 의 switching cost 를 자가 진단 후 review bar 결정.

2.1.5 Online Service 의 Offline 영향 확장

저자 강조 (Ch.9.4): “As online services start impacting offline experiences, such as with shipping physical packages, ride sharing, and so on, the risk and consequentiality can increase.”

서비스	Online-only	Offline 영향
검색 엔진	정보 제공	거의 없음
SNS	콘텐츠·관계	일부 (오프라인 만남)
E-commerce	구매 결정	배송 (delay = 실생활 영향)
Ride sharing	매칭	운전자 안전, 승객 안전
Telemedicine	진단	치료 결정, 생명

Offline 영향 increase → risk profile increase → 윤리 review bar 상승.

2.2 Data Collection — 6 가지 영역 점검

저자 명시 6 가지 질문 (Ch.9.5).

2.2.1 영역 1 — 무엇을 수집하는가

Privacy by Design (Cavoukian) 핵심:
  - Default: privacy 보호 (opt-out 이 아닌 opt-in)
  - End-to-end: 수집부터 삭제까지
  - Visibility: 사용자가 무엇 수집되는지 인지 가능

질문:

수집 항목 명시 (location, browsing, device ID 등)
사용자가 ToS·privacy policy 에서 인지 가능한가
“Data minimization” — 실제 필요한 것만

2.2.2 영역 2 — 데이터의 sensitivity

저자 명시 sensitivity 분류.

등급	예시	보호 수준
Low	디바이스 OS, 일반 browsing	표준 보호
Medium	Location, search history	강화 보호 (encryption, access log)
High	Financial (계좌·결제)	추가 audit, regulatory compliance
Critical	Health (HIPAA), 신원	Individual consent + 법적 framework

저자 질문: “Could the data be used to discriminate against users in ways that infringe human rights?” — 차별 가능성도 sensitivity 의 일부.

가정 — Sensitivity 잘못 분류 시

가정 깨짐: “Location 은 low risk” 잘못 분류.

결과:

Sensitive location 데이터 (집·직장·종교·의료 시설 방문) 가 inadequate protection
Stalking, discrimination, blackmail 가능성
GDPR·HIPAA 등 규제 위반

해결: Location 은 medium 이상. 종교·의료·정치 관련 location 은 sensitive (GDPR 의 special category data 와 평행).

이 분류는 binary (sensitive vs not) 가 아니라 context-dependent. 같은 location 이 normal context 에서는 medium, gay bar visit context 에서는 critical.

2.2.3 영역 3 — Personally Identifiable

PII (Personally Identifiable Information) 분류:

Identified: 이름·주민번호·전화번호 (직접 식별)
Pseudonymous: cookie ID, device ID (연결 가능하지만 직접 식별 아님)
Anonymous: 식별 불가
Anonymized: 처리로 식별 risk 낮춤 (safe harbor, k-anonymity, differential privacy)

상세는 F9-3 의 User Identifiers Sidebar.

2.2.4 영역 4 — 수집 목적

질문:

어떤 purpose 로 수집되는가
어떤 use case 에 사용 가능한가
누가 (어떤 팀, 어떤 회사) access 가능한가

원칙: purpose limitation — 수집 목적 외 사용 금지. GDPR 의 핵심 원칙.

2.2.5 영역 5 — 필요성·삭제

질문:

“필요한 것만 수집” — purpose 에 직접 필요한 항목만
“최대한 빨리 aggregate 또는 삭제” — 개별 record 보유 기간 최소화

저자 인용: “How soon can the data be aggregated or deleted to protect individual users?”

2.2.6 영역 6 — 가능한 harm 시나리오

저자 명시: “What harm would befall users if that data or some subset be made public?”

Harm 종류	시나리오
Health	의료 정보 유출 → 보험·고용 차별
Psychological	Sensitive 검색 기록 (suicide, depression) 공개 → trauma
Emotional	Private 메시지 공개 → 관계 손상
Social	종교·정치 정보 → 평판·관계 손상
Financial	결제 정보 → 사기·신원 도용

이 시나리오 사고 실험이 review 의 핵심. “이 데이터가 유출되면 사용자가 어떤 harm 입는가” 질문에 답할 수 있어야 한다.

2.3 Privacy by Design — Operational Framework

Wikipedia (Cavoukian) 의 7 원칙 요약 (사전지식).

#	원칙	적용
1	Proactive not Reactive	사고 발생 전 보호
2	Privacy as Default	Opt-in, 최소 수집
3	Privacy Embedded into Design	Architecture 단계 통합
4	Full Functionality	Privacy vs functionality 양립
5	End-to-End Security	수집부터 삭제까지
6	Visibility and Transparency	사용자에게 공개
7	Respect for User Privacy	사용자 중심

이 framework 이 Ch.9.5 의 6 영역 점검을 운영화한다.

2.4 User Expectations — Privacy 와 Confidentiality

저자가 분리한 두 개념.

개념	정의	예시
Privacy	“관찰되지 않을 권리”	집안 행동
Confidentiality	“관찰된 정보가 공개되지 않을 권리”	의사가 진료 후 비공개

2.4.1 Privacy Expectation Spectrum

저자 명시 (Ch.9.5):

Public setting (low expectation)  ──────  Private setting (high expectation)
Football stadium → 공원 → 카페 → 집 → 침실

함의: Public 데이터 (예: 공공 트위터 post) 는 privacy expectation 낮음. Private 데이터 (직접 메시지) 는 높음.

2.4.2 Confidentiality 보장 메커니즘

저자 명시 4 가지:

Confidentiality 수준 — 어떤 수준 기대 가능 (예: “회사 직원만 access” vs “공개”)
Internal Safeguards — Access control, audit log, encryption
Breach Detection — 사고 발견·통보·관리 절차
User Redress — 사고 시 사용자 알림·보상

직관 — Access Log 가 모든 것의 시작

회사 내부에서도 데이터 access 가 무제한이면 confidentiality 깨짐.

Access log + audit 의 메커니즘:

모든 데이터 access:
  - Who (누가)
  - When (언제)
  - What (어떤 데이터)
  - Why (어떤 purpose)
  → log 기록
  → 정기 audit
  → 이상 access detect (예: 한밤 access, 무관 user data access)

이 log 가 confidentiality 의 enforcement 도구. Log 없으면 internal abuse detect 불가.

LinkedIn·Microsoft·Google 등 표준 운영. 신규 가입 회사도 platform 도입 단계에서 access log 필수.

이 log 자체가 sensitive (누가 무엇을 봤는지 정보) → log 보호도 별도 framework. Recursive problem 이지만 표준 솔루션 (별도 audit 권한, immutable log) 존재.

3 왜 필요한가

Choice·Data Collection framework 부재 시.

Switching cost 무시 → SaaS lock-in 사용자에 high-risk 실험 강요
Sensitivity 분류 없음 → Location·health 데이터에 inadequate protection
Purpose limitation 없음 → 수집 목적 외 사용 (예: 분석 목적 데이터를 광고에 사용)
Access log 없음 → Internal abuse 발생해도 detect 불가
Breach 절차 없음 → 사고 발생 시 사용자 통보 지연 → 추가 harm

Framework 활성 시.

Risk-proportional review — Switching cost 에 따른 적응적 bar
Privacy by Default — 수집·보유 최소화로 inherent risk ↓
Internal Trust — Access log + audit 으로 internal abuse 차단
Breach Resilience — 사고 발생 시 빠른 통보·redress
Regulatory Compliance — GDPR·HIPAA·CCPA 자동 만족

4 응용 사례 — 데이터 수집 6 영역 매트릭스

가상 실험: “사용자 검색 기록을 분석하여 광고 personalization 평가”

영역	점검	답변
무엇	수집 항목	검색 query, 클릭, 노출
Sensitivity	등급	Medium (검색 query 의 일부 sensitive)
Identifiable	PII 분류	Pseudonymous (cookie ID)
목적	Purpose	광고 ranking 모델 학습 + 평가
필요성·삭제	Retention	90 일 후 aggregate, 1 년 후 raw 삭제
Harm	유출 시	검색 query 의 sensitive 부분 (의료·재무·관계) → psychological·social harm

Review 결정: medium-risk → lightweight IRB review. Sensitivity·retention 강화 가능 시 self- checklist 로 downgrade.

5 코드 예시 — Access Log Audit 자동화

Internal access 의 anomaly detection 간이 구현.

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

rng = np.random.default_rng(42)

# 가상의 access log 생성 (정상 + 이상)
n_normal = 1000
n_anomaly = 5

# 정상 access: 평일 9-18 시, 자기 팀 데이터
normal_users = ["alice", "bob", "carol", "dave", "eve"]
normal_log = pd.DataFrame({
    "user": rng.choice(normal_users, n_normal),
    "timestamp": [datetime(2026, 5, 8, 9, 0) + timedelta(minutes=int(m))
                  for m in rng.uniform(0, 9 * 60, n_normal)],
    "data_owner_team": rng.choice(["ranking", "ranking", "search"], n_normal),
    "user_team": "ranking",
    "purpose": rng.choice(["model_training", "analysis", "monitoring"], n_normal),
})

# 이상 access: 한밤 + 다른 팀 데이터 + sensitive
anomaly_log = pd.DataFrame({
    "user": ["mallory"] * n_anomaly,
    "timestamp": [datetime(2026, 5, 8, 2, 0) + timedelta(minutes=int(m))
                  for m in rng.uniform(0, 60, n_anomaly)],
    "data_owner_team": ["finance"] * n_anomaly,
    "user_team": ["ranking"] * n_anomaly,
    "purpose": ["unknown"] * n_anomaly,
})

log = pd.concat([normal_log, anomaly_log], ignore_index=True)

# Anomaly detection
def is_anomaly(row):
    flags = []
    # 1. 한밤 access (0~6 시)
    if 0 <= row["timestamp"].hour < 6:
        flags.append("late_night")
    # 2. 다른 팀 데이터 access
    if row["data_owner_team"] != row["user_team"]:
        flags.append("cross_team")
    # 3. Purpose 미지정
    if row["purpose"] == "unknown":
        flags.append("no_purpose")
    return ", ".join(flags) if flags else None

log["anomaly_flags"] = log.apply(is_anomaly, axis=1)
anomalies = log[log["anomaly_flags"].notna()]

print(f"전체 access: {len(log)}")
print(f"Anomaly 식별: {len(anomalies)}")
print("\n이상 access 사례:")
print(anomalies[["user", "timestamp", "data_owner_team", "anomaly_flags"]].to_string(index=False))

예상 출력 (시드 42).

전체 access: 1005
Anomaly 식별: 5

이상 access 사례:
    user           timestamp data_owner_team                          anomaly_flags
mallory 2026-05-08 02:18:00         finance late_night, cross_team, no_purpose
mallory 2026-05-08 02:35:00         finance late_night, cross_team, no_purpose
mallory 2026-05-08 02:48:00         finance late_night, cross_team, no_purpose
mallory 2026-05-08 02:09:00         finance late_night, cross_team, no_purpose
mallory 2026-05-08 02:55:00         finance late_night, cross_team, no_purpose

직관 — Multi-flag 가 anomaly detection 핵심

3 개 flag 가 동시 발생하면 anomaly probability 매우 높음.

Single flag (예: 야근 access):
  - 정상 가능성 있음 (실제 야근, 긴급 작업)
  - False positive 다수

3 flags 동시 (야근 + 다른 팀 + purpose 없음):
  - 정상 시나리오 거의 없음
  - 즉시 escalation

이 multi-flag 접근이 access log audit 의 표준. Single flag 만 보면 false positive 폭주 → audit 무시되는 alert fatigue.

추가 발전:

ML 기반 anomaly score (행동 패턴 학습)
User-specific baseline (각 user 의 정상 패턴 modeling)
Temporal pattern (예: 평소 5 회 access 가 갑자기 50 회)

이 정교화는 platform 성숙도와 함께 진화. 시작은 단순 multi-flag rule.

6 관련 주제

선행 — Ch.9 시리즈

다음 글

F9-3 — 윤리 문화·식별자

관련 챕터

F8-0 — Ch.8 개관: 제도적 기억 — Capture 데이터의 윤리

다른 카테고리 연결

Governance — 데이터 거버넌스 — Privacy by Design 운영
Engineering — Access Control · Audit — Internal safeguard
Surveilance — HIPAA · GDPR — 의료·EU 규제