Kwangmin Kim - Klein Ch.1 § 1.17~1.18 심화 — Marijuana (Doubly Censored)

1 들어가며 — Non-Standard Censoring 의 두 데이터

Klein 시리즈 사다리:

편	주제
Ch.1 Overview (01)	19 예제 catalog
… (이전 편들)	…
§ 1.15~1.16 (01-8)	Psychiatric + Channing (Left Truncation)
§ 1.17~1.18 (본 편)	Marijuana (Doubly Censored) + Breast Cancer (Interval Censored)
§ 1.19 (예정 또는 skip)	AIDS (Right Truncation)

본 편이 답하는 다섯 가지 질문

Doubly censored data (left + right) 가 표준 right-censored data 와 무엇이 다른가?
Interval censoring 이 왜 정기 검진 데이터에서 자연 발생하며, 일반 likelihood 가 어떻게 변하는가?
Turnbull self-consistency 알고리즘이 NPMLE 를 어떻게 구하는가?
EM algorithm 이 interval-censored data 에 어떻게 적용되는가?
§ 1.17 (self-reported) 과 § 1.18 (clinical observation) 의 censoring 발생 메커니즘 차이?

2 Non-Standard Censoring — 본 편의 통합 주제

2.1 Censoring 의 3 가지 형태 (Klein Ch.3.3)

실제 사건 시점 T (모름)               관측 영역
       |                                  |

Right Censoring (가장 흔함):
  Study 시작                            관측 종료
    |─────────────────────────────────|
    (T 가 이 영역 너머)               c_i (right cens 시점)
    Likelihood: S(c_i)

Left Censoring (드물음):
              ●(과거 사건)              관측 시작
    T → ←(L_i 시점 이전 발생)─────────|─────────
    Likelihood: 1 - S(L_i) = F(L_i)

Interval Censoring (정기 검진):
  검진 1                  검진 2
    |─────T(어딘가)──────|
    L_i                   R_i
    Likelihood: S(L_i) - S(R_i)

직관 — Likelihood 의 형태

Right censored: \(L_i = T_i^{\text{cens}}, S(L_i)\).

Left censored: \(1 - S(L_i)\).

Interval censored: \(S(L_i) - S(R_i)\).

Exact event: \(f(T_i)\) 또는 한계 \(S(t-) - S(t+)\).

각 type 의 likelihood 가 \(S\) 의 다른 함수 → 하나의 모델로 모두 처리.

구체적 likelihood (mixed):

\[ L = \prod_i \begin{cases} f(T_i) & \text{exact} \\ S(L_i) & \text{right cens} \\ 1 - S(L_i) & \text{left cens} \\ S(L_i) - S(R_i) & \text{interval cens} \end{cases} \]

→ 다양한 censoring 이 공존해도 통일된 likelihood.

3 § 1.17 Marijuana — Doubly Censored Data

3.1 연구 배경 — Self-Reported Survey

Stanford-Palo Alto Peer Counseling Program (Hamburg et al. 1975).

n: 191 California high school boys.
질문: “When did you first use marijuana?”
응답 종류 3 가지:
1. “정확한 나이” (e.g., 14 세에 처음): exact observation.
2. “한 번도 쓴 적 없음”: right-censored at current age.
3. “쓴 적 있으나 시점 모름”: left-censored.

→ Doubly censored data (left + right + exact 혼합).

직관 — 자연스러운 Censoring 발생

왜 이 패턴인가?

Exact: 명확한 첫 경험 (event 가 distinct).
Never: 아직 발생 안 함 (right cens at current age).
Forgot when: 발생했지만 정확한 시점 모름 (left cens).

Self-reported survey 의 특징:

Recall bias (시점 기억 어려움).
Social desirability bias (drug use under-report 가능).
그러나 doubly censored 구조 자체는 자연스러움.

일반 임상 데이터와의 차이:

임상: clinical observation → exact 또는 right cens 만.
Survey: 응답자의 회상 → left cens 발생 가능.

3.2 데이터 — Table 1.8

Age	Exact	Yet to Smoke (Right Cens)	Started Earlier (Left Cens)
10	4	0	0
11	12	0	0
12	19	2	0
13	24	15	1
14	20	24	2
15	13	18	3
16	3	14	2
17	1	6	3
18	0	0	1
>18	4	0	0

Total exact: 100.
Total right-censored: 79 (현재 나이까지 미사용).
Total left-censored: 12 (사용 시점 모름, 현재 나이 이전).
n: 191.

직관 — 데이터의 의미 해석

Age 13 의 row (24 / 15 / 1):

24 명: 13 세에 정확히 첫 사용.
15 명: 13 세 시점 인터뷰, 아직 사용 안 함.
1 명: 13 세 시점 인터뷰, 이미 사용했으나 시점 모름.

Age 17 (1 / 6 / 3):

1 명만 17 세에 정확히 첫 사용 (대부분 어린 나이에 시작).
6 명: 17 세 시점 미사용.
3 명: 17 세 시점 left censored.

해석:

첫 사용 분포: 13-15 세에 집중 (총 57 명).
17 세에 미사용자 비율: 6/(1+6+3) = 60% — 이미 80% 이상이 사용 시작.

3.3 Klein 사용

Ch.5.2: doubly censored data 의 survival function 추정.
Turnbull self-consistency NPMLE 알고리즘.

3.4 Doubly Censored NPMLE — Turnbull (1976) Algorithm

3.4.1 Likelihood

각 개체 \(i\) 의 contribution:

\[ L_i = \begin{cases} f(T_i) & \text{exact: } T_i \\ 1 - S(T_i) & \text{left cens: } T_i \\ S(T_i) & \text{right cens: } T_i \end{cases} \]

3.4.2 Self-Consistency Algorithm

Initial guess: \(\hat S(t)\) (e.g., empirical CDF).
Estimate weights:
- For left-censored at \(T_i\): distribute weight over \(\{0, ..., T_i\}\) proportional to \(\hat S\).
- For right-censored at \(T_i\): distribute weight over \(\{T_i, ..., \infty\}\).
Update \(\hat S\): empirical CDF of pseudo-complete data.
반복 까지 수렴.

직관 — Self-Consistency 의 EM 해석

EM Algorithm 형태:

E-step: censored 관측치의 사건 시점 분포 추정 (current \(\hat S\) 기반).
M-step: weighted CDF 로 \(\hat S\) 갱신.

Self-consistency:

“\(\hat S\) 가 자기 자신과 일치할 때 수렴”.
즉, 추정된 \(\hat S\) 로 다시 weighting 해도 같은 \(\hat S\).

비유:

“내가 한 말이 내 생각과 일치할 때 진정한 자기 표현”.
Iteration 으로 self-consistency 도달.

NPMLE 가 좋은 이론적 성질 (Kaplan-Meier 의 자연스러운 일반화).

4 § 1.18 Breast Cancer Cosmetic Deterioration (Beadle 1984)

4.1 의학적 배경 — 유방 보존 치료의 cosmetic outcome

4.1.1 Mastectomy vs Breast-Conserving Therapy

Mastectomy (전체 유방 절제): 전통적 표준, 미용 결과 나쁨.
Excision biopsy + radiation (보존 치료): 유방 모양 유지, 동등한 생존.
+ Adjuvant chemotherapy: 재발 예방, 그러나 정상 조직 손상 가능 → cosmetic 악화 의심.

4.1.2 연구 가설 (Beadle 1984)

“Radiation + chemotherapy 가 radiation 만 보다 cosmetic deterioration (breast retraction) 을 더 빨리 유발하는가?”

Cosmetic 의의: 환자 quality of life 직접 영향.
Trade-off: chemotherapy 가 재발 예방 vs cosmetic 악화.

직관 — Cosmetic Endpoint 의 의의

Survival outcome:

사망까지의 시간.
명확한 정의.

Cosmetic outcome:

“Breast retraction” (유방 수축).
3-point scale (none, moderate, severe).
사건: first appearance of moderate or severe.
주관적 (clinician 판단) → 일관성 필요.

중요성:

보존 치료의 핵심 이점이 cosmetic.
만약 chemo 가 그것을 저해 → 보존 치료의 가치 감소.
환자가 mastectomy 와 보존 치료 선택 시 정보.

4.2 데이터 구조 — Interval-Censored

4.2.1 Visit Schedule

Initial:    visit 1 → visit 2 → visit 3 → ...
Interval:    4-6 mo    4-6 mo    longer (회복 후)
                                  ↑ 8-12 mo, 1년+ etc.

→ Visit 사이 사건 발생 가능 → interval censoring.

4.2.2 데이터 형식

각 환자의 record:

(a, b]: 사건이 visit a 와 visit b 사이 발생.
≥ a: 마지막 visit a 에서 미발생 (right censored).

4.2.3 Table 1.9 데이터

Radiotherapy only (n = 46):

(0, 7]; (0, 8]; (0, 5]; (4, 11]; (5, 12]; (5, 11]; (6, 10]; (7, 16]; (7, 14]; (11, 15]; (11, 18]; ≥15; ≥17; (17, 25]; (17, 25]; ≥18; (19, 35]; (18, 26]; ≥22; ≥24; ≥24; (25, 37]; (26, 40]; (27, 34]; ≥32; ≥33; ≥34; (36, 44]; (36, 48]; ≥36; ≥36; (37, 44]; ≥37; ≥37; ≥37; ≥38; ≥40; ≥45; ≥46 (×6)

Radiotherapy + Chemotherapy (n = 48):

(0, 22]; (0, 5]; (4, 9]; (4, 8]; (5, 8]; (8, 12]; (8, 21]; (10, 35]; (10, 17]; (11, 13]; ≥11; (11, 17]; ≥11; (11, 20]; (12, 20]; ≥13; (13, 39]; ≥13; ≥13; (14, 17]; (14, 19]; (15, 22]; (16, 24]; (16, 20]; (16, 24]; (16, 60]; (17, 27]; (17, 23]; (17, 26]; (18, 25]; (18, 24]; (19, 32]; ≥21; (22, 32]; ≥23; (24, 31]; (24, 30]; (30, 34]; (30, 36]; ≥31; ≥32; (33, 40]; ≥34; ≥34; ≥35; (35, 39]; (44, 48]; ≥48

직관 — Interval Width 의 의미

좁은 interval (예: (5, 8]):

Visit 5 와 8 사이 사건.
정확한 시점 미상이지만 3 개월 안.
정보량 큼.

넓은 interval (예: (16, 60]):

Visit 16 과 60 사이.
44 개월 의 큰 범위 — 정보량 작음.
환자가 long gap 동안 visit 안 함.

right censored (≥48):

마지막 visit 48 에서 미발생.
Endpoint 시점 미상.

관측 비율:

Radiation only: ~30/46 (65%) interval-observed events.
Radiation + chemo: ~37/48 (77%) — 더 빨리 deterioration?

4.3 Klein 사용

Ch.5.2: interval-censored data 의 survival function 추정.
Turnbull NPMLE.

4.4 Interval-Censored NPMLE — Turnbull Algorithm

4.4.1 Likelihood

\[ L_i = S(L_i) - S(R_i) \]

(사건이 \((L_i, R_i]\) 안에 발생할 확률.)

4.4.2 Turnbull (1976) Algorithm

Innermost intervals: 모든 데이터의 union 에서 distinct intersection 들 결정.
Initial weights: 각 innermost interval 에 균등 분포.
Update:

\[ \hat\pi_j^{(k+1)} = \frac{1}{n} \sum_i \frac{\hat\pi_j^{(k)} \alpha_{ij}}{\sum_l \hat\pi_l^{(k)} \alpha_{il}} \]

\(\alpha_{ij} = 1\) if interval \(j\) 가 환자 \(i\) 의 \((L_i, R_i]\) 안.
0 그 외.

반복 까지 수렴.

직관 — 사건 시점이 어디 있나?

환자 \(i\) 의 \((5, 12]\):

사건이 (5, 6], (6, 7], …, (11, 12] 중 어딘가.
각 sub-interval 의 확률은 현재 \(\hat\pi\) 에 비례.

Self-consistency:

각 환자의 contribution 분배 → \(\hat\pi\) 갱신.
갱신된 \(\hat\pi\) 로 다시 분배.
Iteration 으로 NPMLE 수렴.

EM Algorithm 해석:

E-step: 각 환자의 사건 시점이 어느 innermost interval 에 있을지 expected.
M-step: \(\hat\pi\) 갱신.

5 R + Python EDA — Marijuana

5.1 R — `icenReg` + `interval` 패키지

library(icenReg)
library(interval)

# Klein Table 1.8 — doubly censored
# 각 row: (left, right] 형식
# Exact: (age - 0.5, age + 0.5]
# Right cens (현재 나이): (age, +Inf)
# Left cens (시점 모름): (-Inf, age]

# Marijuana 데이터 시뮬레이션
marijuana <- data.frame(
  left = c(rep(c(9.5, 10.5, 11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5, 18.5),
                c(4, 12, 19, 24, 20, 13, 3, 1, 0, 4)),
            rep(c(12, 13, 14, 15, 16, 17), c(2, 15, 24, 18, 14, 6)),
            rep(-Inf, 12)),
  right = c(rep(c(10.5, 11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5, 18.5, Inf),
                 c(4, 12, 19, 24, 20, 13, 3, 1, 0, 4)),
            rep(Inf, 79),
            rep(c(13, 14, 15, 16, 17, 18), c(1, 2, 3, 2, 3, 1)))
)

# Turnbull NPMLE for doubly censored
fit <- ic_np(cbind(left, right) ~ 0, data = marijuana)
plot(fit, xlab = "Age (years)", ylab = "Probability of first marijuana use")

# 또는 interval 패키지의 icfit
# library(interval)
# fit2 <- icfit(Surv(left, right, type = "interval2") ~ 1, data = marijuana)
# plot(fit2)

5.2 Python — `lifelines` 의 Interval-Censoring 지원

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 데이터 (R 와 동일 구조)
# lifelines 는 interval censoring 직접 지원 제한
# 대신 R 의 icenReg 추천
# Python 에서는 icenReg 와 동등한 NPMLE 함수가 없으므로,
# 직접 EM 구현 또는 외부 패키지 사용 (e.g., interval-censored 의 R 호출)

# 단순 방법: midpoint imputation (information loss)
def midpoint_impute(left, right):
    return np.where(np.isfinite(right) & np.isfinite(left),
                    (left + right) / 2,
                    np.where(np.isfinite(left), left + 1, right - 1))

# Then standard KM
from lifelines import KaplanMeierFitter

# (실제 분석은 R icenReg 권장)

직관 — Python 의 한계

Python (lifelines):

Right censoring 주.
Interval censoring 직접 지원 제한적 (2024 현재).
Workaround: midpoint imputation (정보 손실).

R (icenReg, interval):

Interval/doubly censored 풍부한 도구.
Turnbull NPMLE 정확.

권장:

Interval/doubly censored data → R.
Python 사용자는 R 호출 (rpy2).
또는 직접 EM 구현 (numpy/scipy).

6 R + Python EDA — Breast Cancer Cosmetic

6.1 R — Interval Censoring 분석

library(icenReg)
library(survival)

# Klein Table 1.9 — interval censored
beadle <- data.frame(
  group = c(rep("rad_only", 46), rep("rad_chemo", 48)),
  left = c(
    # rad_only (46)
    0, 0, 0, 4, 5, 5, 6, 7, 7, 11, 11, 15, 17, 17, 17, 18, 19, 18, 22, 24, 24,
    25, 26, 27, 32, 33, 34, 36, 36, 36, 36, 37, 37, 37, 37, 38, 40, 45, 46, 46,
    46, 46, 46, 46, 36, 36,  # 추가 padding
    # rad_chemo (48)
    0, 0, 4, 4, 5, 8, 8, 10, 10, 11, 11, 11, 11, 11, 12, 13, 13, 13, 13, 14, 14,
    15, 16, 16, 16, 16, 17, 17, 17, 18, 18, 19, 21, 22, 23, 24, 24, 30, 30, 31,
    32, 33, 34, 34, 35, 35, 44, 48
  ),
  right = c(
    # rad_only
    7, 8, 5, 11, 12, 11, 10, 16, 14, 15, 18, Inf, Inf, 25, 25, Inf, 35, 26, Inf,
    Inf, Inf, 37, 40, 34, Inf, Inf, Inf, 44, 48, Inf, Inf, 44, Inf, Inf, Inf,
    Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf,
    # rad_chemo
    22, 5, 9, 8, 8, 12, 21, 35, 17, 13, Inf, 17, Inf, 20, 20, Inf, 39, Inf, Inf,
    17, 19, 22, 24, 20, 24, 60, 27, 23, 26, 25, 24, 32, Inf, 32, Inf, 31, 30,
    34, 36, Inf, Inf, 40, Inf, Inf, Inf, 39, 48, Inf
  )
)

# Turnbull NPMLE per group
fit_rad <- ic_np(cbind(left, right) ~ 0,
                 data = beadle[beadle$group == "rad_only", ])
fit_chemo <- ic_np(cbind(left, right) ~ 0,
                   data = beadle[beadle$group == "rad_chemo", ])

par(mfrow = c(1, 1))
plot(fit_rad, col = "blue", xlab = "Months",
     ylab = "Cosmetic deterioration-free probability")
lines(fit_chemo, col = "red")
legend("topright", legend = c("Radiation only", "Radiation + chemo"),
       col = c("blue", "red"), lty = 1)

# Two-sample test (interval censored)
library(interval)
ictest_obj <- ictest(Surv(left, right, type = "interval2") ~ group, data = beadle)
print(ictest_obj)

# Or PH model (Cox-like for interval censored)
fit_ph <- ic_sp(cbind(left, right) ~ group, data = beadle)
summary(fit_ph)

6.2 Python — Manual EM Implementation

import numpy as np
from scipy.optimize import minimize

# Beadle 데이터 (R 와 동일)
# (생략)

def turnbull_npmle(left, right, max_iter=200, tol=1e-7):
    """Turnbull self-consistency NPMLE for interval-censored data."""
    # 1. Innermost intervals (Turnbull 1976 step 1)
    times = np.unique(np.concatenate([left[np.isfinite(left)],
                                       right[np.isfinite(right)]]))
    n_intervals = len(times) - 1

    # 2. Initial weights (uniform)
    pi = np.ones(n_intervals) / n_intervals

    # Indicator matrix: alpha[i, j] = 1 if interval j is in subject i's (L_i, R_i]
    n = len(left)
    alpha = np.zeros((n, n_intervals))
    for i in range(n):
        for j in range(n_intervals):
            if left[i] <= times[j] and times[j + 1] <= right[i]:
                alpha[i, j] = 1

    # 3. EM iterations
    for it in range(max_iter):
        # E-step: probability of interval j for subject i
        contributions = alpha * pi  # broadcasting
        denom = contributions.sum(axis=1, keepdims=True)
        denom[denom == 0] = 1  # safeguard
        e_step = contributions / denom

        # M-step: update pi
        pi_new = e_step.sum(axis=0) / n

        if np.max(np.abs(pi_new - pi)) < tol:
            break
        pi = pi_new

    # Cumulative survival
    F = np.cumsum(pi)
    S = 1 - F
    return times[1:], S


# Apply
left_arr = np.array(beadle.left)
right_arr = np.array(beadle.right)
right_arr = np.where(np.isinf(right_arr), 1e10, right_arr)

times, S = turnbull_npmle(left_arr, right_arr)
plt.step(times, S, where="post")
plt.xlabel("Months")
plt.ylabel("Survival (deterioration-free)")
plt.show()

직관 — Turnbull NPMLE 의 핵심

핵심 발상:

각 환자의 사건 시점이 어떤 innermost interval 에 있을 가능성 분배.
분배는 현재 \(\pi\) 추정에 비례 (E-step).
분배 후 \(\pi\) 갱신 (M-step).

대안 — Midpoint Imputation:

각 interval 의 중점에 사건 발생 가정 → 표준 KM.
정보 손실 + bias.
Quick visualization 만 적합.

Turnbull NPMLE 의 우위:

정보 보존.
Asymptotic 통계 성질 (consistency).
PH model 과 결합 가능 (icenReg::ic_sp).

본 데이터: rad_only vs rad_chemo 비교 시 정확한 결과.

7 두 데이터의 페다고지 통합

측면	§ 1.17 Marijuana	§ 1.18 Breast Cancer
n	191	94
사건	First marijuana use	First moderate/severe retraction
Censoring	Doubly (left + right + exact)	Interval + right
데이터 출처	Self-reported survey	Clinical observation
Klein 사용	Ch.5.2	Ch.5.2
NPMLE	Turnbull self-consistency	Turnbull self-consistency

직관 — 두 데이터의 상보성

§ 1.17 Marijuana:

Self-reported → recall bias + social desirability.
Doubly censored 가 자연 발생 (응답 종류).
Public health 응용 (drug onset prevention).

§ 1.18 Breast Cancer:

Clinical observation → 객관적이지만 visit schedule 한계.
Interval censoring 이 visit interval 의 결과.
Clinical decision 응용 (chemo 의 cosmetic trade-off).

공통:

같은 NPMLE algorithm (Turnbull self-consistency).
같은 Klein Ch.5.2 의 도구.
다른 도메인이지만 통계적 도전 동일.

8 핵심 직관 통합

Right censoring = 가장 흔한 형태.
Left censoring = 사건이 관측 시작 이전.
Interval censoring = 두 visit 사이 (정기 검진).
Doubly censored = left + right + exact 혼합.
Likelihood: 각 type 별로 \(S\) 의 다른 함수.
Turnbull NPMLE: self-consistency / EM algorithm.
Self-reported vs clinical: censoring 발생 메커니즘 다름, 도구 동일.

9 실전 체크리스트 — § 1.17~1.18

§ 1.17 Marijuana

3 가지 응답 유형 분리 (exact / right / left).
Doubly censored data 의 likelihood 인지.
Turnbull NPMLE 적용.
Self-reported bias 인지.

§ 1.18 Breast Cancer

Visit schedule 의 영향 파악.
Interval censoring 데이터 형식 (a, b].
Right censoring ≥a 형식.
Turnbull NPMLE per group.
2-sample test (ictest in R).
PH model for interval-censored (ic_sp).

EDA

각 type 별 count.
NPMLE survival curve.
그룹 비교 (그룹별 NPMLE 또는 PH).
Midpoint imputation 과 비교 (sanity check).

다음 단계

§ 1.19 (AIDS — right truncation) 또는 Ch.2 진행.

10 관련 주제

Klein 시리즈

Ch.1 Overview
(이전) § 1.15~1.16 — Psychiatric · Channing
(다음) § 1.19 (예정 또는 skip)

관련 개념 (cross-category)

11 참고문헌

Klein, J. P., & Moeschberger, M. L. (2003). Survival Analysis: Techniques for Censored and Truncated Data (2nd ed.), Ch.1 § 1.17~1.18. Springer.
Turnbull, B. W., & Weiss, L. (1978). A Likelihood Ratio Statistic for Testing Goodness of Fit with Randomly Censored Data. Biometrics, 34(3), 367-375.
Hamburg, B. A., Kraemer, H. C., & Jahnke, W. (1975). A Hierarchy of Drug Use in Adolescence: Behavioral and Attitudinal Correlates of Substantial Drug Use. American Journal of Psychiatry, 132(11), 1155-1163.
Beadle, G. F., Silver, B., Botnick, L., Hellman, S., & Harris, J. R. (1984a). Cosmetic Results Following Primary Radiation Therapy for Early Breast Cancer. Cancer, 54(12), 2911-2918.
Beadle, G. F., Harris, J. R., et al. (1984b). Cosmetic Results Following Primary Radiation Therapy for Early Breast Cancer. Cancer, 54(12).
Turnbull, B. W. (1974). Nonparametric Estimation of a Survivorship Function with Doubly Censored Data. JASA, 69(345), 169-173.
Turnbull, B. W. (1976). The Empirical Distribution Function with Arbitrarily Grouped, Censored, and Truncated Data. JRSS B, 38(3), 290-295.
Sun, J. (2006). The Statistical Analysis of Interval-Censored Failure Time Data. Springer.
Anderson-Bergman, C. (2017). icenReg: Regression Models for Interval Censored Data in R. Journal of Statistical Software, 81(12), 1-23.
Fay, M. P., & Shaw, P. A. (2010). Exact and Asymptotic Weighted Logrank Tests for Interval Censored Data: The interval R Package. Journal of Statistical Software, 36(2), 1-34.
Davidson-Pilon, C. (2019). lifelines. JOSS, 4(40), 1317.