Kwangmin Kim - Klein Ch.3 Overview — Censoring and Truncation

1 들어가며 — Ch.2 의 4 함수에서 Ch.3 의 추론 기반으로

Ch.2 는 분포 자체 (S, h, H, m) 를 정의했다. 그러나 실제 임상 데이터에서는 — censoring 과 truncation 때문에 — 분포의 직접 추정이 불가능하다. Ch.3 은 censoring/truncation 을 likelihood 차원에서 정확히 처리하는 framework 를 구성한다.

Ch.3 의 한 줄 요약

“censoring 은 사건 시점의 부분 정보, truncation 은 관측 가능성 자체의 조건부 제약. 두 현상은 likelihood 의 master 식 (3.5.1) 와 counting process N(t)·Y(t) 의 두 도구로 통일적으로 처리된다.”

절	주제	정전 예제
§ 3.1	Introduction — 5 절 + counting process 위치	—
§ 3.2	Right censoring 5 종	§ 1.2 Leukemia, § 1.3 BMT, § 1.4 Dialysis
§ 3.3	Left + Interval censoring	§ 1.17 Marijuana (doubly), § 1.18 Breast (interval)
§ 3.4	Truncation 3 종	§ 1.16 Channing (left), § 1.19 AIDS (right)
§ 3.5	Likelihood master 식	모든 parametric MLE의 출생
§ 3.6	Counting Process	Aalen 1975, NA + KM 의 통일 도출

본 편은 7 절을 모두 훑은 후 — 4 편 시리즈 (overview + 3 deep-dive) 로 분기한다.

2 § 3.1 — Introduction

Censoring vs Truncation 의 차이

Censoring: 개체는 관측되지만 사건 시점이 부분적으로만 알려짐. 예: “환자 A 는 12 주째에도 사건이 발생하지 않음” — 12 주 이상이라는 정보 보유.
Truncation: 개체 자체가 표본에 포함될 조건을 만족할 때만 관측. 예: “은퇴 시설 입소자만 추적” — 입소 전 사망자는 표본에 포함되지 않음.

항목	Censoring	Truncation
정보	부분 정보 (lower/upper bound)	정보 자체 부재
위험 집합	모든 개체	조건 만족자만
Likelihood	$f$ 또는 $S$ 또는 구간 확률	조건부 분포 (분모 추가)
생물 직관	“결과 미관찰”	“표본 추출 자체가 biased”

이 구분이 § 3.5 의 likelihood 식의 분기점.

3 § 3.2 — Right Censoring (5 종)

표기: 개체 $i$ 의 진짜 사건 시간 $X_i$, censoring 시간 $C_{r,i}$. 관측은

\[ T_i = \min(X_i, C_{r,i}), \quad \delta_i = I(X_i \leq C_{r,i}) = \begin{cases} 1 & \text{event} \\ 0 & \text{censored} \end{cases} \]

3.1 Type I — 사전 종료 시점

가장 단순한 형태. 모든 개체가 동일한 사전 결정 종료 시점 $C_r$ 에 sacrifice/censoring.

예시: NCTR 의 carcinogen 동물 실험 — 실험 시작 후 $T$ 일이 지나면 살아있는 모든 동물 sacrifice.

Type I 의 특징

censoring 시간 $C_r$ 가 고정 (random 아님).
모든 censoring 관측의 시간이 동일.
가장 단순한 likelihood — Klein Example 3.9 의 exponential.

3.2 Progressive Type I — 다단계 sacrifice

여러 시점 $C_{r}^{(1)} < C_{r}^{(2)} < \cdots$ 에서 사전 결정된 비율로 sacrifice.

예시: 200 마리 쥐 실험에서 — 42 주에 일부 sacrifice (조기 정보), 104 주에 나머지 sacrifice. 비치명 질병의 자연사 정보 + 비용 절감.

3.3 Generalized Type I — 개체별 시작 시점

환자가 서로 다른 시점에 입학하지만 종료일은 공통. 개체별 follow-up 길이가 다름.

예시: § 1.5 Sedmak 1989 breast cancer trial — 환자 입학 시점이 분산. 종료일 (예: 1985-12-31) 은 공통.

Lexis Diagram (Keiding 1990)

x 축: calendar time
y 축: time on study
45° 선: 개체의 시간 진행
종료일에 도달하면 censoring (open dot), 종료일 전에 사건 발생하면 사건 (filled dot).

이 diagram 이 — 개체별 시작 시점이 다른 임상시험을 시각화하는 표준 도구.

3.4 Type II — 첫 r 사건까지

사전 결정된 정수 $r < n$. 첫 $r$ 사건이 발생할 때 실험 종료.

예시: 100 개 전구 신뢰성 시험 — 첫 60 개가 끊어지면 종료. 시험 시간 $T_{(r)}$ 가 random.

Type II 의 통계적 특징

$r, n - r$ 이 고정 정수, censoring 시간 $T_{(r)}$ 가 random.
데이터 = 첫 $r$ 개의 ordered statistics → order statistics 이론 직접 적용.
Klein Theoretical Note 1, 식 (3.5.7):

\[ L_{II,1} = \frac{n!}{(n-r)!} \prod_{i=1}^r f(x_{(i)}) \left[S(x_{(r)})\right]^{n-r} \]

3.5 Random / Competing Risks — 독립 가정 필수

각 개체에 random censoring time $C_{r,i}$ (분포 $G$) 가 따로 존재. $X_i \perp C_{r,i}$ 가정 필수.

예시: 사고사·이주·loss to follow-up — 사망의 원인이 아닌 이유로 추적 중단.

독립 가정의 검증 불가능성

(Klein § 2.7 의 Tsiatis 1975 와 동일한 구조) — $(T_i, \delta_i)$ 만으로는 $X \perp C_r$ 식별 불가.

typical 한 독립 사례: 사고사, 이주, 자연사 (사건이 사망 외 endpoint 일 때).

의심 사례: 환자 상태 악화로 인한 dropout, 부작용으로 인한 약물 중단.

후자는 informative censoring → 표준 KM/NA 가 편향됨. § 1.13 NLSY pneumonia 의 dropout 처리 가 이 함정의 정전 예제.

4 § 3.3 — Left + Interval Censoring

4.1 Left Censoring

개체의 사건 시간 $X$ 가 censoring 시점 $C_l$ 이전에 이미 발생했지만 정확한 시점 미상.

\[ T = \max(X, C_l), \quad \varepsilon = I(X \geq C_l) = \begin{cases} 1 & \text{exact} \\ 0 & \text{left censored} \end{cases} \]

예시 (Klein Example 3.3): § 1.17 California 고등학생 marijuana 첫 사용 — “사용한 적은 있는데 언제인지 기억 안 남” → 인터뷰 시점 $C_l$ 이전 어딘가에 발생.

예시 (Klein Example 3.4): 영유아 인지 발달 — 시험 시작 시점에 이미 task 수행 가능 → “그 task 를 익힌 시점은 시험 시작 전 어딘가”.

4.2 Doubly Censoring

Left + right censoring 동시 존재.

\[ T = \max[\min(X, C_r), C_l], \quad \delta = \begin{cases} 1 & \text{exact} \\ 0 & \text{right censored} \\ -1 & \text{left censored} \end{cases} \]

예시: § 1.17 marijuana 데이터 — “사용 안 함” (right censored) + “사용 했는데 시점 모름” (left censored) + “정확한 시점 알려짐” (exact) 모두 공존.

Doubly censoring 의 추정

표준 KM 은 right-censored 만 처리. Doubly censored 데이터는 Turnbull (1974) self-consistency 또는 EM algorithm 이 필요.

R: icenReg::ic_np(), interval 패키지. Python: lifelines (제한적).

4.3 Interval Censoring

가장 일반적인 형태. 사건이 구간 $(L_i, R_i]$ 안 어딘가에서 발생했다는 정보만.

예시 (Klein Example 3.5): Framingham 의 angina pectoris 첫 발병 — 2 년 간격 임상 검진 사이에 발생.

예시 (Klein Example 3.6): § 1.18 Beadle 1984 breast cancer cosmetic deterioration — 4-6 개월 간격 visit 사이에 cosmetic 변화 발생.

Interval censoring 의 일반성

Left censoring = interval $(0, C_l]$ — 좌측 끝 0.
Right censoring = interval $(C_r, \infty)$ — 우측 끝 ∞.
Exact event = degenerate interval $(t, t]$.

따라서 interval censoring 이 모든 censoring 의 super-class. § 3.5 의 likelihood 식이 이 사실을 한 줄로 표현.

5 § 3.4 — Truncation (3 종)

5.1 Left Truncation (Delayed Entry)

$Y_R = \infty$ — 사건 시간 $X$ 가 truncation 시점 $Y_L$ 보다 큰 경우만 관측.

예시 (Klein Example 3.7): § 1.16 Channing House 은퇴 시설 — 시설에 입소해야 (즉 입소 가능 연령까지 살아남아야) 표본 진입.

예시: Microscopic particle size — 현미경 해상도보다 큰 입자만 관측, 작은 입자는 표본 자체에 미포함.

Length-Biased Sampling 과의 관계

Channing house 처럼 — 더 오래 사는 사람일수록 표본에 포함될 확률이 높은 sampling. 이를 보정하지 않으면 체계적 과대 추정.

조건부 likelihood (Klein § 3.5):

\[ \text{개체 기여} = \frac{f(x_i)}{S(Y_{L,i})} \]

분모 $S(Y_{L,i})$ 가 “$Y_L$ 까지 살아남았을 조건부 확률”.

Counting process 차원: $Y(t) = $ “시점 $t$ 의 위험 집합” 정의에서 — entry 시점 $Y_L$ 이후 만 포함시키도록 자연스럽게 보정.

5.2 Right Truncation

$Y_L = 0$ — 사건 시간 $X$ 가 $Y_R$ 보다 작은 경우만 관측.

예시 (Klein Example 3.8): § 1.19 AIDS 추적 — 1986-06-30 에 sampling 시 AIDS 발병자만 관측. 그 이후 발병한 환자는 누락.

예시: 별의 거리 — 너무 멀면 보이지 않음, 가까운 별만 표본.

예시: 사망 records 기반 mortality 연구 — 사망한 사람만 records 에 등재.

조건부 likelihood:

\[ \text{개체 기여} = \frac{f(Y_i)}{F(Y_R)} = \frac{f(Y_i)}{1 - S(Y_R)} \]

(분모가 “$Y_R$ 이전에 사건 발생 확률”).

Reverse-Time KM

Right-truncated 데이터는 — 시간을 거꾸로 진행하면 left-truncated 가 됨. § 1.19 AIDS 분석의 “reverse-time Kaplan-Meier” (Lagakos 1988) 가 이 변환의 정전 예제.

5.3 Censoring vs Truncation — 한 줄 비교

차이	Right censoring	Right truncation
정보	“$X > C_r$” 라는 lower bound	$X$ 가 $Y_R$ 이하일 때만 관측, 그 외 표본 미포함
Likelihood	$S(C_r)$	$f(X) / [1 - S(Y_R)]$
위험 집합	시점 $t$ 이전 censored 자 제외	시점 $t$ 이전 truncated 자 제외

Channing 의 양쪽 효과

§ 1.16 Channing House 데이터는 — left truncation (입소 전 사망자 미포함) + right censoring (1975-07-01 이전 종료) 둘 다 존재. 이런 양쪽 효과 처리가 counting process framework 의 진가.

6 § 3.5 — Likelihood Construction

6.1 Master 식 (3.5.1)

Klein 식 3.5.1 — 모든 censoring 의 통일 likelihood

\[ L \propto \underbrace{\prod_{i \in D} f(x_i)}_{\text{exact events}} \cdot \underbrace{\prod_{i \in R} S(C_{r,i})}_{\text{right censored}} \cdot \underbrace{\prod_{i \in L} [1 - S(C_{l,i})]}_{\text{left censored}} \cdot \underbrace{\prod_{i \in I} [S(L_i) - S(R_i)]}_{\text{interval censored}} \]

$D$: exact event 관측자
$R$: right-censored 관측자
$L$: left-censored 관측자
$I$: interval-censored 관측자

각 개체는 자신이 가진 정보의 양에 비례하여 likelihood 에 기여:

exact: 밀도 $f$ (가장 강한 정보)
right-censored: $S(C_r)$ (“최소 $C_r$ 까지는 살아있었다”)
left-censored: $1 - S(C_l) = F(C_l)$ (“$C_l$ 이전에 발생”)
interval: $S(L) - S(R)$ (“$(L, R]$ 안에서 발생”)

6.2 Truncation 의 추가

Left truncation $(Y_{L,i}, Y_{R,i})$:

$f(x_i) \to f(x_i) / [S(Y_{L,i}) - S(Y_{R,i})]$
$S(C_i) \to S(C_i) / [S(Y_{L,i}) - S(Y_{R,i})]$

Right truncation only:

\[ L \propto \prod_i \frac{f(Y_i)}{1 - S(Y_i)} \]

6.3 Type I Censoring 의 도출

Klein Example 3.9 의 풀이:

$\delta = 0$: $\Pr(T, \delta=0) = \Pr(X > C_r) = S(C_r)$.
$\delta = 1$: $\Pr(T, \delta=1) = f(t)$.

결합 표현:

\[ \Pr(t, \delta) = [f(t)]^\delta [S(t)]^{1-\delta} \]

전체 likelihood:

\[ L = \prod_{i=1}^n [f(t_i)]^{\delta_i} [S(t_i)]^{1-\delta_i} = \prod_{i=1}^n [h(t_i)]^{\delta_i} \exp[-H(t_i)] \]

Exponential 의 단순화

$f(x) = \lambda e^{-\lambda x}$, $S(x) = e^{-\lambda x}$:

\[ L = \lambda^r \exp[-\lambda S_T] \]

여기서 $r = \sum \delta_i$ (관측 사건 수), $S_T = \sum t_i$ (총 관측 시간).

MLE: $\hat{\lambda} = r / S_T$ — “사건 수 / 총 시간 위험” — 가장 직관적인 hazard 추정량.

6.4 Type II Censoring (식 3.5.7)

첫 $r$ 개 ordered statistics 의 joint density:

\[ L_{II,1} = \frac{n!}{(n-r)!} \prod_{i=1}^r f(x_{(i)}) [S(x_{(r)})]^{n-r} \]

상수 $n!/(n-r)!$ 가 inference 에 영향 안 줌 — 비례 형태로 보면 식 (3.5.1) 와 동일.

6.5 Random Censoring (Klein Example 3.10)

$X \perp C_r$ 가정 + censoring 분포 $g, G$.

\[ L = \prod_i [f(t_i) G(t_i)]^{\delta_i} [g(t_i) S(t_i)]^{1-\delta_i} \]

핵심: censoring 분포 $g, G$ 가 관심 모수 (수명 분포의 모수) 와 무관하면 — censoring 항이 상수 → 식 (3.5.6):

\[ L \propto \prod_i [f(t_i)]^{\delta_i} [S(t_i)]^{1-\delta_i} \]

“Non-informative censoring” 의 정의

censoring 분포 $G$ 가 $f$ 의 모수와 무관 + $X \perp C_r$.

이 둘이 만족되면 — censoring 을 무시하고 식 (3.5.1) 사용. 만족 안 되면 (informative censoring) 표준 분석 모두 편향.

검증 가능성: 불가능 (Klein § 2.7 Tsiatis 와 동일). 도메인 지식 (왜 dropout 했는가?) 으로 가정의 합리성 판단.

6.6 Regression 의 개체별 분포 (식 3.5.2)

각 개체가 다른 분포 $f_i, S_i$ (covariate $Z_i$ 에 의존):

\[ L = \prod_{i \in D} f_i(x_i) \prod_{i \in R} S_i(C_{r,i}) \prod_{i \in L} [1 - S_i(C_{l,i})] \prod_{i \in I} [S_i(L_i) - S_i(R_i)] \]

이 식이 — Cox PH (Klein Ch.8), AFT (Ch.12), Aalen (Ch.10) 등 모든 회귀 모형의 likelihood 출생.

7 § 3.6 — Counting Processes (Aalen 1975)

7.1 왜 counting process 가 필요한가

§ 3.5 의 likelihood 는 parametric 추정에 강력. 그러나 — 비모수 (KM, NA) 와 반모수 (Cox) 추정의 통일적 도출, 점근 성질, 신뢰구간 계산 에 한계.

Aalen (1975, 1978) 이 — 확률과정·martingale 이론·counting process 를 결합 → 모든 비모수·반모수 도구의 통일 framework.

4 가지 핵심 도구

$N(t)$ — counting process: 시점 $t$ 까지의 사건 수.
$Y(t)$ — at-risk process: 시점 $t$ 의 위험 집합 크기.
$\Lambda(t)$ — compensator: $N$ 의 “예측 가능한” 부분.
$M(t) = N(t) - \Lambda(t)$ — martingale: noise (mean 0).

이 4 가지로 — Nelson-Aalen, Kaplan-Meier, log-rank 검정, Cox partial likelihood 모두 통일적으로 도출.

7.2 Counting Process $N(t)$

정의

확률 과정 $N(t), t \geq 0$ 가 counting process 이려면:

$N(0) = 0$
$N(t) < \infty$ a.s.
Sample path 가 right-continuous, piecewise constant, +1 의 jump 만.

오른쪽 censored 데이터에서:

\[ N_i(t) = I[T_i \leq t, \delta_i = 1] = \begin{cases} 0 & \text{개체 } i \text{ 사건 미발생} \\ 1 & \text{개체 } i \text{ 사건 발생함} \end{cases} \]

\[ N(t) = \sum_{i=1}^n N_i(t) = \sum_{T_i \leq t} \delta_i \]

시점 $t$ 까지 발생한 사건의 누적 수.

7.3 History (Filtration) $\mathcal{F}_t$

시점 $t$ 까지 알려진 모든 정보의 집합. $s \leq t$ 이면 $\mathcal{F}_s \subset \mathcal{F}_t$.

right-censored 데이터에서:

\[ \mathcal{F}_t = \sigma(\{(T_i, \delta_i) : T_i \leq t\} \cup \{T_i > t\}) \]

7.4 Intensity Process $\lambda(t)$

식 (3.6.2):

\[ E[dN(t) \mid \mathcal{F}_{t^-}] = Y(t) h(t)\, dt = \lambda(t)\, dt \]

의미

$Y(t)$: 시점 $t$ 의 위험 집합 크기 (확률적, $\mathcal{F}_{t^-}$ 에서 알려짐).
$h(t)$: 모집단 hazard.
$\lambda(t) = Y(t) h(t)$: 시점 $t$ 의 사건 발생 강도 (count 수의 기댓값).

직관: 100 명이 위험에 있고 hazard 가 0.05/year 이면 — 다음 순간 사건 발생 강도는 $5/$year.

7.5 Compensator $\Lambda(t)$

\[ \Lambda(t) = \int_0^t \lambda(s)\, ds = \int_0^t Y(s) h(s)\, ds \]

$N$ 의 예측 가능한 부분. $E[N(t) \mid \mathcal{F}_{t^-}] = \Lambda(t)$.

7.6 Martingale $M(t) = N(t) - \Lambda(t)$

핵심 정의

\[ M(t) = N(t) - \Lambda(t) \]

성질:

$E[dM(t) \mid \mathcal{F}_{t^-}] = 0$ — mean zero noise.
$E[M(t) \mid \mathcal{F}_s] = M(s)$ for $s < t$ — martingale 성질.

해석: $M(t)$ 는 “관측된 사건 ($N$) 에서 예측된 사건 ($\Lambda$) 을 뺀 잔차” — 모형의 무작위 noise.

이것이 — 모든 통계 추론의 출발점.

7.7 Predictable Variation $\langle M \rangle(t)$

$M^2(t)$ 의 compensator. martingale 의 분산 정보.

\[ \langle M \rangle(t) = \int_0^t \lambda(s)\, ds = \Lambda(t) \]

(데이터에 동시 사건 없을 경우. ties 있으면 Bernoulli variance 사용.)

7.8 Nelson-Aalen 의 Stochastic Integral 도출

식 (3.6.5):

\[ \frac{dN(t)}{Y(t)} = h(t)\, dt + \frac{dM(t)}{Y(t)} \]

(왼쪽: 관측. 오른쪽: 모형 + noise.)

양변을 0~$t$ 적분:

\[ \hat{H}(t) = \int_0^t \frac{J(u)}{Y(u)}\, dN(u) = \int_0^t J(u) h(u)\, du + \int_0^t \frac{J(u)}{Y(u)}\, dM(u) \]

여기서 $J(u) = I[Y(u) > 0]$, $0/0 = 0$ 약속.

Nelson-Aalen 의 자연 도출

$\hat{H}(t) = \sum_{t_i \leq t} \frac{d_i}{n_i}$ (이산 표본, $d_i$ = 시점 $t_i$ 사건 수, $n_i$ = $Y(t_i)$).
stochastic integral of predictable $J/Y$ w.r.t. counting process $N$.
bias: $E[\hat{H}(t)] = E[H^*(t)]$ where $H^*(t) = \int J(u) h(u) du$ — 데이터가 있는 범위에서는 $H(t)$ 와 일치.
variance: martingale CLT 로 도출 가능. 식 (3.6.6):

\[ \langle Z^{(\infty)} \rangle = \int_0^t \frac{h(u)}{y(u)}\, du \]

estimator: $n \int dN(u) / Y(u)^2$.

이것이 — Nelson-Aalen 의 모든 신뢰구간 (Klein Ch.4) 의 출생.

7.9 Kaplan-Meier 의 Product Integral

연속 분포: $S(t) = \exp[-H(t)]$.

이산: $S(t) = \prod_{s \leq t} [1 - dH(s)]$.

비모수 추정:

\[ \hat{S}(t) = \prod_{s \leq t} [1 - d\hat{H}(s)] = \prod_{t_i \leq t}\left[1 - \frac{dN(t_i)}{Y(t_i)}\right] = \prod_{t_i \leq t}\left[1 - \frac{d_i}{n_i}\right] \]

KM 의 출생

이 식이 Kaplan-Meier estimator (Klein Ch.4). counting process framework 에서 — 추정량의 정의 + 점근 성질 + 신뢰구간 + 신뢰대 모두 자동 도출.

$\hat{S}(t)/S(t) - 1$ 도 stochastic integral → martingale → CLT.

7.10 Counting Process Likelihood

Klein 식 (last):

\[ L = \prod_{j=1}^n \lambda_j(t)^{dN_j(t)} \exp\!\left[-\int_0^\tau \lambda_j(u)\, du\right] \]

right-censored 에서 $\lambda_j(t) = Y_j(t) h(t)$:

\[ L \propto \left[\prod_{j} h(t_j)^{\delta_j}\right] \exp\!\left[-\sum_j H(t_j)\right] \]

이는 § 3.5 의 식 (3.5.1) 과 정확히 동일. counting process 가 — 같은 likelihood 를 다른 (더 강력한) 도구로 도출.

7.11 Andersen-Borgan-Gill-Keiding (1993) 의 통일

Counting process framework 의 활용

분야	Counting process 해석
Nelson-Aalen	$\int J/Y\, dN$ — Klein Ch.4
Kaplan-Meier	$\prod [1 - dN/Y]$ — Klein Ch.4
Smoothed hazard	kernel × $dN/Y$ — Klein Ch.6
Log-rank test	$\int K(u) [dN_1 - dN_2]$ stochastic integral — Klein Ch.7
Cox partial likelihood	$\prod e^{\beta'z_i}/\sum_{j \in R(t_i)} e^{\beta'z_j}$ — Klein Ch.8
Aalen additive	$\hat{B}(t) = \int Y^- dN$ — Klein Ch.10
Schoenfeld residual	martingale residual 의 일부 — Klein Ch.11

모든 도구가 stochastic integral with respect to counting process martingale — 점근 성질이 martingale CLT 로 통일적으로 증명됨.

이것이 — 1975~1995 의 “modern survival analysis” 혁명의 본질.

8 R + Python — 식 (3.5.1) Likelihood 직접 구현

§ 1.2 Leukemia 데이터로 right-censored exponential MLE 직접 계산.

8.1 R — 직접 likelihood 코드

library(survival)

leukemia <- data.frame(
  time = c(1, 22, 3, 12, 8, 17, 2, 11, 8, 12, 2, 5, 4, 15, 8, 23, 5, 11, 4, 1, 8,
           10, 7, 32, 23, 22, 6, 16, 34, 32, 25, 11, 20, 19, 6, 17, 35, 6, 13, 9, 6, 10),
  status = c(rep(1, 21), 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0)
)

# Exponential log-likelihood (식 3.5.1 with f = λ exp(-λt), S = exp(-λt))
neg_loglik <- function(lambda, t, delta) {
  -sum(delta * log(lambda) - lambda * t)
}

# MLE — closed form: λ̂ = r / Σ t
r <- sum(leukemia$status)
S_T <- sum(leukemia$time)
lambda_hat <- r / S_T
cat(sprintf("Closed-form MLE: λ̂ = %d / %d = %.4f\n", r, S_T, lambda_hat))

# 수치 최적화로 검증
opt <- optim(0.01, neg_loglik, t = leukemia$time, delta = leukemia$status,
             method = "Brent", lower = 0.001, upper = 1)
cat(sprintf("Numerical MLE: λ̂ = %.4f\n", opt$par))

# Counting process N(t) 와 at-risk Y(t) 시각화
times <- sort(unique(leukemia$time))
N <- sapply(times, function(t) sum(leukemia$status[leukemia$time <= t]))
Y <- sapply(times, function(t) sum(leukemia$time >= t))

# Nelson-Aalen H(t) — stochastic integral 식 (3.6 의 도출)
H_NA <- cumsum(sapply(times, function(t) {
  d <- sum(leukemia$status[leukemia$time == t])
  y <- sum(leukemia$time >= t)
  if (y > 0) d/y else 0
}))

# survival 패키지의 Nelson-Aalen 과 비교
fit <- survfit(Surv(time, status) ~ 1, data = leukemia, type = "fh")
plot(times, H_NA, type = "s", col = "red", lwd = 2,
     xlab = "Time (weeks)", ylab = "H(t)",
     main = "Nelson-Aalen — manual vs survival package")
lines(fit, fun = "cumhaz", col = "blue", lty = 2, conf.int = FALSE)
legend("topleft", c("Manual ∫ dN/Y", "survival::survfit"),
       col = c("red", "blue"), lwd = 2, lty = c(1, 2))

8.2 Python — counting process 시각화

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter, NelsonAalenFitter

leukemia = pd.DataFrame({
    "time": [1, 22, 3, 12, 8, 17, 2, 11, 8, 12, 2, 5, 4, 15, 8, 23, 5, 11, 4, 1, 8,
             10, 7, 32, 23, 22, 6, 16, 34, 32, 25, 11, 20, 19, 6, 17, 35, 6, 13, 9, 6, 10],
    "status": [1]*21 + [1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0],
})

# Counting process N(t), at-risk Y(t), compensator Λ(t)
times = np.sort(leukemia["time"].unique())
N_t = np.array([leukemia[(leukemia["time"] <= t) & (leukemia["status"] == 1)].shape[0]
                for t in times])
Y_t = np.array([leukemia[leukemia["time"] >= t].shape[0] for t in times])

# Manual Nelson-Aalen — stochastic integral
NA_increments = []
for t in times:
    d_t = leukemia[(leukemia["time"] == t) & (leukemia["status"] == 1)].shape[0]
    y_t = leukemia[leukemia["time"] >= t].shape[0]
    NA_increments.append(d_t / y_t if y_t > 0 else 0)
H_NA = np.cumsum(NA_increments)

# Plot — Counting process 의 4 가지
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

axes[0, 0].step(times, N_t, where="post", color="black")
axes[0, 0].set_title("N(t) — Counting process (cumulative events)")
axes[0, 0].set_xlabel("Time (weeks)")

axes[0, 1].step(times, Y_t, where="post", color="blue")
axes[0, 1].set_title("Y(t) — At-risk process (decreasing)")
axes[0, 1].set_xlabel("Time (weeks)")

axes[1, 0].step(times, H_NA, where="post", color="red")
naf = NelsonAalenFitter().fit(leukemia["time"], leukemia["status"])
naf.cumulative_hazard_.plot(ax=axes[1, 0], style="--", color="green")
axes[1, 0].set_title("Ĥ(t) — Nelson-Aalen (manual vs lifelines)")
axes[1, 0].set_xlabel("Time (weeks)")

# Martingale residuals (시각화)
# M_i(t) = N_i(t) - integral of Y_i(u) h(u) du
# 단순 lambda 추정으로 근사
lam_hat = leukemia["status"].sum() / leukemia["time"].sum()
M_residuals = leukemia["status"] - lam_hat * leukemia["time"]
axes[1, 1].scatter(leukemia["time"], M_residuals, alpha=0.6)
axes[1, 1].axhline(0, color="red", linestyle="--")
axes[1, 1].set_title("Martingale residuals — exponential fit")
axes[1, 1].set_xlabel("Time (weeks)")
axes[1, 1].set_ylabel("M_i = δ_i - λ̂ t_i")

plt.tight_layout()
plt.savefig("klein_ch3_overview.png", dpi=100)

9 Ch.3 심화편 예고

본 편에 이은 deep-dive 시리즈

심화편	범위	주제
03-1	§ 3.1~3.2 + 정전 임상 사례	Right censoring 6 종 — Type I/Generalized/Progressive/Type II/Progressive II/Random — 의 likelihood 도출 + Ch.1 19 데이터 매핑 + R/Python 시뮬레이션
03-2	§ 3.3~3.4	Left + Interval censoring + Truncation 3 종 — Channing/AIDS/marijuana/breast cosmetic 데이터로 Turnbull NPMLE + reverse-time KM 시연
03-3	§ 3.5~3.6	Likelihood master 식 + Counting process 정밀 도출 — Aalen 1975 의 martingale CLT, KM/NA 의 통일 framework
03-4	§ 3.7	Exercises 9 문제 완전 풀이 — Group A 식별·B Likelihood·C Theoretical

10 실전 체크리스트 — Ch.3 Overview

Censoring

Right censoring 5 종 (Type I·II·progressive·generalized·random) 식별.
Left censoring + Interval censoring + Doubly censoring 의 표현.
모든 censoring = interval censoring 의 특수 경우.

Truncation

Left/Right/Interval truncation 의 정의.
Censoring vs Truncation — 부분 정보 vs 정보 부재.

Likelihood

Master 식 (3.5.1): $L \propto \prod f^\delta S^{1-\delta} \cdots$.
Truncation 처리 — 분모에 조건부 확률 추가.
Type I/II/random/progressive 별 likelihood 차이.
Regression 의 개체별 분포 (식 3.5.2).

Counting Process

$N(t), Y(t), \Lambda(t), M(t)$ 4 가지 핵심 정의.
Intensity $\lambda(t) = Y(t) h(t)$.
Martingale $M = N - \Lambda$, mean 0, predictable variation $\langle M \rangle = \Lambda$.
Nelson-Aalen $\hat{H}(t) = \int J/Y\, dN$ — stochastic integral.
KM $\hat{S}(t) = \prod[1 - dN/Y]$ — product integral.
Martingale CLT → KM/NA/log-rank/Cox 의 신뢰구간·점근성.

11 관련 주제

Klein 시리즈

Ch.1 시리즈 — censoring/truncation 정전 예제

관련 개념 (cross-category)

12 참고문헌

Klein, J. P., & Moeschberger, M. L. (2003). Survival Analysis: Techniques for Censored and Truncated Data (2nd ed.), Ch.3, pp. 63-90. Springer.
Aalen, O. O. (1975). Statistical inference for a family of counting processes. PhD thesis, University of California, Berkeley.
Aalen, O. O. (1978). Nonparametric inference for a family of counting processes. Annals of Statistics, 6(4), 701-726.
Andersen, P. K., Borgan, Ø., Gill, R. D., & Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer.
Fleming, T. R., & Harrington, D. P. (1991). Counting Processes and Survival Analysis. Wiley.
Turnbull, B. W. (1974). Nonparametric estimation of a survivorship function with doubly censored data. JASA, 69(345), 169-173.
Turnbull, B. W. (1976). The empirical distribution function with arbitrarily grouped censored and truncated data. JRSS B, 38(3), 290-295.
David, H. A. (1981). Order Statistics, 2nd ed. Wiley. (Type II censoring 의 likelihood)
David, H. A., & Moeschberger, M. L. (1978). The Theory of Competing Risks. Griffin.
Lagakos, S. W., Barraj, L. M., & De Gruttola, V. (1988). Nonparametric analysis of truncated survival data, with application to AIDS. Biometrika, 75(3), 515-523.
Keiding, N. (1990). Statistical inference in the Lexis diagram. Philosophical Transactions of the Royal Society A, 332, 487-509.
Tsiatis, A. (1975). A nonidentifiability aspect of the problem of competing risks. PNAS, 72(1), 20-22.
Therneau, T. M., & Grambsch, P. M. (2000). Modeling Survival Data: Extending the Cox Model. Springer.