Kwangmin Kim - Klein § 2.6~2.7 — Regression Models and Competing Risks

1 들어가며 — Ch.2 의 마지막 두 조각

지금까지 Ch.2 는 homogeneous population (모든 개체가 동일한 분포) 을 가정했다. § 2.6 에서 이 가정을 깨고 covariate $Z$ (treatment·age·biomarker·…) 를 도입한다. § 2.7 에서는 사건 자체가 K 가지 원인 중 하나일 수 있는 competing risks 로 확장한다.

§ 2.6~2.7 의 한 줄 요약

“§ 2.6 — covariate 효과를 모델링하는 두 framework: AFT (시간 척도 변환) vs PH (강도 곱셈). § 2.7 — 한 환자가 여러 원인 중 하나로 사건을 경험할 때, 단순 KM 은 잘못된 답을 준다.”

절	핵심 함수 변경	추가 도구
§ 2.6 AFT	$S(x \mid Z) = S_0[x \exp(-\gamma'Z)]$	log linear model, MLE
§ 2.6 PH	$h(x \mid z) = h_0(x) \exp(\beta'z)$	partial likelihood (Ch.8)
§ 2.6 Additive	$h(x \mid z) = h_0(x) + \sum z_j(t) \beta_j(t)$	Aalen 가중 LS (Ch.10)
§ 2.7 Cause-specific	$h_i(t)$ for each cause	Aalen-Johansen estimator (Ch.4)
§ 2.7 CIF	$F_i(t) = \Pr[T \leq t, \delta = i]$	Fine-Gray (Ch.10)

본 편은 두 절의 정의 + 직관 + 사례 + 함정 + R/Python 실전을 모두 담는다.

2 § 2.6 — Regression Models for Survival Data

2.1 왜 회귀 framework 가 필요한가

Ch.1 의 19 데이터 catalog 에서 보았듯, 임상 자료의 거의 모든 질문은 “이 covariate 가 생존에 어떤 영향을 주는가?” 이다.

§ 1.2 Leukemia: 6-MP vs placebo (treatment effect)
§ 1.4 Dialysis: surgical vs percutaneous (procedure effect)
§ 1.6 Burn: chlorhexidine vs povidone (antiseptic effect)
§ 1.8 Laryngeal: TNM stage I~IV (ordinal severity)
§ 1.13 Pneumonia: breastfeeding (causal protection)
§ 1.16 Channing: gender (life expectancy)

회귀는 — 비교 (group A vs B) + 예측 (특정 환자의 생존 곡선) + 위험 인자 식별 (HR > 1) — 모두를 통합한다.

Time-dependent covariate

$Z$ 는 시간에 따라 변할 수 있다 — $Z(x) = [Z_1(x), \ldots, Z_p(x)]$. 예:

GVHD 발생 여부 (BMT § 1.3)
Excision 수행 여부 (Burn § 1.6)
재발 후 약물 변경

이 경우 $Z$ 는 baseline 변수가 아니라 process 다. Counting process format Surv(start, stop, event) (Klein § 9.2, 11.5) 가 자연스러운 처리법.

2.2 Approach 1 — AFT (Accelerated Failure-Time)

2.2.1 정의

$Y = \ln X$ 가 covariate 의 선형 함수 + 오차:

\[ Y = \mu + \gamma'Z + \sigma W \tag{2.6.1} \]

$\gamma$: 회귀계수 vector
$\sigma$: 척도 모수
$W$: error distribution
- $W \sim$ extreme value → Weibull AFT
- $W \sim$ normal → Log-normal AFT
- $W \sim$ logistic → Log-logistic AFT

2.2.2 시간 가속/감속 의미 — 유도

baseline ($Z=0$) 에서의 $X = e^Y$ 의 생존함수 $S_0(x)$ 라 두자. covariate 의 효과는:

\[ \begin{aligned} \Pr[X > x \mid Z] &= \Pr[Y > \ln x \mid Z] \\ &= \Pr[\mu + \sigma W > \ln x - \gamma'Z \mid Z] \\ &= \Pr[e^{\mu + \sigma W} > x \exp(-\gamma'Z) \mid Z] \\ &= S_0[x \exp(-\gamma'Z)] \end{aligned} \]

즉

\[ \boxed{S(x \mid Z) = S_0[x \exp(-\gamma'Z)]} \]

“가속·감속” 의미

$\gamma'Z > 0$: $\exp(-\gamma'Z) < 1$, baseline 시간을 단축 → 생존이 가속 (accelerated, 더 빨리 사건).
$\gamma'Z < 0$: $\exp(-\gamma'Z) > 1$, baseline 시간을 연장 → 생존이 감속 (decelerated, 더 늦게 사건).

예시 — 약 효과 $\gamma = 0.69$ 라면 $\exp(\gamma) = 2$, 약 그룹의 시간이 baseline 보다 2 배 늘어남. “약 처치 그룹은 placebo 보다 평균 2 배 오래 산다” 는 직관적 해석. 이것이 임상 보고서에서 AFT 가 선호되는 이유.

2.2.3 Hazard 형태로의 변환

식 (2.6.2):

\[ h(x \mid Z) = h_0[x \exp(-\gamma'Z)] \exp(-\gamma'Z) \]

체인룰 적용으로 도출. 두 효과 — 시간 척도 변환 + jacobian 인자 — 모두 곱셈으로.

2.2.4 Weibull AFT (Klein Example 2.5)

Weibull $X$ 에 대하여 $Y = \ln X = \mu + \sigma W$, $W \sim$ standard extreme value, $\mu = -\ln \lambda / \alpha$, $\sigma = 1/\alpha$. covariate 추가:

\[ Y = \gamma'Z + \sigma W \]

\[ S_X(x \mid Z) = \exp\!\left\{-\left[x \exp(-\gamma'Z)\right]^\alpha\right\} \]

해석

baseline (Weibull) hazard 에 대해 covariate 가 시간을 $\exp(-\gamma'Z)$ 배 만큼 늘리거나 줄인다. AFT 의 직관적 표현이 이 식.

2.3 Approach 2 — Multiplicative Hazards (PH)

2.3.1 정의

covariate 가 hazard 에 곱셈 효과:

\[ h(x \mid z) = h_0(x) c(\beta' z) \tag{2.6.3} \]

$h_0(x)$: baseline hazard (모수 vs 비모수 모두 가능)
$c(\cdot)$: link function — 비음수 함수
- Cox (1972) 가 채택한 link: $c(\beta'z) = \exp(\beta'z)$ — 모든 실수 $\beta'z$ 에 대해 양수 보장.

2.3.2 Proportional Hazards 성질

두 개체 $z_1, z_2$ 의 hazard 비:

\[ \frac{h(x \mid z_1)}{h(x \mid z_2)} = \frac{h_0(x) c(\beta'z_1)}{h_0(x) c(\beta'z_2)} = \frac{c(\beta'z_1)}{c(\beta'z_2)} \]

PH 의 핵심 성질

hazard 비가 시간 $x$ 에 의존하지 않는다. 이것이 “proportional hazards” 의 출처.

장점: $\beta$ 의 해석이 매우 단순 — exp($\beta_j$) = HR (hazard ratio).
- Cox link: $(_j) = $ “covariate $z_j$ 가 1 단위 증가할 때 hazard 가 몇 배” — 임상에서 직관적.
단점: 가정 위반 시 (cross-over hazard, time-dependent effect) 결과 왜곡.

§ 1.4 Dialysis 와 § 1.6 Burn 이 PH 가 위반되는 정전 예제 — Klein Ch.11 진단 도구 (Schoenfeld residuals 등) 가 이를 검출.

2.3.3 Survival 형태 — Lehmann Alternative

식 (2.6.4):

\[ S(x \mid z) = S_0(x)^{c(\beta'z)} \]

도출:

\[ \begin{aligned} S(x \mid z) &= \exp\!\left[-\int_0^x h_0(t) c(\beta'z)\, dt\right] \\ &= \left\{\exp\!\left[-\int_0^x h_0(t)\, dt\right]\right\}^{c(\beta'z)} \\ &= \left[S_0(x)\right]^{c(\beta'z)} \end{aligned} \]

이 형태가 비모수 통계에서 “Lehmann alternative” 로 알려진 분포족.

2.3.4 PH 진단 — Log-Log 평행성

식 (2.6.7):

\[ \ln[-\ln S(x \mid z)] = \beta'z + \ln[-\ln S_0(x)] \]

Log-Log Plot 진단법

각 covariate 그룹별 KM curve 의 $\ln[-\ln \hat{S}(x)]$ 를 시간 축에 plot.

PH 성립: 곡선들이 평행 ($\beta'z$ 만큼 vertical shift).
PH 위반: 곡선이 교차 또는 발산.

Klein Ch.11 의 표준 진단 도구. Schoenfeld residuals 가 더 정밀하지만, log-log plot 은 즉시 시각적으로 확인 가능.

2.3.5 Weibull PH (Klein Example 2.5 continued)

Weibull baseline $h_0(x) = \alpha\lambda x^{\alpha-1}$ 에 Cox link:

\[ h(x \mid z) = \alpha\lambda x^{\alpha-1} \exp(\beta'z) \]

\[ S(x \mid z) = \exp[-\lambda x^\alpha]^{\exp(\beta'z)} = \exp[-\lambda x^\alpha \exp(\beta'z)] = \exp\!\left\{-\lambda\left[x \exp(\beta'z/\alpha)\right]^\alpha\right\} \]

마지막 형태가 AFT 형태 $S_0(x \exp(-\gamma'z))$ 와 동일 (with $\gamma = -\beta/\alpha$).

Klein 정리 — Weibull 의 유일성

Weibull 만이 AFT 와 PH 두 framework 모두에서 닫혀있는 유일한 연속 분포.

다시 말해, Weibull 에 대해서는:

AFT 로 모델링하면 PH 도 만족.
PH 로 모델링하면 AFT 도 만족.

$\beta$ (PH 계수) 와 $\gamma$ (AFT 계수) 의 변환: $\gamma = -\beta/\alpha$.

이 사실이 Weibull AFT 가 (parametric) Cox 의 자연스러운 대안인 이유. parametric AFT 의 모형 검정이 PH 검정의 근사가 되기도 한다.

2.4 Approach 3 — Additive Hazards (Aalen)

2.4.1 정의

식 (2.6.5):

\[ h(x \mid z) = h_0(x) + \sum_{j=1}^p z_j(x) \beta_j(x) \]

$\beta_j(x)$ 가 시간의 함수 — 시간 가변 효과 자연 표현.
추정: 비모수 가중 최소제곱 (Klein Ch.10).
응용: excess mortality (§ 1.5 Breast cancer Aalen Sedmak 1989), § 6.3 외부 표준 비교.

PH vs Additive

측면	PH (Cox)	Additive (Aalen)
효과 구조	곱셈	덧셈
해석	“위험 몇 배” (HR)	“초과 사건/시간”
Time-dependent	별도 처리 필요	자연스러움
Baseline 식별	$h_0$ 추정 가능하지만 보통 무시	$h_0$ 가 외부 표준 으로 알려진 경우 유용
임상 인지도	매우 높음	낮음

PH 가 표준이지만, excess mortality (관찰 hazard - 표준 hazard) 모델링에는 Aalen 이 자연스럽다.

3 § 2.7 — Models for Competing Risks

3.1 왜 새 framework 가 필요한가

지금까지 사건은 단일 type (“사망”, “재발”) 이라고 암묵적으로 가정. 하지만 임상 현실:

§ 1.3 BMT: treatment failure = relapse 또는 death-in-remission. 한 환자 가 둘 중 하나만 경험.
인구 mortality: cause = heart disease / cancer / accident / other.
§ 1.16 Channing: 사망 원인이 다양 (심장·암·폐렴·…).

Competing risks 의 핵심 어려움

한 사건이 발생하면 다른 사건은 관측 불가 (censored). 단순히 “관심 사건이 아닌 것을 censoring 으로 처리” 하면 — censoring 이 informative 가 되어 KM 은 편향된 추정.

예: 심장병 사망을 관심 사건으로 두고 암 사망을 censoring 처리 → 1 - KM 이 심장병 사망 누적 확률을 과대평가. Cumulative Incidence Function (CIF) 가 올바른 답.

3.2 Latent Failure Time Framework

각 cause $i$ ($i = 1, \ldots, K$) 마다 잠재 (관측 불가) 사건 시간 $X_i$ 가 있다고 가정. 관측되는 것:

\[ T = \min(X_1, \ldots, X_K), \quad \delta = i \text{ if } T = X_i \]

$T$: 어느 원인으로든 첫 사건 시간
$\delta$: 어떤 원인 (1, 2, …, K)

“잠재” 의 의미

$X_i$ 는 “다른 원인이 모두 제거되었을 때 cause $i$ 로 사건이 일어났을 시간” — counterfactual. 실제로는 하나만 관측. 이것이 § 2.7 전체의 어려움의 근원.

3.3 Cause-Specific Hazard

정의: Cause-Specific Hazard

\[ h_i(t) = \lim_{\Delta t \to 0} \frac{P[t \leq T < t + \Delta t,\, \delta = i \mid T \geq t]}{\Delta t} \tag{2.7.1} \]

“시점 $t$ 까지 어느 원인으로도 사건이 발생하지 않은 사람이, 다음 순간 cause $i$ 로 사건을 경험할 강도”.

총 hazard:

\[ h_T(t) = \sum_{i=1}^K h_i(t) \]

직관 — “원인별 강도의 합”

총 위험은 각 원인별 위험의 합. “환자가 어떤 원인으로든 죽을 강도” = “심장병 사망 강도 + 암 사망 강도 + …”.

이 분해 가 — competing risks 분석의 출발점.

3.3.1 Joint Survival 에서의 도출

식 (2.7.2):

\[ h_i(t) = \frac{-\partial S(t_1, \ldots, t_K) / \partial t_i \big|_{t_1 = \cdots = t_K = t}}{S(t, \ldots, t)} \]

여기서 $S(t_1, \ldots, t_K) = \Pr[X_1 > t_1, \ldots, X_K > t_K]$ 가 $K$ 변량 joint survival.

3.3.2 독립 vs 종속 — 두 정전 예제

Klein Example 2.6 — 독립 cause

$X_i$ 들이 독립이면 $S(t_1, \ldots, t_K) = \prod_i S_i(t_i)$, 그리고

\[ h_i(t) = \frac{-S'_i(t)}{S_i(t)} = h_i^{\text{marginal}}(t) \]

cause-specific hazard = marginal hazard. 즉 다른 원인이 censoring 처럼 작동.

Klein Example 2.7 — 종속 cause

$S(t_1, t_2) = [1 + \theta(\lambda_1 t_1 + \lambda_2 t_2)]^{-1/\theta}$ (Kendall’s $\tau = \theta/(\theta+2)$, Clayton copula). cause-specific hazard:

\[ h_i(t) = \frac{\lambda_i}{1 + \theta t (\lambda_1 + \lambda_2)} \]

marginal hazard ($X_1$ 만 보면):

\[ h_1^{\text{marginal}}(t) = \frac{\lambda_1}{1 + \theta \lambda_1 t} \]

둘이 다르다. 이 차이가 종속 의 신호.

3.4 Identifiability Dilemma — Tsiatis 1975

정전 정리 — 식별 불가능성

관측 데이터 $(T, \delta)$ 만으로는 cause 들이 독립인지 종속인지 판정 불가능.

이유 (Klein Example 2.7 참고): 종속 모델 의 cause-specific hazard 와 동일한 cause-specific hazard 를 갖는 독립 모델 을 항상 만들 수 있다.

따라서:

cause-specific hazard $h_i(t)$ 는 직접 추정 가능 (assumption 불필요).
marginal hazard $h_i^{\text{marginal}}(t)$ 는 추정 불가능 (독립 가정 필요, 그리고 그 가정은 검증 불가).
“cause $i$ 외의 원인을 모두 제거했을 때의 생존확률” 은 인과적 가정 없이는 답할 수 없는 질문.

이것이 — Pierre Bernoulli (1760) 가 “smallpox 백신의 영향” 을 묻고 270 년 후에도 여전히 풀리지 않은 — competing risks 의 본질적 난점.

3.5 세 종류의 Probability

Crude / Net / Partial Crude

종류	정의	임상 질문	추정 가능성
Crude	“모든 원인이 작동하는 실제 세계에서 cause $i$ 로 사망할 확률”	“이 환자가 50 세 전에 심장병으로 죽을 확률?”	CIF $F_i(t)$ — 직접 추정 가능
Net	“cause $i$ 만 작동하는 가상 세계에서 사망할 확률”	“암을 정복한 세계에서 심장병 사망 확률?”	marginal $S_i(t)$ — 독립 가정 필요
Partial Crude	“일부 원인 ($J$ 집합) 만 작동하는 세계에서 cause $i$ 로 사망할 확률”	“심장병 vs 암 만 있을 때 심장병 비율?”	$F_i^J(t)$ — 가정 필요

임상 보고에서 사용하는 것: 거의 항상 crude (CIF). 인과적 질문이 아니라 관측 가능한 확률.

3.6 Cumulative Incidence Function (CIF)

정의

\[ F_i(t) = \Pr[T \leq t, \delta = i] = \int_0^t h_i(u) \exp\!\left[-H_T(u)\right]\, du \tag{2.7.3} \]

여기서 $H_T(t) = \sum_j \int_0^t h_j(u)\, du$.

식의 직관

$F_i(t)$ = ∫(시점 $u$ 에서 cause $i$ 의 강도) × (시점 $u$ 까지 어느 원인으로도 사건이 없을 확률) du.

$h_i \exp(-H_T)$ = “cause $i$ 로 정확히 시점 $u$ 에 사건이 일어날 확률밀도”. 이를 0~$t$ 적분 = “[0, t] 사이에 cause $i$ 로 사건이 일어날 확률”.

3.7 Sub-distribution 성질

$F_i$ 는 진짜 distribution 이 아님:

$F_i(0) = 0$
$F_i$ 는 비감소.
$F_i(\infty) = \Pr[\delta = i] < 1$ — cause $i$ 로 결국 사건을 경험할 확률.

이 성질이 “sub-distribution” 의 정의. $1 - F_i(t) \neq S_i^{\text{marginal}}(t)$ — 단순 변환으로 net survival 도출 불가.

3.8 1 - KM ≠ CIF — 정전 함정

가장 흔한 분석 오류

Cause $i$ 의 누적 발생 확률을 추정하려고 — 다른 원인을 censoring 으로 처리하고 KM 을 계산한 후 $1 - \hat{S}_{KM}(t)$ 을 CIF 추정량으로 사용.

이는 체계적으로 과대 추정. 이유:

KM 은 “다른 원인이 없는 가상 세계” 의 net survival 을 추정 (독립 가정).
CIF 는 “모든 원인이 작동하는 실제 세계” 의 crude probability.
다른 원인이 있는 만큼 cause $i$ 사건 발생이 줄어들 수밖에 없음 (시간 안에).

올바른 추정량: Aalen-Johansen estimator (Klein Ch.4):

\[ \hat{F}_i(t) = \sum_{t_j \leq t} \frac{d_{ij}}{n_j} \cdot \hat{S}_{KM}(t_{j-1}) \]

$d_{ij}$: 시점 $t_j$ 의 cause $i$ 사건 수.
$n_j$: 시점 $t_j$ 의 위험 집합.
$\hat{S}_{KM}(t_{j-1})$: 모든 cause 통합 KM (시점 $t_{j-1}$).

R cmprsk::cuminc(), survival::survfit(... ~ ..., etype = ...). Python lifelines 의 AalenJohansenFitter.

3.9 Net Survival 의 Bound — Peterson 1976

종속 인 cause 의 net survival 추정은 불가능하지만, bound 는 가능:

\[ S_T(t) \leq S_i(t) \leq 1 - F_i(t) \]

lower bound: 완전 양의 상관 (한 cause 가 늦으면 다른 cause 도 늦음).
upper bound: 완전 음의 상관.

실용성

이 bound 는 일반적으로 매우 넓어 (uninformative). 더 좁은 bound (Klein-Moeschberger 1988, Zheng-Klein 1994) 는 dependence structure (copula family) 에 가정을 두어 도출 — 이 가정 자체가 검증 불가.

3.10 Pepe-Mori Conditional Probability

CIF 는 모집단 에서의 확률. 임상에서는 종종 — “다른 원인이 발생하지 않은 환자만 보았을 때 cause $i$ 발생 확률” — 이 더 유용. Pepe-Mori (1991, 1993):

\[ P(X \leq t, X < Y \mid Y \geq t \text{ or } X \leq Y) = \frac{F_i(t)}{1 - F_{j \neq i}(t)} \]

사례

심장병 사망의 누적 확률을 — “암으로 죽지 않은 환자 사이에서” — 보는 것. CIF 는 “모집단 절대 비율” 인데, conditional probability 는 “위험 집합 기준 비율”.

임상 의사가 환자에게 설명할 때: “당신처럼 암이 없는 환자 중에서, 5 년 안에 심장병으로 사망할 확률은 $F_i(5) / [1 - F_{\text{cancer}}(5)]$ 입니다”.

3.11 Fine-Gray Subdistribution Hazard

CIF 자체에 회귀 모델을 적용하려면 — Fine & Gray (1999) 의 subdistribution hazard:

\[ \lambda_i(t) = \lim_{\Delta t \to 0} \frac{\Pr[t \leq T < t + \Delta t, \delta = i \mid T \geq t \text{ or } (T < t \text{ and } \delta \neq i)]}{\Delta t} \]

위험 집합이 일반 hazard 와 다름 — 다른 원인으로 이미 사건을 경험한 사람도 위험 집합에 포함 (counterfactually still at risk for cause $i$).
CIF 와 직접 관계: $F_i(t) = 1 - \exp[-\Lambda_i(t)]$, $\Lambda_i = \int \lambda_i$.

Cause-specific vs Subdistribution

	Cause-specific $h_i$	Subdistribution $\lambda_i$
위험 집합	모든 cause 사건 미발생자	사건 미발생자 + 다른 cause 사건 발생자
회귀 대상	강도 (한 원인의 즉각 위험)	CIF (한 원인의 누적 발생)
해석	생물학적 메커니즘	임상적 누적 위험
사용	메커니즘 분석	환자 설명, 정책 결정

현대 임상: 두 모델 모두 보고 — cause-specific hazard 는 메커니즘, subdistribution hazard 는 누적 영향.

4 R + Python — 실전 두 사례

4.1 사례 1 — Leukemia (PH + AFT)

§ 1.2 Freireich 1963 데이터로 Cox PH + Weibull AFT 비교.

4.1.1 R — `survival` + `flexsurv`

library(survival)
library(flexsurv)

leukemia <- data.frame(
  time = c(1, 22, 3, 12, 8, 17, 2, 11, 8, 12, 2, 5, 4, 15, 8, 23, 5, 11, 4, 1, 8,
           10, 7, 32, 23, 22, 6, 16, 34, 32, 25, 11, 20, 19, 6, 17, 35, 6, 13, 9, 6, 10),
  status = c(rep(1, 21), 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0),
  group = factor(c(rep("placebo", 21), rep("6-MP", 21)),
                 levels = c("placebo", "6-MP"))
)

# 1. Cox PH
cox_fit <- coxph(Surv(time, status) ~ group, data = leukemia)
summary(cox_fit)
# coef = log HR, exp(coef) = HR

# 2. Weibull AFT
aft_fit <- flexsurvreg(Surv(time, status) ~ group, data = leukemia, dist = "weibull")
print(aft_fit)
# coefficients on log time scale

# 3. PH 진단 — log-log plot
fit_km <- survfit(Surv(time, status) ~ group, data = leukemia)
plot(fit_km, fun = "cloglog", col = c("red", "blue"), lwd = 2,
     xlab = "log time", ylab = "log(-log S)",
     main = "PH check — should be parallel")
legend("bottomright", legend = c("placebo", "6-MP"),
       col = c("red", "blue"), lwd = 2)

# 4. Schoenfeld residual 검정
ph_test <- cox.zph(cox_fit)
print(ph_test)
plot(ph_test)

# 5. Weibull = AFT ∩ PH 변환 검증
alpha <- 1 / aft_fit$res["scale", "est"]
gamma <- aft_fit$res["group6-MP", "est"]
beta_implied <- -alpha * gamma
beta_cox <- coef(cox_fit)
cat(sprintf("AFT γ = %.3f → 변환된 PH β = %.3f\n", gamma, beta_implied))
cat(sprintf("Cox 직접 β  = %.3f\n", beta_cox))
# 두 값이 가까우면 Weibull 의 PH/AFT 동등성 검증

4.1.2 Python — `lifelines`

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lifelines import CoxPHFitter, WeibullAFTFitter, KaplanMeierFitter

leukemia = pd.DataFrame({
    "time": [1, 22, 3, 12, 8, 17, 2, 11, 8, 12, 2, 5, 4, 15, 8, 23, 5, 11, 4, 1, 8,
             10, 7, 32, 23, 22, 6, 16, 34, 32, 25, 11, 20, 19, 6, 17, 35, 6, 13, 9, 6, 10],
    "status": [1]*21 + [1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0],
    "six_mp": [0]*21 + [1]*21,
})

# 1. Cox PH
cph = CoxPHFitter().fit(leukemia, duration_col="time", event_col="status")
cph.print_summary()

# 2. Weibull AFT
aft = WeibullAFTFitter().fit(leukemia, duration_col="time", event_col="status")
aft.print_summary()

# 3. PH check — log(-log S) plot
fig, ax = plt.subplots(figsize=(8, 5))
for grp, color, label in [(0, "red", "placebo"), (1, "blue", "6-MP")]:
    sub = leukemia[leukemia["six_mp"] == grp]
    kmf = KaplanMeierFitter().fit(sub["time"], sub["status"])
    times = kmf.survival_function_.index.values
    S = kmf.survival_function_.values.flatten()
    mask = (S > 0) & (S < 1) & (times > 0)
    ax.plot(np.log(times[mask]), np.log(-np.log(S[mask])),
            "o-", color=color, label=label)
ax.set_xlabel("log time")
ax.set_ylabel("log(-log S)")
ax.set_title("PH check")
ax.legend()

# 4. Schoenfeld 검정
cph.check_assumptions(leukemia, p_value_threshold=0.05)
plt.tight_layout()
plt.savefig("klein_2_6_ph_aft.png", dpi=100)

4.2 사례 2 — § 1.3 BMT (Competing Risks)

Copelan 1991 의 137 명 BMT 데이터 — relapse vs death-in-remission competing risks.

4.2.1 R — `cmprsk` + `survival`

library(KMsurv)  # 데이터
library(cmprsk)
library(survival)

# bmt 데이터: KMsurv 패키지에 포함
data(bmt)
# t1 = treatment failure time, d1 = treatment failure indicator
# d2 = relapse indicator, d3 = death-in-remission indicator
# 또는 새 변수 만들기:
bmt$cause <- with(bmt, ifelse(d1 == 0, 0,        # censored
                       ifelse(d2 == 1, 1, 2)))   # 1=relapse, 2=death

# 1. Cumulative Incidence Function (Aalen-Johansen)
ci_fit <- cuminc(ftime = bmt$t1, fstatus = bmt$cause, group = bmt$group)
plot(ci_fit, lty = 1:2, col = rep(1:3, each = 2),
     xlab = "Time (days)", ylab = "Cumulative incidence",
     main = "BMT — Relapse vs Death CIF")

# 2. 1 - KM (틀린 추정) 비교
km_relapse <- survfit(Surv(t1, d2) ~ 1, data = bmt)
plot(stepfun(km_relapse$time, c(0, 1 - km_relapse$surv)),
     do.points = FALSE, lwd = 2, col = "red", add = TRUE)
legend("topright", legend = c("CIF (correct)", "1 - KM (incorrect, overestimates)"),
       col = c("black", "red"), lwd = 2)

# 3. Cause-specific Cox (relapse 만)
cox_relapse <- coxph(Surv(t1, d2) ~ z1 + z3 + z5, data = bmt)  # 예시 covariates
summary(cox_relapse)

# 4. Fine-Gray subdistribution hazard
fg_relapse <- crr(ftime = bmt$t1, fstatus = bmt$cause,
                  cov1 = bmt[, c("z1", "z3", "z5")],
                  failcode = 1, cencode = 0)
summary(fg_relapse)

# 두 모델 비교 — cause-specific HR vs subdistribution HR 가 다른 의미

4.2.2 Python — `lifelines`

from lifelines import AalenJohansenFitter, KaplanMeierFitter, CoxPHFitter
# Note: Fine-Gray 는 lifelines 에 직접 구현 없음 — scikit-survival 사용

# 가상 데이터 (실제 BMT 데이터로 대체)
n = 137
np.random.seed(42)
df = pd.DataFrame({
    "time": np.random.exponential(500, n),
    "cause": np.random.choice([0, 1, 2], n, p=[0.4, 0.3, 0.3]),  # 0=cens, 1=relapse, 2=death
})

# 1. CIF — Aalen-Johansen
fig, ax = plt.subplots(figsize=(8, 5))
for cause, color, label in [(1, "red", "Relapse"), (2, "blue", "Death")]:
    aj = AalenJohansenFitter()
    aj.fit(df["time"], df["cause"], event_of_interest=cause)
    aj.plot(ax=ax, color=color, label=f"CIF {label}")

# 2. 1 - KM (overestimate 함정)
df_relapse = df.copy()
df_relapse["status"] = (df_relapse["cause"] == 1).astype(int)
kmf = KaplanMeierFitter().fit(df_relapse["time"], df_relapse["status"])
ax.step(kmf.survival_function_.index, 1 - kmf.survival_function_.values,
        color="orange", linestyle="--", label="1 - KM (incorrect)")
ax.set_title("BMT competing risks — CIF vs 1-KM")
ax.set_xlabel("Time")
ax.set_ylabel("Cumulative probability")
ax.legend()
plt.tight_layout()
plt.savefig("klein_2_7_competing_risks.png", dpi=100)

결과 해석

CIF (Aalen-Johansen) 가 1 - KM 보다 항상 작거나 같음 — competing event 가 cause $i$ 사건 발생을 “방해” 하기 때문.
두 곡선 차이가 클수록 — competing risks 의 영향이 크다는 신호.
임상 보고에서 반드시 CIF 를 사용해야 하는 이유.

5 직관 통합 — Ch.2 의 마무리

§ 2.6~2.7 한 페이지 요약

§ 2.6 — 회귀 framework:

Framework	식	해석	주요 응용
AFT	$S(x \mid Z) = S_0[x e^{-\gamma'Z}]$	시간 가속/감속	임상 보고 (parametric)
PH (Cox)	$h(x \mid z) = h_0(x) e^{\beta'z}$	hazard ratio	표준 (Klein Ch.8)
Additive (Aalen)	$h = h_0 + \sum z_j(t) \beta_j(t)$	초과 hazard	excess mortality (Ch.10)

Weibull 의 유일성: AFT ∩ PH 동시 만족하는 유일 분포. $\gamma = -\beta/\alpha$.

§ 2.7 — Competing risks:

잠재 시간 $X_1, \ldots, X_K$, 관측 $(T = \min, \delta)$.
Cause-specific hazard $h_i$ — 직접 추정 가능.
Marginal/net $h_i^{\text{marginal}}$ — 독립 가정 필요, 검증 불가 (identifiability dilemma).
CIF $F_i(t) = \int h_i \exp(-H_T)$ — crude probability, 임상 보고의 표준.
1 - KM ≠ CIF — 가장 흔한 분석 함정.
Cause-specific Cox (메커니즘) + Fine-Gray (누적 영향) 둘 다 보고.

6 실전 체크리스트 — § 2.6~2.7

§ 2.6 — Regression

AFT $Y = \mu + \gamma'Z + \sigma W$, $S(x \mid Z) = S_0[x e^{-\gamma'Z}]$.
PH $h(x \mid z) = h_0(x) c(\beta'z)$, Cox link $c = \exp$.
PH 의 시간 불변 비율 $h(z_1)/h(z_2) = c(\beta'z_1)/c(\beta'z_2)$.
Lehmann alternative $S(x \mid z) = S_0(x)^{c(\beta'z)}$.
Log-Log plot 의 평행성 = PH 진단.
Schoenfeld residuals 정밀 검정 (Klein Ch.11).
Additive Aalen $h = h_0 + \sum z_j(t) \beta_j(t)$ — time-varying 자연.
Weibull 의 유일성: AFT ∩ PH, $\gamma = -\beta/\alpha$.

§ 2.7 — Competing Risks

Latent $(X_1, \ldots, X_K)$ → 관측 $(T = \min, \delta)$.
Cause-specific $h_i(t)$ + 총 hazard $h_T = \sum h_i$.
독립이면 cause-specific = marginal, 종속이면 다름.
Identifiability dilemma (Tsiatis 1975) — $(T, \delta)$ 만으로 dependence 식별 불가.
Crude (CIF) vs Net (counterfactual) vs Partial crude.
CIF $F_i(t) = \int_0^t h_i(u) \exp[-H_T(u)] du$ — Aalen-Johansen.
1 - KM ≠ CIF — 함정 회피.
Cause-specific Cox (메커니즘) + Fine-Gray subdistribution (누적) 둘 다.
Pepe-Mori conditional probability — 임상 환자 설명.

7 관련 주제

Klein 시리즈

(이전) Ch.2 overview — 7 절 조망
(이전) § 2.2~2.3 심화 — Survival + Hazard
(이전) § 2.4~2.5 심화 — MRL + 9 Parametric Models
(다음 chapter) Ch.3 — Censoring and Truncation (likelihood 정밀 정의)

Ch.1 시리즈 — competing risks 정전 예제

관련 개념 (cross-category)

Cox Proportional Hazards
Accelerated Failure Time
Aalen Additive Hazards
Fine-Gray Subdistribution Hazard
Cumulative Incidence Function
Causal Inference — counterfactual 가정의 구조

8 참고문헌

Klein, J. P., & Moeschberger, M. L. (2003). Survival Analysis: Techniques for Censored and Truncated Data (2nd ed.), Ch.2 § 2.6~2.7, pp. 45-57. Springer.
Cox, D. R. (1972). Regression Models and Life-Tables. JRSS B, 34(2), 187-220.
Aalen, O. O. (1980). A model for nonparametric regression analysis of counting processes. In Lecture Notes in Statistics, 2, 1-25.
Tsiatis, A. (1975). A nonidentifiability aspect of the problem of competing risks. PNAS, 72(1), 20-22.
Peterson, A. V. (1976). Bounds for a joint distribution function with fixed sub-distribution functions: Application to competing risks. PNAS, 73(1), 11-13.
Fine, J. P., & Gray, R. J. (1999). A Proportional Hazards Model for the Subdistribution of a Competing Risk. JASA, 94(446), 496-509.
Pepe, M. S., & Mori, M. (1993). Kaplan-Meier, marginal or conditional probability curves in summarizing competing risks failure time data? Statistics in Medicine, 12(8), 737-751.
Heckman, J. J., & Honoré, B. E. (1989). The identifiability of the competing risks model. Biometrika, 76(2), 325-330.
Klein, J. P., & Moeschberger, M. L. (1988). Bounds on net survival probabilities for dependent competing risks. Biometrics, 44(2), 529-538.
Putter, H., Fiocco, M., & Geskus, R. B. (2007). Tutorial in Biostatistics: Competing Risks and Multi-State Models. Statistics in Medicine, 26(11), 2389-2430.
Andersen, P. K., Geskus, R. B., de Witte, T., & Putter, H. (2012). Competing Risks in Epidemiology: Possibilities and Pitfalls. International Journal of Epidemiology, 41(3), 861-870.
Therneau, T. M., & Grambsch, P. M. (2000). Modeling Survival Data: Extending the Cox Model. Springer.
Kalbfleisch, J. D., & Prentice, R. L. (2002). The Statistical Analysis of Failure Time Data, 2nd ed. Wiley.