Kwangmin Kim - Klein § 8.1~8.2 — Cox 모형의 도입

1 들어가며 — Ch.8 첫 번째 deep-dive

편	주제
Ch.8 Overview	9 절 조망
§ 8.1~8.2 (본 편)	Cox 모형 도입 + Coding Covariates
§ 8.3~8.4 (예정)	Partial Likelihood + Ties
§ 8.5 (예정)	Local Tests
§ 8.6~8.8 (예정)	Discretizing + Model Building + Survival Estimation
§ 8.9 (예정)	Exercises

§ 8.1~8.2 의 한 줄 요약

“§ 8.1 의 Cox PH 모형 $h(t|Z) = h_0(t) \exp(\beta'Z)$ — semiparametric (baseline 비모수 + covariate effect 모수). Hazard ratio $\exp(\beta'(Z-Z^*))$ 가 시간 불변 (PH 가정). Lehmann alternative $S(t|Z) = S_0(t)^{\exp(\beta'Z)}$. § 8.2 의 coding — dichotomous (두 부호 동치), K-level qualitative (K-1 dummy 권장 vs 단일 ordinal 의 함정), continuous (단위 변경 효과), interaction (product term). Klein Examples 8.1 immunoperoxidase RR=2.67, 8.2 larynx Stage IV vs I RR=5.60, 8.3 kidney transplant 두 coding 의 동치성 (1.17, 1.28, 1.93).”

2 § 8.1 — Cox Proportional Hazards 모형

2.1 모형 정의

정의: Cox PH 모형 (식 8.1.1·8.1.2)

표본 $n$, 데이터 $(T_j, \delta_j, Z_j)$, $j = 1, \ldots, n$.

$T_j$: study time.
$\delta_j$: 사건 indicator.
$Z_j = (Z_{j1}, \ldots, Z_{jp})'$: $p$-차원 covariate vector.

일반 형태 (식 8.1.1):

\[ h(t \mid Z) = h_0(t) \cdot c(\beta'Z) \]

표준 형태 (식 8.1.2):

\[ h(t \mid Z) = h_0(t) \cdot \exp(\beta'Z) = h_0(t) \exp\left(\sum_{k=1}^p \beta_k Z_k\right) \]

$h_0(t)$: baseline hazard — 임의 함수 (비모수).
$\beta = (\beta_1, \ldots, \beta_p)'$: regression coefficient (모수).
$\exp(\beta'Z)$: multiplicative covariate effect.

→ Semiparametric: baseline 비모수 + 효과 모수.

직관 — 왜 exp(β’Z) 인가

제약: $h(t|Z) > 0$ (hazard 는 항상 양수).

해결: $\exp(\cdot)$ 는 항상 양수 → 부호 제약 자동 만족.

Linear Model 형태:

\[ \log h(t \mid Z) - \log h_0(t) = \log\left(\frac{h(t|Z)}{h_0(t)}\right) = \beta'Z \]

→ “log hazard ratio = covariate 의 linear combination”.

다른 회귀와의 일관성:

OLS: $E[Y|X] = \beta'X$.
Logistic: $\text{logit}[P(Y=1|X)] = \beta'X$.
Cox: $\log[h(t|Z)/h_0(t)] = \beta'Z$.

→ 동일한 linear formulation. Coding 규칙 (dummy, interaction) 도 동일.

Lehmann Alternative (Ch.7 § 7.3 의 PH alternative):

\[ S(t \mid Z) = \exp\left[-\int_0^t h_0(u) \exp(\beta'Z) du\right] = S_0(t)^{\exp(\beta'Z)} \]

→ KM 곡선이 baseline 의 거듭제곱 형태.

2.2 Hazard Ratio — 식 8.1.3

정의: Hazard Ratio (식 8.1.3)

두 개체 $Z$ 와 $Z^*$:

\[ \frac{h(t \mid Z)}{h(t \mid Z^*)} = \frac{h_0(t) \exp(\beta'Z)}{h_0(t) \exp(\beta'Z^*)} = \exp\left[\beta'(Z - Z^*)\right] = \exp\left[\sum_{k=1}^p \beta_k (Z_k - Z_k^*)\right] \]

시간 불변 — $h_0(t)$ cancel.

→ 이것이 “Proportional Hazards” 의 의미: 두 개체의 hazard 비가 시간에 따라 일정 (proportional).

직관 — PH 가정의 시각적 표현

KM Curves:

Baseline $S_0(t)$ 의 모양은 임의 (어떤 분포도 가능).
다른 covariate 의 곡선은 baseline 의 거듭제곱: $S(t|Z) = S_0(t)^{\exp(\beta'Z)}$.
$\exp(\beta'Z) > 1$ 면 곡선 더 빨리 떨어짐 (위험 큼).

Log-log plot:

$\log[-\log S(t|Z)] = \beta'Z + \log[-\log S_0(t)]$.
두 covariate 의 log-log 곡선은 수직 평행.
PH OK ↔︎ log-log 곡선 평행.

PH 위반 시:

시간에 따라 hazard ratio 변동.
곡선이 cross 하거나 평행 안 함.
Ch.9 의 time-varying coefficient 또는 stratified Cox 필요.
Ch.11 의 진단 (Schoenfeld residuals) 으로 점검.

→ PH 가정 점검은 Cox 분석의 첫 단계. 위반 시 Ch.9·10 의 대안.

2.3 Hazard Ratio 의 임상적 해석

직관 — 임상에서의 RR 해석

Treatment effect ($Z_1 = 1$ vs $Z_1 = 0$):

\[ \text{RR} = \exp(\beta_1) \]

$\text{RR} = 2$: 처치군이 control 보다 2 배 빠른 사건 ($\beta_1 = \log 2 = 0.693$).
$\text{RR} = 0.5$: 처치군이 절반 위험 ($\beta_1 = -0.693$).
$\text{RR} = 1$: 차이 없음 ($\beta_1 = 0$).

Continuous covariate (예: age):

$\text{RR per 1 unit} = \exp(\beta)$. $\text{RR per 10 units} = \exp(10\beta)$.

→ 단위 선택이 해석에 영향 (수치 자체가 다름, 의미 같음).

95% CI for RR:

\[ \exp(b \pm 1.96 \cdot \text{SE}(b)) \]

(Wald CI 직접 transform, $\beta$ 의 정규성 가정).

→ 임상 보고 표준: “$\text{RR} = 2.67$ (95% CI: 1.14, 6.25)”.

3 § 8.2 — Coding Covariates

3.1 양적 vs 질적

직관 — 두 변수 종류의 처리 차이

양적 (Quantitative) — 연속 또는 이산 수치:

Age, blood pressure, biomarker level, dose.
단일 covariate $Z_k$ 사용. 1 unit 변화 시 $e^{\beta_k}$.

질적 (Qualitative) — 카테고리:

Gender, race, treatment, stage, smoking status.
$K$ 카테고리 → $K-1$ dummy variables 필요.

→ 일반 회귀 (OLS, logistic) 와 동일한 coding 규칙. Cox PH 만의 특이점 없음.

Fixed time vs Time-dependent:

Fixed time (Ch.8): $Z$ 가 study 시작 시점에 고정 (age at entry, treatment assignment).
Time-dependent (Ch.9): $Z(t)$ 가 시간에 따라 변동 (current dose, biomarker over time).

→ 본 편 Ch.8 은 fixed time 만.

3.2 Dichotomous Variable (Z = 0/1)

직관 — 두 coding 의 동치성

Coding A: $Z_1 = 1$ if male, 0 if female.

Male hazard: $h(t|Z_1=1) = h_0(t) \exp(\beta_1)$.
Female hazard: $h(t|Z_1=0) = h_0(t) \exp(0) = h_0(t)$.
RR(male/female) = $\exp(\beta_1)$.

Coding B: $Z_2 = 1$ if female, 0 if male (반대).

RR(female/male) = $\exp(\beta_2)$.

관계:

\[ \beta_2 = -\beta_1, \quad \exp(\beta_2) = 1/\exp(\beta_1) \]

→ 부호만 뒤집힘. 통계적 결론 동일.

실무 가이드:

Reference group (= 0) 을 가장 일반적 또는 정상 (control) 으로.
$\beta_1 > 0$ 이면 “covariate 가 1 일 때 hazard 증가” 자연 해석.

3.3 K-Level Qualitative — K-1 Dummy Variables

정의: K-1 Dummy Coding (식 8.2.1)

$K$ 카테고리 → $K-1$ dummy variables. 한 카테고리는 reference (모든 dummy = 0).

3-level race 예 (식 8.2.2):

$Z_1 = 1$ if black, 0 otherwise.
$Z_2 = 1$ if white, 0 otherwise.
Reference = Hispanic ($Z_1 = Z_2 = 0$).

Hazard rates:

Black: $h(t|Z_1=1, Z_2=0) = h_0(t) \exp(\beta_1)$.
White: $h(t|Z_1=0, Z_2=1) = h_0(t) \exp(\beta_2)$.
Hispanic: $h_0(t)$ (reference).

Hazard ratios:

RR(Black/Hispanic) = $\exp(\beta_1)$.
RR(White/Hispanic) = $\exp(\beta_2)$.
RR(Black/White) = $\exp(\beta_1 - \beta_2)$.

함정 — 단일 Ordinal Z=1,2,…,K 사용 금지

잘못된 coding: 4-stage cancer 를 단일 covariate $Z = 1, 2, 3, 4$ 로.

문제: Cox PH 가정 하에서

\[ \frac{h(t|Z=2)}{h(t|Z=1)} = \frac{h(t|Z=3)}{h(t|Z=2)} = \frac{h(t|Z=4)}{h(t|Z=3)} = e^\beta \]

→ “인접 카테고리 RR 이 모두 같다” 라는 강한 가정.

예 (race):

\[ \text{RR(white/black)} = \text{RR(Hispanic/white)} = e^\beta, \quad \text{RR(Hispanic/black)} = e^{2\beta} \]

이 가정은 거의 항상 위반 (race 간 mortality 비율이 등간격 아님).

올바른 처리: $K-1$ dummy variables → 각 RR 독립적으로 추정.

예외: 진짜 ordinal (예: stage 가 등간격 진행 보장) 일 때만 단일 covariate 가능. 그러나 진단으로 검증 권장.

→ K-1 dummy 가 표준 (안전 + 유연 + 해석 명확).

3.4 Continuous Covariate — 단위 변경

직관 — Age 의 단위 효과

$Z = $ age in years.

1 년 단위 RR:

\[ \frac{h(t \mid \text{age}=51)}{h(t \mid \text{age}=50)} = \exp(\beta) = \text{RR per year} \]

10 년 단위 RR:

\[ \frac{h(t \mid \text{age}=60)}{h(t \mid \text{age}=50)} = \exp(10\beta) = \text{RR per decade} \]

관계: $\exp(10\beta) = [\exp(\beta)]^{10}$.

예 (Klein Example 8.2 larynx): $b_4 = 0.0189$.

RR per year = $e^{0.0189} = 1.019$ (작아 보임).
RR per decade = $e^{0.189} = 1.21$ (10 년에 21% 위험 증가, 의미 있음).

→ 임상 보고 시 단위 선택: 50세 vs 40세 같은 비교가 직관적.

연속 covariate 의 가정:

Linear 효과 (log scale): $\log h$ vs $Z$ 직선.
위반 시 (예: U-shape, threshold) → § 8.6 의 discretization 또는 spline.
Ch.11 진단 (martingale residuals) 으로 검증.

3.5 Interaction — Product Term

정의: Interaction Term (식 8.2.6)

예: Treatment 효과가 sex 에 따라 다른가?

$Z_1 = 1$ if treatment 1, 0 if treatment 2.
$Z_2 = 1$ if male, 0 if female.
$Z_3 = Z_1 \times Z_2$ (interaction).

모형:

\[ h(t \mid Z) = h_0(t) \exp(\beta_1 Z_1 + \beta_2 Z_2 + \beta_3 Z_3) \]

Treatment 1 vs 2 의 RR:

Female ($Z_2 = 0$): $\exp(\beta_1)$.
Male ($Z_2 = 1$): $\exp(\beta_1 + \beta_3)$.

$\beta_3 = 0$: treatment 효과가 sex 무관 (no interaction).

$\beta_3 \neq 0$: interaction 존재.

→ Interaction 검정: $H_0: \beta_3 = 0$ → § 8.5 local test.

직관 — 언제 interaction 을 포함할 것인가

임상 가설 기반:

면역 치료가 성별에 따라 다른 효과? → treatment × sex.
약물 효과가 연령대에 따라 다름? → treatment × age.
처치가 stage 에 따라 다름? → treatment × stage (Klein Example 8.2 의 stage × age 변형).

탐색적 분석:

Forest plot 으로 sub-group RR 시각화.
Interaction 검정으로 통계적 유의성.

주의:

Multiple interaction 은 표본 크기 필요 (sparse cell 회피).
High-order interaction (3-way+) 는 해석 어려움.
임상 의의 + 통계적 유의성 모두 고려.

→ Interaction 은 모형의 정교함 + 해석 부담의 trade-off.

4 Klein Example 8.1 — Breast Cancer Immunoperoxidase

데이터 (Klein § 1.5)

OSU Cancer Registry 의 45 명 lymph-node-negative breast cancer (10+ year follow-up).

$Z = 1$ if IH+ (immunoperoxidase positive, 9 명).
$Z = 0$ if IH- (36 명).

모형과 해석

모형:

\[ h(t \mid Z) = h_0(t) \exp(\beta Z) \]

RR(IH+/IH-) = $\exp(\beta)$.

§ 8.3 의 추정: $b = 0.9802$.

→ $\text{RR} = e^{0.9802} = 2.67$.

임상 해석:

“IH+ 환자가 IH- 환자보다 2.67 배 빠른 사망”.

의의:

SLM 검사로 lymph-node negative 진단 → 좋은 예후 가정.
그러나 IH 검사 시 micrometastasis 검출 가능.
IH+ 의 hidden positive 가 진정한 negative 와 다른 mortality.
→ IH 검사의 임상적 중요성 의 통계적 근거.

5 Klein Example 8.2 — Larynx 4 Stage

데이터 (Klein § 1.8)

90 명 male larynx cancer.

Outcome: time to death.
Covariates: age (year), stage (I, II, III, IV).

5.1 Stage 만의 모형 — 식 8.2.3

K=4 Stage 의 K-1=3 Dummy Variables

식 8.2.3:

$Z_1 = 1$ if Stage II, 0 otherwise.
$Z_2 = 1$ if Stage III, 0 otherwise.
$Z_3 = 1$ if Stage IV, 0 otherwise.
Stage I = reference ($Z_1 = Z_2 = Z_3 = 0$).

모형:

\[ h(t \mid Z) = h_0(t) \exp(\beta_1 Z_1 + \beta_2 Z_2 + \beta_3 Z_3) \]

§ 8.4 결과 (Breslow tie 처리):

\[ b_1 = 0.0658, \quad b_2 = 0.6121, \quad b_3 = 1.7228 \]

Hazard Ratios:

비교	RR	해석
Stage II vs I	$e^{0.0658} = 1.07$	거의 동일
Stage III vs I	$e^{0.6121} = 1.84$	1.84 배 위험
Stage IV vs I	$e^{1.7228} = 5.60$	5.60 배 위험
Stage III vs II	$e^{0.6121-0.0658} = e^{0.5463} = 1.73$	1.73 배

함정 비교 — 단일 ordinal 사용 시

만약 stage 를 단일 covariate $Z = 1, 2, 3, 4$ 로 처리하면:

RR(II/I) = RR(III/II) = RR(IV/III) = $e^\beta$ (모두 동일).
데이터에서는 RR(II/I)=1.07 << RR(IV/III)=5.60/1.84=3.04 — 5 배 차이.

→ 등간격 가정 위반. 단일 ordinal 은 검정력 잃음.

→ 3 dummy 가 정답.

Score test 비교 (Klein Example 8.2):

단일 ordinal: $\chi^2 = 13.64$ (1 df).
3 dummy (Score test for $\beta_1 = \beta_2 = \beta_3 = 0$): § 8.5 의 multivariate 검정으로 $\chi^2 = 18.95$ (3 df).

→ 3 dummy 가 더 정확하지만 자유도도 큼. 단일 ordinal 이 가정 만족하면 더 검정력 강하나, 위반 시 잘못된 결론.

5.2 Stage + Age 모형 — 식 8.2.4

Continuous Age 추가

식 8.2.4:

\[ h(t \mid Z) = h_0(t) \exp(\beta_1 Z_1 + \beta_2 Z_2 + \beta_3 Z_3 + \beta_4 Z_4) \]

with $Z_4 = $ age in years.

§ 8.5 결과 (full model):

\[ \mathbf{b} = (0.1386, 0.6383, 1.6931, 0.0189) \]

Hazard Ratios:

비교	RR
Stage II vs I (age 보정)	$e^{0.1386} = 1.15$
Stage III vs I (age 보정)	$e^{0.6383} = 1.89$
Stage IV vs I (age 보정)	$e^{1.6931} = 5.44$
50세 vs 40세 (same stage)	$e^{10 \cdot 0.0189} = e^{0.189} = 1.21$

해석:

“50세 환자가 40세 환자보다 (같은 stage) 1.21 배 빠른 사망”.
“Stage IV 가 Stage I 보다 (같은 age) 5.44 배”.
Age 보정 후 stage 효과 거의 동일 (1.15, 1.89, 5.44 vs no-age 의 1.07, 1.84, 5.60).

→ Age 와 stage 가 거의 독립 (correlations 약함). Age 보정으로 stage 효과 변화 미미.

5.3 Age × Stage Interaction — 식 8.2.7

Interaction 의 도입

가설: Age 효과가 stage 에 따라 다른가? (예: 고령 + 고stage 의 synergy).

Product terms:

$Z_5 = Z_1 \cdot Z_4$ (Stage II × age).
$Z_6 = Z_2 \cdot Z_4$ (Stage III × age).
$Z_7 = Z_3 \cdot Z_4$ (Stage IV × age).

모형 (식 8.2.7):

\[ h(t \mid Z) = h_0(t) \exp\left(\sum_{k=1}^7 \beta_k Z_k\right) \]

예시: 50세 male with Stage II:

$Z_1 = 1, Z_2 = Z_3 = 0$.
$Z_4 = 50$.
$Z_5 = Z_1 \cdot Z_4 = 1 \cdot 50 = 50$.
$Z_6 = Z_7 = 0$.

→ Hazard $= h_0(t) \exp(\beta_1 + 50 \beta_4 + 50 \beta_5)$.

Interaction 검정:

$H_0$: $\beta_5 = \beta_6 = \beta_7 = 0$ (no interaction).

→ § 8.5 의 3-차원 subset test (Wald, LR, Score). 본 편에서는 coding 만, 검정 결과는 § 8.5 deep-dive 에서.

6 Klein Example 8.3 — Kidney Transplant (4-Group)

데이터 (Klein § 1.7)

863 명 kidney transplant: race × gender 4 그룹.

White male: 432.
Black male: 92.
White female: 280.
Black female: 59.

→ Reference = white female (가장 큰 그룹).

6.1 Coding A — 4-Group Dummy

직접 4-Group Coding

식 8.2.x:

$Z_1 = 1$ if black male, 0 otherwise.
$Z_2 = 1$ if white male, 0 otherwise.
$Z_3 = 1$ if black female, 0 otherwise.
Reference: white female.

§ 8.5 결과:

\[ b_1 = 0.1596, \quad b_2 = 0.2484, \quad b_3 = 0.65 \]

Hazard Ratios (vs white female):

그룹	RR
Black male	$e^{0.1596} = 1.17$
White male	$e^{0.2484} = 1.28$
Black female	$e^{0.65} = 1.93$

해석:

Black female 이 가장 높은 mortality (vs white female).
Male 들 (black + white) 이 white female 보다 약간 높음.
Race 와 gender 의 단순 main effect 만 보임.

6.2 Coding B — Main Effects + Interaction

$2 \times 2$ Factorial Coding

식 8.2.x (대안):

$Z_1 = 1$ if female, 0 if male.
$Z_2 = 1$ if black, 0 if white.
$Z_3 = Z_1 \cdot Z_2$ (female × black, 즉 black female indicator).

§ 8.5 결과:

\[ b_1 = -0.2484, \quad b_2 = -0.0888, \quad b_3 = 0.7455 \]

핵심: $\exp(\beta_3) = e^{0.7455} = 2.11$ — “black 인 것의 excess RR for females vs males”.

4 그룹 RR 도출 (white female reference):

그룹	$Z_1, Z_2, Z_3$	linear combination	RR
White female (ref)	1, 0, 0	$\beta_1 = -0.2484$	$e^{-0.2484}$
White male	0, 0, 0	0	1
Black female	1, 1, 1	$\beta_1 + \beta_2 + \beta_3$	$e^{-0.2484-0.0888+0.7455}$
Black male	0, 1, 0	$\beta_2$	$e^{-0.0888}$

White female 을 reference 로 transform:

각 그룹의 RR 을 white female 기준으로:

Black male / white female: $e^{-0.0888 - (-0.2484)} = e^{0.1596} = 1.17$ ✓.
White male / white female: $e^{0 - (-0.2484)} = e^{0.2484} = 1.28$ ✓.
Black female / white female: $e^{-0.2484-0.0888+0.7455 - (-0.2484)} = e^{0.6567} \approx 1.93$ ✓.

→ Coding A 와 B 의 RR 완전 일치. 두 coding 의 likelihood 도 동일.

Coding 선택 기준:

Coding A (4-group dummy): RR 직접 보고 편함, sub-group analysis 자연.
Coding B (main + interaction): interaction effect 검정 가능 ($H_0: \beta_3 = 0$).

→ 임상 의도에 따라 선택. 통계적 결론은 동일.

7 Practical Notes

Practical Note 1 — Coding 선택의 가이드

Reference Group 선택:

가장 큰 군 (검정력) — 표본 크기 큰 그룹.
가장 정상/control (해석) — RR 의 baseline.
가장 위험 적은 (양수 β) — 직관적 해석.

Categorical 변수 처리:

K=2: 단일 dummy (방향 자유).
K≥3: K-1 dummy 권장 (단일 ordinal 함정 회피).
진짜 ordinal (예: 약물 용량 단계): 단일 + 진단 (linearity 검증).

Continuous 변수 처리:

단위 표준화 (예: age in decades for 직관).
Linearity 가정 점검 (martingale residuals).
위반 시 log·spline·discretize.

Interaction:

임상 가설 명확할 때만 (탐색적 interaction 은 multiple comparison 함정).
Sparse cell 주의 (작은 표본 시 sub-group 추정 불안정).

Practical Note 2 — Likelihood 의 Coding 무관성

Coding A (4-group dummy) 와 Coding B (main + interaction) 의 partial likelihood 값 동일.

→ 검정 통계량 (Wald, LR, Score) 도 동일.

→ 다른 coding 은 같은 데이터에 다른 lens — 해석 + sub-question 의 차이만.

R survival::coxph 의 다양 coding 결과 비교로 확인 가능.

8 응용 분야

분야	주요 covariate	Coding
임상시험	treatment, sex, age	dummy + continuous
종양학	stage, grade, biomarker	K-1 dummy + continuous
심혈관 epidemiology	BP, cholesterol, smoking	continuous + dummy
Phase I 용량	dose level	단일 ordinal (등간격 가정) 또는 dummy
신약 개발	treatment × biomarker	interaction
인구통계	age, sex, race, education	dummy + continuous + interaction

9 코드 예시

9.1 Step 1 — R `coxph` Dichotomous

library(survival)

# Klein Example 8.1 — breast cancer immunoperoxidase
data(brcancer)  # 가상 또는 사용자 데이터
fit <- coxph(Surv(time, status) ~ ih, data = brcancer)
summary(fit)

# coef = 0.9802
# exp(coef) = 2.67 (RR)
# se(coef) = 0.4349
# 95% CI for RR: (1.14, 6.25)

9.2 Step 2 — K-1 Dummy + Continuous

# Klein Example 8.2 — larynx 4 stage + age
data(larynx)

# stage 를 factor 로 (자동 K-1 dummy)
larynx$stage <- as.factor(larynx$stage)

fit <- coxph(Surv(time, delta) ~ stage + age, data = larynx)
summary(fit)
# stage2: b=0.1386, RR=1.15
# stage3: b=0.6383, RR=1.89
# stage4: b=1.6931, RR=5.44
# age: b=0.0189, RR per yr=1.02

# RR per decade (10 year)
exp(10 * 0.0189)  # 1.21

9.3 Step 3 — Interaction (R)

# Klein Example 8.2 — age × stage interaction
fit_int <- coxph(Surv(time, delta) ~ stage * age, data = larynx)
summary(fit_int)
# stage2:age, stage3:age, stage4:age 추가

# Interaction 검정 (LR test)
fit_no_int <- coxph(Surv(time, delta) ~ stage + age, data = larynx)
anova(fit_no_int, fit_int, test = "LRT")

9.4 Step 4 — Two Coding Equivalence (Python)

from lifelines import CoxPHFitter
import pandas as pd

# Klein Example 8.3 — kidney transplant 두 coding
df = pd.DataFrame({
    'time': times, 'event': events,
    'race': race,    # 'black' or 'white'
    'gender': gender,  # 'male' or 'female'
})

# Coding A — 4-group dummy
df['group'] = df['race'] + '_' + df['gender']
df_a = pd.get_dummies(df, columns=['group'], prefix='', drop_first=True)
fit_a = CoxPHFitter()
fit_a.fit(df_a, duration_col='time', event_col='event')

# Coding B — main + interaction
df['female'] = (df['gender'] == 'female').astype(int)
df['black'] = (df['race'] == 'black').astype(int)
df['black_female'] = df['female'] * df['black']
fit_b = CoxPHFitter()
fit_b.fit(df[['time', 'event', 'female', 'black', 'black_female']],
          duration_col='time', event_col='event')

# Likelihood 비교 — 동일
print(fit_a.log_likelihood_)
print(fit_b.log_likelihood_)
# 동일

# RR 도출 — 동일
print(fit_a.summary)
print(fit_b.summary)

10 핵심 takeaway

§ 8.1~8.2 의 5 가지 교훈

Cox PH 모형 식 8.1.2 — $h(t|Z) = h_0(t) \exp(\beta'Z)$. Semiparametric (baseline 비모수 + effect 모수). 식 8.1.3 의 hazard ratio 시간 불변. Lehmann alternative $S(t|Z) = S_0^{\exp(\beta'Z)}$.
Coding 의 일반 회귀 일관성 — Dummy variables, continuous, interaction 모두 OLS·logistic 의 규칙 그대로. Cox PH 만의 특이점 없음.
K-1 dummy 권장 — K-level qualitative 의 표준 처리. 단일 ordinal $Z = 1, 2, \ldots, K$ 의 함정 (인접 RR 등간격 가정) 회피. Klein Example 8.2 larynx: 3 dummy 로 RR(II/I)=1.07, RR(IV/I)=5.60 (등간격 아님).
Continuous 의 단위 변경 효과 — Klein Example 8.2 age: $b = 0.0189$ → 1년 RR=1.019 (작아 보임) vs 10년 RR=1.21 (의미 명확). 해석 단위 신중 선택.
Coding A vs B 의 동치성 (Klein Example 8.3) — 4-group dummy vs main+interaction 의 likelihood + RR 완전 동일. Coding 은 lens 의 차이 — 임상 의도에 맞춰 선택.

11 관련 주제

선행 지식

후속 주제

§ 8.3~8.4 — Partial Likelihood + Ties
§ 8.5 — Local Tests (Klein Example 8.2 의 stage 효과 + interaction 검정)
Ch.9 — Time-Varying Coefficients (PH 위반 처리)
Ch.11 — Schoenfeld Residuals (PH 가정 진단)

관련 개념

OLS·Logistic regression 의 dummy + interaction (동일 규칙)
Lehmann alternative — Cox PH 의 KM 함의
Linear model formulation — $\log h$ 의 linearity 가정
Reference group 선택의 임상 가이드