Kwangmin Kim - Klein Ch.1 § 1.11~1.12 심화 — Tongue Cancer (DNA Ploidy)

1 들어가며 — Ch.1 의 추가 예제

Klein 시리즈 사다리:

편	주제
Ch.1 Overview (01)	19 예제 catalog
§ 1.1~1.2 (01-1)	6-MP Leukemia
§ 1.3~1.4 (01-2)	BMT + Dialysis
§ 1.5~1.6 (01-3)	Breast Cancer + Burn
§ 1.7~1.8 (01-4)	Kidney Transplant + Laryngeal
§ 1.9~1.10 (01-5)	Auto/Allo BMT + Lymphoma BMT
§ 1.11~1.12 (본 편)	Tongue Cancer (DNA Ploidy) + STD Reinfection
§ 1.13~1.19 (예정 또는 skip)	추가 7 예제

Ch.1 의 두 layer

Core 9 예제 (§ 1.2~1.10): 책 본문에서 도구 시연.

Klein 13 chapter 의 표준 예제.
한 데이터가 multiple chapter 에 등장 (BMT 6 chapter, etc.).

Exercise 10 예제 (§ 1.11~1.19): 본문에서 짧게 소개, exercises 에서 활용.

학생의 자율 분석.
Core 와 다른 데이터 특성 (작은 sample, 큰 sample, 다양한 covariates).
§ 1.11~1.12 가 양 극단 — 작은 (80) vs 큰 (877).

2 § 1.11 Tongue Cancer — DNA Ploidy 의 Prognostic 효과

2.1 의학적 배경 — DNA Ploidy 와 종양

2.1.1 DNA Content 의 의미

정상 인간 세포: 2n (diploid, 46 chromosomes).

종양 세포의 변이:

Diploid: 2n DNA 함량 유지. 보통 덜 공격적.
Aneuploid: 2n 이 아닌 비정상 DNA 함량 (3n, 4n, 또는 비정수). 공격적.
Hyperdiploid: 2n 보다 약간 많음.

2.1.2 Flow Cytometry 측정

Sickle-Santanello et al. (1988):

종양 조직 → 단일 세포 분리 → DNA 염색 (DAPI)
    ↓
Flow cytometer 통과
    ↓
각 세포의 DNA 함량 측정
    ↓
히스토그램 분석 → ploidy 분류

직관 — Aneuploid 의 prognostic 의미

왜 aneuploid 가 공격적인가:

비정상 DNA = 유전적 불안정성 (genomic instability).
염색체 수 변화 → 종양 억제 유전자 손실.
빠른 mutation 축적 → 빠른 진행, 전이.

임상 의의:

Aneuploid 환자: 더 적극적 치료 (chemotherapy + radiation).
Diploid 환자: 보존적 치료.
Ploidy 가 prognostic biomarker.

본 데이터 (tongue cancer): ploidy 의 효과 검정.

2.2 데이터 — Table 1.6

2.2.1 Aneuploid Tumors (52 명)

Death times (31): 1, 3, 3, 4, 10, 13, 13, 16, 16, 24, 26, 27, 28, 30, 30, 32, 41, 51, 65, 67, 70, 72, 73, 77, 91, 93, 96, 100, 104, 157, 167.
Censored (21): 61+, 74+, 79+, 80+, 81+, 87+ (×2), 88+, 89+, 93+, 97+, 101+, 104+, 108+, 109+, 120+, 131+, 150+, 231+, 240+, 400+.

2.2.2 Diploid Tumors (28 명)

Death times (22): 1, 3, 4, 5, 5, 8, 12, 13, 18, 23, 26, 27, 30, 42, 56, 62, 69, 104, 104, 112, 129, 181.
Censored (6): 8+, 67+, 76+, 104+, 176+, 231+.
Time unit: weeks.
Total: 80 명.

직관 — 데이터 첫 인상

Aneuploid (n=52):

사건 비율: 31/52 = 60%.
Median (Death only): ~50 weeks.
Long survivors (231+, 240+, 400+) 도 있음 → heterogeneity.

Diploid (n=28):

사건 비율: 22/28 = 79%.
Median (Death only): ~30 weeks.

Paradox:

Aneuploid 사망률 (60%) 낮음, Diploid 사망률 (79%) 높음 — 의학 직관과 반대?

원인 가능성:

Censored 비율 차이 (aneuploid 40% vs diploid 21%).
Aneuploid 환자 follow-up 더 김.
사망률만으로는 부족 — KM curve + log-rank 로 시간 분포 비교 필요.

→ Crude rate 가 잘못된 결론 유도. 시간 분석 필수.

2.3 Klein 책 사용

Exercises 전용 (본문에서 분석 안 함).
학생 자율 분석:
- KM by ploidy.
- Log-rank test.
- Cox PH.
- Median survival 비교.

3 § 1.12 STD Reinfection — Core Group 가설

3.1 공중보건 배경

3.1.1 Gonorrhea & Chlamydia

Gonorrhea (임질): Neisseria gonorrhoeae 박테리아.
Chlamydia (클라미디아): Chlamydia trachomatis 박테리아.
공통 특성:
- 성관계로 전파.
- 여성에서 무증상 → 진단 지연.
- 무치료 시 PID (골반염), 불임.
- 항생제로 쉽게 치료.
- 그러나 재감염 (reinfection) 흔함.

3.1.2 Core Group 가설

직관 — Core Group 의 통계적 함의

가설 (May & Anderson 1988):

“전체 인구의 소수 (5~10%) 가 STD 의 60~80% 를 전파한다 — core group.”

이들의 특성:

빈번한 partner 변경.
낮은 condom use.
무증상 → 치료 지연.
재감염 반복.

공중보건 함의:

일반 인구 대상 캠페인 = 비효율적.
Core group 에 targeted intervention = 효율적.
“STD 박멸” 은 core group 식별·치료에 달림.

통계적 도전:

Core group 식별 = “재감염 위험 높은 subgroup” 식별.
Multivariate Cox + risk score model.
High-risk patient 의 사회·행동 특성 추출.

3.2 데이터 구조

n: 877 (큰 sample).
사건: 첫 reinfection (gonorrhea 또는 chlamydia).
시간: 진단~reinfection (또는 censoring).

3.2.1 Demographic Variables

변수	분포
Race	33% white, 67% black
Marital status	7% divorced/separated, 3% married, 90% single
Age	mean 20.6 (range 13-48)
Schooling	mean 11.4 years (range 6-18)
Initial infection	16% gonorrhea, 45% chlamydia, 39% both

3.2.2 Behavioral Variables

변수	분포
Partners (last 30 days)	mean 1.27 (range 0-19)
Oral sex (past 12 months)	33%
Rectal sex (past 12 months)	6%
Condom use	6% always, 58% sometimes, 36% never

3.2.3 Symptoms at Initial Diagnosis

Symptom	빈도
Abdominal pain	14%
Discharge	46%
Dysuria (배뇨통)	13%
Itch	19%
Lesion	3%
Rash	3%
Lymph involvement	1%

직관 — 변수의 3 categories

Demographic (5):

측정 신뢰도 높음 (객관).
Confounder 통제용.

Behavioral (4):

Self-reported → bias 위험.
Recall bias, social desirability bias.
그러나 reinfection 의 직접 원인.

Symptoms (7):

객관적 진단 시 측정.
일부는 reinfection 과 다른 메커니즘 (개인 면역).

총 16 변수 (binary indicators 포함하면 더 많음) → multivariate analysis.

3.3 Multivariate Cox — Variable Selection

3.3.1 도전

877 events 가능 (sample size 큼).
16 + categorical levels = 약 25 covariates.
Variable selection 필요.

3.3.2 표준 방법

Forward stepwise:
- 빈 모델 → 가장 유의한 변수 추가 → 반복.
- 빠르지만 변수 간 interaction 무시.
Backward stepwise:
- Full 모델 → 가장 무의한 변수 제거 → 반복.
- 더 안정적이나 시간 오래.
AIC/BIC 기반:
- 모든 가능한 모델 평가 (작은 변수 수일 때).
- Information criterion 으로 best 선택.
LASSO Cox (Tibshirani 1997):

\[ \widehat\beta = \arg\min \Bigl\{ -\ell(\beta) + \lambda \sum_j |\beta_j| \Bigr\} \]
- L1 penalty → 일부 \(\beta_j = 0\) (자동 selection).
- Cross-validation 으로 \(\lambda\) 결정.

직관 — Variable Selection 의 함정

Stepwise 의 문제:

P-value inflation: 다중 검정 보정 안 됨.
Sensitivity to data: 약간 다른 sample → 다른 변수 선택.
Confidence interval invalid: post-selection inference 어려움.

LASSO 의 우위:

자동 sparsity → cleaner model.
Cross-validation 으로 tuning.
그러나 standard error 어렵 (post-LASSO inference).

Modern 권장:

LASSO + post-selection inference (Berk et al. 2013).
Stability selection (Meinshausen-Bühlmann 2010).
Bayesian variable selection (Spike-and-slab).

본 데이터 (n=877, p=20+) 에 LASSO 가 자연스러움.

3.3.3 Risk Score 구성

\[ \text{Risk Score}_i = \sum_j \widehat\beta_j Z_{ij} \]

각 환자의 위험 점수 계산.
Top 5~10% 가 core group 후보.

3.3.4 Targeted Intervention

Core group 식별 후:

더 빈번한 follow-up.
Behavioral counseling.
Partner notification.
Free condoms / education.

→ 통계 분석이 공중보건 정책 으로 직접 연결.

3.4 Klein 책 사용

Exercises 전용.
학생 자율 분석:
- Multivariate Cox.
- Variable selection (any method).
- Risk score 구성.
- High-risk subgroup 분석.

4 R + Python EDA — Tongue Cancer

4.1 R — `survival`

library(survival)
library(survminer)

# Klein Table 1.6
tongue <- data.frame(
  ploidy = c(rep("aneuploid", 52), rep("diploid", 28)),
  time = c(
    # Aneuploid: 31 deaths + 21 censored
    1, 3, 3, 4, 10, 13, 13, 16, 16, 24, 26, 27, 28, 30, 30, 32,
    41, 51, 65, 67, 70, 72, 73, 77, 91, 93, 96, 100, 104, 157, 167,
    61, 74, 79, 80, 81, 87, 87, 88, 89, 93, 97, 101, 104, 108, 109,
    120, 131, 150, 231, 240, 400,
    # Diploid: 22 deaths + 6 censored
    1, 3, 4, 5, 5, 8, 12, 13, 18, 23, 26, 27, 30, 42, 56, 62, 69,
    104, 104, 112, 129, 181,
    8, 67, 76, 104, 176, 231
  ),
  status = c(
    rep(1, 31), rep(0, 21),  # aneuploid
    rep(1, 22), rep(0, 6)    # diploid
  )
)

# 기본 통계
table(tongue$ploidy, tongue$status)
#           0  1
# aneuploid 21 31
# diploid    6 22

# KM
fit <- survfit(Surv(time, status) ~ ploidy, data = tongue)
ggsurvplot(fit, data = tongue, pval = TRUE, conf.int = TRUE,
           palette = c("red", "blue"),
           xlab = "Weeks", ylab = "Survival probability",
           legend.labs = c("Aneuploid", "Diploid"))

# Log-rank
survdiff(Surv(time, status) ~ ploidy, data = tongue)

# Cox PH
cox_fit <- coxph(Surv(time, status) ~ ploidy, data = tongue)
summary(cox_fit)
# HR aneuploid / diploid

# Median survival
print(fit)
# aneuploid median: ~73 weeks
# diploid median: ~28 weeks

4.2 Python — `lifelines`

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.statistics import logrank_test

# 데이터 (R 동일)
aneuploid_death = [1, 3, 3, 4, 10, 13, 13, 16, 16, 24, 26, 27, 28, 30, 30,
                    32, 41, 51, 65, 67, 70, 72, 73, 77, 91, 93, 96, 100,
                    104, 157, 167]
aneuploid_cens = [61, 74, 79, 80, 81, 87, 87, 88, 89, 93, 97, 101, 104,
                   108, 109, 120, 131, 150, 231, 240, 400]
diploid_death = [1, 3, 4, 5, 5, 8, 12, 13, 18, 23, 26, 27, 30, 42, 56,
                  62, 69, 104, 104, 112, 129, 181]
diploid_cens = [8, 67, 76, 104, 176, 231]

tongue = pd.DataFrame({
    "ploidy": (["aneuploid"] * (len(aneuploid_death) + len(aneuploid_cens))
               + ["diploid"] * (len(diploid_death) + len(diploid_cens))),
    "time": aneuploid_death + aneuploid_cens + diploid_death + diploid_cens,
    "status": ([1] * len(aneuploid_death) + [0] * len(aneuploid_cens)
               + [1] * len(diploid_death) + [0] * len(diploid_cens)),
})

# KM
fig, ax = plt.subplots(figsize=(9, 6))
for grp, color in [("aneuploid", "red"), ("diploid", "blue")]:
    sub = tongue[tongue["ploidy"] == grp]
    kmf = KaplanMeierFitter()
    kmf.fit(sub["time"], sub["status"], label=grp)
    kmf.plot_survival_function(ax=ax, color=color)
    print(f"{grp}: median = {kmf.median_survival_time_:.0f} weeks")
ax.set_xlabel("Weeks")
ax.set_ylabel("Survival probability")
plt.tight_layout()

# Log-rank
result = logrank_test(
    tongue[tongue["ploidy"] == "aneuploid"]["time"],
    tongue[tongue["ploidy"] == "diploid"]["time"],
    tongue[tongue["ploidy"] == "aneuploid"]["status"],
    tongue[tongue["ploidy"] == "diploid"]["status"]
)
print(f"Log-rank: p = {result.p_value:.4f}")

# Cox
tongue["aneuploid_ind"] = (tongue["ploidy"] == "aneuploid").astype(int)
cph = CoxPHFitter()
cph.fit(tongue[["time", "status", "aneuploid_ind"]],
        duration_col="time", event_col="status")
print(cph.summary)

직관 — KM 결과의 정확한 해석

Median survival 비교:

Aneuploid: ~73 weeks.
Diploid: ~30 weeks.

역설적: aneuploid 가 더 길게 생존? 의학 직관과 반대.

가능한 원인:

Selection bias: aneuploid 환자가 더 적극적 치료 받음.
Stage difference: diploid 환자가 더 진행된 stage 일 가능성.
Confounding: age, comorbidity 등 통제 안 됨.

해결:

Multivariate Cox 로 confounder 통제.
Stage 등 추가 covariate 필요 → 본 데이터의 한계.

본 데이터의 교훈: Crude comparison 의 해석 주의. 의학 직관과 결과가 반대일 때 confounding 의심.

5 R + Python EDA — STD Reinfection

5.1 R — Variable Selection

library(survival)
library(glmnet)
library(dplyr)

# STD 데이터 (시뮬레이션 — 실제는 Klein web)
set.seed(42)
n <- 877
std <- data.frame(
  race_white = rbinom(n, 1, 0.33),
  married = rbinom(n, 1, 0.03),
  divorced = rbinom(n, 1, 0.07),
  age = pmin(48, pmax(13, rnorm(n, 20.6, 5))),
  schooling = pmin(18, pmax(6, rnorm(n, 11.4, 2))),
  initial_gono = rbinom(n, 1, 0.16),
  initial_chla = rbinom(n, 1, 0.45),
  partners = pmin(19, rpois(n, 1.27)),
  oral_sex = rbinom(n, 1, 0.33),
  rectal_sex = rbinom(n, 1, 0.06),
  condom_always = rbinom(n, 1, 0.06),
  condom_never = rbinom(n, 1, 0.36),
  abdominal_pain = rbinom(n, 1, 0.14),
  discharge = rbinom(n, 1, 0.46),
  dysuria = rbinom(n, 1, 0.13),
  itch = rbinom(n, 1, 0.19),
  lesion = rbinom(n, 1, 0.03),
  rash = rbinom(n, 1, 0.03),
  lymph = rbinom(n, 1, 0.01),
  time = rexp(n, rate = 0.005),
  status = rbinom(n, 1, 0.4)
)

# Multivariate Cox (full model)
cox_full <- coxph(Surv(time, status) ~ ., data = std)
summary(cox_full)

# Backward stepwise (AIC)
cox_back <- step(cox_full, direction = "backward", trace = 0)
summary(cox_back)

# LASSO Cox (glmnet)
X <- model.matrix(Surv(time, status) ~ ., data = std)[, -1]
y <- with(std, Surv(time, status))

lasso_fit <- cv.glmnet(X, y, family = "cox", alpha = 1)
plot(lasso_fit)
print(coef(lasso_fit, s = "lambda.min"))
print(coef(lasso_fit, s = "lambda.1se"))

# Risk score
risk_score <- predict(lasso_fit, X, s = "lambda.1se")

# High-risk subgroup (top 10%)
threshold <- quantile(risk_score, 0.9)
std$high_risk <- risk_score > threshold

# KM by risk group
fit_risk <- survfit(Surv(time, status) ~ high_risk, data = std)
ggsurvplot(fit_risk, data = std, pval = TRUE,
           legend.labs = c("Low risk", "High risk (core group?)"))

5.2 Python — `lifelines` + `scikit-survival`

from lifelines import CoxPHFitter
from sksurv.linear_model import CoxnetSurvivalAnalysis  # LASSO Cox
from sksurv.util import Surv

# 데이터 (위 R 와 동일 구조)
# (생성 생략)

# Multivariate Cox (lifelines)
cph = CoxPHFitter()
cph.fit(std, duration_col="time", event_col="status")
print(cph.summary)
# 변수별 coefficient + p-value

# LASSO Cox (scikit-survival)
y = Surv.from_dataframe("status", "time", std)
X = std.drop(columns=["time", "status"])

coxnet = CoxnetSurvivalAnalysis(
    l1_ratio=1.0,  # pure LASSO
    alpha_min_ratio=0.01,
    n_alphas=100,
    fit_baseline_model=True
)
coxnet.fit(X, y)

# Cross-validation 으로 best alpha 선택
from sklearn.model_selection import GridSearchCV
gscv = GridSearchCV(
    CoxnetSurvivalAnalysis(l1_ratio=1.0),
    {"alphas": [[a] for a in coxnet.alphas_]},
    cv=5, n_jobs=-1
).fit(X, y)
best_alpha = gscv.best_params_["alphas"][0]

# Risk score
risk_score = coxnet.predict(X)

직관 — 877 명 + 20 변수 의 모델 빌딩

Events per Variable (EPV):

877 × 0.4 (event rate) ≈ 350 events.
350 / 20 ≈ 17.5 EPV.
일반 권장: EPV ≥ 10 → 충분.

Variable selection 권장 순서:

Univariate screening: 각 변수 별로 univariate Cox.
Domain pre-selection: 임상적 의미 있는 변수 (e.g., partners, condom).
Multivariate baseline: 위 변수로 Cox 적합.
Stepwise / LASSO: 추가 변수 선택.
Validation: Cross-validation 또는 bootstrap.

Stepwise 의 alternative:

LASSO: sparsity 자동.
Elastic net: LASSO + ridge.
Random survival forest: feature importance.
Bayesian variable selection.

본 데이터 (n=877, p=20) 에서 모든 방법 적용 가능 + 결과 비교.

6 두 데이터의 대비

측면	§ 1.11 Tongue	§ 1.12 STD
n	80	877
그룹	2 (ploidy)	continuous risk
사건	사망	reinfection
공변량	1 (ploidy)	20+ (demographic·behavioral·symptoms)
도전	Crude vs adjusted comparison	Variable selection
Klein 사용	Exercises	Exercises

직관 — 양 극단의 페다고지

§ 1.11 Tongue (작은 단순):

80 명, 단일 그룹 변수.
“단순 비교의 해석 주의” 학습.
Crude rate ≠ time-adjusted analysis.

§ 1.12 STD (큰 풍부):

877 명, 20+ covariates.
“Variable selection” 학습.
Risk score → public health intervention.

상보성:

작은 데이터: simple analysis 의 함정.
큰 데이터: complex analysis 의 도전.
학생이 양 극단을 모두 학습.

7 핵심 직관 통합

DNA Ploidy = 종양 유전적 불안정성 → prognostic biomarker.
Crude rate vs time-adjusted = 단순 비율이 잘못된 결론 가능 (tongue).
Core group 가설 = 소수가 다수의 STD 전파 → targeted intervention.
Variable selection 도전 = 877 + 20 변수 → LASSO·stepwise·post-selection inference.
Risk score = multivariate Cox → high-risk subgroup → 공중보건 정책.
양 극단 페다고지 = 작은+단순 (tongue) vs 큰+풍부 (STD).

8 실전 체크리스트 — § 1.11~1.12

§ 1.11 Tongue Cancer

DNA ploidy 의 임상 의미 (aneuploid = 공격적).
Flow cytometry 측정 원리.
80 명 데이터 정확히 입력.
KM by ploidy + log-rank.
Cox PH + HR 해석.
Crude vs adjusted 의 해석 주의.

§ 1.12 STD Reinfection

Core group 가설 의 공중보건 의미.
20+ covariates 의 categorization (demographic·behavioral·symptoms).
Self-reported bias 인지.
Multivariate Cox + variable selection.
LASSO Cox vs stepwise 비교.
Risk score 구성 + high-risk subgroup 식별.
Targeted intervention 의 통계적 근거.

EDA

그룹별/변수별 events·n·평균.
KM curve + univariate Cox.
Multivariate Cox + variable selection.
Validation (cross-validation, bootstrap).

다음 단계

§ 1.13~1.19 (추가 7 예제, optional).
Ch.2 (Basic Quantities and Models) 으로 이동 — \(S(t), h(t)\) 정확한 정의.

9 관련 주제

Klein 시리즈

Ch.1 Overview
§ 1.1~1.2 — Introduction · Leukemia
§ 1.3~1.4 — BMT · Dialysis
§ 1.5~1.6 — Breast Cancer · Burn
§ 1.7~1.8 — Kidney Transplant · Laryngeal
§ 1.9~1.10 — Auto/Allo BMT · Lymphoma BMT
(다음) § 1.13~1.19 (예정 또는 skip)

관련 개념 (cross-category)

10 참고문헌

Klein, J. P., & Moeschberger, M. L. (2003). Survival Analysis: Techniques for Censored and Truncated Data (2nd ed.), Ch.1 § 1.11~1.12. Springer.
Sickle-Santanello, B. J., Farrar, W. B., DeCenzo, J. F., et al. (1988). Technical and Statistical Improvements for Flow Cytometric DNA Analysis of Paraffin-Embedded Tissue. Cytometry, 9(6), 594-599.
May, R. M., & Anderson, R. M. (1988). The Transmission Dynamics of Human Immunodeficiency Virus (HIV). Philosophical Transactions of the Royal Society B, 321(1207), 565-607.
Yorke, J. A., Hethcote, H. W., & Nold, A. (1978). Dynamics and Control of the Transmission of Gonorrhea. Sexually Transmitted Diseases, 5(2), 51-56. (Core group 원전)
Tibshirani, R. (1997). The LASSO Method for Variable Selection in the Cox Model. Statistics in Medicine, 16(4), 385-395.
Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2011). Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent. Journal of Statistical Software, 39(5), 1-13. (R glmnet Cox)
Meinshausen, N., & Bühlmann, P. (2010). Stability Selection. JRSS B, 72(4), 417-473.
Berk, R., Brown, L., Buja, A., Zhang, K., & Zhao, L. (2013). Valid Post-Selection Inference. Annals of Statistics, 41(2), 802-837.
Pölsterl, S. (2020). scikit-survival. JMLR, 21(212), 1-6.
Davidson-Pilon, C. (2019). lifelines. JOSS, 4(40), 1317.