Kwangmin Kim - Klein Ch.1 § 1.13~1.14 심화 — Hospitalized Pneumonia

1 들어가며 — NLSY 자매 데이터셋

Klein 시리즈 사다리:

편	주제
Ch.1 Overview (01)	19 예제 catalog
§ 1.1~1.2 (01-1)	Leukemia 6-MP
§ 1.3~1.4 (01-2)	BMT + Dialysis
§ 1.5~1.6 (01-3)	Breast + Burn
§ 1.7~1.8 (01-4)	Kidney + Laryngeal
§ 1.9~1.10 (01-5)	Auto/Allo + Lymphoma BMT
§ 1.11~1.12 (01-6)	Tongue + STD
§ 1.13~1.14 (본 편)	Pneumonia + Weaning (NLSY)
§ 1.15~1.19 (예정 또는 skip)	추가 5 예제

본 편이 답하는 다섯 가지 질문

NLSY (National Longitudinal Survey of Youth) 가 사회역학·공중보건 연구의 표준 자료원이 된 이유는?
모유 수유의 protective effect 를 시간-사건 분석으로 어떻게 검정하는가?
Cohort life table (actuarial estimator) 가 KM 과 무엇이 다르고 언제 자연스러운가?
Predictive model building vs causal inference — 같은 Cox model 을 두 목적으로 쓸 때의 차이?
NLSY 의 두 자매 데이터 (Pneumonia + Weaning) 가 같은 자료원에서 다른 통계 도전을 시연하는 방법?

1.1 NLSY 소개

직관 — NLSY 가 표준이 된 이유

NLSY79: 1979 년 시작, 14-21 세 youth 의 longitudinal panel.

Stratified random sample of US youth.
Annual interviews 1979-2018+.
Comprehensive variables: 교육, 직업, 결혼, 출산, 소득, 건강, 행동.
Mother-Child files: NLSY79 여성의 자녀 정보 매칭 (NLSY79 Children).

Survival analysis 응용:

§ 1.13 Pneumonia: 자녀의 pneumonia 입원 시간.
§ 1.14 Weaning: 자녀의 모유 수유 종료 시간.
같은 자료원 → covariate 풍부 + 신뢰도 높음.

장점:

큰 sample (수천 명).
다양한 covariate.
Public health policy 도출 가능.

한계:

Self-reported (recall bias).
Selection bias (특정 인구 group).
Pregnancy 시점 회상 정확도.

2 § 1.13 Time to Hospitalized Pneumonia

2.1 공중보건 배경 — 모유 수유의 보호 효과

2.1.1 모유의 면역학적 가치

모유 (Breast Milk)
    ├── IgA (immunoglobulin A): 점막 면역
    ├── Lactoferrin: 항균 단백질
    ├── Lysozyme: 박테리아 cell wall 분해
    ├── Bifidobacteria: 장내 미생물 균형
    └── Macrophages: 활성 면역세포

→ 영아 면역계 발달 + respiratory/GI 감염 보호.

2.1.2 Pneumonia 의 의의

Respiratory infection 의 심각한 형태.
영아 사망의 주요 원인 (특히 1 세 미만).
Hospitalization 이 객관적 marker (외래 진단의 변동성 회피).

2.1.3 가설 (Klein § 1.13)

“모유 수유 (vs never breast fed) 가 첫 1년 내 pneumonia 입원을 protective?”

직관 — 가설의 통계적 함의

처치 (Treatment):

“Ever breast fed” vs “Never breast fed”.
Binary indicator.

사건 (Event):

First hospitalization for pneumonia.
“Time” = 출생 후 시점.

Censoring:

1 세 도달 시 censored (study 종료).
Lost to follow-up.

가설 검정:

\(H_0\): \(S_{\text{breast}}(t) = S_{\text{never}}(t)\) (no protective effect).
\(H_1\): \(S_{\text{breast}}(t) > S_{\text{never}}(t)\) (모유 수유가 protective).

Cox model:

\[ h(t \mid Z) = h_0(t) \exp(\beta_1 \text{breast} + \beta_2 \text{birthweight} + \cdots) \]

\(\beta_1 < 0\) → protective.
\(\exp(\beta_1)\) = “hazard ratio of ever breast fed vs never”.

2.2 데이터 구조

n: 3,470 (NLSY 1979-1986).
사건: First hospitalized pneumonia (in first year).
시간: 출생 후 개월 (또는 일).
Censoring: 1 년 도달 또는 follow-up 종료.

2.2.1 Child Variables

변수	분포
Normal birthweight (≥5.5 lb)	36%
Race	56% white, 28% black, 16% other
Number of siblings	0-6
Age at hospitalization	(사건 시)

2.2.2 Mother Variables

변수	분포
Mother’s age	mean 21.64 (range 14-29)
Years of schooling	mean 11.4
Region	15% NE, 25% NC, 40% S, 20% W
Poverty	92%
Urban	76%

2.2.3 Health Behavior During Pregnancy

변수	분포
Alcohol use	36%
Cigarette use	34%

직관 — Pregnancy Health Behaviors

중요성:

임신 중 알코올·담배 = 영아 면역계·폐 발달 저해.
Pneumonia 위험 직접 영향.

Confounder:

모유 수유와 health behavior 가 상관 (건강 의식 있는 mother → 더 조심).
Confounding 통제 위해 multivariate Cox 필수.

Self-reported bias:

알코올·담배 사용 = social desirability bias (under-report).
진짜 노출은 측정값보다 클 가능성 → effect size attenuation.

2.3 Klein 사용

Exercises only.
학생 자율 분석:
- KM by breast feeding status.
- Multivariate Cox (breast feeding effect, 다른 covariate 통제).
- Subgroup analysis (race, poverty 별).

3 § 1.14 Times to Weaning

3.1 배경 — 모유 수유 지속 기간 결정 요인

3.1.1 Weaning (이유)

정의: 모유 수유 종료 (formula 또는 solid food 로 전환).
WHO 권장: 6 개월 exclusive + 24 개월까지 지속.
현실: 다양한 요인으로 조기 종료.

3.1.2 연구 질문

“어떤 요인이 모유 수유 지속 기간을 결정하는가?”

직관 — Predictive vs Causal

Predictive modeling (Klein § 1.14 의 목적):

“Mother 의 특성으로 weaning 시점 예측.”
변수 효과의 인과적 해석 안 함.
Cross-validation 으로 예측 정확도 평가.
정책: 조기 종료 위험 mother 식별 → targeted support.

Causal inference:

“Smoking 이 weaning 을 인과적으로 단축?”
Confounding 통제 필수.
Counterfactual analysis (DAG, instrumental variable).
Causal estimand: ATE (Average Treatment Effect).

같은 Cox model, 다른 해석:

Predictive: \(\beta\) 가 유용한 prediction.
Causal: \(\beta\) 가 진짜 인과 효과.

본 데이터 (§ 1.14): predictive 목적 명시 (Klein Ch.8 의 model building 시연).

3.2 데이터 구조

n: 927 (NLSY first-born).
사건: Weaning (breast feeding 종료).
Response: Duration of breast feeding (weeks).
Restriction:
- 1978년 이후 출생 (recall bias 회피).
- Gestation 20-45 weeks (preterm/term 모두 포함).
- 모유 수유 시도자만 (선택 편향).

3.2.1 Variables

변수	의미
Race of mother	1 white, 2 black, 3 other
Poverty status	1 if mother in poverty
Smoking at birth	1 if smoking
Alcohol drinking at birth	1 if drinking
Age of mother at child’s birth	continuous
Year of child’s birth	continuous
Education of mother	years
Lack of prenatal care	1 if late or no prenatal care

직관 — Recall Bias 와 Cohort Restriction

문제:

NLSY 는 annual interview → 과거 사건 회상.
모유 수유 종료 시점 = 정확한 회상 어려움 (특히 오래된 사건).

Klein 의 해결:

1978년 이후 출생 만 분석.
인터뷰 시 recall window 짧음.
Recall accuracy 높음.

Trade-off:

Recall bias 감소 (정확).
Sample size 감소 (1978 이전 출생 제외).
Generalization 제한 (later cohort).

이는 survey 데이터 분석의 보편 원칙 — recall bias 통제가 sample size 확보보다 중요.

3.3 Klein 사용 매핑

Chapter	본 데이터 사용
Ch.5.4	Cohort life table (actuarial estimator)
Ch.8	Predictive Cox model building

3.4 Cohort Life Table — Actuarial Estimator (Ch.5.4)

3.4.1 KM vs Life Table

KM (product-limit):

각 사건 시점 정확히 사용.
Continuous time.
작은 sample 에 적절.

Life Table (actuarial):

Time 을 fixed interval 로 grouping.
각 interval 안에서 censoring 균등 가정.
큰 sample / interval 데이터에 적절.

3.4.2 Actuarial Survival Estimate

Interval \(j = (t_{j-1}, t_j]\):

\(n_j\): interval 시작 시 at risk 수.
\(d_j\): interval 안에서 사건.
\(w_j\): interval 안에서 censored.
Effective at risk: \(n'_j = n_j - w_j/2\) (균등 가정).

조건부 survival:

\[ \hat p_j = 1 - \frac{d_j}{n'_j} \]

누적 survival:

\[ \hat S(t_j) = \prod_{k=1}^j \hat p_k \]

직관 — Actuarial 의 가정

Censoring 균등 가정:

Interval 안에서 censoring 이 시작·끝에 균등 분포.
“\(w_j/2\)” = “average exposure” 를 인구에서.

왜 KM 이 아닌 actuarial?

Interval data: 사건 시점이 정확하지 않고 interval 만 알려진 경우.
Large sample: 사건 수 많아 정확한 시점이 비효율.
History: 보험 통계학 전통 (life expectancy, mortality table).

Weaning 데이터에 적절한 이유:

Recall: 정확한 주 (week) 보다 month interval 이 신뢰할 수 있음.
Public health policy: monthly weaning rate 가 유용 (개입 timing).

3.5 Predictive Model Building (Ch.8)

3.5.1 절차

Univariate screening: 각 covariate 별 univariate Cox.
Multivariate baseline: 임상적 의미 있는 변수 포함.
Variable selection: stepwise / LASSO.
Interactions: 의심되는 interaction 검정.
Validation:
- Internal: bootstrap, k-fold CV.
- External: 다른 cohort 적용.
Performance metric:
- C-index (concordance).
- Time-dependent AUC.
- Brier score.

직관 — Predictive 의 평가

Causal inference 의 평가:

Effect size + confidence interval.
Confounding 통제 정도.
Sensitivity analysis.

Predictive modeling 의 평가:

Discrimination: high vs low risk 구분 정확도.
Calibration: 예측 확률 = 실제 빈도?
Prediction error (MSE, Brier).

C-index (Concordance):

무작위 두 환자 (i, j) 중 더 빨리 사건 발생자가 더 높은 risk score?
C = 0.5: random.
C > 0.7: useful.
C > 0.8: strong.

본 weaning 모델: 어떤 mother 가 빨리 weaning 할지 예측 → C-index 평가.

4 R + Python EDA

4.1 R — Pneumonia 데이터 (시뮬레이션)

library(survival)
library(survminer)

# NLSY pneumonia 시뮬레이션 (실제는 Klein web)
set.seed(42)
n <- 3470
pneumonia <- data.frame(
  id = 1:n,
  breast_fed = rbinom(n, 1, 0.5),
  birthweight_normal = rbinom(n, 1, 0.36),
  race = sample(c("white", "black", "other"),
                n, prob = c(0.56, 0.28, 0.16), replace = TRUE),
  siblings = sample(0:6, n, prob = c(0.3, 0.25, 0.2, 0.15, 0.05, 0.03, 0.02),
                    replace = TRUE),
  mother_age = pmin(29, pmax(14, rnorm(n, 21.64, 4))),
  mother_schooling = pmin(20, pmax(6, rnorm(n, 11.4, 2.5))),
  poverty = rbinom(n, 1, 0.92),
  urban = rbinom(n, 1, 0.76),
  alcohol = rbinom(n, 1, 0.36),
  cigarette = rbinom(n, 1, 0.34),
  time = pmin(12, rexp(n, rate = 0.05)),  # months
  status = rbinom(n, 1, 0.05)
)

# KM by breast feeding
fit <- survfit(Surv(time, status) ~ breast_fed, data = pneumonia)
ggsurvplot(
  fit, data = pneumonia,
  pval = TRUE, conf.int = TRUE,
  palette = c("red", "blue"),
  xlab = "Months", ylab = "Pneumonia-free probability",
  legend.labs = c("Never breast fed", "Ever breast fed")
)

# Multivariate Cox
cox_fit <- coxph(Surv(time, status) ~ breast_fed + birthweight_normal +
                 race + siblings + mother_age + mother_schooling +
                 poverty + urban + alcohol + cigarette,
                 data = pneumonia)
summary(cox_fit)
# breast_fed coefficient: HR < 1 = protective?

# C-index for predictive accuracy
library(survAUC)
predict_cox <- predict(cox_fit, type = "lp")
c_index <- concordance(cox_fit)
print(c_index)

4.2 R — Weaning 데이터 + Cohort Life Table

library(survival)

# NLSY weaning 시뮬레이션
set.seed(42)
n <- 927
weaning <- data.frame(
  race = sample(1:3, n, prob = c(0.6, 0.25, 0.15), replace = TRUE),
  poverty = rbinom(n, 1, 0.3),
  smoking = rbinom(n, 1, 0.25),
  drinking = rbinom(n, 1, 0.15),
  age = pmin(35, pmax(15, rnorm(n, 24, 5))),
  birth_year = sample(1979:1988, n, replace = TRUE),
  education = pmin(20, pmax(6, rnorm(n, 12, 2.5))),
  no_prenatal = rbinom(n, 1, 0.15),
  duration = pmin(80, rexp(n, rate = 0.05)),  # weeks
  weaned = rbinom(n, 1, 0.85)
)

# Cohort life table (actuarial)
# survival 패키지의 survfit 으로 actuarial (interval-grouped)
intervals <- seq(0, 80, by = 4)  # 4-week intervals
weaning$interval <- cut(weaning$duration, breaks = intervals,
                         include.lowest = TRUE, right = FALSE)

# 또는 KMsurv 패키지의 lifetab
# library(KMsurv)
# lifetab(intervals, ...)

# Actuarial 직접 계산
life_table <- weaning %>%
  group_by(interval) %>%
  summarise(
    n_at_risk = n(),
    d_events = sum(weaned),
    w_censored = sum(1 - weaned)
  ) %>%
  mutate(
    n_eff = n_at_risk - w_censored / 2,
    p_survive = 1 - d_events / n_eff,
    S_t = cumprod(p_survive)
  )
print(life_table)

# Predictive Cox
cox_pred <- coxph(Surv(duration, weaned) ~ factor(race) + poverty +
                  smoking + drinking + age + education + no_prenatal,
                  data = weaning)
summary(cox_pred)

# Bootstrap C-index
library(boot)
boot_c <- function(data, indices) {
  sub <- data[indices, ]
  cox_b <- coxph(Surv(duration, weaned) ~ factor(race) + smoking + age,
                 data = sub)
  concordance(cox_b)$concordance
}
boot_result <- boot(weaning, boot_c, R = 200)
print(boot_result)
quantile(boot_result$t, c(0.025, 0.975))

4.3 Python — `lifelines` + `scikit-survival`

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.utils import concordance_index
from sksurv.metrics import concordance_index_censored

# Pneumonia 데이터 (위 R 와 동일)
# (생략)

# KM by breast_fed
fig, ax = plt.subplots(figsize=(9, 6))
for bf, color in [(0, "red"), (1, "blue")]:
    sub = pneumonia[pneumonia["breast_fed"] == bf]
    kmf = KaplanMeierFitter()
    kmf.fit(sub["time"], sub["status"],
            label="Never" if bf == 0 else "Ever")
    kmf.plot_survival_function(ax=ax, color=color)
ax.set_xlabel("Months")
ax.set_ylabel("Pneumonia-free probability")
plt.tight_layout()

# Multivariate Cox
cph = CoxPHFitter()
cph.fit(pneumonia.drop(columns=["id", "race"])
        .assign(race_white=pneumonia["race"] == "white",
                race_black=pneumonia["race"] == "black"),
        duration_col="time", event_col="status")
print(cph.summary)

# C-index
c_index = concordance_index(pneumonia["time"],
                             -cph.predict_partial_hazard(...),
                             pneumonia["status"])
print(f"C-index: {c_index:.3f}")

5 두 데이터의 대비

측면	§ 1.13 Pneumonia	§ 1.14 Weaning
n	3,470	927
사건	First pneumonia 입원	Weaning
Time scale	Months	Weeks
핵심 노출	Breast feeding (vs never)	다양한 mother 변수
분석 목적	Causal (모유 수유 효과)	Predictive (mother factor 예측)
Klein 사용	Exercises	Ch.5.4 + Ch.8
통계 도전	Confounding 통제	Model building + validation

직관 — 같은 자료원, 다른 도구

§ 1.13 Pneumonia (3,470):

큰 sample → confounding 통제 충분.
“모유 수유의 protective effect” 가설.
Causal 관점.

§ 1.14 Weaning (927):

작은 sample 이지만 풍부한 mother variables.
“어떤 요인이 weaning 예측?”.
Predictive 관점.

페다고지:

같은 NLSY → 같은 데이터 처리 (recall bias, self-report).
다른 통계 목표 → 다른 도구 (causal Cox vs predictive Cox + life table).

6 핵심 직관 통합

NLSY = 사회역학 longitudinal panel 의 표준 (수천 명, 수십 변수).
모유 수유 protective effect = causal inference 의 표준 가설.
Self-reported bias + recall bias = NLSY 같은 survey 데이터의 본질적 한계.
Cohort life table (actuarial) = interval data + 큰 sample 에 적절.
Predictive vs causal = 같은 Cox model, 다른 평가 기준.
C-index = predictive 모델의 discrimination 평가.
Validation (bootstrap, CV) = predictive model 의 generalization.

7 실전 체크리스트 — § 1.13~1.14

§ 1.13 Pneumonia

NLSY 자료원 인지.
모유 수유 protective effect 가설.
Confounding 통제 (multivariate Cox).
Self-reported bias + recall bias 인지.
KM + log-rank + Cox PH.

§ 1.14 Weaning

First-born + 1978+ 출생 제한 이유 (recall bias).
Cohort life table (actuarial) 계산.
KM 과 life table 비교.
Predictive Cox model 적합.
C-index + bootstrap CI.
Predictive vs Causal 의 평가 기준 차이.

EDA

그룹별 events·censored·n.
Univariate screening.
Multivariate Cox + selection.
Validation (bootstrap, CV).

다음 단계

§ 1.15~1.19 (5 추가 예제, optional).
Ch.2 (Basic Quantities and Models) 으로 — \(S(t), h(t)\) 정확한 정의.

8 관련 주제

Klein 시리즈

관련 개념 (cross-category)

9 참고문헌

Klein, J. P., & Moeschberger, M. L. (2003). Survival Analysis: Techniques for Censored and Truncated Data (2nd ed.), Ch.1 § 1.13~1.14. Springer.
Bureau of Labor Statistics (2018). NLSY79 User’s Guide. https://www.bls.gov/nls/.
Center for Human Resource Research (1995). NLSY79 Children Documentation. Ohio State University.
World Health Organization (2003). Global Strategy for Infant and Young Child Feeding. Geneva: WHO.
Quigley, M. A., Kelly, Y. J., & Sacker, A. (2007). Breastfeeding and Hospitalization for Diarrheal and Respiratory Infection in the United Kingdom Millennium Cohort Study. Pediatrics, 119(4), e837-e842.
Cutler, S. J., & Ederer, F. (1958). Maximum Utilization of the Life Table Method in Analyzing Survival. Journal of Chronic Diseases, 8(6), 699-712. (Actuarial life table 정전)
Berkson, J., & Gage, R. P. (1950). Calculation of Survival Rates for Cancer. Proceedings of the Mayo Clinic, 25, 270-286.
Steyerberg, E. W. (2009). Clinical Prediction Models: A Practical Approach. Springer. (Predictive model building 정전)
Harrell, F. E., Lee, K. L., & Mark, D. B. (1996). Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors. Statistics in Medicine, 15(4), 361-387. (C-index)
Therneau, T. M., & Grambsch, P. M. (2000). Modeling Survival Data. Springer.
Davidson-Pilon, C. (2019). lifelines. JOSS, 4(40), 1317.
Pölsterl, S. (2020). scikit-survival. JMLR, 21(212), 1-6.