Kwangmin Kim - Recalibration — 모형 보정의 도구

1 왜 Recalibration 인가

가정 위반: 모형의 다른 인구 적용

가설: Framingham 모형 (미국 백인 인구) 을 한국 인구에 적용 → 위험 과대 또는 과소 예측.

원인: - Baseline 위험의 차이 (인종·환경·생활습관). - Effect size 의 차이. - 측정 도구의 차이.

3 단계 직관:

추상 정의: 모형이 적합된 인구의 분포에 맞춤. 다른 인구의 분포 다름 → 보정 필요.
일상어 비유: 일기 예보 모형이 한 도시 자료로 학습 — 다른 도시의 기후 차이.
반사실 시나리오: 보정 없이 적용 → 위험 분류 오류. 임상 의사 결정 잘못.

2 3 단계 점진 보정

정의: Steyerberg 의 3 단계 (2009)

Level 1 — Intercept Update: - \(\hat\beta_0\) 만 재추정. - 다른 모수 그대로. - Mean calibration 보정.

Level 2 — Slope Update: - \(\hat\beta_0\) + global slope 재추정. - 모든 \(\hat\beta_j\) scale. - Spread 보정.

Level 3 — Full Re-estimation: - 모든 모수 재추정. - 가장 유연. - 새 자료의 충분한 표본 필요.

2.1 Level 1 — Intercept Update

정의: Intercept 만 재추정

새 자료에서 logistic regression 의 intercept 만 재추정:

\[\text{logit}(\hat r_{\text{new}}) = \beta_0^{\text{new}} + \hat\eta^{\text{old}}\]

여기서 \(\hat\eta^{\text{old}}\) = 원 모형의 linear predictor.

\(\beta_0^{\text{new}}\) = mean calibration 보정.

직관 3 단계: Intercept 의 의미

추상 정의: 새 인구의 baseline 위험 차이 보정. Effect size (slope) 가정 그대로.
일상어 비유: 다른 도시의 평균 기온 차이 — 평균은 다르지만 변화 패턴 비슷.
반사실 시나리오: Effect size 가 인구 간 일정이면 intercept 만으로 충분. 다르면 slope update 필요.

2.2 Level 2 — Slope Update

정의: Slope 도 재추정

\[\text{logit}(\hat r_{\text{new}}) = \beta_0^{\text{new}} + \beta_1^{\text{new}} \cdot \hat\eta^{\text{old}}\]

\(\beta_1^{\text{new}}\) = global slope 보정.

\(\beta_1^{\text{new}} = 1\) 이면 over-/under-fit 없음. \(< 1\) 이면 over-fit, \(> 1\) 이면 under-fit.

직관 3 단계: Slope 의 의미

추상 정의: 모든 covariate 의 효과를 일정 비율로 scale. Over-fit 보정의 표준.
일상어 비유: 시험 점수의 분포가 너무 spread → 모든 점수에 일정 비율 보정.
반사실 시나리오: 한 covariate 의 효과가 인구 간 다르면 slope update 부족 → Level 3.

2.3 Level 3 — Full Re-estimation

정의: Full Re-estimation

새 자료에서 모든 \(\hat\beta\) 재추정. 사실상 새 모형 적합.

조건: 새 자료의 표본 크기 충분 (사건 수 ≥ 100 권장).

직관 3 단계: Full 의 trade-off

추상 정의: 가장 유연 — 각 covariate 의 효과 자유. 단 over-fit 위험 ↑.
일상어 비유: 새 자료에 완전 적합 — 정확하지만 일반화 ↓ 가능.
반사실 시나리오: 새 자료가 작으면 full re-estimation 이 over-fit. Intercept 또는 slope update 가 안전.

3 Recalibration 의 절차

[Step 1] External validation 에서 calibration plot 시각.
   ↓
[Step 2] Mean calibration (α) + Slope (β) 산출.
   ↓
[Step 3] α ≈ 0, β ≈ 1: 보정 불필요.
   ↓
[Step 4] α ≠ 0, β ≈ 1: Intercept update.
   ↓
[Step 5] β ≠ 1: Slope update.
   ↓
[Step 6] 패턴 복잡: Full re-estimation (충분 자료 시).

4 Reichle 등의 사례

사례: Framingham 의 European 보정

미국 Framingham 모형을 European 인구 (SCORE) 에 적용 → 위험 과대 평가.

해결: - Intercept update: European baseline 위험 보정. - Slope update: European 의 effect size 분포 다름.

결과: SCORE (Systematic Coronary Risk Evaluation) — Framingham 의 European recalibrated 버전.

3 단계 직관:

추상 정의: 모형의 portability (이식성) 가 recalibration 의 핵심.
일상어 비유: 다른 도시의 환율 적용 — 환율 보정 후 같은 도구 사용.
반사실 시나리오: Recalibration 없이 적용 → 의사 결정 오류 (예: 모든 환자에 statin 권장).

5 A/B 테스트의 Recalibration

사례: 새 시장의 conversion 모형

A/B 테스트의 conversion 모형 — 미국 자료로 학습, 한국 시장에 적용.

미국 보정 모형: \[\text{logit}(\hat r) = -2 + 0.5 \cdot \text{engagement} + 0.3 \cdot \text{paid}\]

한국 적용 시 calibration plot 시각: - 한국의 baseline conversion 다름 → intercept update. - Effect size 가 한국의 사용자 행동에 다를 수 있음 → slope update.

# Intercept update
korea_X = korea_data[["engagement", "paid"]]
korea_eta = us_model.predict_linear(korea_X)
intercept_model = sm.Logit(korea_data["conversion"],
                            sm.add_constant(korea_eta)).fit()
new_intercept = intercept_model.params[0]

3 단계 직관:

추상 정의: 시장 별 baseline 차이 보정 → 더 정확한 lift 추정.
일상어 비유: 다른 도시의 매출 예측 모형 — 시장 보정 후 적용.
반사실 시나리오: 보정 없이 적용 → 가격·캠페인 정책 오류.

6 Calibration Drift — 시간 따른 보정

가정 위반: 모형의 시간 안정성

가설: 2010 년 자료로 적합한 ASCVD 모형을 2025 년에 적용. 그동안 statin 보급·식이 변화·진단 기준 갱신 → 진성 위험 분포 변화.

증상: - Calibration plot 이 시간 따라 대각선에서 멀어짐. - High-risk 영역에서 over-prediction 증가.

3 단계 직관:

추상 정의: \(P(Y \mid X)\) 가 calendar time 의 함수. 모형 적합 시점의 분포가 적용 시점에 변경.
일상어 비유: 환율 변환표가 10 년 전 — 지금 적용 시 부정확. 정기 갱신 필요.
반사실 시나리오: 정기 (예: 5 년) recalibration 이 표준. 또는 시간 의존 baseline hazard 모형.

6.1 Drift 의 검출

# 시간 strata 별 calibration
for year_group in ["2015-2019", "2020-2024"]:
    sub = data[data["year_group"] == year_group]
    obs = sub["Y"].mean()
    pred = sub["pred_risk"].mean()
    print(f"{year_group}: 관찰={obs:.3f}, 예측={pred:.3f}, ratio={obs/pred:.2f}")

Ratio 가 시간 따라 변하면 drift.

7 Subgroup-Specific Recalibration

정의: Subgroup 별 Recalibration

다른 subgroup (성별·인종·연령군) 에서 calibration 다르면 subgroup-specific intercept update.

\[\text{logit}(\hat r_g) = \beta_0^g + \hat\eta_{\text{old}}\]

각 subgroup \(g\) 의 \(\beta_0^g\) 별도 추정.

직관 3 단계: Subgroup Recalibration 의 가치

추상 정의: 모형의 systematic bias 가 subgroup 다름 → group 별 보정.
일상어 비유: 시험 채점이 학년별 다른 표준 — 학년별 보정 필요.
반사실 시나리오: 통합 intercept update 만 하면 subgroup 의 차이 가림. Subgroup 보정이 정밀.

예시 — Framingham 의 인종 보정: - White: \(\beta_0\) 표준. - Black: \(\beta_0 + 0.3\) (high baseline 보정). - Asian: \(\beta_0 - 0.2\).

8 Recalibration 의 3 단계 코드 예시

import numpy as np
import pandas as pd
import statsmodels.api as sm

# 가상: 미국 학습 모형 + 한국 자료 적용
us_model_params = {"intercept": -3.5, "smoke": 0.7, "age": 0.04}

# 한국 자료
korea = pd.DataFrame({
    "smoke": np.random.binomial(1, 0.45, 500),
    "age": np.random.normal(55, 12, 500),
    "Y": np.random.binomial(1, 0.15, 500),
})

# 원 모형의 linear predictor (한국에 적용)
korea["eta_old"] = (us_model_params["intercept"]
                    + us_model_params["smoke"] * korea["smoke"]
                    + us_model_params["age"] * korea["age"])

# Level 1: Intercept update
m1 = sm.Logit(korea["Y"], sm.add_constant(korea["eta_old"], has_constant="add")).fit(disp=0)
new_intercept = m1.params[0]
new_slope = m1.params[1]
print(f"Level 1 (intercept only): α_new = {new_intercept:.3f}")

# Level 2: Slope update (intercept + global slope 모두 자유)
print(f"Level 2 (intercept + slope): α = {new_intercept:.3f}, β = {new_slope:.3f}")
print(f"  β = 1 → no over/under-fit. β < 1 → over-fit.")

# Level 3: Full re-estimation
X_full = sm.add_constant(korea[["smoke", "age"]])
m3 = sm.Logit(korea["Y"], X_full).fit(disp=0)
print(f"\nLevel 3 (full re-estimation):")
print(m3.params)

해석: 3 단계의 결과 비교 → 가장 적절한 보정 수준 선택.

9 Q&A — 흔한 오해

Q1: Recalibration 후 AUC 가 변할까?

A: 아니다. Intercept 또는 slope update 는 rank 보존 — AUC 불변.

3 단계 직관:

추상 정의: AUC 는 rank 측도. 단조 변환 (intercept, slope) 후 rank 동일 → AUC 동일.
일상어 비유: 학생 점수에 일정 보정 추가 — 등수 동일.
반사실 시나리오: Full re-estimation 은 다른 변수 가중치 변경 → AUC 변경 가능.

결론: AUC 변화 = Level 3 만. Level 1~2 는 calibration 만 영향.

Q2: 작은 표본에 Level 3 적용 가능?

A: 권장 안 함. 새 자료의 사건 수 < 100 시 over-fit 위험.

3 단계 직관:

추상 정의: \(\beta\) 추정의 분산이 사건 수 반비례 — 작은 자료에서 noise 큼.
일상어 비유: 5 명의 시험 점수로 새 채점 기준 — 임의성 큼.
반사실 시나리오: Intercept update (Level 1) 만이 안전. 그 후 자료 ↑ 시 Level 2 → 3.

Q3: External validation 후 calibration 부정확 — 모형 폐기?

A: 아니다. Recalibration 으로 보정 가능.

3 단계 직관:

추상 정의: 모형의 rank (discrimination) 가 보존되면 recalibration 으로 calibration 만 보정.
일상어 비유: 시험 채점 기준 다른 학교에 적용 — 채점만 보정 (등수 보존).
반사실 시나리오: Discrimination (AUC) 도 부정확 시 모형 자체 부적합 — full re-estimation 또는 새 모형.

10 임상 사례 — 다양한 인구의 Recalibration

사례: ASCVD 의 다양한 인종 보정

미국 AHA/ACC 의 Pooled Cohort Equation: - White vs Black, Male vs Female 의 4 가지 인종-성별 조합 별 별도 모형. - 각 모형의 intercept + slope 보정.

3 단계 직관:

추상 정의: 인종-성별의 baseline 위험 + effect size 다름 → 분리 모형 + 별도 보정.
일상어 비유: 4 가지 학교 (남자/여자 × 우수/일반) 의 별도 평가 도구.
반사실 시나리오: 통합 모형 시 한 그룹의 정확도 ↑, 다른 그룹 ↓. 분리가 형평성.

사례: SCORE 의 European 보정

ESC (European Society of Cardiology) 의 SCORE — Framingham 의 European 보정.

보정 절차: 1. 미국 자료의 logistic 적합. 2. European cohort 자료에서 intercept + slope 산출. 3. 추가: low-risk vs high-risk European 국가별 별도.

결과: SCORE Low (Belgium, France, …) vs SCORE High (Russia, Romania, …).

3 단계 직관:

추상 정의: European 안에서도 baseline 위험 차이 큼 → subgroup 별 보정.
일상어 비유: 같은 시험을 다른 국가에 적용 — 국가별 합격선 보정.
반사실 시나리오: 단일 European 모형 시 high-risk 국가의 환자 위험 과소 평가. Subgroup 보정이 표준.

11 A/B 테스트의 다국가 Recalibration

사례: 글로벌 A/B 의 시장별 보정

대규모 IT 회사의 A/B 모형: - 미국 자료로 학습. - 한국·일본·인도네시아·브라질 시장에 적용 → 각 시장 calibration 점검.

시장 별 차이: - Baseline conversion rate 차이. - Treatment effect 차이. - 사용자 행동 패턴 차이.

대응: - Level 1: 시장 별 intercept update. - Level 2: Treatment slope 조정 (시장 별 효과 비례 조정). - Level 3: 큰 시장 (예: 미국·한국) 은 full re-estimation.

3 단계 직관:

추상 정의: 글로벌 모형 + 시장 별 보정 — global + local 결합.
일상어 비유: 글로벌 영화 평가 + 국가별 평론 — 평균 + 지역 보정.
반사실 시나리오: 단일 모형 글로벌 적용 시 일부 시장의 정확도 ↓. 시장별 보정이 정밀.

12 결론

Recalibration 의 3 단계 (intercept → slope → full) 가 모형의 portability 도구. External validation 의 calibration plot 이 1 차 진단. 새 자료 크기에 따라 적절한 단계 선택. Subgroup·시장 별 보정이 글로벌 활용의 표준. 시간 따른 drift 의 정기 점검도 필수.

다음 글 (H-WOO13-7) 에서 Brier Score 와 extraneous variable 을 본다.

13 관련 주제

Calibration
1111-11-11, Brier Score + Extraneous