Kwangmin Kim - Time-Varying G-Formula와 IPW MSM

1 정의

정의: 시간변동 G-Formula

Sequential exchangeability + 양의 확률 + 일관성 아래

\[\mathrm{E}[Y^{\bar{a}}] = \sum_{\bar{l}} \mathrm{E}[Y | \bar{A}=\bar{a}, \bar{L}=\bar{l}] \prod_{k=0}^{K} f(l_k | \bar{a}_{k-1}, \bar{l}_{k-1})\]

각 시점 covariate $L_k$ 의 분포가 과거 처치 + 과거 covariate 조건부 — sequential modeling.

→ Robins (1986) 의 g-formula 의 시간변동 형태. Part III 의 핵심 도구.

정의: 시간변동 IPW MSM

비안정화 가중치 $W^{\bar{A}}$ 와 안정화 가중치 $SW^{\bar{A}}$:

\[W^{\bar{A}} = \prod_{k=0}^{K} \frac{1}{f(A_k | \bar{A}_{k-1}, \bar{L}_k)}, \qquad SW^{\bar{A}} = \prod_{k=0}^{K} \frac{f(A_k | \bar{A}_{k-1})}{f(A_k | \bar{A}_{k-1}, \bar{L}_k)}\]

가상 모집단에서 $\bar{A}$ 와 $\bar{L}$ 가 independence — sequentially randomized experiment 의 인공적 회복.

Marginal Structural Model: \[\mathrm{E}[Y^{\bar{a}}] = g(\bar{a}; \boldsymbol{\beta})\]

시점별 처치의 함수형. 가중 회귀로 $\boldsymbol{\beta}$ 추정.

직관 — 두 도구의 본질적 차이: G-formula 는 covariate 분포 모형 (sequential). IPW MSM 은 처치 모형 (sequential). 각자 다른 모형 의존 → 두 도구의 결과 일치 시 robust.

2 21.1 Time-Varying G-Formula

2.1 단순 사례 — Table 21.1 의 g-formula 적용

Hernan Ch.21.1 의 정량 분석

Table 21.1 에서 strategy never treat ($a_0 = a_1 = 0$):

Step 1: $L_1$ 의 분포 (given $A_0 = 0$): - $\Pr(L_1 = 0 | A_0 = 0) = (2400 + 1600) / 16000 = 0.25$. - $\Pr(L_1 = 1 | A_0 = 0) = (2400 + 9600) / 16000 = 0.75$.

Step 2: 결과 평균 (given $A_0 = 0, A_1 = 0$): - $\mathrm{E}[Y | A_0=0, A_1=0, L_1=0] = 84$. - $\mathrm{E}[Y | A_0=0, A_1=0, L_1=1] = 52$.

Step 3: g-formula: \[\widehat{\mathrm{E}}[Y^{a_0=0, a_1=0}] = 84 \times 0.25 + 52 \times 0.75 = 60\]

같은 절차로 $\widehat{\mathrm{E}}[Y^{a_0=1, a_1=1}] = 76 \times 0.5 + 44 \times 0.5 = 60$.

ATE = 0 — 정확! (Ch.20 의 전통 도구는 -8).

직관 — 왜 g-formula 가 작동하는가: $L_1$ 분포를 $A_0=0$ 의 분포로 sample 함. 즉 “$A_0=0$ 일 때의 가상 모집단” 에서 outcome 평균. Conditional 분석 ($L_1$ 보정) 의 collider 함정 회피.

직관 — Ch.20 의 stratification 과의 결정적 차이: stratification 은 각 stratum 안에서 비교 → collider conditioning. G-formula 는 stratum 별 평균을 가중 평균 → marginal 로 변환. 같은 데이터, 다른 frame — 다른 결과.

2.2 Causally Interpreted Structured Tree Graph (Figure 21.1, 21.2)

Tree Graph 의 직관

Figure 21.1 — 관측 데이터의 tree: - 시점별 분기: $A_0 \to L_1 \to A_1$. - 각 leaf 의 N 과 mean Y 표시. - 처치 결정 확률 (e.g., 0.5, 0.4, 0.8) 표시.

Figure 21.2 — 반사실 데이터의 tree (always treat): - $A_0 = 1$ 강제 → $L_1$ 분포만 sampled (관측 데이터의 $\Pr(L_1 | A_0=1)$ 사용). - $A_1 = 1$ 강제 → 모든 가상 환자가 $A_1=1$. - 결과 평균: 가상 분포의 가중 평균.

→ G-formula 가 관측 tree 를 가상 tree 로 변환 하는 절차.

직관 — Tree 변환의 의미: 관측 tree 에서 처치 결정 부분만 가상 (전부 1) 으로 변경. covariate 분포는 관측 데이터에서 그대로 sample. Sequential simulation 의 시각적 표현.

2.3 Parametric G-Formula 의 Monte Carlo 절차

Parametric G-Formula 알고리즘

고차원 데이터 (다중 confounder + 다중 시점) 에서 parametric g-formula:

Step 1: Outcome 모형 적합: \[\mathrm{E}[Y | \bar{A}, \bar{L}] = \alpha_0 + \alpha_1 \bar{A} + \alpha_2 \bar{L} + \cdots\]

Step 2: 매 시점 $k$ 의 covariate 모형 적합: \[f(L_k | \bar{A}_{k-1}, \bar{L}_{k-1}) = \text{logistic 또는 linear regression}\]

Step 3: Monte Carlo 시뮬레이션: - 각 가상 환자에 대해 $A_0$ 강제 → $L_1$ 모형으로 sample → $A_1$ 강제 → $L_2$ sample → … → $Y$ predict. - 1000 명 가상 환자 시뮬레이션.

Step 4: 평균: \[\widehat{\mathrm{E}}[Y^{\bar{a}}] = \frac{1}{n_\text{sim}} \sum_i \widehat{Y}_i^{\bar{a}}\]

Bootstrap 으로 95% CI.

직관 — Monte Carlo 의 본질: 분석 해석 형태 (sum + product) 가 고차원에서 계산 불가. 시뮬레이션 으로 우회. 수치 계산 으로 분석적 형태 대체.

직관 — Software 지원: gfoRmula R 패키지 (Lin et al. 2019), GFORMULA SAS macro. 시간변동 g-formula 의 표준 구현. 계산 비용 큼 — 분 단위 분석 가능 (큰 데이터에서).

2.4 G-Formula 의 함정

G-Formula 의 misspecification 위험

Outcome 모형 misspecified: $\mathrm{E}[Y | \bar{A}, \bar{L}]$ 의 함수형 잘못 → 모든 가상 시나리오 결과 편향.
Covariate 분포 모형 misspecified: $f(L_k | \cdot)$ 의 분포 잘못 → 시뮬레이션 분포 편향.
시점이 길수록 누적 편향: 매 시점 작은 misspecification 이 시점 곱 으로 누적.
양의 확률 위반: 일부 covariate history 에서 처치 확률 0 또는 1 → 분포 sample 시 외삽.

→ DR (Ch.21.3) 또는 multiple specifications 의 sensitivity 검토 필수.

직관 — Sequential modeling 의 누적 위험: 시점별 한 모형 misspecification → 다음 시점 covariate 분포 영향 → 그 다음 시점 결과 영향. 시간이 길수록 위험 증폭. 시간변동 g-formula 의 핵심 도전.

3 21.2 Time-Varying IPW MSM

3.1 IPW Weight 의 시간변동 일반화

가중치 도출 (Hernan Ch.21.2)

각 시점 $k$ 의 처치 결정 모형: \[\Pr(A_k = a_k | \bar{A}_{k-1}, \bar{L}_k)\]

비안정화 가중치 = 매 시점 역확률의 곱: \[W_i^{\bar{A}} = \prod_{k=0}^{K} \frac{1}{\widehat{\Pr}(A_{k,i} = a_{k,i} | \bar{A}_{k-1,i}, \bar{L}_{k,i})}\]

가상 모집단의 의미: - 각 환자의 가중치만큼 복제. - 가상 모집단에서 매 시점 처치는 측정된 covariate 와 독립. - = sequentially randomized experiment 의 인공 회복.

직관 — 시점별 IPW 의 곱: 시점 0 에서 처치 받을 확률이 0.5 → 가중치 2. 시점 1 에서 0.4 → 가중치 2.5. 둘 다 받은 환자의 누적 가중치 = 2 × 2.5 = 5. 누적 곱이 폭발 위험.

3.2 Hernan Table 21.1 의 IPW 분석

Figure 21.3 의 정량 분석

Table 21.1 데이터에 IPW 적용:

가중치 계산: $A_0$ 무작위 50:50, $A_1$ 은 $L_1$ 의존 (40% if $L_1=0$, 80% if $L_1=1$).

$A_0$	$L_1$	$A_1$	$\Pr(A_0)$	$\Pr(A_1\|L_1)$	$W^{\bar{A}}$
0	0	0	0.5	0.6 (1-0.4)	$1/(0.5 \times 0.6) = 3.33$
0	0	1	0.5	0.4	$1/(0.5 \times 0.4) = 5$
0	1	0	0.5	0.2 (1-0.8)	$1/(0.5 \times 0.2) = 10$
0	1	1	0.5	0.8	$1/(0.5 \times 0.8) = 2.5$
1	0	0	0.5	0.6	$3.33$
1	0	1	0.5	0.4	$5$
1	1	0	0.5	0.2	$10$
1	1	1	0.5	0.8	$2.5$

가상 모집단 (각 환자 $\times$ $W^{\bar{A}}$): - 총 크기: $32000 \times 4 = 128,000$ (4 = 가능한 history 수). - 각 (A_0, A_1) 조합에 32,000 명 동일 분포.

결과: - ${ps}[Y | A_0=0, A_1=0] = $ 가중 평균 = 60. - ${ps}[Y | A_0=1, A_1=1] = $ 60. - ATE = 0 ✓.

직관 — 가중치의 효과: 처치 받기 어려운 history (가중치 큼) 의 환자가 가상 모집단에서 더 많이 표현. 가상 모집단에서 모든 history 가 균등 분포 → confounding 제거.

직관 — 정확히 g-formula 와 같은 결과: 두 다른 도구가 같은 0 도달. 다른 모형 의존 → 결과 일치 → robust 신호. Hernan 의 강조 — “두 도구 모두 정확한 답”.

3.3 Stabilized Weights 의 결정적 중요성

안정화의 시간변동 필수성

비안정화 $W^{\bar{A}}$: - 시점 곱 → 분산 폭발. - $K = 60$ (NHEFS-like) 에서 평균 가중치 $2^{60}$ — 사실상 무한. - 효과 추정 분산 비현실적.

안정화 $SW^{\bar{A}}$: - 분자 $\prod_k f(A_k | \bar{A}_{k-1})$ 가 분산 흡수. - 평균 가중치 $\approx 1$, 분산 합리적. - 시간변동에서 stabilization 은 필수.

NHEFS-like 분석에서 비안정화 시 분산 폭발 → 결과 무용. 안정화 시 95% CI 합리적.

직관 — 안정화 분자의 의미: $f(A_k | \bar{A}_{k-1})$ 는 “과거 처치만 알 때의 처치 확률” — covariate 를 모를 때의 baseline. 분모는 “covariate 알 때의 정확한 확률”. 두 비율이 가중치를 역수가 아닌 비율 로 표현 → 분산 안정.

직관 — Hernan 의 권장: 시간변동 IPW 는 항상 stabilization. 비안정화는 단일 시점에서만 권장. 이론적 동등성 + 실무적 차이.

3.4 IPW Weighted Logistic Regression

MSM 의 가중 회귀 적합

NHEFS-like 분석에서 MSM: \[\mathrm{E}[Y^{\bar{a}}] = \beta_0 + \beta_1 \cdot \text{cumulative}(\bar{a}) + \beta_2 \cdot a_K\]

가중 회귀: \[\mathrm{E}[Y | \bar{A}] = \beta_0 + \beta_1 \cdot cum(\bar{A}) + \beta_2 \cdot A_K\] with weights $\widehat{SW}_i^{\bar{A}}$.

GEE 또는 robust SE 로 분산 추정 (가중으로 IID 깨짐).

Bootstrap 도 권장 — 시점별 weight 추정의 분산 누적 고려.

직관 — MSM 의 함수형 선택: cumulative dose, last treatment, treatment duration 등 다양한 표현. 도메인 지식 + sensitivity 로 선택. 함수형 가정 misspecification 위험 — single-robust.

4 두 도구의 비교

측면	G-Formula	IPW MSM
핵심 모형	Outcome + covariate 분포	처치
모형 수 (시점 K+1)	$2(K+1)$	$K+1$
Misspecification 위험	Outcome 또는 covariate	처치
분산 폭발	Monte Carlo 노이즈	가중치 곱
Stabilization	자연	필수
구현 비용	큼 (시뮬레이션)	중간 (가중 회귀)
표준 software	gfoRmula, GFORMULA	표준 (gee, lme4)
강점	다양한 strategy 비교 가능	단순 구현
약점	모형 specification 부담 큼	가중치 폭발

직관 — 도구 선택의 trade-off: G-formula 는 유연성 우세 — 임의 strategy 비교. IPW MSM 은 단순함 우세 — 한 함수형 가정. 두 도구 모두 시도 + 결과 일치성 검토 가 표준.

5 응용 분야

HIV/AIDS Cohort 분석: ART 시작 시점·중단 효과의 g-formula
종양학 SMART: 시점별 항암제 선택의 IPW MSM
만성 질환 약물 dose: 시간변동 dose 의 g-formula 시뮬레이션
마케팅 attribution: 사용자 캠페인 sequence 의 IPW MSM
온라인 추천: 사용자 노출 history 의 g-formula

6 코드 — Time-Varying G-Formula 구현

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from statsmodels.api import OLS, Logit, add_constant

# === Hernan Table 21.1 데이터 ===
data_summary = pd.DataFrame({
    "A0": [0, 0, 0, 0, 1, 1, 1, 1],
    "L1": [0, 0, 1, 1, 0, 0, 1, 1],
    "A1": [0, 1, 0, 1, 0, 1, 0, 1],
    "N":  [2400, 1600, 2400, 9600, 4800, 3200, 1600, 6400],
    "Y":  [84, 84, 52, 52, 76, 76, 44, 44]
})
rows = []
for _, r in data_summary.iterrows():
    for _ in range(int(r["N"])):
        rows.append({"A0": r["A0"], "L1": r["L1"], "A1": r["A1"], "Y": r["Y"]})
df = pd.DataFrame(rows)

# === G-formula (parametric, simulation-based) ===
def parametric_g_formula(df, strategy_a0, strategy_a1, n_sim=10000):
    """Hernan Table 21.1 의 parametric g-formula"""
    # Step 1: L_1 모형 (given A_0)
    l1_model = smf.logit("L1 ~ A0", data=df).fit(disp=False)

    # Step 2: Y 모형 (given A_0, A_1, L_1)
    y_model = smf.ols("Y ~ A0 + A1 + L1 + A0:L1", data=df).fit()

    # Step 3: Monte Carlo 시뮬레이션
    sim_data = pd.DataFrame({"A0": [strategy_a0] * n_sim})
    # L_1 sample
    pr_l1 = l1_model.predict(sim_data)
    sim_data["L1"] = np.random.binomial(1, pr_l1)
    # A_1 강제
    sim_data["A1"] = strategy_a1
    # Y 예측
    sim_data["Y_pred"] = y_model.predict(sim_data)
    return sim_data["Y_pred"].mean()

# Always treat vs Never treat
e_y_always = parametric_g_formula(df, 1, 1)
e_y_never = parametric_g_formula(df, 0, 0)
print(f"=== Parametric G-Formula ===")
print(f"E[Y^(1,1)] = {e_y_always:.1f}")
print(f"E[Y^(0,0)] = {e_y_never:.1f}")
print(f"ATE = {e_y_always - e_y_never:.2f} (진짜 0)")

# === IPW MSM ===
print(f"\n=== IPW MSM ===")
# 처치 모형
a0_model = smf.logit("A0 ~ 1", data=df).fit(disp=False)   # Marginal
a1_model = smf.logit("A1 ~ A0 + L1", data=df).fit(disp=False)

# 가중치
df["pa0"] = a0_model.predict(df)
df["pa1"] = a1_model.predict(df)
df["w0"] = np.where(df.A0 == 1, 1/df.pa0, 1/(1-df.pa0))
df["w1"] = np.where(df.A1 == 1, 1/df.pa1, 1/(1-df.pa1))
df["w"] = df["w0"] * df["w1"]

# Stabilized weights
df["stab_pa1"] = a1_model.predict(df.assign(L1=df.L1.mean()))
df["sw1"] = np.where(df.A1 == 1, df.stab_pa1/df.pa1, (1-df.stab_pa1)/(1-df.pa1))
df["sw"] = df["w0"] * df["sw1"]

print(f"가중치 평균 (비안정화): {df.w.mean():.2f}")
print(f"가중치 평균 (안정화):   {df.sw.mean():.2f}")

# 가중 평균
def weighted_mean_ipw(df, a0_val, a1_val, weight_col):
    sub = df[(df.A0 == a0_val) & (df.A1 == a1_val)]
    return (sub.Y * sub[weight_col]).sum() / sub[weight_col].sum()

e_y_never_ipw = weighted_mean_ipw(df, 0, 0, "sw")
e_y_always_ipw = weighted_mean_ipw(df, 1, 1, "sw")
print(f"\nE[Y^(1,1)] (IPW) = {e_y_always_ipw:.1f}")
print(f"E[Y^(0,0)] (IPW) = {e_y_never_ipw:.1f}")
print(f"IPW MSM ATE = {e_y_always_ipw - e_y_never_ipw:.2f} (진짜 0)")

7 한 줄 요약

시간변동 g-formula 는 시점별 covariate 분포 모형 + 결과 모형 + Monte Carlo 시뮬레이션. 시간변동 IPW MSM 은 시점별 가중치 곱 + MSM 가중 회귀. Hernan Table 21.1 의 32,000 명 사례에서 두 도구 모두 정확히 0 추정 — 전통 도구 -8 와 대비. Stabilization 이 시간변동에서 필수 (분산 폭발 회피). G-formula 는 outcome + covariate 모형, IPW MSM 은 처치 모형 의존 — 다른 misspecification 위험. 두 도구 모두 시도 + 결과 일치 검토 가 robust 표준.

8 관련 주제

선행 지식

후속 주제

다른 카테고리 연결