Kwangmin Kim - Meta-Learners

출처

이 글은 사전지식 기반 (교재 미확인 — agent 사전학습 기반). 핵심 정리는 Künzel, Sekhon, Bickel, Yu (2019) PNAS 원논문에 한해 명시. 교재 (Causal Machine Learning) 는 docs/book 미보유.

이 글은 Phase J 시리즈의 10 번째 글이자 J-MLHTE 시리즈 의 두 번째. CATE 추정의 가장 단순·유연한 접근 — Meta-learners — 를 다룬다.

1 진입 직관 — “기존 ML 도구의 재사용”

이전 글에서 ML 기반 HTE 추정 의 동기를 봤다. 가장 단순한 접근:

Meta-learner 의 한 줄 원리: 기존 ML 회귀 알고리즘 (Random Forest, XGBoost, Neural Network 등) 을 재사용 하여 CATE 추정.

Causal Forest 같은 specialized 알고리즘 대신 general purpose ML 을 적절히 조합.

1.1 3 가지 주요 Meta-learner

S-learner (Single): 처치를 feature 로 포함 한 단일 모델
T-learner (Two): 처치군/대조군 각각 별도 모델
X-learner: T-learner 의 정교한 변형 (Künzel et al. 2019)

비유 — 음식 만들기: S-learner = 한 번에 모든 재료 섞어 끓이기. T-learner = 재료별 별도 조리 후 합치기. X-learner = T-learner 후 재료 비율 조정 (cross-fit).

2 S-Learner (Single Model)

2.1 메커니즘

처치 \(A\) 를 feature 로 포함 하여 단일 ML 모델 학습:

\[ \mu(x, a) = \mathbb{E}[Y | X=x, A=a] \]

그 후 CATE 추정:

\[ \hat{\tau}(x) = \hat{\mu}(x, 1) - \hat{\mu}(x, 0) \]

2.2 알고리즘

Input: (X, A, Y) 데이터
1. ML 모델 f 학습: f(X, A) → Y
2. CATE 추정:
   τ̂(x) = f(x, 1) - f(x, 0)

2.3 장점

가장 단순 — 1 개 모델만
모든 covariate 사용 (treatment 도 포함)
기존 모든 ML 알고리즘 적용 가능

2.4 한계

편향 위험: ML 알고리즘이 treatment \(A\) 를 무시 할 수 있음. \(A\) 가 수많은 covariate 중 하나 로 보여 영향력 낮게 학습.

사례: Random Forest 가 깊이 10. 100 covariate + treatment. RF 가 treatment 변수 를 splitting feature 로 거의 안 사용. CATE 과소 추정.

해결: Treatment 를 강조하는 feature engineering (예: treatment 와 covariate 의 interaction terms 명시).

언제 사용: Treatment 가 강한 main effect 를 가질 때. 단순 baseline.

3 T-Learner (Two Models)

3.1 메커니즘

처치군과 대조군 각각 별도 모델:

\[ \mu_1(x) = \mathbb{E}[Y | X=x, A=1], \quad \mu_0(x) = \mathbb{E}[Y | X=x, A=0] \]

CATE:

\[ \hat{\tau}(x) = \hat{\mu}_1(x) - \hat{\mu}_0(x) \]

3.2 알고리즘

Input: (X, A, Y)
1. 처치군 모델 f1 학습: f1(X) → Y | A=1
2. 대조군 모델 f0 학습: f0(X) → Y | A=0
3. CATE:
   τ̂(x) = f1(x) - f0(x)

3.3 장점

Treatment 를 강제로 분리 — S-learner 의 무시 함정 회피
각 군의 적합한 모델 가능 (다른 알고리즘도 OK)
직관적 — 두 group 의 결과 비교

3.4 한계 1: Imbalanced Data

처치군과 대조군의 sample size 다르면 한 모델이 더 정확. 차이가 진짜 effect heterogeneity 가 아닌 모델 정확성 차이.

사례: 처치군 1000 명, 대조군 9000 명. 대조군 모델 매우 정확, 처치군 모델 부정확. CATE 의 잡음 (noise) 가 처치군 모델의 잡음 우세.

3.5 한계 2: Regularization 의 비대칭

두 모델의 regularization 이 다르면 CATE 가 artifact. 같은 hyperparameter 사용해도 다른 데이터 분포 에 다른 효과.

사례: Both Random Forest with depth 10. 처치군 (1000) 에서 depth 10 은 overfitting. 대조군 (9000) 에서 depth 10 은 적절. CATE 가 모델 차이 를 반영.

3.6 언제 사용

Sample size 충분히 균등 + 모델 hyperparameter 조심스럽게 튜닝.

4 X-Learner (Künzel et al. 2019)

4.1 동기

T-learner 의 imbalanced data 문제 해결. Cross-pseudo-outcome + propensity weighting.

4.2 알고리즘

X-Learner Algorithm (Künzel et al. 2019)

Input: (X, A, Y), propensity score e(x) = P(A=1 | X=x)

Stage 1: T-learner
  - 처치군 모델 μ̂_1(x) = E[Y | X=x, A=1]
  - 대조군 모델 μ̂_0(x) = E[Y | X=x, A=0]

Stage 2: Cross-pseudo-outcome
  - 처치 환자 (i with A_i=1) 에 대해:
    D̃_i = Y_i - μ̂_0(X_i)   # "actual treated" - "predicted control"
  - 대조 환자 (i with A_i=0) 에 대해:
    D̃_i = μ̂_1(X_i) - Y_i   # "predicted treated" - "actual control"

Stage 3: 두 ML 모델 학습
  - τ̂_1(x) = E[D̃ | X=x, A=1]   # 처치 환자에서 D̃ 의 expectation
  - τ̂_0(x) = E[D̃ | X=x, A=0]   # 대조 환자에서 D̃ 의 expectation

Stage 4: Propensity Score Weighted Combination
  τ̂(x) = e(x) * τ̂_0(x) + (1 - e(x)) * τ̂_1(x)

4.3 메커니즘 직관

Stage 2 의 pseudo-outcome: 각 환자의 “observed - estimated counterfactual”. 처치 환자는 \(Y_i\) (관찰) 와 \(\hat{\mu}_0(X_i)\) (예측 counterfactual) 의 차이.

Stage 3 의 두 모델 학습: 처치 환자와 대조 환자 별도로 CATE 학습. 각 group 의 uniqueness 활용.

Stage 4 의 weighted combination: propensity score 로 두 추정 결합. Imbalanced data 보정.

4.4 장점

T-learner 의 imbalanced data 문제 해결
각 group 의 효과 추정 별도 가능
Theoretical guarantees (Künzel et al. 2019)

4.5 한계

Propensity score 추정 추가 필요
알고리즘 복잡 (4 stages)
작은 데이터에서 불안정

4.6 언제 사용

Imbalanced treatment groups + 충분한 데이터.

5 DR-Learner (Doubly Robust)

Doubly robust meta-learner. Outcome regression + propensity score 결합.

5.1 알고리즘

DR pseudo-outcome:
  ψ_i = (A_i - e(X_i)) / (e(X_i)(1 - e(X_i))) * (Y_i - μ̂(X_i, A_i))
        + μ̂(X_i, 1) - μ̂(X_i, 0)

ML 모델: τ̂(x) = E[ψ | X=x]

5.2 장점

Doubly robust — outcome model 또는 propensity model 중 하나만 정확 해도 일관 추정.

5.3 응용

EconML 의 DR-Learner 가 standard. Chernozhukov et al. (2018) 의 DML 과 친척.

6 4 Meta-Learners 비교

Learner	모델 수	Imbalanced data	복잡도	권장 사용
S-learner	1	영향 작음	단순	Baseline, treatment 강한 effect
T-learner	2	약함	단순	균등 sample
X-learner	4 + propensity	강함	중	Imbalanced sample
DR-learner	2 + propensity	강함	중	Doubly robust 필요

7 사례 — Drug Effect Heterogeneity

7.1 시나리오

당뇨 환자 5000 명. Drug X (처치 2000 명, 대조 3000 명). 50 covariate.

7.2 분석

X-learner 사용 (sample 약간 imbalanced):

T-learner 로 \(\hat{\mu}_1, \hat{\mu}_0\) 학습 (Random Forest 또는 XGBoost)

Cross-pseudo-outcome 계산

CATE 두 모델 학습

Propensity score (logistic regression on 50 covariate) 로 결합

7.3 결과

\(\hat{\tau}(x)\) — 각 환자의 예측 처치 효과. 의사가 효과 큰 환자에 처방 결정.

Effect heterogeneity 검증: \(\hat{\tau}(x)\) 의 분산이 통계적으로 유의 한가? (별도 검정).

8 시뮬레이션 — S vs T vs X

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression

np.random.seed(42)

# 시나리오: 5 covariate, 3 개가 effect modifier
n = 3000
d = 5
X = np.random.randn(n, d)

# 진짜 CATE
def true_cate(X):
    return 0.5 + 0.3 * X[:, 0] - 0.2 * X[:, 1] + 0.1 * X[:, 2] ** 2

# Imbalanced treatment
prob_A = 1 / (1 + np.exp(-(0.3 * X[:, 0] - 0.2)))
A = (np.random.random(n) < prob_A).astype(int)
print(f"Treatment ratio: {A.mean():.2f}")

# 결과
te = true_cate(X)
Y0 = 1.0 + 0.5 * X[:, 0] - 0.3 * X[:, 1] + np.random.normal(0, 1, n)
Y = Y0 + A * te

# S-learner
print("\n[S-Learner]")
X_aug = np.column_stack([X, A])
rf_s = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf_s.fit(X_aug, Y)
X_aug_1 = np.column_stack([X, np.ones(n)])
X_aug_0 = np.column_stack([X, np.zeros(n)])
cate_s = rf_s.predict(X_aug_1) - rf_s.predict(X_aug_0)
mse_s = np.mean((cate_s - te) ** 2)
corr_s = np.corrcoef(cate_s, te)[0, 1]
print(f"  CATE MSE: {mse_s:.3f}, Correlation: {corr_s:.3f}")

# T-learner
print("\n[T-Learner]")
rf_T = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf_C = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf_T.fit(X[A == 1], Y[A == 1])
rf_C.fit(X[A == 0], Y[A == 0])
cate_t = rf_T.predict(X) - rf_C.predict(X)
mse_t = np.mean((cate_t - te) ** 2)
corr_t = np.corrcoef(cate_t, te)[0, 1]
print(f"  CATE MSE: {mse_t:.3f}, Correlation: {corr_t:.3f}")

# X-learner
print("\n[X-Learner]")
# Stage 1: T-learner (이미 위에서)
mu_1 = rf_T.predict(X)
mu_0 = rf_C.predict(X)

# Stage 2: Cross pseudo-outcome
D_tilde = np.where(A == 1, Y - mu_0, mu_1 - Y)

# Stage 3: 두 ML 모델
rf_tau_T = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf_tau_C = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf_tau_T.fit(X[A == 1], D_tilde[A == 1])
rf_tau_C.fit(X[A == 0], D_tilde[A == 0])
tau_T = rf_tau_T.predict(X)
tau_C = rf_tau_C.predict(X)

# Stage 4: Propensity-weighted combination
ps_model = LogisticRegression()
ps_model.fit(X, A)
ps = ps_model.predict_proba(X)[:, 1]
cate_x = ps * tau_C + (1 - ps) * tau_T

mse_x = np.mean((cate_x - te) ** 2)
corr_x = np.corrcoef(cate_x, te)[0, 1]
print(f"  CATE MSE: {mse_x:.3f}, Correlation: {corr_x:.3f}")

# Comparison
print(f"\n[비교 — Imbalanced data 에서]")
print(f"  S-learner: MSE={mse_s:.3f}, Corr={corr_s:.3f}")
print(f"  T-learner: MSE={mse_t:.3f}, Corr={corr_t:.3f}")
print(f"  X-learner: MSE={mse_x:.3f}, Corr={corr_x:.3f}")
print(f"\n→ Imbalanced sample 에서 X-learner 가 보통 약간 우월")
print(f"→ 정확한 결과는 데이터·모델 선택에 따라 다름")

9 결론

Meta-learners 는 기존 ML 도구 의 재사용. S (단순) → T (분리) → X (정교) 의 위계. Imbalanced data 와 sample size 에 따라 선택.

핵심 메시지:

S-learner: 단일 모델, 단순, treatment 무시 위험
T-learner: 분리 학습, imbalanced data 함정
X-learner: Künzel 의 정교화, propensity weighting
DR-learner: Doubly robust 변형
CausalML / EconML: 패키지 도구
Sample size·imbalance·model 선택 이 결정적

다음 글: Causal Forest — specialized 알고리즘.

10 관련 주제

선행 지식

Phase J 후속 글

Causal Forest (placeholder)
Double/Debiased ML (placeholder)

11 참고문헌

Künzel, S. R., Sekhon, J. S., Bickel, P. J., Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects. PNAS 116, 4156-4165.
Kennedy, E. H. (2020). Optimal doubly robust estimation of heterogeneous causal effects. arXiv:2004.14497.
Nie, X. & Wager, S. (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika 108, 299-319.
Foster, J. C., Taylor, J. M. G., Ruberg, S. J. (2011). Subgroup identification from randomized clinical trial data. Stat. Med. 30, 2867-2880.
Microsoft Research. EconML library: https://github.com/microsoft/EconML
Uber Engineering. CausalML library: https://github.com/uber/causalml