Kwangmin Kim - Causal Forest

출처

이 글은 사전지식 기반 (교재 미확인 — agent 사전학습 기반). 핵심 정리는 Athey & Imbens (2016) PNAS, Wager & Athey (2018) JASA 원논문에 한해 명시.

이 글은 Phase J 시리즈의 11 번째 글. Random Forest 의 인과 변형 인 Causal Forest — Meta-learners 와 다른 specialized 알고리즘 — 를 다룬다.

1 진입 직관 — “Tree 의 splitting 을 인과 효과로”

이전 글의 Meta-learners 는 기존 ML 알고리즘 재사용. Causal Forest 는 처음부터 인과 추론을 위해 설계 된 알고리즘.

결정적 차이: Random Forest 의 splitting criterion 은 결과 \(Y\) 의 분산 감소 (예측 정확도 최대화). Causal Forest 의 splitting criterion 은 처치 효과 \(\tau(x)\) 의 heterogeneity 최대화.

즉 Tree 가 처치 효과가 다른 sub-group 을 자동 발견.

비유 — 의학 분류: 일반 ML tree = “증상으로 환자 그룹화” (진단). Causal forest tree = “처치 효과가 다른 환자 그룹화” (개인화 처방).

2 Random Forest 의 빠른 복습

2.1 알고리즘

다수 decision tree 의 bagging ensemble. 각 tree:

Bootstrap sample 추출

각 노드에서 random subset of features 로 best split 선택

결과 \(Y\) 의 분산 감소 가 최대인 split 채택

2.2 Splitting Criterion

MSE 최소화 (regression) 또는 Gini impurity (classification). 즉 결과 예측 정확도.

3 Causal Forest 의 핵심 변형

3.1 새로운 Splitting Criterion

처치 효과 \(\tau(x)\) 의 heterogeneity 최대화. Athey & Imbens (2016) 의 정의:

\[ \text{Goodness} = \frac{1}{|S_L|} \sum_{i \in S_L} (\hat{\tau}_L)^2 + \frac{1}{|S_R|} \sum_{i \in S_R} (\hat{\tau}_R)^2 \]

여기서 \(\hat{\tau}_L\), \(\hat{\tau}_R\) 는 left/right child 의 처치 효과 추정. 큰 차이가 나는 split 우선.

수식 직관: Random Forest 가 Y 분산 감소 라면, Causal Forest 는 τ(x) 분산 증가 (heterogeneity 발견). 정반대 방향.

3.2 직관

두 sub-group 의 처치 효과가 매우 다름 발견. 그 split 이 effect modifier 의 식별.

예: 연령 50 세 가 best split. 50 미만 군 효과 +30%, 50 이상 군 효과 -10%. 연령이 진짜 effect modifier.

4 Honest Estimation — Wager & Athey (2018)

4.1 동기

일반 RF 는 같은 데이터로 split + estimation. Overfitting 위험.

특히 인과 효과 추정에서: split 이 noise 에 의해 발견 + 그 split 으로 효과 추정 → 효과 부풀림.

4.2 Honest Tree

데이터를 두 부분으로 분할:

Splitting subsample: tree 의 split 결정에만 사용

Estimation subsample: 각 leaf 의 효과 추정에 사용

별개 데이터 로 split 과 estimation → overfitting 차단 + 통계적 valid 추정.

4.3 결과

Theorem (Wager & Athey 2018): Honest causal forest 의 CATE 추정량은 asymptotically normal. 신뢰구간 + p-value 계산 가능.

즉 ML 추정량에 고전 통계 추론 적용 가능. 매우 강력.

5 Causal Forest 의 알고리즘 (간략)

Causal Forest Algorithm

Input: (X, A, Y)

For b = 1 to B (예: 1000 trees):
  1. Bootstrap subsample I_b
  2. Honest split: I_b 를 두 부분으로
     - Splitting set: I_b^split
     - Estimation set: I_b^est
  3. Tree 학습:
     - 각 노드에서 random subset of features
     - Best split = max heterogeneity of CATE (splitting set 사용)
  4. 각 leaf 의 CATE 추정 = (Y_treated - Y_control) on estimation set

Output: 1000 trees 의 ensemble
   τ̂(x) = average over trees of leaf-CATE for x

5.1 신뢰구간 추정

Infinitesimal jackknife 또는 bootstrap of subsamples 로 분산 추정. Wager & Athey (2018) 의 method.

6 EconML 사례 코드

6.1 CausalForestDML

EconML 의 CausalForestDML — Causal Forest + Double Machine Learning 결합. 가장 강력한 변형.

from econml.dml import CausalForestDML
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

# 데이터: X (covariate), T (treatment), Y (outcome)
# 가정: T 와 Y 의 관계 + confounder X

est = CausalForestDML(
    model_y=RandomForestRegressor(),       # outcome model
    model_t=RandomForestClassifier(),      # propensity model
    n_estimators=1000,
    discrete_treatment=True
)

est.fit(Y, T, X=X)

# CATE 예측
cate_pred = est.effect(X_test)

# 신뢰구간
cate_ci_low, cate_ci_high = est.effect_interval(X_test, alpha=0.05)

함의: 단 몇 줄로 honest causal forest + DML + 신뢰구간. ML + 인과 추론의 통합.

7 Causal Forest vs Meta-learners

측면	Meta-learner	Causal Forest
Algorithm	기존 ML 재사용	Specialized
Splitting	Standard (Y 예측)	CATE heterogeneity
Honest	옵션	표준
신뢰구간	어려움	Asymptotic normal
구현	직접	Library (EconML, grf)
유연성	매우 높음	중 (RF 기반)

권고: baseline 으로 Meta-learner (T- or X-learner) + Random Forest. 공식 분석 으로 Causal Forest (with honest splitting).

8 Variable Importance

8.1 기능

Causal Forest 도 variable importance 제공. 어느 covariate 가 effect modifier 로 가장 중요한가.

8.2 응용

효과 modifier 발견 의 데이터-driven 도구. Hernan Ch.4 의 사전 정의 modifier 와 다른 접근.

9 R 패키지 — grf

grf (Generalized Random Forests, Athey, Tibshirani, Wager). R 의 표준 causal forest 패키지.

library(grf)
cf <- causal_forest(X, Y, W, num.trees = 2000)
tau.hat <- predict(cf, X.test)

10 시뮬레이션 — Causal Forest 성능

import numpy as np
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

# 시뮬레이션: 5 covariate, 2 effect modifier
n = 3000
d = 5
X = np.random.randn(n, d)

# 진짜 CATE
def true_cate(X):
    return 0.5 + 0.5 * X[:, 0] - 0.3 * X[:, 1]

A = np.random.choice([0, 1], n, p=[0.5, 0.5])
te = true_cate(X)
Y0 = 1.0 + 0.5 * X[:, 0] + np.random.normal(0, 1, n)
Y = Y0 + A * te

# 간단한 Causal Forest 구현 (개념적, EconML 사용 권장)
# 여기서는 T-learner 와 비교

# T-learner
rf_T = RandomForestRegressor(n_estimators=500, max_depth=10, random_state=42)
rf_C = RandomForestRegressor(n_estimators=500, max_depth=10, random_state=42)
rf_T.fit(X[A == 1], Y[A == 1])
rf_C.fit(X[A == 0], Y[A == 0])
cate_t = rf_T.predict(X) - rf_C.predict(X)

# Causal Forest 의 핵심 — splitting criterion 차이를 흉내
# 실제 grf/EconML 사용을 권장
# 여기서는 honest split 흉내: train/test split

mask_train = np.random.random(n) < 0.5
X_train, X_est = X[mask_train], X[~mask_train]
A_train, A_est = A[mask_train], A[~mask_train]
Y_train, Y_est = Y[mask_train], Y[~mask_train]

# Train: structure 결정 (간단히 T-learner)
rf_T_train = RandomForestRegressor(n_estimators=500, max_depth=10, random_state=42)
rf_C_train = RandomForestRegressor(n_estimators=500, max_depth=10, random_state=42)
rf_T_train.fit(X_train[A_train == 1], Y_train[A_train == 1])
rf_C_train.fit(X_train[A_train == 0], Y_train[A_train == 0])

# Estimation: 별도 데이터로 leaf 평균
# (이는 단순화 — 실제 honest forest 는 leaf 단위로 별도 추정)
cate_honest_train = rf_T_train.predict(X) - rf_C_train.predict(X)

# 비교
mse_t = np.mean((cate_t - te) ** 2)
mse_honest = np.mean((cate_honest_train - te) ** 2)

print(f"[Causal Forest 흉내 — T-learner with honest split]\n")
print(f"  진짜 CATE: 0.5 + 0.5*X0 - 0.3*X1")
print(f"\n  T-learner (no honest):    MSE = {mse_t:.3f}")
print(f"  T-learner (honest split): MSE = {mse_honest:.3f}")
print(f"\n→ 실제 EconML CausalForestDML 권장 — proper honest splitting + 신뢰구간")
print(f"→ 본 시뮬레이션은 개념 설명용")

# Variable importance (T-learner 흉내)
print(f"\n[Variable Importance — RF feature_importance]")
importance = rf_T.feature_importances_
print(f"  X0 (진짜 modifier): {importance[0]:.3f}")
print(f"  X1 (진짜 modifier): {importance[1]:.3f}")
print(f"  X2 (무관): {importance[2]:.3f}")
print(f"  X3 (무관): {importance[3]:.3f}")
print(f"  X4 (무관): {importance[4]:.3f}")
print(f"\n→ 진짜 modifier (X0, X1) 가 importance 높음 — 자동 발견")

11 결론

Causal Forest 는 Random Forest 의 인과 변형. Splitting criterion 이 처치 효과 heterogeneity. Honest estimation 으로 통계적 valid 신뢰구간 가능. EconML, grf 의 표준 도구.

핵심 메시지:

Splitting criterion 변경: Y 예측 → CATE heterogeneity
Honest estimation: Train/Estimation split — overfitting 차단
Wager-Athey theorem: Asymptotic normality + 신뢰구간
EconML / grf: Production 도구
Variable importance: Effect modifier 자동 발견
Meta-learner 와 비교: Specialized vs general

다음 글: Double/Debiased ML — Neyman orthogonality + cross-fitting.

12 관련 주제

선행 지식

Phase J 후속 글

Double/Debiased ML (placeholder)

다른 카테고리 연결

Machine_Learning — Random Forest 깊이 (placeholder)

13 참고문헌

Athey, S. & Imbens, G. W. (2016). Recursive partitioning for heterogeneous causal effects. PNAS 113, 7353-7360.
Wager, S. & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. J. Amer. Statist. Assoc. 113, 1228-1242.
Athey, S., Tibshirani, J., Wager, S. (2019). Generalized random forests. Annals of Statistics 47, 1148-1178.
Breiman, L. (2001). Random forests. Machine Learning 45, 5-32.
Lu, M., Sadiq, S., Feaster, D. J., Ishwaran, H. (2018). Estimating individual treatment effect in observational data using random forest methods. J. Comput. Graph. Stat. 27, 209-219.
Microsoft Research. EconML CausalForestDML: https://econml.azurewebsites.net/