Kwangmin Kim - Local Linear Regression + Bandwidth

출처

이 글은 사전지식 기반 (교재 미확인 — agent 사전학습 기반). 핵심 인용 — Imbens & Lemieux (2008), Imbens & Kalyanaraman (2012), Calonico, Cattaneo, Titiunik (2014), Gelman & Imbens (2019).

이 글은 J-RDD 시리즈의 세 번째 글. RDD 추정의 정통 — local linear regression + bandwidth selection — 을 다룬다.

1 진입 직관 — “Cutoff 근처의 회귀”

RDD 추정의 핵심: cutoff 근처의 outcome 추세 를 양쪽에서 별도 추정 + cutoff 에서의 jump 가 처치 효과.

결정적 선택:

얼마나 멀리 까지 데이터 사용? (bandwidth)

어떤 모델 로 추세 fit? (local linear vs higher-order)

kernel weighting — cutoff 가까울수록 가중치 ↑

잘못된 선택은 bias 또는 variance 폭발.

2 Local Linear Regression — Imbens-Lemieux (2008)

2.1 공식

Cutoff \(c\) 양쪽에서 별도 linear regression:

좌측 (\(X < c\)): \[ Y_i = \alpha_- + \beta_- (X_i - c) + \varepsilon_i \]

우측 (\(X \geq c\)): \[ Y_i = \alpha_+ + \beta_+ (X_i - c) + \varepsilon_i \]

RDD estimator: \[ \hat{\tau}_{RDD} = \hat{\alpha}_+ - \hat{\alpha}_- \]

cutoff 에서의 intercept 차이.

2.2 직관

Cutoff 양쪽의 추세선 을 그리고 cutoff 에서 두 선의 점프 가 처치 효과.

2.3 Sample 사용

Bandwidth 안 의 sample 만 사용 — 즉 \(|X_i - c| \leq h\).

Bandwidth 가 결정적.

2.4 왜 Linear?

Cutoff 근처에서는 선형 근사 가 충분 (1 차 Taylor expansion). 고차 항 은 overfitting 위험.

Gelman & Imbens (2019): high-order polynomial 사용 비추천. Local linear 또는 local quadratic 정도가 적정.

3 Kernel Weighting

3.1 동기

cutoff 가까울수록 가중치 ↑, 멀수록 가중치 ↓. bias 줄임.

3.2 옵션

Kernel	함수	특징
Uniform	\(\mathbf{1}\{\|x\| \leq 1\}\)	모두 균등
Triangular	\((1 - \|x\|)_+\)	중앙에 가중 (RDD 표준)
Epanechnikov	\(\frac{3}{4}(1 - x^2)_+\)	효율적

3.3 표준 — Triangular

Imbens-Lemieux 등 RDD 문헌에서 triangular kernel 을 선호. 경계에서 부드럽게 0.

3.4 Weighting 적용

Weighted Least Squares (WLS):

\[ \min \sum_i K\left(\frac{X_i - c}{h}\right) (Y_i - \alpha - \beta(X_i - c))^2 \]

4 Bandwidth — Bias-Variance Trade-off

4.1 작은 h

Bias 작음: cutoff 근처만 사용 — local linear approximation 정확
Variance 큼: sample 적음 — SE 큼

4.2 큰 h

Bias 큼: cutoff 멀리까지 사용 — local linear approximation 부정확
Variance 작음: sample 많음 — SE 작음

4.3 Optimal h

Mean Squared Error (MSE) 최소화. Bias² + Variance 의 trade-off.

5 Imbens-Kalyanaraman (2012) — IK Bandwidth

5.1 동기

MSE-optimal bandwidth 의 data-driven 선택. 이전의 임의 선택 (h = sample SD 등) 의 한계 극복.

5.2 공식 (간략)

\[ h_{IK} = C_K \cdot \left(\frac{\hat{\sigma}^2}{\hat{m}''^2}\right)^{1/5} \cdot n^{-1/5} \]

\(\hat{\sigma}^2\): residual variance \(\hat{m}''\): outcome 의 2 차 미분 \(C_K\): kernel-specific constant \(n\): sample size

5.3 의의

RDD 의 bandwidth 선택 의 첫 정통 도구. R 패키지 rdrobust::rdbwselect.

5.4 한계

Bias 보정 안 됨. CCT (다음 절) 가 후속.

6 Calonico-Cattaneo-Titiunik (2014) — CCT Bandwidth

6.1 동기

IK 의 2 차 bias term 이 inference 에 영향. bias correction + robust SE 통합.

6.2 메커니즘

Bias-corrected estimator: local quadratic 으로 bias 추정 + 보정
Robust SE: 큰 sample 또는 작은 sample 에서 모두 valid

6.3 공식

\[ \hat{\tau}^{bc} = \hat{\tau} - \hat{\text{bias}} \]

여기서 \(\hat{\text{bias}}\) 는 local quadratic 의 2 차 항으로 추정.

Robust SE: bias correction 의 추가 variance 포함.

6.4 의의

Modern RDD inference 의 표준. R 패키지 rdrobust 의 default.

6.5 영향

Imbens, Cattaneo 등의 후속 연구. RDD 의 모든 응용 에서 표준 도구.

7 `rdrobust` Package — 표준 도구

7.1 기능

rdrobust(y, x, c = 0) — 자동:

Triangular kernel

CCT bandwidth

Bias-corrected estimator

Robust SE

Plot 자동 생성

7.2 사용

library(rdrobust)
# Sharp RDD
result <- rdrobust(y = data$income, x = data$gpa, c = 3.5)
summary(result)
rdplot(y = data$income, x = data$gpa, c = 3.5)

# Fuzzy RDD
result_fuzzy <- rdrobust(
  y = data$income,
  x = data$age,
  c = 65,
  fuzzy = data$medicare
)

7.3 Python: `rdrobust`

Python port — 동일 기능. pip install rdrobust.

8 Higher-Order Polynomial 의 위험

8.1 문제

3 차, 4 차 polynomial 사용 — cutoff 근처에서 곡선 fit. Runge phenomenon 같은 oscillation 위험.

8.2 Gelman & Imbens (2019)

제목: “Why High-order Polynomials Should Not Be Used in Regression Discontinuity Designs.”

이유:

Boundary 에서 oscillation — cutoff 의 jump 가 artifact

Robust SE 의 잘못된 추정

해석 어려움 — coefficient 가 fit-specific

8.3 권장

Local linear (1 차) 또는 local quadratic (2 차) 만 사용. Bandwidth 로 sample 제한.

9 RDD Inference

9.1 Robust SE — CCT (2014)

Bias correction 의 추가 uncertainty 포함. Conventional CI 보다 더 보수적.

9.2 Permutation Inference — Cattaneo, Frandsen, Titiunik (2015)

Cutoff 근처의 sample 에 대해 처치 무작위 재배정. Sharp null (효과 0) 의 sampling distribution 추정.

작은 sample 에서 유용.

10 RDD Plot — 시각적 검증

10.1 Binned Scatter

Running variable 을 bin 으로 나누고 bin 내 outcome 평균 점.

10.2 Fitted Line

Cutoff 양쪽에서 fitted regression line.

10.3 Visual Inspection

Cutoff jump 가 시각적으로 명확한가? cutoff 멀리 의 추세는 연속적 인가?

RDD 의 표준 보고. rdplot (R/Python) 자동 생성.

11 시뮬레이션 — Bandwidth Sensitivity

import numpy as np

np.random.seed(42)

n = 3000
X = np.random.uniform(2.0, 4.0, n)
cutoff = 3.5
A = (X >= cutoff).astype(int)

# Outcome: 비선형 trend + 처치 효과 5
true_effect = 5.0
Y = 30 + 5 * X + 2 * (X - 3) ** 2 + true_effect * A + np.random.normal(0, 3, n)

print(f"[Bandwidth Sensitivity 시뮬레이션]\n")
print(f"진짜 효과: {true_effect}")
print(f"(비선형 trend: 2 * (X - 3)^2)\n")

# Local linear with various bandwidths
for h in [0.05, 0.1, 0.2, 0.3, 0.5, 1.0]:
    mask = np.abs(X - cutoff) <= h
    n_band = mask.sum()
    if n_band < 30:
        print(f"  h = {h}: too few samples")
        continue

    X_band = X[mask] - cutoff
    Y_band = Y[mask]
    A_band = A[mask]

    # Left
    left = X_band < 0
    if left.sum() < 5 or (~left).sum() < 5:
        continue

    # Linear fit
    p_left = np.polyfit(X_band[left], Y_band[left], 1)
    p_right = np.polyfit(X_band[~left], Y_band[~left], 1)

    # Intercept at cutoff (X - c = 0)
    alpha_left = p_left[1]
    alpha_right = p_right[1]
    rdd = alpha_right - alpha_left

    # Bias rough estimate
    bias = rdd - true_effect

    print(f"  h = {h:.2f}, n = {n_band}: RDD = {rdd:.2f}, bias = {bias:+.2f}")

print(f"\n[관찰]")
print(f"  - 작은 h: 큰 variance (RDD 추정치 noise 크고)")
print(f"  - 큰 h: 큰 bias (비선형 trend → linear 근사 부정확)")
print(f"  - 중간 h (~0.2~0.3) 가 보통 optimal")

# Higher-order polynomial 의 위험
print(f"\n[Higher-order polynomial 위험]")
h = 1.0   # 큰 bandwidth
mask = np.abs(X - cutoff) <= h
X_band = X[mask] - cutoff
Y_band = Y[mask]
left = X_band < 0

for order in [1, 2, 3, 4]:
    if left.sum() < order + 2:
        continue
    p_left = np.polyfit(X_band[left], Y_band[left], order)
    p_right = np.polyfit(X_band[~left], Y_band[~left], order)
    alpha_left = p_left[-1]
    alpha_right = p_right[-1]
    rdd_p = alpha_right - alpha_left
    print(f"  Order {order}: RDD = {rdd_p:.2f}")

print(f"\n  → Order 가 높을수록 oscillation, RDD 추정 불안정")
print(f"  → Gelman & Imbens (2019): linear/quadratic 만 사용 권장")

12 결론

Local linear regression + triangular kernel + CCT bandwidth + bias correction 이 RDD 의 정통. rdrobust 패키지가 표준 도구. Higher-order polynomial 사용 비추천 (Gelman-Imbens 2019).

핵심 메시지:

Local linear: cutoff 양쪽 별도 linear fit
Triangular kernel: cutoff 가까울수록 가중치 ↑
Bandwidth bias-variance trade-off: 작은 h ↔︎ 큰 h
IK (Imbens-Kalyanaraman 2012): MSE-optimal bandwidth
CCT (Calonico-Cattaneo-Titiunik 2014): bias correction + robust SE
rdrobust: 표준 패키지
Higher-order 회피 (Gelman-Imbens 2019)

다음 글: McCrary Density Test + 진단.

13 관련 주제

선행 지식

Phase J 후속 글

McCrary Density Test + 진단 (placeholder)

14 참고문헌

Imbens, G. W. & Lemieux, T. (2008). Regression discontinuity designs: A guide to practice. J. Econometrics 142, 615-635.
Imbens, G. W. & Kalyanaraman, K. (2012). Optimal bandwidth choice for the regression discontinuity estimator. Review of Economic Studies 79, 933-959.
Calonico, S., Cattaneo, M. D., Titiunik, R. (2014). Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica 82, 2295-2326.
Calonico, S., Cattaneo, M. D., Farrell, M. H., Titiunik, R. (2017). rdrobust: Software for regression-discontinuity designs. Stata Journal 17, 372-404.
Gelman, A. & Imbens, G. (2019). Why high-order polynomials should not be used in regression discontinuity designs. J. Bus. & Econ. Statist. 37, 447-456.
Cattaneo, M. D., Frandsen, B. R., Titiunik, R. (2015). Randomization inference in the regression discontinuity design. Journal of Causal Inference 3, 1-24.

1 진입 직관 — “Cutoff 근처의 회귀”

2 Local Linear Regression — Imbens-Lemieux (2008)

2.1 공식

2.2 직관

2.3 Sample 사용

2.4 왜 Linear?

3 Kernel Weighting

3.1 동기

3.2 옵션

3.3 표준 — Triangular

3.4 Weighting 적용

4 Bandwidth — Bias-Variance Trade-off

4.1 작은 h

4.2 큰 h

4.3 Optimal h

5 Imbens-Kalyanaraman (2012) — IK Bandwidth

5.1 동기

5.2 공식 (간략)

5.3 의의

5.4 한계

6 Calonico-Cattaneo-Titiunik (2014) — CCT Bandwidth

6.1 동기

6.2 메커니즘

6.3 공식

6.4 의의

6.5 영향

7 rdrobust Package — 표준 도구

7.1 기능

7.2 사용

7.3 Python: rdrobust

8 Higher-Order Polynomial 의 위험

8.1 문제

8.2 Gelman & Imbens (2019)

8.3 권장

9 RDD Inference

9.1 Robust SE — CCT (2014)

9.2 Permutation Inference — Cattaneo, Frandsen, Titiunik (2015)

10 RDD Plot — 시각적 검증

10.1 Binned Scatter

10.2 Fitted Line

10.3 Visual Inspection

11 시뮬레이션 — Bandwidth Sensitivity

12 결론

13 관련 주제

14 참고문헌

7 `rdrobust` Package — 표준 도구

7.3 Python: `rdrobust`