Kwangmin Kim - Regression Discontinuity Design (RDD)

출처

이 글은 사전지식 기반 (교재 미확인 — agent 사전학습 기반). 핵심 인용 — Thistlethwaite & Campbell (1960), Imbens & Lemieux (2008), Hahn, Todd, Van der Klaauw (2001), Imbens & Kalyanaraman (2012), Calonico, Cattaneo, Titiunik (2014), McCrary (2008).

이 글은 Phase J 시리즈의 17 번째 글이자 J-RDD 시리즈 (4 편) 의 첫 글. Quasi-experimental causal inference 의 가장 깨끗한 형태 — Regression Discontinuity Design — 을 다룬다.

1 진입 직관 — “cutoff 양쪽이 거의 같은 사람들”

이전 시리즈의 DiD 는 시간 차원 활용. RDD 는 cutoff 양쪽의 거의 동일 사람들 활용.

RDD 의 한 줄 원리: 어떤 연속 변수 (running variable) 의 cutoff 를 기준으로 처치 결정. Cutoff 바로 위/아래 의 사람들은 거의 동일 해야 — 처치 외 모든 면에서.

1.1 사례

대학 장학금: GPA 3.5 이상이면 장학금. 3.49 학생 vs 3.51 학생 — 0.02 점 차이 — 거의 동일.

두 그룹의 졸업 후 소득 비교. 차이가 장학금의 인과 효과.

비유 — 컷라인: 시험 60 점 이상 합격. 59 점 학생 vs 61 점 학생 — 능력 거의 같음. 합격/불합격의 차이가 합격의 효과.

결정적: Cutoff 가 외생적 (exogenous) — 사람들이 정확히 cutoff 위치를 모르거나 manipulate 못 함.

2 RDD 의 첫 적용 — Thistlethwaite & Campbell (1960)

2.1 시나리오

1960 년 미국 National Merit Scholarship 의 효과 측정. 시험 점수 기준 cutoff 로 장학금 수여.

2.2 분석

Cutoff 양쪽의 학생 의 향후 학업·경력 비교. cutoff 양쪽 학생의 능력 거의 동일 — 차이는 장학금의 인과 효과.

2.3 의의

Quasi-experimental design 의 원조 사례. RCT 가 어려운 자연 실험 에서 인과 추론.

3 정의: Running Variable + Cutoff

정의: Regression Discontinuity Design

Running variable (또는 forcing variable, score) \(X\): 연속 변수 (예: GPA, 나이, 점수, 인구)
Cutoff \(c\): 임계값 (예: 3.5, 65 세, 60 점)
Treatment \(A\): \(A = 1\) if \(X \geq c\), else \(A = 0\) (Sharp) 또는 확률적 (Fuzzy)

3.1 식별

Cutoff 근처에서 \(A\) 가 급변 — 외생적 (exogenous). 따라서:

\[ \tau_{RDD} = \lim_{x \downarrow c} \mathbb{E}[Y | X=x] - \lim_{x \uparrow c} \mathbb{E}[Y | X=x] \]

cutoff 위에서의 expected outcome - cutoff 아래에서의 expected outcome. 이것이 cutoff 에서의 ATT.

3.2 가정

Continuity: 다른 모든 변수의 distribution 이 cutoff 에서 연속 (불연속 점프 없음)
No manipulation: cutoff 양쪽으로 조작 불가능 (예: 점수를 정확히 cutoff 위로 올리기 어려움)

수식 직관: Cutoff 가 processor 역할 — 비슷한 사람들을 무작위로 두 그룹에 배정. Local randomization 효과.

4 Sharp RDD vs Fuzzy RDD

4.1 Sharp RDD

\(X \geq c\) 면 반드시 처치 받음. 처치 결정이 결정론적.

예: GPA 3.5 이상 학생 모두 장학금 받음. 정확히.

4.2 Fuzzy RDD

\(X \geq c\) 면 처치 받을 확률 높음 — 그러나 100% 아님. 확률적.

예: GPA 3.5 이상 학생 80% 가 장학금 받음 (별도 심사 등 추가 요인). 이하 학생도 10% 받음.

4.3 추정량 차이

측면	Sharp	Fuzzy
Treatment 결정	결정론적	확률적
추정량	\(\tau = \lim (\mu_+ - \mu_-)\)	\(\tau = \frac{\lim (\mu_+ - \mu_-)}{\lim (P_+ - P_-)}\)
해석	모집단 cutoff ATT	Cutoff 의 complier 에 대한 LATE

4.4 Fuzzy RDD = IV 의 Special Case

Fuzzy RDD 는 cutoff 가 instrument 인 Instrumental Variable 의 special case. Local Average Treatment Effect (LATE) 추정.

5 Local Linear Regression — Imbens & Lemieux (2008)

5.1 동기

Cutoff 양쪽에서 outcome 의 평균 을 추정. 단순 cutoff 양쪽의 sample 평균 보다 local linear regression 이 효율적.

5.2 알고리즘

Cutoff \(c\) 의 양쪽에서 별도 linear regression:

좌측 (\(X < c\)): \[ Y_i = \alpha_- + \beta_- (X_i - c) + \varepsilon_i \]

우측 (\(X \geq c\)): \[ Y_i = \alpha_+ + \beta_+ (X_i - c) + \varepsilon_i \]

\(\hat{\tau}_{RDD} = \hat{\alpha}_+ - \hat{\alpha}_-\) — cutoff 에서의 jump.

5.3 Bandwidth

Cutoff 근처의 sample 만 사용. Bandwidth \(h\) — cutoff 로부터 거리 \(h\) 이내 의 sample.

Bandwidth 선택이 결정적:

작은 h: bias 작음, variance 큼

큰 h: bias 큼, variance 작음

5.4 Bandwidth Selection

IK (Imbens-Kalyanaraman 2012): MSE-optimal bandwidth.

CCT (Calonico-Cattaneo-Titiunik 2014): Bias-corrected + robust SE.

R 패키지: rdrobust (Calonico 등). 표준 도구.

다음 글에서 깊이.

6 McCrary Density Test (2008)

6.1 동기

RDD 의 no manipulation 가정 검증.

6.2 메커니즘

만약 사람들이 cutoff 위로 점수 조작 가능하면, cutoff 바로 위에 sample 몰림. Running variable 의 density 가 cutoff 에서 jump.

6.3 검정

Running variable 의 density 를 cutoff 양쪽에서 별도 추정 + cutoff 에서 동일 검정.

유의 jump 발견 시 manipulation 의심 — RDD 가정 위반.

6.4 사례 — Test Score

SAT 1300 점 cutoff. 학생들이 1300 점에 몰림 (예: 1295~1299 점이 적음). McCrary test 가 jump 발견 — score manipulation 가능성.

다음 글에서 깊이.

7 Visual Inspection — RDD 의 표준

7.1 Density Plot

Running variable 의 histogram + cutoff 위치. Spike 또는 gap 확인.

7.2 Outcome Plot

X 축: running variable, Y 축: outcome. Local averages (binned scatter) + fitted regression. Cutoff 에서 jump 시각화.

7.3 Other Covariates Plot

다른 baseline 변수의 cutoff 양쪽 연속성 확인. Jump 발견 시 RDD 가정 의심.

8 응용 영역

8.1 교육

장학금 cutoff (GPA, 시험 점수). 향후 학업 성취·경력 효과.

사례: Lee (2008) — 재선거 (incumbent) 의 효과.

8.2 정치

선거 승패의 cutoff (50% 득표). 정당이 근소한 승리 vs 근소한 패배 의 정치 효과.

사례: Lee, Moretti, Butler (2004) — 민주당 의원의 정책 효과.

8.3 의료

체중 cutoff (저체중 신생아 1500g). 위/아래의 집중 치료 차이.

사례: Almond, Doyle, Kowalski, Williams (2010) — 집중 케어 의 효과.

8.4 사회 정책

연령 cutoff (65 세 의료보험), 소득 cutoff (지원 자격).

사례: Card, Dobkin, Maestas (2009) — Medicare 자격 의 의료 사용 효과.

9 RDD vs RCT — Trade-off

측면	RCT	RDD
데이터	무작위 배정	자연 실험 (관찰)
Identification	Exchangeability	Local randomization at cutoff
효과	ATE (전체 모집단)	ATT at cutoff (local)
외적 타당도	모집단 일반화 가능	Cutoff 근처만
가정	약 (무작위 보장)	강 (continuity, no manipulation)

9.1 핵심 한계

RDD 의 효과 추정 은 cutoff 근처의 사람들 에만 유효. 멀리 떨어진 모집단 으로 일반화 어려움.

10 후속 3 글 안내

10.1 J-RDD-1: Sharp vs Fuzzy RDD

두 design 의 깊은 비교. Fuzzy RDD = IV special case. LATE 추정.

10.2 J-RDD-2: Local Linear Regression + Bandwidth

Imbens-Lemieux (2008) 의 정통 정리. IK + CCT bandwidth selection. rdrobust 사례.

10.3 J-RDD-3: McCrary Density Test + 진단

Manipulation test 깊이. Other diagnostics — covariate balance, placebo cutoff. RDD 의 robustness check 종합.

11 시뮬레이션 — Sharp RDD

import numpy as np

np.random.seed(42)

# 시나리오: GPA cutoff 3.5, 처치 = 장학금
n = 2000
GPA = np.random.uniform(2.0, 4.0, n)
cutoff = 3.5

# Sharp: GPA >= 3.5 면 장학금
A = (GPA >= cutoff).astype(int)

# Outcome (졸업 후 소득): GPA 함수 + 장학금 효과 (= 5)
true_effect = 5.0
income = 30 + 10 * GPA + true_effect * A + np.random.normal(0, 5, n)

# 단순 비교 (장학생 vs 비장학생) — confounding (GPA)
naive = income[A == 1].mean() - income[A == 0].mean()
print(f"[Sharp RDD 시뮬레이션]\n")
print(f"진짜 처치 효과: {true_effect}")
print(f"\nNaive (단순 비교): {naive:.2f}")
print(f"  → biased — GPA 효과 + 장학금 효과 합산")

# RDD: cutoff 양쪽의 평균
bandwidth_options = [0.1, 0.2, 0.3, 0.5]

print(f"\n[RDD with different bandwidths]")
for h in bandwidth_options:
    mask = np.abs(GPA - cutoff) <= h
    n_band = mask.sum()
    if n_band < 20:
        continue
    Y_left = income[mask & (GPA < cutoff)].mean()
    Y_right = income[mask & (GPA >= cutoff)].mean()
    rdd_simple = Y_right - Y_left
    print(f"  h = {h}: n = {n_band}, RDD = {rdd_simple:.2f}")

# Local linear regression (간단 구현)
print(f"\n[Local Linear Regression — h = 0.3]")
h = 0.3
mask = np.abs(GPA - cutoff) <= h
gpa_local = GPA[mask]
income_local = income[mask]

# Left
left = gpa_local < cutoff
right = gpa_local >= cutoff

# Linear regression on each side
from numpy.polynomial import polynomial as P

X_left = (gpa_local[left] - cutoff).reshape(-1, 1)
y_left = income_local[left]
beta_left = np.polyfit(gpa_local[left] - cutoff, y_left, 1)
alpha_left = beta_left[1]   # intercept at cutoff

X_right = (gpa_local[right] - cutoff).reshape(-1, 1)
y_right = income_local[right]
beta_right = np.polyfit(gpa_local[right] - cutoff, y_right, 1)
alpha_right = beta_right[1]

rdd_local_linear = alpha_right - alpha_left
print(f"  Left intercept: {alpha_left:.2f}")
print(f"  Right intercept: {alpha_right:.2f}")
print(f"  RDD estimate: {rdd_local_linear:.2f}")
print(f"  → 진짜 효과 {true_effect} 에 가까움")

12 결론

RDD 는 cutoff 의 local randomization 활용 인과 추론. Sharp/Fuzzy design + local linear regression + McCrary density test 의 결합. 자연 실험의 가장 깨끗한 형태이지만 cutoff 근처의 효과 만 추정.

핵심 메시지:

Running variable + Cutoff: 외생적 처치 결정
Sharp vs Fuzzy RDD: 결정론적 vs 확률적
Local linear regression (Imbens-Lemieux 2008)
Bandwidth selection (IK, CCT) — 결정적
McCrary density test: Manipulation 진단
응용: 교육, 정치, 의료, 사회 정책
한계: Cutoff 근처만, 외적 타당도 약함

후속 3 글에서 깊이.

13 관련 주제

선행 지식

(Phase D) Hernan Ch.7 — Confounding
DiD 시리즈

Phase J 후속 글

Sharp vs Fuzzy RDD (placeholder)
Local Linear Regression + Bandwidth (placeholder)
McCrary Density Test + 진단 (placeholder)

14 참고문헌

Thistlethwaite, D. L. & Campbell, D. T. (1960). Regression-discontinuity analysis. J. Educational Psychology 51, 309-317.
Hahn, J., Todd, P., Van der Klaauw, W. (2001). Identification and estimation of treatment effects with a regression-discontinuity design. Econometrica 69, 201-209.
Imbens, G. W. & Lemieux, T. (2008). Regression discontinuity designs: A guide to practice. J. Econometrics 142, 615-635.
Imbens, G. W. & Kalyanaraman, K. (2012). Optimal bandwidth choice for the regression discontinuity estimator. Review of Economic Studies 79, 933-959.
Calonico, S., Cattaneo, M. D., Titiunik, R. (2014). Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica 82, 2295-2326.
McCrary, J. (2008). Manipulation of the running variable in the regression discontinuity design: A density test. J. Econometrics 142, 698-714.
Lee, D. S. (2008). Randomized experiments from non-random selection in U.S. House elections. J. Econometrics 142, 675-697.
Almond, D., Doyle, J. J., Kowalski, A. E., Williams, H. (2010). Estimating marginal returns to medical care. QJE 125, 591-634.
Card, D., Dobkin, C., Maestas, N. (2009). Does Medicare save lives? QJE 124, 597-636.