Kwangmin Kim - Ch.17 § 17.4~17.7 심화 — 8 Schools Robust·\(t\) Regression·연습 + Ch.17 결산

1 개요 — Ch.17 심화 시리즈의 마지막 편

Ch.17 심화 시리즈 구성:

03-17-0 — Ch.17 Overview (7 절 조망).
03-17-1 — § 17.1~17.3 (Robustness·Overdispersed 모형·식 (17.1) scale mixture·식 (17.3) importance weighting).
03-17-2 (본편) — § 17.4~17.7 (8 schools robust·\(t\) regression·문헌·연습 + Ch.17 결산).

이 편은 Ch.17의 두 실전 응용 (8 schools + 선형 회귀) 과 연습문제 풀이 를 다룬다. 마지막에 Ch.17 시리즈 결산과 Part IV 다음 편 (Ch.18 Missing Data) 예고로 마무리.

직관: § 17.4 와 § 17.5의 대조

§ 17.4 (8 schools): 모집단 분포 (parameter 층) 가 \(t_\nu\). “이 8 개 학교 효과가 heavy-tail 에서 나왔다.”

§ 17.5 (regression): 관측 오차 (data 층) 가 \(t_\nu\). “개별 관측의 오차가 heavy-tail 이다 — 이상치 허용.”

둘 다 scale mixture + Gibbs 로 계산. 적용 위치만 다르다. Gelman의 Ch.17 전체가 “heavy-tail을 어느 층에 둘 것인가” 의 두 사례로 요약된다.

2 § 17.4 Robust Inference for the Eight Schools

2.1 모형 — 식 (17.4)

Ch.5 의 정규 계층 모형을 \(t_\nu\) 모집단 분포로 확장:

\[ \theta_j | \nu, \mu, \tau \sim t_\nu(\mu, \tau^2), \quad j = 1, \dots, 8 \quad \text{(17.4)} \]

Likelihood 그대로:

\[ y_j \sim N(\theta_j, \sigma_j^2) \]

Hyperprior:

\[ p(\mu, \tau | \nu) \propto 1 \]

2.2 Scale Mixture Representation

(17.4) 를 auxiliary \(U_j\) 로 풀어쓰면:

\[ \begin{aligned} \theta_j | U_j, \mu, \tau &\sim N(\mu, U_j \tau^2) \\ U_j &\sim \text{Inv-}\chi^2(\nu, 1) \end{aligned} \]

(스케일 \(\sigma^2\) 자리에 \(\tau^2\), \(U_j\) 는 단위 분산.)

즉 각 \(\theta_j\) 가 개별 scale \(U_j \tau^2\) 를 가진 정규. \(U_j\) 큰 관측 = outlier \(\theta_j\).

2.3 Gibbs Sampler — Ch.12 § 12.1 재적용

Mixture parameterization 하에서 조건부가 모두 conjugate:

\(\theta_j | y_j, \sigma_j^2, U_j, \mu, \tau^2\): 정규.

\[ \theta_j | \cdot \sim N\left( \frac{y_j / \sigma_j^2 + \mu / (U_j \tau^2)}{1/\sigma_j^2 + 1/(U_j \tau^2)}, \; \frac{1}{1/\sigma_j^2 + 1/(U_j \tau^2)} \right) \]

\(U_j | \theta_j, \mu, \tau^2, \nu\): Inv-\(\chi^2\).

\[ U_j | \cdot \sim \text{Inv-}\chi^2\left( \nu + 1, \frac{\nu + (\theta_j - \mu)^2 / \tau^2}{\nu + 1} \right) \]

\(\mu | \theta, \tau^2, U\): 정규.

\[ \mu | \cdot \sim N\left( \bar\theta_U, \frac{\tau^2}{\sum 1/U_j} \right), \quad \bar\theta_U = \frac{\sum \theta_j / U_j}{\sum 1 / U_j} \]

\(\tau^2 | \theta, \mu, U\): conditional scaled inverse-\(\chi^2\) 또는 Gamma (prior에 따라).

\(\nu\) 고정 시 cycle이 완결. \(\nu\) 도 추정하려면 Metropolis step 추가.

2.4 Table 17.1 — Robust 결과

2500 MCMC draws (5 chains × 500 last-half).

School	2.5%	25%	Median	75%	97.5%
A	-2	6	11	16	34
B	-5	4	8	12	21
C	-14	2	7	11	21
D	-6	4	8	12	21
E	-9	1	6	9	17
F	-9	3	7	10	19
G	-1	6	10	15	26
H	-8	4	8	13	26

비교: Table 5.3 (정규 모형) 과 거의 동일. 약간의 차이만:

School A (관측 최대값 28): robust에서 median 11 (정규는 10) — 약간 덜 shrinkage.
Credible interval 폭도 유사.

결론: 원 데이터에는 이상치가 없으므로 robust 도 정규와 같은 결과. 이것이 정규 가정의 robustness 확인.

2.5 Importance Resampling — 식 (17.5)

Section 17.3의 기법을 8 schools 에 적용.

1단계: 정규 모형 사후 \(p_0(\theta, \mu, \tau | y)\) 에서 \(S = 5000\) samples (Ch.5.5 계산).

2단계: Importance ratio 계산 (17.5):

\[ \frac{p(\theta, \mu, \tau | \nu, y)}{p_0(\theta, \mu, \tau | y)} \propto \prod_{j=1}^8 \frac{t_\nu(\theta_j | \mu, \tau^2)}{N(\theta_j | \mu, \tau^2)} \quad \text{(17.5)} \]

2.6 (17.5) 의 우아함

직관: “prior ratio 만 남는다”

Ratio 를 풀어쓰면:

\[ \frac{p(\theta, \mu, \tau | \nu, y)}{p_0(\theta, \mu, \tau | y)} = \frac{p(\mu, \tau | \nu) \cdot p(\theta | \mu, \tau, \nu) \cdot p(y | \theta, \mu, \tau, \nu)}{p_0(\mu, \tau) \cdot p_0(\theta | \mu, \tau) \cdot p_0(y | \theta, \mu, \tau)} \]

Likelihood \(p(y | \theta, \mu, \tau, \nu) = \prod N(y_j | \theta_j, \sigma_j^2)\) 는 \(\nu\) 에 무관 → 분자·분모에서 약분.

Hyperprior \(p(\mu, \tau | \nu) \propto 1\) 과 \(p_0(\mu, \tau) \propto 1\) 도 같음 → 약분.

남는 것: \(p(\theta | \mu, \tau, \nu) / p_0(\theta | \mu, \tau)\) — population distribution 의 ratio만.

\[ \prod_j \frac{t_\nu(\theta_j | \mu, \tau^2)}{N(\theta_j | \mu, \tau^2)} \]

이것이 식 (17.5) 의 핵심. Modification은 population distribution에 있고 likelihood·hyperprior는 그대로 라는 구조가 importance weight 계산을 단순화.

이 패턴은 일반적이다: Robust 확장이 모형의 특정 층에만 있으면, importance ratio는 그 층의 ratio만 계산하면 된다. Ch.17 § 17.3 의 importance weighting 가 실무에서 작동하는 이유.

2.7 3단계 — Resampling

5000 draws 에서 500 samples without replacement — weights 비례 확률.

결과: Robust posterior 근사. 원 Gibbs 결과와 거의 일치 → Importance resampling 가 효율적 대안.

2.8 Importance Ratio 분산 경고

Gelman 원문:

“the long tail of the distribution of the logarithms of the importance ratios (not shown) does indicate serious problems for obtaining accurate inferences using importance resampling.”

Log-ratio tail이 매우 긴 편. ESS < 100 정도. 대략적 경향 파악에는 OK이나 정밀 추론에는 직접 MCMC 필요.

2.9 Sensitivity Analysis Figure 17.1

\(1/\nu\) 축 (정규 \(0\) ~ Cauchy \(1\)) 에서 각 학교 \(\theta_j\) 의 posterior mean·SD.

주요 관찰:

8개 학교 모두 \(1/\nu\) 에 거의 무관.
Mean 과 SD 모두 안정.
Simulation variability 만 noise로 보임.

해석: 원 데이터의 정규 가정이 결론에 큰 영향 안 줌. 정규 모형의 robustness 확인.

2.10 가상 Outlier 시나리오 (03-17-0 복기)

만약 \(y_8 = 100\) 이었다면 Figure 17.1 이 완전 달랐을 것:

\(1/\nu = 0\) (정규): \(\tau\) 폭발 + shrinkage 붕괴.
\(1/\nu\) 증가: \(\tau\) 정상화 + 7개 학교 정상 shrinkage + 8번째만 독특.

이 가상 분석이 “robust 모형이 진짜 쓸모 있을 때” 를 보여준다.

2.11 \(\nu\) 를 Unknown 으로 다루기

Prior on \(1/\nu\): uniform on \([0, 1]\).

\(1/\nu = 0\) (정규) 부터 \(1/\nu = 1\) (Cauchy) 까지.
Long-tail을 선호 (half mass between \(t_1\) and \(t_2\)).

Prior \((\mu, \tau | \nu)\): \(g(\nu) \propto 1\). Improper이지만 posterior proper.

계산: Gibbs + \(1/\nu\) 에 Metropolis step.

2.12 Figure 17.2 — \(1/\nu\) Posterior

8 schools 데이터에서 \(1/\nu\) 사후 히스토그램:

0 근처에 mass 집중.
데이터가 “정규에 가까운” 것을 지지.
\(1/\nu \in [0.3, 1]\) 영역은 사후 probability가 매우 작음.

결론: 데이터 자체가 정규 가정을 약하게 지지. Robustness 필요성이 낮음.

2.13 추정량의 Robustness 척도별 차이

Gelman의 깊은 통찰:

“Robustness and sensitivity to modeling assumptions depend on the estimands being studied.”

50% / 95% CI: \(\nu\) 에 insensitive (8 schools).
99.9% tail: \(\nu\) 에 매우 sensitive — 극단 tail은 distribution 형태에 의존.

실무 원칙: 관심 추정량이 tail-heavy 한 quantity (극단 분위수, value-at-risk 등)이면 heavy-tail 모형이 필수. 반대로 중심 추정량이면 정규도 충분.

3 § 17.5 Robust Regression using \(t\)-Distributed Errors

3.1 모형

Ch.14 정규 선형 회귀 오차를 \(t\) 로 교체:

\[ y_i | X_i, \beta, \sigma^2 \sim t_\nu(X_i \beta, \sigma^2) \]

Scale mixture:

\[ \begin{aligned} y_i | X_i, \beta, \sigma^2, V_i &\sim N(X_i \beta, V_i) \\ V_i &\sim \text{Inv-}\chi^2(\nu, \sigma^2) \end{aligned} \]

3.2 Posterior Mode via EM

목적: \(p(\beta, \sigma^2 | \nu, y)\) 의 posterior mode (MAP).

EM 설계 — \(V_i\) 를 “missing data”:

E-step: \(V_i\) 의 conditional posterior — 식 (17.6):

\[ p(V_i | y_i, \beta^{\text{old}}, \sigma^{\text{old}}, \nu) = \text{Inv-}\chi^2\left( \nu + 1, \frac{\nu (\sigma^{\text{old}})^2 + (y_i - X_i \beta^{\text{old}})^2}{\nu + 1} \right) \quad \text{(17.6)} \]

중요 통계: \(\mathbb{E}[1/V_i]\):

\[ \mathbb{E}\left[ \frac{1}{V_i} \Big| y_i, \beta^{\text{old}}, \sigma^{\text{old}}, \nu \right] = \frac{\nu + 1}{\nu (\sigma^{\text{old}})^2 + (y_i - X_i \beta^{\text{old}})^2} \]

(Inv-\(\chi^2\) 의 \(1/V\) 기댓값 공식.)

M-step: 가중 선형 회귀 (weight \(W_{ii} = \mathbb{E}[1/V_i]\)):

\[ \hat\beta^{\text{new}} = (X^T W X)^{-1} X^T W y \]

\[ (\hat\sigma^{\text{new}})^2 = \frac{1}{n} (y - X\hat\beta^{\text{new}})^T W (y - X\hat\beta^{\text{new}}) \]

3.3 핵심 관찰 — IWLS 동등성

직관: EM for \(t\) = Iterative Weighted Least Squares

M-step이 가중 회귀 임을 주목하라. Weight:

\[ W_{ii} = \frac{\nu + 1}{\nu (\sigma^{\text{old}})^2 + (y_i - X_i \beta^{\text{old}})^2} \]

해석:

\(|y_i - X_i \beta^{\text{old}}|\) 작음 (fit 잘됨) → weight 큼 (정보 많음).
\(|y_i - X_i \beta^{\text{old}}|\) 큼 (outlier) → weight 작음 (영향 축소).

이것이 “robust regression = outlier downweighting” 의 수학적 실체.

빈도주의와의 연결: Huber’s M-estimator, IRLS, Tukey bi-square 등 고전 robust 방법은 weight 함수가 다를 뿐 구조가 동일. \(t\) 모형의 EM은 그들의 확률적 기초 제공.

Weight 함수 비교:

방법	Weight as function of residual \(r\)
OLS	상수 (1)
\(t_\nu\) EM	\((\nu+1) / (\nu \sigma^2 + r^2)\)
Huber	\(\min(1, c/\|r\|)\)
Tukey	\((1 - (r/c)^2)^2\) if \(\|r\| \leq c\), else 0

\(t\) EM 은 smooth 연속 weight, Huber 는 piecewise, Tukey 는 경계 영역 완전 차단. \(t\)가 수학적으로 가장 자연스럽다.

3.4 초기값과 수렴

초기값: OLS (= 모든 \(W_{ii} = 1\)).

수렴 속도: 보통 10~30 iteration 안에 수렴. Outlier 가 많을수록 수렴 느림.

3.5 ECME for \(\nu\) 추정

\(\nu\) 도 추정하려면 ECME (Liu-Rubin 1994):

M-step이 “regression step” + “\(\nu\) step” 로 분할.
\(\nu\) step은 marginal likelihood \(p(y | \nu)\) 최대화 — Brent’s method 나 grid.

3.6 Gibbs Sampler for Robust Regression

Posterior sampling (MAP 아닌 full posterior):

\(V_i | \beta, \sigma^2, y\): 식 (17.6) Inv-\(\chi^2\).
\(\beta | V, \sigma^2, y\): Ch.14 가중 회귀 사후 (정규).
\(\sigma^2 | V, \beta, y\): Ch.14 Inv-\(\chi^2\).
(Optional) \(\nu | V, \beta, \sigma^2\): Metropolis.

Ch.12 § 12.1 의 parameter expansion 기법으로 더 효율화 가능.

3.7 Multimodality 경고

\(\nu\) 작을 때 (예: \(\nu \leq 4\)) 사후가 다봉일 수 있다. 여러 체인 + overdispersed 시작점 + \(\hat R\) 점검 필수.

원인: “\(y_1\) 이 outlier” vs “\(y_5\) 가 outlier” 라는 다른 가설이 비슷한 likelihood → 여러 mode.

3.8 Python 예제 — EM for \(t\) Regression

import numpy as np


def em_t_regression(y, X, nu=4, max_iter=50, tol=1e-6):
    """EM for t regression, updating (beta, sigma^2) alternately."""
    n, p = X.shape

    # initialize with OLS
    beta = np.linalg.lstsq(X, y, rcond=None)[0]
    sigma2 = np.mean((y - X @ beta)**2)

    for it in range(max_iter):
        # E-step: weights = E[1/V_i | y, beta_old, sigma_old]
        resid = y - X @ beta
        weights = (nu + 1) / (nu * sigma2 + resid**2)

        # M-step: weighted least squares
        W_diag = weights
        XtWX = (X * W_diag[:, None]).T @ X
        XtWy = (X * W_diag[:, None]).T @ y
        beta_new = np.linalg.solve(XtWX, XtWy)
        sigma2_new = np.mean(weights * (y - X @ beta_new)**2)

        if np.max(np.abs(beta_new - beta)) < tol and abs(sigma2_new - sigma2) < tol:
            beta, sigma2 = beta_new, sigma2_new
            break
        beta, sigma2 = beta_new, sigma2_new

    return beta, sigma2, weights, it + 1


# simulate regression with outliers
rng = np.random.default_rng(42)
n = 50
X = np.column_stack([np.ones(n), rng.standard_normal(n)])
beta_true = np.array([1.5, 0.8])
y = X @ beta_true + 0.5 * rng.standard_normal(n)

# inject 5 outliers
y[:5] += 10.0

# OLS
beta_ols = np.linalg.lstsq(X, y, rcond=None)[0]
print(f"OLS:         beta = {beta_ols.round(3)}")

# t_4 EM
beta_t, sigma2_t, w, iters = em_t_regression(y, X, nu=4)
print(f"t_4 EM:      beta = {beta_t.round(3)}, converged in {iters} iters")
print(f"True:        beta = {beta_true}")

# compare weights of outliers vs normal obs
print(f"\nWeights — outliers (first 5): {w[:5].round(3)}")
print(f"Weights — normal (next 5):   {w[5:10].round(3)}")

예상 출력:

OLS:         beta = [2.497 0.790]
t_4 EM:      beta = [1.524 0.803], converged in 7 iters
True:        beta = [1.5 0.8]

Weights — outliers (first 5): [0.01 0.01 0.01 0.01 0.01]
Weights — normal (next 5):   [2.1 1.8 2.5 1.9 2.2]

해석:

OLS는 5개 outlier에 끌려 intercept 이 \(1.5 \to 2.5\) 로 왜곡.
\(t_4\) EM은 참값 복원 — outlier weights가 200배 작아 거의 무시.

이것이 robust regression의 실용 가치.

4 § 17.6 Bibliographic Note — 주제별 재구성

4.1 고전적 Robust Bayesian

Mosteller, Wallace (1964) Inference and Disputed Authorship — Federalist Papers 에 Negative binomial 적용, 모형 민감도 광범위 연구. 베이즈 robustness 의 선구 사례.
Box, Tiao (1968) — 정규 모형의 이상치에 대한 베이즈 robust 초기 논의.
Smith (1983) — Box 접근 확장, \(t\) family (본 교재와 동일한 \(1/\nu\) parameterization).
Dempster (1975) — 실용적 관점의 robust Bayesian overview.
Rubin (1983a) — 데이터의 모형 적합 평가 한계·untestable 가정 민감도.

4.2 \(t\) Family & EM

Dempster, Laird, Rubin (1977) — \(t\) 모형에 EM 적용.
Lange, Little, Taylor (1989) — \(t\) 분포의 통계적 활용 종합 논문.
Liu, Rubin (1995), Meng, van Dyk (1997) — 더 빠른 EM 확장 (PX-EM, AECM).
Raghunathan, Rubin (1990) — Importance resampling 기반 \(t\) 추론.

4.3 현대 Robust 베이즈

Tipping, Lawrence (2005) — Factorized variational approximation for \(t\).
Vanhatalo, Jylanki, Vehtari (2009) — \(t\) 모형에 Laplace 근사.
Jylanki, Vanhatalo, Vehtari (2011) — \(t\) 모형에 EP.
Liu (2004) — Robit regression.

4.4 Overdispersed Binomial

Anderson (1988) — binomial overdispersion의 non-Bayesian 리뷰.

4.5 Hierarchical Robust

Gaver, O’Muircheartaigh (1987) — 계층적 Poisson robust 베이즈.
O’Hagan (1979), Gelman (1992a) — tail 과 shrinkage의 이론적 연결.

4.6 Theoretical Robust Bayesian

Berger (1984, 1990), Berger, Berliner (1986) — 비정상 관측에 대한 robustness 최대화 prior family.
Wasserman (1992) — 이론적 연구 후속.

4.7 Mixture Robust

Taplin, Raftery (1994) — 유한 mixture 기반 robust 베이즈 (농업 실험).

4.8 Regression ↔︎ \(t\) ↔︎ Iterative Regression

Rubin (1983b), Lange, Sinsheimer (1993) — Robust 회귀, \(t\), 반복 회귀 계산의 연결.

5 § 17.7 Exercises — 핵심 풀이

5.1 Exercise 17.1 — 2-Component Mixture Prior for 8 Schools

문제: 다음 prior 하에서 \(\theta_8\) 의 posterior 를 계산·\(y_8 \in \{0, 25, 50, 100\}\) 에 대해 비교.

\[ p(\theta_j) = \lambda_1 N(\theta_j | \mu_1, \tau_1^2) + \lambda_2 N(\theta_j | \mu_2, \tau_2^2) \]

여기서 \(\mu_1 = 0, \tau_1 = 10, \mu_2 = 15, \tau_2 = 25, \lambda_1 = 0.9, \lambda_2 = 0.1\).

해석: “대부분 coaching 프로그램은 거의 효과 없음 (\(\theta \approx 0\)), 일부만 크게 효과 (평균 15, SD 25)”. 실제 SAT coaching 의 현실 반영.

(a) Posterior for \((\theta_1, \dots, \theta_8)\):

각 \(\theta_j\) 의 marginal posterior 는 mixture of two normals:

\[ p(\theta_j | y_j) \propto p(\theta_j) \cdot N(y_j | \theta_j, \sigma_j^2) \]

Component별로:

\[ \lambda_1 N(\theta_j | \mu_1, \tau_1^2) \cdot N(y_j | \theta_j, \sigma_j^2) = w_{1j} \cdot N(\theta_j | m_{1j}, s_{1j}^2) \]

여기서

\[ s_{1j}^2 = \frac{1}{1/\tau_1^2 + 1/\sigma_j^2}, \quad m_{1j} = s_{1j}^2 \cdot (\mu_1/\tau_1^2 + y_j/\sigma_j^2) \]

\(w_{1j}\) = marginal likelihood from component 1 × \(\lambda_1\).

동일하게 component 2. 그 후 normalize:

\[ \pi_{1j} = \frac{w_{1j}}{w_{1j} + w_{2j}} \]

Posterior: \(\theta_j | y_j \sim \pi_{1j} N(m_{1j}, s_{1j}^2) + (1 - \pi_{1j}) N(m_{2j}, s_{2j}^2)\).

(b) \(y_8 \in \{0, 25, 50, 100\}\) 별 posterior \(\theta_8\):

\(y_8 = 0\): component 1 지배적 (\(\pi_{1,8} \approx 1\)), \(\theta_8 \approx N(0.5, \text{small})\).
\(y_8 = 25\): 두 component 혼합, bimodal or skewed.
\(y_8 = 50\): component 2 지배적 (\(\pi_{1,8} \to 0\)), \(\theta_8 \approx N(30, \text{medium})\).
\(y_8 = 100\): component 2만, \(\theta_8 \approx N(50, \text{large})\).

Mixture prior 의 효과: \(y_8\) 커질수록 prior component 자동 전환 → 정규 prior 보다 유연.

5.2 Exercise 17.2 — Federalist Papers Negative Binomial

문제: Mosteller-Wallace Federalist Papers 데이터 (단어 “may” 빈도, Table 17.2). Hamilton 247 blocks vs Madison 247 blocks. Poisson vs Negative binomial 적합.

(a) Poisson:

\[ y_i | \lambda \sim \text{Poisson}(\lambda), \quad p(\lambda) \propto 1/\lambda \quad (\text{Jeffreys}) \]

Posterior: \(\lambda | y \sim \text{Gamma}(\sum y_i, n)\).

Hamilton: \(\sum y_i \approx 128 \cdot 0 + 67 \cdot 1 + 32 \cdot 2 + \dots = 189\). \(\hat\lambda_H = 189/247 \approx 0.765\).

Madison: \(\sum y_i \approx 156 \cdot 0 + 63 \cdot 1 + \dots = 172\). \(\hat\lambda_M \approx 0.696\).

(b) Negative binomial:

\(y_i \sim \text{NegBin}(\alpha, \beta)\). Gamma-Poisson mixture 해석: 각 block 마다 고유 \(\lambda_i \sim \text{Gamma}(\alpha, \beta)\).

문제의 동기: Poisson 적합이 나쁘면 (overdispersion) NegBin 가 더 적합. Hamilton/Madison 의 분산/평균 비율 계산:

Hamilton: \(\mathrm{Var}/\mathrm{Mean} \approx 1.1\) (대략 Poisson). Madison: 비슷하게 약간 overdispersed.

NegBin 사후 추정 후 Hamilton과 Madison의 \(\lambda\) 분포가 겹치는지 로 authorship 판별.

5.3 Exercise 17.4 — Newcomb Speed of Light Robust

문제: Newcomb 광속 측정 데이터에 robust 모형 적합, posterior predictive check.

데이터: 66개 측정, 2개 outlier (음수 값 — 측정 실패).

정규 모형: \(\mu\) 추정이 outlier에 끌려 참값 (33.02) 에서 벗어남.

\(t_\nu\) 모형: \(\nu = 4\) 에서

\[ y_i \sim t_\nu(\mu, \sigma^2) \]

\(\mu\) 사후 평균 참값에 훨씬 가까움, outlier \(V_i\) 매우 큼 (자동 downweight).

Posterior predictive check: replicated data 생성 후 outlier 개수·분포 비교. \(t_4\) 모형이 tail을 적절히 재현.

5.4 Exercise 17.7 — EM for \(t\) Location-Scale

문제: \(y_1, \dots, y_n \sim \text{iid } t_\nu(\mu, \sigma^2)\), prior \(p(\mu, \log\sigma) \propto 1\). EM 알고리즘으로 \((\mu, \log\sigma)\) posterior mode 계산.

모형 specification (17.1):

\[ y_i | V_i \sim N(\mu, V_i), \quad V_i \sim \text{Inv-}\chi^2(\nu, \sigma^2) \]

Joint log posterior:

\[ \log p(\mu, \log \sigma, V_1, \dots, V_n | y) = \sum_i \left[ -\frac{1}{2} \log V_i - \frac{(y_i - \mu)^2}{2V_i} \right] + \sum_i \log \text{Inv-}\chi^2(V_i | \nu, \sigma^2) + \text{const} \]

E-step: \(\mathbb{E}_\text{old}[\log p]\) 를 \(V_i\) 에 대해 평균.

\(V_i | \cdot \sim \text{Inv-}\chi^2(\nu + 1, [\nu \sigma_\text{old}^2 + (y_i - \mu_\text{old})^2]/(\nu+1))\).

필요한 기댓값: \(\mathbb{E}[1/V_i]\) 및 \(\mathbb{E}[\log V_i]\).

\[ \mathbb{E}[1/V_i] = \frac{\nu + 1}{\nu \sigma_\text{old}^2 + (y_i - \mu_\text{old})^2} \]

M-step: \(\mathbb{E}_\text{old}[\log p]\) 를 \((\mu, \log\sigma)\) 에 대해 최대화.

\(\mu\) 에 대해:

\[ \mu_\text{new} = \frac{\sum_i y_i \cdot \mathbb{E}[1/V_i]}{\sum_i \mathbb{E}[1/V_i]} \]

(Weighted mean, weight = \(1/V_i\) 의 조건부 기댓값.)

\(\sigma^2\) 에 대해: 유사한 가중 평균 형태.

수렴: 보통 20 iter 이내. Gibbs sampler 보다 훨씬 빠른 MAP 계산.

6 Ch.17 심화 시리즈 결산

6.1 2편 논리 지도

[Ch.17 Overview] 03-17-0
    ↓ 7 절 조망, Part IV 확장 계단 네 번째 관문
[§ 17.1~17.3] 03-17-1: Aspects·Models·Computation
    ↓ Robustness 두 측면 (outlier + sensitivity)
    ↓ 식 (17.1) scale mixture 공통 원리
    ↓ t·NegBin·BetaBin·Robit
    ↓ Gibbs + 식 (17.3) importance weighting
[§ 17.4~17.7] 03-17-2 (본편): 8 Schools + Regression + 연습
    ↓ 식 (17.4) t_nu 8 schools, 식 (17.5) importance ratio
    ↓ Figure 17.1~17.2 nu sensitivity
    ↓ Robust regression, 식 (17.6), EM = IWLS
    ↓ Federalist·Newcomb·mixture priors
    ↓ Ch.17 결산

6.2 Ch.17 결산 실전 체크리스트

Robustness 진단

Posterior predictive check tail 일치 확인.
Residuals 극단값 비율 (정규 가정의 1% 이상이면 robust 고려).
모형 결과의 관측 제거·추가 민감도.

분포 선택

연속 \(y\) → \(t_\nu\), 기본 \(\nu = 4\).
Count \(y\) → Negative binomial.
Binomial \(y\) (\(m > 1\)) → Beta-binomial.
Binary 분리 → Robit.
2-component mixture 가 더 자연스러우면 그것.

Prior

\(\nu\) 에 uniform on \([0, 1]\) of \(1/\nu\) (Gelman 권장) 또는 고정.
Scale parameter \(\sigma, \tau\) 에 Half-Normal or Half-Cauchy.
Mixture prior 면 각 component 와 mixing \(\lambda\) 에 합리적 prior.

계산

Scale mixture parameterization (auxiliary \(V_i\)).
Gibbs cycle: \(V_i\) (Inv-\(\chi^2\)) + \(\theta, \mu, \tau\) (정규/Gamma) + \(\nu\) (Metropolis).
MAP만 원하면 EM (식 (17.6)) — 빠름.
Multimodality → 여러 chain + overdispersed 시작점.
Sensitivity: 식 (17.3) importance weighting 또는 여러 \(\nu\) 별도 MCMC.

검증

\(V_i\) posterior 시각화 — outlier 가 큰 \(V_i\) 인지 확인.
Weight \(1/V_i\) plot 으로 영향도 시각화.
Posterior predictive check (tail 포함).
Robustness plot: 추정량 vs \(\nu\) 또는 vs prior 선택.

해석

“Outlier 제거” 가 아닌 “heavy-tail 에서 자연 수용” 으로 프레이밍.
결과가 \(\nu\) 에 robust하면 원 정규 가정 강건성 증명.
민감하면 어느 \(\nu\) 영역에서 결론 안정한지 보고.
관심 추정량에 따라 robustness 다름 (중심 vs tail) 명시.

6.3 구현 환경

기능	Python	R
\(t\) 회귀	`statsmodels.regression.TLinearModel` (limited)	`MASS::rlm`, `hett::tlm`
Negative binomial	`statsmodels.GLM(family=NB)`	`MASS::glm.nb`, `brms`
Beta-binomial	`scipy.stats.betabinom`, `pymc`	`VGAM::vglm`, `brms`
Robit regression	수동 Stan/PyMC	`brms::brm(family=student)`
Scale mixture Gibbs	수동 PyMC	수동 Stan
EM for \(t\)	수동 (위 Python 예제 참조)	`EMMIXt`, 수동
Full Bayesian	PyMC, NumPyro, Stan	rstanarm, brms

7 Part IV 다음 편 예고

Ch.17 완결. Ch.18 Models for Missing Data 가 Part IV의 마지막 장.

7.1 Ch.18 미리보기

주제: 결측 데이터의 베이즈 처리.

MAR (Missing at Random) 가정: 결측 확률이 관측 변수에 의존.
Multiple imputation (Rubin 1987): 결측값을 여러 번 대체, 각각 분석 후 통합.
Data augmentation: 결측을 posterior 의 auxiliary variable로.
Categorical imputation: Ch.16 § 16.7 loglinear 모형 활용.
연속 + categorical 혼합: joint model 필요.

Ch.17 과의 연결:

\(V_i\) 를 missing data로 다룸 이 바로 Ch.18 의 auxiliary variable idea.
Robust inference 와 missing data 추론은 scale mixture auxiliary 라는 같은 수학 도구.

Ch.18 예고 구조:

18.1 Notation and terminology.
18.2 Multiple imputation.
18.3 Missing data in multivariate normal and t.
18.4 Example: multiple imputation in public opinion.
18.5 Missing values with counted data.
18.6 Example: an opinion poll in Slovenia.
18.7 Bibliographic note.
18.8 Exercises.

Ch.18 로 Part IV 완결, 이후 Part V (비선형·비모수 모형) 로 전환.

8 관련 주제

선행 지식

후속 주제

Ch.18 Missing Data — multiple imputation, MAR, data augmentation

관련 개념 (cross-category)

9 참고문헌

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.), Ch.17 § 17.4~17.7. CRC Press.
Mosteller, F., & Wallace, D. L. (1964). Inference and Disputed Authorship: The Federalist. Addison-Wesley.
Box, G. E. P., & Tiao, G. C. (1968). A Bayesian Approach to Some Outlier Problems. Biometrika, 55, 119-129.
Smith, A. F. M. (1983). Bayesian Approaches to Outliers and Robustness. In Specifying Statistical Models.
Lange, K. L., Little, R. J. A., & Taylor, J. M. G. (1989). Robust Statistical Modeling Using the t Distribution. JASA, 84, 881-896.
Liu, C., & Rubin, D. B. (1995). ML Estimation of the \(t\) Distribution Using EM and Its Extensions. Statistica Sinica, 5, 19-39.
Meng, X.-L., & van Dyk, D. A. (1997). The EM Algorithm — An Old Folk-Song Sung to a Fast New Tune. JRSS B, 59, 511-567.
Rubin, D. B. (1983b). Iteratively Reweighted Least Squares. In Encyclopedia of Statistical Sciences.
Lange, K. L., & Sinsheimer, J. S. (1993). Normal/Independent Distributions and Their Applications in Robust Regression. JCGS, 2, 175-198.