Kwangmin Kim - Ch.22 § 22.4~22.7 심화 — Unspecified H · Classification · Regression

1 들어가며 — Ch.22 시리즈의 자리

Ch.22 의 사다리 마지막 편:

편	주제	핵심
Overview (04-22-0)	Ch.22 큰 그림	5 절 조망
§ 22.1~22.3 (04-22-1)	정의·적합·식별	식 (22.1)~(22.7)·Gibbs·식 (22.10)~(22.12)·label switching
§ 22.4~22.7 (본 편)	알려지지 않은 H · 응용 · 연습	Truncated + sparse Dirichlet·classification·regression·exercises·Ch.22 결산

본 편이 답하는 다섯 가지 질문

$H$ 가 미지일 때 RJMCMC 의 복잡함을 피하면서 “올바른 $H$” 를 자동 선택하는 단순한 방법은? (§ 22.4)
왜 Dirichlet $a = n_0/H$ 가 $H$ 큰 upper bound 하에서도 작은 $H_n$ 을 favor 하는가? (Ishwaran-Zarepour 정당화)
Mixture 가 분류 (discriminant analysis) 와 회귀 (mixture of experts) 에서 어떻게 비모수 도구로 작동하는가? (§ 22.5)
식 (22.13) joint modeling 과 conditional modeling 의 trade-off 는 무엇인가?
§ 22.7 의 8 연습문제가 점검하는 mixture 의 8 측면은?

2 § 22.4 Unspecified Number of Mixture Components — Truncated Upper Bound

2.1 동기 — RJMCMC 의 대안

§ 22.1 끝에서 본 $H$ 결정 4 접근 중 가장 단순한 truncated upper bound + sparse Dirichlet 을 본 절에서 깊게.

기존 방법의 문제:

여러 $H$ 비교 (WAIC, LOO): $H$ 의 사후 불확실성 무시. 단일 $\widehat H$ 에 conditional 한 inference.
계층적 $H$ + RJMCMC (Richardson-Green 1997): $H$ 가 변할 때 dimension change 처리 필요. 구현 복잡.

직관 — Truncated upper bound 의 영리함

핵심 아이디어: $H$ 를 충분히 큰 값 (예: $H = 20, 50$) 으로 고정하고, prior 가 자동으로 불필요한 component 를 빈 cluster로 만들도록 한다.

“$H$ 가 진짜로 큰가” 질문 회피.
RJMCMC 같은 dimension change 회피.
표준 Gibbs sampler 가 그대로 작동.

대신 prior 의 hyperparameter $a$ 를 $H$ 에 따라 조정 해야 한다 — 그것이 식 $a = n_0/H$.

2.2 식 (22.10) 재방문 — $a$ 의 효과

$\pi = (\pi_1, \ldots, \pi_H) \sim \text{Dirichlet}(a, \ldots, a)$.

$a = 1$ (uniform on simplex):

모든 $\pi_h$ 가 비슷한 크기로 추출되는 경향.
$H$ 클수록 $\pi_h \approx 1/H$ 평균값.
데이터를 모든 component 에 분산 → $H_n \approx H$ → cluster 수 결정 무용.

$a = n_0/H$ (sparse, $n_0$ 고정):

$H$ 클수록 $a$ 작음.
작은 $a$ 의 Dirichlet 은 simplex 의 corner 근처에 집중.
결과: 소수 $\pi_h$ 가 큰 값, 나머지 거의 0 → 자동 sparsity.

직관 — $a$ 가 작을 때 Dirichlet 의 거동

3-차원 simplex 를 시각화하자 (3 component 비율).

$a = 5$: 중앙 (1/3, 1/3, 1/3) 에 집중 — uniform component.
$a = 1$: simplex 전체에 평탄.
$a = 0.1$: 세 vertex (1, 0, 0), (0, 1, 0), (0, 0, 1) 근처에 집중 — 거의 단일 component.
$a = 0.01$: 더 강하게 vertex 집중.

따라서 $a$ 가 작을수록 Dirichlet 은 “확률을 한두 component 에 몰빵” 을 favor.

2.3 Stick-Breaking 표현으로 보는 $a = n_0/H$

Dirichlet 의 stick-breaking 표현:

\[ \lambda_h \sim \text{Gamma}(a, 1), \qquad \pi_h = \frac{\lambda_h}{\sum_l \lambda_l} \]

$a$ 작으면 $\text{Gamma}(a, 1)$ 의 사후 mode 가 0 근처 (heavy density at small values).
대부분 $\lambda_h$ 가 0 근처, 소수가 right tail.
정규화 후 → 소수 큰 $\pi$ + 다수 작은 $\pi$ 의 분포.

직관 — $a = n_0/H$ 에서 $H \to \infty$ 한계

$H$ 가 매우 크면 $a = n_0/H \to 0$:

거의 모든 $\lambda_h$ 가 0 → 거의 모든 $\pi_h$ 가 0.
Right tail 의 소수만 유의미한 weight.

이것이 정확히 Dirichlet Process 의 stick-breaking (Sethuraman 1994). $H \to \infty$ 한계에서 truncated finite mixture 가 DP 로 수렴 — Ch.23 의 동기.

Ch.22 의 truncated 모델 = Ch.23 DP 의 finite approximation.

따라서 본 절의 $a = n_0/H$ 가 “마법의 숫자” 가 아니라 DP prior 의 자연스러운 truncation.

2.4 이론적 정당화 — Ishwaran-Zarepour, Rousseau-Mengersen

Ishwaran & Zarepour (2002): $a = \alpha/H$ Dirichlet 으로 $H$ 큰 truncation 에 대해 DP mixture 와 사후가 거의 동일.

Rousseau & Mengersen (2011): Overfitted mixture (true $H_0 < H$) 에서 redundant component 가 사후적으로 0 으로 수렴 ($a \leq d/2$, $d$ = component parameter 차원).

자동 sparsity 의 메커니즘

Marginal likelihood:

\[ p(y \mid H) = \int p(y, z, \theta, \pi \mid H) dz d\theta d\pi \]

$H$ 가 크면 추가 dimension (component parameter) 의 적분이 marginal 을 줄임 — complexity penalty. 자동으로 더 작은 $H_n$ 이 사후적으로 선호.

Bayesian Occam’s razor 의 mixture 버전.

2.5 $H_n$ — Occupied Components 수

\[ H_n = \sum_{h=1}^H 1_{n_h > 0}, \quad n_h = \sum_i z_{ih} \]

$H_n$ 은 MCMC 의 매 iteration 에서 데이터가 실제 사용한 component 수. $H$ 는 upper bound, $H_n$ 은 effective.

추정 절차:

$H = 20$ (또는 도메인이 허용하는 큰 값) 으로 Gibbs.
각 iteration 의 $H_n$ 기록.
$H_n$ 사후 분포 → mode 또는 평균이 effective cluster 수.

이상적으로 $H_n$ 의 사후가 $H$ (upper bound) 에 무관해야 한다 (sufficiently large $H$). $a = n_0/H$ 가 이 안정성을 보장.

2.6 Galaxy / Acidity / Iris 사례

데이터	$n$	차원	upper $H$	$\widehat H_n$	비고
Galaxy	82	1	5	~3	후퇴 속도, multimodal 의심
Acidity	155	1	5	2 or 3	위스콘신 호수 산성도
Iris	150	4	6	~3	3 species 와 일치

2.6.1 Galaxy (Table 22.2)

5 component 사후 평균:

$h$	1	2	3	4	5
$\pi_h$	0.66	0.16	0.06	0.09	0.03
$\mu_h$	0.10	0.20	1.89	-2.35	0.02

$\pi_1 = 0.66$ 의 dominant component (origin 근처), 나머지는 outlier 또는 작은 cluster.

직관 — Galaxy 결과의 신뢰성

$\pi_5 = 0.03$ (3%) 의 component 는 단 2~3 점만 흡수 — single observation 가까운 cluster 위험.

해석 가이드:

$\pi_h \cdot n > 5$: 신뢰 가능한 cluster.
$\pi_h \cdot n \in [1, 5]$: 의심스러운 cluster (overfitting 가능성).
$\pi_h \cdot n < 1$: 거의 무시 가능 (사실상 빈 component).

2.6.2 Acidity (Table 22.3)

5 component 결과 → 사실상 2~3 cluster (한 cluster 가 skewed 라 정규로는 2 component 로 분리).

2.6.3 Iris (Table 22.4)

4 차원, 6 upper bound. 결과: $\pi$ 가 큰 3 component 가 setosa, versicolor, virginica 에 거의 일치. 작은 weight 의 추가 component 는 species 분포가 정확히 정규가 아닌 부분 을 보완.

직관 — 통계적 cluster ≠ 생물학적 species

Iris 의 진짜 species 수 = 3, 추정 cluster 수 = 4~5. 왜?

Versicolor 와 virginica 는 4 차원 공간에서 약간 겹친다.
정확히 multivariate normal 이 아닌 species 는 mixture 가 추가 component 로 보완.
4 차원 anisotropic distribution 을 spherical normal 로 모델링 → 한 species 가 2 component 로 쪼개짐 가능.

경고: cluster 수를 “ground truth” 로 해석할 때 항상 component family ($f$) 의 적절성 점검.

2.7 Default Hyperparameters (실무 권장)

데이터 표준화 후:

$a = n_0/H = 1/H$ — minimal informative ($n_0 = 1$).
$\mu_0 = 0, \kappa = 1$ — cluster mean prior.
$a_\tau = 3, b_\tau = 1$ — variance prior (실 사례에서 좋은 default, BDA recommends).

다변량:

$H = $ 최대 그럴듯한 cluster 수 (도메인) + 약간의 여유.
$\Sigma_h$ 가정: diagonal (단순) vs full (유연하나 overfitting 위험). 데이터 차원 + $n$ 에 따라.

3 § 22.5 Mixture for Classification and Regression

3.1 Bayesian Discriminant Analysis — Dirichlet 갱신

분류: $y \in \{1, \ldots, C\}$, $x \in \mathbb{R}^p$.

3.1.1 Bayes 룰의 분리

\[ \Pr(y_i = c \mid x_i) = \frac{\psi_c f_c(x_i)}{\sum_{c'} \psi_{c'} f_{c'}(x_i)} \]

$\psi_c = \Pr(y_i = c)$ — class prior.
$f_c(x_i) = f(x_i \mid y_i = c)$ — class-conditional density.

직관 — Discriminative vs Generative

Discriminative (logistic, GP classification): $\Pr(y \mid x)$ 직접 모델링.
Generative (Bayesian discriminant): $\Pr(y) \cdot p(x \mid y)$ 분해 후 Bayes 룰.

장점 (generative):

Missing $y$ (semi-supervised) 자연 처리.
Missing predictor $x$ 자연 처리.
Class imbalance 표현 명시적.

단점 (generative):

$f_c(x)$ 의 정확한 modeling 필요. 잘못되면 분류 성능 저하.

3.1.2 $\psi$ 의 conjugate 갱신

Prior: $\psi \sim \text{Dirichlet}(a\psi_{01}, \ldots, a\psi_{0C})$.

$\psi_0 = E(\psi)$ — prior mean.
$a$ = prior sample size.

Fully supervised 사후:

\[ \psi \mid y, X \sim \text{Dirichlet}\Bigl(a\psi_{01} + n_1, \ldots, a\psi_{0C} + n_C\Bigr) \]

$n_c = \sum_i 1_{y_i = c}$. 단순 conjugate — 분석적 closed form.

3.2 Class-Conditional Density 의 Mixture

$f_c(x_i)$ 를 단일 정규로 가정하면 class 가 “단봉 + 정규” 라는 강한 가정. 실제 class 분포는 multimodal 또는 skewed 가능 → mixture 로 표현:

\[ f_c(x_i) = \sum_{h=1}^H \pi_{ch} N_p(x_i \mid \mu_{ch}^*, \Sigma_{ch}^*) \]

직관 — 두 단계 mixture

외부: 각 class $c$ 가 자기만의 mixture distribution.
내부: 그 mixture 가 $H$ 개 multivariate normal.

응용:

음성 인식: 같은 음소 (class) 라도 발화자·억양에 따라 acoustic feature 가 multimodal → mixture.
이미지 객체: 같은 객체 class (개) 라도 종·각도·조명에 따라 분포가 multimodal.
의료: 같은 질병 class 라도 환자 subset 에 따라 임상 feature 분포 다름.

3.2.1 단순화 — 공통 weights

Computational tractability 를 위해 $\pi_{ch} = \pi_h$ (class-independent weights):

모든 class 가 같은 mixture component 집합 사용.
차이는 $(\mu_{ch}^*, \Sigma_{ch}^*)$ 의 class-specific values.
$\pi \sim \text{Dirichlet}(1/H, \ldots, 1/H)$ — sparse.
$(\mu_{ch}^*, \Sigma_{ch}^*) \sim$ normal-inverse-Wishart conjugate.

3.3 Gibbs Sampler — Discriminant Analysis

매 iteration:

3.3.1 Step 1 — $z_i$ multinomial

\[ \Pr(z_i = h \mid y_i = c, \cdots) = \frac{\pi_h N_p(x_i \mid \mu_{ch}^*, \Sigma_{ch}^*)}{\sum_{h'} \pi_{h'} N_p(x_i \mid \mu_{ch'}^*, \Sigma_{ch'}^*)} \]

3.3.2 Step 2 — $\psi$ Dirichlet

$\psi \mid y \sim \text{Dirichlet}(a\psi_{0c} + n_c)$.

3.3.3 Step 3 — $\pi$ Dirichlet

$\pi \mid z \sim \text{Dirichlet}(1/H + \sum_{c, i} 1_{y_i = c, z_i = h})$.

3.3.4 Step 4 — $(\mu_{ch}^, \Sigma_{ch}^)$ normal-inverse-Wishart

같은 $(c, h)$ 그룹의 데이터로 표준 conjugate 갱신.

Semi-Supervised 확장

Unlabeled $i$ 의 $y_i$ 도 sampling:

\[ \Pr(y_i = c \mid x_i, \cdots) \propto \psi_c f_c(x_i) = \psi_c \sum_h \pi_h N_p(x_i \mid \mu_{ch}^*, \Sigma_{ch}^*) \]

→ 추가 Gibbs step. Unlabeled data 의 cluster 정보가 labeled data 의 추정에 기여 → 분류 성능 향상.

특히 labeled data 적고 unlabeled 많을 때 (실제 의료 데이터 흔함) 큰 효과.

3.4 Product Kernel — Mixed-Type Predictors

$x_i$ 가 categorical + continuous 혼합일 때:

\[ f(x_i \mid \theta_i) = \prod_{j=1}^p \mathcal{K}_j(x_{ij} \mid \theta_{ij}) \]

$\mathcal{K}_j$ = predictor $j$ 에 적절한 kernel:
- 연속: 정규.
- 이진: Bernoulli.
- count: Poisson.
- 다범주: multinomial.

문제: product kernel 은 conditional independence ($\theta_i$ given) 가정. 실제 predictor 간 dependence 는?

해결: $\theta_i$ 자체를 mixture 로 ($\theta_i \sim \sum_h \pi_h \delta_{\Theta_h}$). Mixture indicator 를 통해 marginal dependence 유도.

3.5 식 (22.13) Joint Modeling for Regression

회귀: $y_i \in \mathbb{R}$, $x_i \in \mathbb{R}^p$. $w_i = (y_i, x_i) \in \mathbb{R}^{p+1}$ 의 joint mixture:

\[ f(w_i) = \sum_{h=1}^H \pi_h N_{p+1}(w_i \mid \mu_h, \Sigma_h) \quad (22.13) \]

3.5.1 식 (22.14)~(22.15) — Conditional Density 유도

Joint mixture 의 conditional 을 계산하면:

\[ f(y_i \mid x_i) = \sum_{h=1}^H \pi_h(x_i) N(y_i \mid \beta_{0h} + x_i \beta_{1h}, \sigma_h^2) \quad (22.14) \]

predictor-dependent weights:

\[ \pi_h(x_i) = \frac{\pi_h N_p(x_i \mid \mu_h^{(x)}, \Sigma_h^{(x)})}{\sum_{h'} \pi_{h'} N_p(x_i \mid \mu_{h'}^{(x)}, \Sigma_{h'}^{(x)})} \quad (22.15) \]

$\beta_{0h}, \beta_{1h}$ = $h$-th component 의 conditional regression 계수.
$\sigma_h^2$ = $h$-th 의 conditional variance.

직관 — Mixture of Experts

식 (22.14) 의 의미:

“각 $x_i$ 에서 conditional 분포가 $H$ 개 linear regression 의 가중 합. 가중치는 $x_i$ 가 어느 component 의 $x$-distribution 와 가까운지에 따라.”

이는 mixture of experts (Jacobs et al. 1991, Jordan-Jacobs 1994) 와 같은 정신:

입력 공간을 “softly partition” — 각 영역에 다른 expert (linear regression).
전체로는 비선형 회귀.

차이: mixture of experts 는 gating function 을 명시적 (logistic, neural net) 으로 학습. 식 (22.13)~(22.15) 는 joint $N_{p+1}$ 의 자연스러운 유도.

3.5.2 Joint Modeling 의 4 가지 한계

Fixed predictor: $x_i$ 가 design 에 의해 정해진 경우 (실험 설계) → $x$ 에 분포 부여가 부자연스럽다.
Categorical predictor: $x$ 가 일부 categorical 이면 multivariate normal 부적절 → product kernel 또는 Gaussian copula 필요.
High-dimensional $x$: $p$ 큼 → marginal $f(x)$ 추정에 막대한 자원, conditional 만 필요한데.
Conditional 단순, joint 복잡: $y \mid x$ 가 단일 정규로 충분한데 joint 가 multimodal $\Rightarrow$ joint mixture 가 conditional 에 부정확한 노이즈 추가.

실무 권장

Density estimation 자체가 목적: joint mixture (식 22.13).
Conditional 만 필요: conditional mixture 직접 (예: GP regression Ch.21, Bayesian additive regression trees).
Mixed-type $x$: product kernel + mixture indicator.
High-dim $x$: feature selection 후 conditional, 또는 sparse joint.

4 § 22.6 Bibliographic Note

4.1 EM·VB·EP

Dempster, Laird, Rubin (1977) — EM 의 mixture 응용.
Bishop (2006) — Variational Bayes for mixtures (chapter 10 of PRML).
Minka (2001) — Expectation Propagation for mixtures.

4.2 MCMC for Mixtures

Diebolt & Robert (1994) — Gibbs sampler 원전.
Richardson & Green (1997) — Reversible Jump for unknown $H$.
Stephens (2000a, 2000b) — Unspecified $H$ + label switching postprocessing.
Jasra, Holmes, Stephens (2005) — Label switching survey.
Papaspiliopoulos & Roberts (2008) — Label-switching moves within MCMC.

4.3 Sparse Dirichlet Theory

Ishwaran & Zarepour (2002) — $a = \alpha/H$ Dirichlet 의 DP 근사.
Rousseau & Mengersen (2011) — Overfitted mixture 의 redundant component 의 zero-out 점근.

4.4 응용

Belin & Rubin (1990, 1995a, 1995b) — Schizophrenia.
Rubin & Wu (1997) — Schizophrenia 확장.
Gelman & King (1990b) — Election mixture (informative prior).
Roeder & Wasserman (1997) — Galaxy density.
Fraley & Raftery (2002) — Model-based clustering (mclust 패키지).
Dunson (2010a) — Conditional density mixtures (epidemiology).
Dunson & Bhattacharya (2010) — Joint product kernel for classification/regression.

4.5 Surveys

McLachlan & Peel (2000) — Finite Mixture Models 정전.
Fruhwirth-Schnatter (2006) — Finite Mixture and Markov Switching Models 베이즈 관점.
Titterington, Smith, Makov (1985) — 비-베이즈 종합.
West (1992) — 베이즈 brief review.

5 § 22.7 Exercises — 8 문제 풀이 (요약)

각 문제의 수식 유도·시뮬레이션 코드·심화 직관 은 § 22.7 심화 (04-22-3) 에서 다룬다. 본 절은 핵심 풀이만 정리한다.

5.1 Exercise 1 — Cluster Point Estimate (Mean vs Median vs Mode)

문제: 3-component mixture 에서 각 데이터 포인트의 component 식별. Pointwise marginal mean/median/mode 중 어느 것?

풀이:

Mean: 잠재 indicator $z_i \in \{1, 2, 3\}$ 의 marginal mean = $\sum_h h \cdot \Pr(z_i = h \mid y_i)$. 정수가 아니라 의미 부적절 (component label 은 categorical).
Median: 같은 이유로 categorical 변수에서 median 정의 모호.
Mode: $\arg\max_h \Pr(z_i = h \mid y_i)$ = MAP 분류. 권장.

직관 — Categorical 변수의 point estimate

연속/순서형 변수: mean, median 의미 있음.

명목형 (categorical) 변수: mode 만이 의미 있는 point estimate. 이것이 분류 문제의 0-1 loss 와 연결 — Bayes optimal classifier 가 사후 mode.

5.2 Exercise 2 — Overfitted Mixture (3 → 2/3/4/unspecified)

문제: True $H_0 = 3$ 의 normal mixture (centers $-2, 0, 2$, scale 1) 에서 500점 추출. Bayesian mixture 를 $H = 2, 3, 4$, unspecified $H \in [1, 6]$ 으로 적합.

예상 결과:

$H = 2$: 두 cluster 가 합쳐짐 (center 0, scale 더 큼). Underfit.
$H = 3$: 정확한 회복.
$H = 4$: 4 component 가 잡히나 한 component 가 매우 작은 weight (overfit). Sparse Dirichlet 이면 자동 zero-out.
Unspecified: $H_n$ 사후가 3 에 집중.

직관 — Overfitted mixture 의 우아함

$a = n_0/H$ 와 함께 $H$ 큰 upper bound 를 두면 — overfit 이 자동 회피.

Rousseau-Mengersen (2011): Posterior of $\pi_h$ for redundant components → 0. 데이터로부터의 evidence 가 약하면 그 component 는 빈 cluster 로 전락.

5.3 Exercise 3 — Long-Tailed Data with Normal Mixture

문제: 3 개 $t_4$ 분포의 mixture 에서 데이터 추출. Normal mixture 로 적합.

예상 결과:

$t_4$ 의 heavy tail → normal mixture 가 같은 center 의 여러 component 로 tail 표현. 즉, 3 개 가 아니라 5~7 component.

직관 — 분포 가족이 다르면 cluster 수 부풀려짐

$t_4$ = $N$ 의 scale mixture (Inv-$\chi^2$).

$t_4$ 한 분포 ≈ 같은 center 의 두 정규 (작은 분산 + 큰 분산). Normal mixture 가 $t_4$ 를 흉내내려면 각 진짜 cluster 마다 2 component.

해결: $t$-component mixture 사용 ($f$ = $t_\nu$).

5.4 Exercise 4 — Galaxy Density Estimation

문제: 82 점 galaxy 데이터, $\alpha$ Dirichlet + normal-inverse-gamma. $\alpha$ 감소·$k$ 증가·prior variance 증가의 효과.

(a) $\alpha \to 0$: sparse 강화 → cluster 수 줄어듦. 너무 작으면 single cluster 로 붕괴.

(b) $k$ 증가: upper bound 증가. $a = \alpha/k$ 라면 $H_n$ 안정. $a$ 고정이면 cluster 수 부풀려짐.

(c) Prior variance $\kappa$ 증가: cluster mean 의 사전 분포 매우 넓음 → cluster mean 들이 데이터 영역 밖에 위치 가능 → 부적절 (대부분 데이터가 한 cluster 에 몰림).

직관 — Diffuse $P_0$ 의 함정

Diffuse prior 는 일반적으로 “객관적” 으로 여겨지지만, mixture 에서는 반대 효과:

$\kappa$ 큼 → cluster mean 이 데이터 멀리.
데이터가 그 cluster 에 매핑될 likelihood 작음.
→ 모든 데이터가 한 dominant cluster 에 몰림.

따라서 mixture 의 $P_0$ 는 데이터 영역과 비슷한 scale 이 권장.

5.5 Exercise 5 — Football Point Spread Mixture

문제: § 1.6 의 football data (score differential - point spread) 를 normal 대신 finite mixture of normals 로 적합. $a = 1/k$ Dirichlet 사용.

예상 결과:

단일 정규 적합이 이미 거의 완벽 (Section 1.6) → mixture 가 single dominant component 로 회귀.
$H_n \approx 1$.
따라서 normal 가정 정당화.

직관 — Single component 결과의 의미

“Mixture 모델이 단일 component 를 선택” = “이 데이터에 정규 가정이 충분”.

Mixture 는 null 가설 (정규) 을 데이터 기반 검증 도구. Bayes factor 처럼 작동하지만 더 자연스러운 (continuous) prior.

5.6 Exercise 6 — Kidney Cancer Mixture vs Gamma

문제: $y_j \sim \text{Poisson}(10 n_j \theta_j)$, $\theta_j$ 의 prior 를 (a) 단일 Gamma vs (b) $\sum_h \pi_h \delta_{\theta_h^*}$ + Dirichlet sparse 로.

비교:

(a) 단일 Gamma: $\theta_j$ 가 연속, smooth 사후.
(b) Mixture of point masses: $\theta_j$ 가 $H$ 개 점 중 하나. 같은 $\theta^*$ 를 공유하는 county 들이 cluster.

결과 차이:

1. 는 county 별 $\theta_j$ 가 모두 다름 (Bayesian shrinkage).
1. 는 county 들이 그룹화 → 같은 그룹 내 정보 공유 강함 → shrinkage 더 강함.

직관 — Discrete vs Continuous mixing distribution

Continuous: smooth, 모든 $\theta_j$ 다름.
Discrete (DP-like): “타입” 으로 분류, 같은 타입은 동일 $\theta$.

응용:

지역 데이터 (kidney cancer county): 비슷한 county 들 그룹화 의미 있음 → discrete.
개별 환자 효과: 각자 다른 → continuous.

5.7 Exercise 7 — Improper Prior 위험

문제: Component-specific parameter 에 noninformative prior → 어떤 문제?

풀이:

§ 22.1 에서 본 degenerate mode:

정규 mixture 에서 한 component 가 single observation 에 정확히 맞춰지면서 $\sigma_h^2 \to 0$.
Likelihood $\to \infty$.
Improper prior + $\int_0^\epsilon 1/\sigma^2 d\sigma^2$ 가 발산 → improper posterior.

해결:

Proper prior on $\sigma_h^2$ (Inverse Gamma).
또는 분산 비율 $\sigma_2/\sigma_1$ 고정 (single 분산 의 improper OK).
또는 정보적 prior + 데이터 표준화.

5.8 Exercise 8 — Dirichlet Sparsity 점근

문제: $\pi \sim \text{Dirichlet}(1/k, \ldots, 1/k)$ 에서 1000 표본 추출, $k = 5, 10, 25, 50, 100, 1000$. 정렬된 order statistic 의 사후 평균.

예상 결과:

$k$ 클수록 정렬된 $\pi$ 의 분포가 stick-breaking 한계 에 수렴 (Dirichlet Process 의 stick-breaking representation):

\[ \pi_{(h)} \approx V_{(h)} \prod_{l < h}(1 - V_{(l)}), \qquad V \sim \text{Beta}(1, \alpha) \]

첫 element $\pi_{(1)} \approx 0.5$ 정도 (지수적 감소).
매우 빠른 decay — top 5 가 $\sum \approx 0.95$.

비교: $\text{Dirichlet}(1, \ldots, 1)$ 은 모든 element 가 비슷 ($\approx 1/k$).

직관 — Stick-breaking 의 시각화

$k = 100, a = 0.01$ 의 Dirichlet 표본:

$\pi_{(1)} \approx 0.4$ (가장 큰).
$\pi_{(2)} \approx 0.25$.
$\pi_{(3)} \approx 0.15$.
$\pi_{(4)} \approx 0.1$.
$\pi_{(5)} \approx 0.05$.
나머지 95 개 $\sum \approx 0.05$.

상위 5 개가 95% 의 weight. 자동 sparsity.

이것이 Ch.23 DP 의 stick-breaking representation 의 직접적 시각화.

6 Ch.22 시리즈 결산

6.1 4 편의 핵심

편	한 줄 요약
Overview (04-22-0)	“분포 자체에 사전분포 — finite mixture 의 5 절 조망”
§ 22.1~22.3 (04-22-1)	“ECM/Gibbs 식 (22.5)~(22.7), label switching 처리”
§ 22.4~22.7 (본 편)	“$H$ 자동 결정 + 분류·회귀·연습 + 결산”

6.2 Ch.22 의 핵심 수식 통합

번호	수식	의미
(22.1)	$p(y_i) = \sum_h \lambda_h f(y_i \mid \theta_h)$	Finite mixture
(22.2)	$p(y, z) = \prod_i \prod_h (\lambda_h f(y_i \mid \theta_h))^{z_{ih}}$	Joint with indicator
(22.5)	$_{ij} = $ Bayes 룰 ratio	E-step
(22.6)	$_j^{} = $ conjugate weighted avg	M-step random effect
(22.7)	$\mu, \beta$ group mean	M-step group
(22.10)	$\pi \sim \text{Dir}(a, \ldots, a),\ \theta_h \sim P_0$	Exchangeable prior
(22.11)	$y_i \mid z_i \sim N(\mu_{z_i}, \tau_{z_i}^2)$	Location-scale
(22.12)	$\mu_h, \tau_h^2 \sim N \cdot \text{Inv-Gamma}$	Conjugate
-	$a = n_0/H$	Sparse Dirichlet
(22.13)	$f(w_i) = \sum_h \pi_h N_{p+1}(w_i \mid \mu_h, \Sigma_h)$	Joint mixture
(22.14)	$f(y \mid x) = \sum_h \pi_h(x) N(y \mid \beta_{0h} + x\beta_{1h}, \sigma_h^2)$	Mixture of experts
(22.15)	$_h(x) = $ predictor-dependent weights	Gating function

6.3 Ch.22 의 시퀀스 — 점점 큰 일반화

Ch.22 § 22.1 Setup
  → 식 (22.1) finite mixture, latent indicator

Ch.22 § 22.2 Schizophrenia
  → ECM/Gibbs 의 closed-form 계산

Ch.22 § 22.3 Label switching
  → Identifiability 와 후처리

Ch.22 § 22.4 Unspecified H
  → Truncated upper bound + sparse Dirichlet
  → DP (Ch.23) 으로의 가교

Ch.22 § 22.5 Classification/Regression
  → Discriminant analysis, mixture of experts
  → 식 (22.14) predictor-dependent

Ch.23 Dirichlet Process
  → H → ∞ 한계
  → 비모수 베이즈 완성

6.4 Ch.22 의 유산과 한계

유산:

Latent variable augmentation 의 표준 도구.
Identifiability 처리 (label switching) 의 모범.
$H$ 자동 결정 (sparse Dirichlet) — 단순하지만 강력.
Robust 통계학 ($t$ = scale mixture) 의 통일.
Discriminant analysis + mixture of experts 의 베이즈 framework.

한계:

$H$ 가 truncated — 정말 무한 cluster 면 부족 (Ch.23).
Component 가족 ($f$) 가정에 cluster 결과가 민감.
다변량 covariance 가정에 cluster 수 민감.
Label switching 의 본질적 어려움 (다변량은 postprocessing 필수).

다음 장 (Ch.23):

DP = Ch.22 의 $H \to \infty$ 한계.
Stick-breaking = $a = n_0/H \to 0$ 의 자연스러운 표현.
Polya urn / Chinese restaurant process = $z_i$ 의 marginal sequential.
HDP = 그룹별 mixture 의 component 공유.

7 Ch.22 시리즈 통합 체크리스트

모델 결정

Mixture 가 정말 필요한가 (conditioning 변수 관측 가능?).
Component 가족 ($f$): 정규 default, $t$ robust, multivariate $\Sigma$ 결정.
$H$ 결정: 도메인 지식 vs WAIC 비교 vs truncated upper bound + sparse.

Prior

Proper prior 필수 (improper 는 degenerate posterior).
$a = n_0/H$ Dirichlet (with $n_0 = 1$ default) for sparse.
$P_0$ 의 scale = 데이터 영역 (표준화 후 $\mu_0 = 0, \kappa = 1$).
식별성: order constraint 또는 informative component prior.

계산

Crude estimate 100 starting points, ECM 100 iter mode finding.
Mode → $t_4$ approximation → importance resampling → Gibbs 시작.
Gibbs 6 단계 (또는 PyMC pm.Mixture) 사용.
NUTS for nonconjugate hyperparameters.

검증

PPC test: sufficient statistic 외 quantity (극값, 분위수).
WAIC, LOO-CV 로 $H$ 또는 모델 비교 (DIC 권장 안 함).
$\widehat R$ on density 또는 switching-invariant.
Component-specific 추론 시 KL postprocessing.
Cluster size $\pi_h \cdot n > 5$ 만 신뢰.

해석

Density estimation: switching 무시.
Cluster ≠ 진짜 부분모집단 — kernel 가정에 민감.
외삽: cluster 가정의 외부 일반성 검토.
Ch.21 GP, Ch.23 DP 와 비교 — 어느 모델이 가장 적절한가.

8 관련 주제

Ch.22 시리즈

선행 지식

후속 주제

Ch.23 Dirichlet Process Models Overview (예정)

관련 개념 (cross-category)

9 참고문헌

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.), Ch.22 § 22.4~22.7. CRC Press.
Ishwaran, H., & Zarepour, M. (2002). Exact and Approximate Sum-Representations for the Dirichlet Process. Canadian Journal of Statistics, 30(2), 269-283.
Rousseau, J., & Mengersen, K. (2011). Asymptotic Behaviour of the Posterior Distribution in Overfitted Mixture Models. JRSS B, 73(5), 689-710.
Sethuraman, J. (1994). A Constructive Definition of Dirichlet Priors. Statistica Sinica, 4, 639-650. (Stick-breaking 정의)
Richardson, S., & Green, P. J. (1997). On Bayesian Analysis of Mixtures with an Unknown Number of Components. JRSS B, 59(4), 731-792.
Stephens, M. (2000a, 2000b). Bayesian Analysis of Mixture Models with Unknown Number of Components·Dealing with Label Switching. JRSS B, 62(4), 795-809.
Jasra, A., Holmes, C. C., & Stephens, D. A. (2005). Markov Chain Monte Carlo Methods and the Label Switching Problem in Bayesian Mixture Modeling. Statistical Science, 20(1), 50-67.
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive Mixtures of Local Experts. Neural Computation, 3(1), 79-87.
Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical Mixtures of Experts and the EM Algorithm. Neural Computation, 6(2), 181-214.
Diebolt, J., & Robert, C. P. (1994). Estimation of Finite Mixture Distributions through Bayesian Sampling. JRSS B, 56(2), 363-375.
Roeder, K., & Wasserman, L. (1997). Practical Bayesian Density Estimation Using Mixtures of Normals. JASA, 92(439), 894-902.
Fraley, C., & Raftery, A. E. (2002). Model-Based Clustering, Discriminant Analysis, and Density Estimation. JASA, 97(458), 611-631.
Dunson, D. B. (2010a). Nonparametric Bayes Applications to Biostatistics. In Bayesian Nonparametrics (Hjort et al., eds.), Cambridge.
Dunson, D. B., & Bhattacharya, A. (2010). Nonparametric Bayes Regression and Classification through Mixtures of Product Kernels. In Bayesian Statistics 9, Oxford.
Belin, T. R., & Rubin, D. B. (1995a, 1995b). Inference for Finite Mixture Models. Statistica Sinica.
McLachlan, G. J., & Peel, D. (2000). Finite Mixture Models. Wiley.
Fruhwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. (Ch.10 — VB for mixtures)
Minka, T. (2001). Expectation Propagation for Approximate Bayesian Inference. UAI.

1 들어가며 — Ch.22 시리즈의 자리

2 § 22.4 Unspecified Number of Mixture Components — Truncated Upper Bound

2.1 동기 — RJMCMC 의 대안

2.2 식 (22.10) 재방문 — \(a\) 의 효과

2.3 Stick-Breaking 표현으로 보는 \(a = n_0/H\)

2.4 이론적 정당화 — Ishwaran-Zarepour, Rousseau-Mengersen

2.5 \(H_n\) — Occupied Components 수

2.6 Galaxy / Acidity / Iris 사례

2.6.1 Galaxy (Table 22.2)

2.6.2 Acidity (Table 22.3)

2.6.3 Iris (Table 22.4)

2.7 Default Hyperparameters (실무 권장)

3 § 22.5 Mixture for Classification and Regression

3.1 Bayesian Discriminant Analysis — Dirichlet 갱신

3.1.1 Bayes 룰의 분리

3.1.2 \(\psi\) 의 conjugate 갱신

3.2 Class-Conditional Density 의 Mixture

3.2.1 단순화 — 공통 weights

3.3 Gibbs Sampler — Discriminant Analysis

3.3.1 Step 1 — \(z_i\) multinomial

3.3.2 Step 2 — \(\psi\) Dirichlet

3.3.3 Step 3 — \(\pi\) Dirichlet

3.3.4 Step 4 — \((\mu_{ch}^*, \Sigma_{ch}^*)\) normal-inverse-Wishart

3.4 Product Kernel — Mixed-Type Predictors

3.5 식 (22.13) Joint Modeling for Regression

3.5.1 식 (22.14)~(22.15) — Conditional Density 유도

3.5.2 Joint Modeling 의 4 가지 한계

4 § 22.6 Bibliographic Note

4.1 EM·VB·EP

4.2 MCMC for Mixtures

4.3 Sparse Dirichlet Theory

4.4 응용

4.5 Surveys

5 § 22.7 Exercises — 8 문제 풀이 (요약)

5.1 Exercise 1 — Cluster Point Estimate (Mean vs Median vs Mode)

5.2 Exercise 2 — Overfitted Mixture (3 → 2/3/4/unspecified)

5.3 Exercise 3 — Long-Tailed Data with Normal Mixture

5.4 Exercise 4 — Galaxy Density Estimation

5.5 Exercise 5 — Football Point Spread Mixture

5.6 Exercise 6 — Kidney Cancer Mixture vs Gamma

5.7 Exercise 7 — Improper Prior 위험

5.8 Exercise 8 — Dirichlet Sparsity 점근

6 Ch.22 시리즈 결산

6.1 4 편의 핵심

6.2 Ch.22 의 핵심 수식 통합

6.3 Ch.22 의 시퀀스 — 점점 큰 일반화

6.4 Ch.22 의 유산과 한계

7 Ch.22 시리즈 통합 체크리스트

8 관련 주제

9 참고문헌

3.3.4 Step 4 — \((\mu_{ch}^, \Sigma_{ch}^)\) normal-inverse-Wishart