Kwangmin Kim - Ch.23 § 23.4~23.7 심화 — HDP · NDP · Density Regression + Ch.23 결산 + Part V 전체 결산

1 들어가며 — Ch.23 시리즈의 마지막

Ch.23 의 사다리:

편	주제
Overview (04-23-0)	Ch.23 큰 그림
§ 23.1~23.3 (04-23-1)	DP·stick-breaking·DPM
§ 23.4~23.7 (본 편)	HDP·NDP·density regression·Ch.23 결산·Part V 결산

본 편이 답하는 다섯 가지 질문

DPM 이 hierarchical 모델의 부품 으로서 어떻게 (회귀 잔차·random effect·FDA basis 등) 작동하는가?
여러 그룹의 분포에 dependence 를 부여하는 두 가지 길 — HDP (atom 공유) 와 NDP (distribution cluster) — 가 무엇이 다른가?
식 (23.21) mixture of experts 의 finite 버전이 DDP / kernel / probit stick-breaking 으로 무한화되는 메커니즘은?
Ch.23 의 5 가지 응용 (residual·ANOVA·FDA·HDP·density regression) 이 통합되는 원리는?
Part V 다섯 장 (Ch.19~23) 의 사다리는 함수와 분포 의 어떤 점진적 일반화인가?

2 § 23.4 Beyond Density Estimation

지금까지의 DPM 은 단일 분포 추정. 본 절은 DPM 이 hierarchical 모델의 부품 으로서의 위력 — 회귀 잔차, random effect, functional data 등.

2.1 식 (23.10) — Nonparametric Residual

선형 회귀 + DP scale mixture 잔차:

\[ y_i = X_i \beta + \epsilon_i, \quad \epsilon_i \sim N(0, \phi_i^{-1}), \quad \phi_i \sim P, \quad P \sim \text{DP}(\alpha P_0) \quad (23.10) \]

$P_0 = \text{Gamma}(\nu/2, \nu/2)$ 면 prior 는 $t_\nu$ 잔차 위에 centered.

직관 — DP scale mixture vs t 분포

$t_\nu$ 분포 (Ch.17 robust): 모든 관측치가 같은 mixing distribution.

$\phi_i \sim \text{Gamma}(\nu/2, \nu/2)$ — continuous mixing.
Marginal 잔차 = $t_\nu$ — heavy tail 이지만 단봉 + 대칭.

DP scale mixture: $P$ 자체가 nonparametric.

$\phi_i$ 들이 discrete cluster 형성 — 같은 $\phi^*$ 공유.
Outlier subset 이 별도 cluster (큰 분산) → 자동 격리.
그러나 prior 는 여전히 단봉 + 대칭 (Gaussian kernel × symmetric base).

따라서 $t_\nu$ 의 연속 mixing 을 discrete 으로 일반화. 데이터가 cluster 구조를 가질 때 DP 가 더 유연.

2.2 대안 — Location Mixture for Skewed/Multimodal

scale mixture 의 한계: 단봉·대칭만 표현.

Location mixture 잔차:

\[ \epsilon_i \sim N(\mu_i, \tau^{-1}), \quad \mu_i \sim P, \quad P \sim \text{DP}(\alpha P_0), \quad \tau \sim \text{Gamma}(a_\tau, b_\tau) \]

$P_0 = N(0, \tau^{-1})$ centered at 0.

직관 — Location mixture 가 잡는 것

$\mu_i$ 가 여러 mode 형성 가능 → 잔차가 multimodal 또는 skewed.

응용:

금융 데이터: 잔차 분포가 평상 시 + 위기 시 두 mode.
유전학: 잔차가 평균 0 + 일부 점이 양 (또는 음) 으로 systematic shift.
이질적 그룹: 데이터에 보이지 않는 hidden subset 의 잔차 분포 차이.

2.3 식 (23.11)~(23.13) ANOVA + DP

One-factor ANOVA:

\[ y_{ij} = \mu_i + \epsilon_{ij}, \quad \mu_i \sim f, \quad \epsilon_{ij} \sim g \quad (23.11) \]

전통: $f = N(\mu, \psi^{-1})$, $g = N(0, \sigma^2)$ (Ch.5 hierarchical).

DPM 일반화:

\[ \mu_i \sim P, \quad P \sim \text{DP}(\alpha P_0) \quad (23.12) \]

$\epsilon_{ij} \sim N(0, \sigma^2)$ 가정.

→ Subjects 가 latent class 형성:

\[ \mu_i = \mu_{S_i}^*, \quad \Pr(S_i = h) = \pi_h, \quad h = 1, 2, \ldots \quad (23.13) \]

직관 — Hierarchical (Ch.5) vs Latent Class (DPM)

Ch.5 hierarchical ($\mu_i \sim N$): 모든 subject 가 다른 $\mu_i$ — continuous shrinkage.

DPM ($\mu_i \sim P$, DP): subjects 가 type 으로 cluster — discrete grouping.

차이의 본질:

측면	Hierarchical	DPM
$\mu_i$ 분포	Continuous (Normal)	Discrete (atomic)
Shrinkage	Toward common mean	Within cluster
해석	개별 차이 강조	Type 발견

언제 어느 것?

도메인 지식이 “subset 이 명확히 다른 type” 이면 → DPM.
단순히 개별 차이만 표현 → hierarchical.
둘 다 모르면 WAIC 비교.

응용:

Hidden disease subtype: 같은 진단 환자가 실제로 여러 subtype.
Customer segment: 마케팅 데이터의 자동 segment 발견.
Gene expression: 같은 condition 의 sample 들이 hidden cluster 형성.

2.3.1 Identifiability 주의

$P$ 와 $g$ 둘 다 unknown (DP for $\mu_i$, DP for $\epsilon_{ij}$) 하면 mean 의 분리 어려움.

해결: 하나만 nonparametric (예: $\mu_i \sim$ DP, $\epsilon \sim N$ 고정), 또는 post-processing 으로 mean centering.

2.4 Functional Data Analysis (FDA) + DP

Ch.21 GP-FDA 의 대안. Subject 별 함수:

\[ y_{ij} \sim N(f_i(t_{ij}), \sigma^2), \quad f_i(t) = \sum_{h=1}^H \theta_{ih} b_h(t) \quad (23.14) \]

$\theta_i \in \mathbb{R}^H$ = subject-specific basis coefficients.

2.4.1 DP on Coefficients

\[ \theta_i \sim P, \quad P \sim \text{DP}(\alpha P_0) \]

→ functional clustering: 같은 cluster 의 subject 는 동일 함수.

\[ f_i(t) = f_{S_i}^*(t), \quad f_h^*(t) = b(t) \theta_h^*, \quad \theta_h^* \sim P_0 \]

2.4.2 Variable Selection $P_0$

$P_0 = \bigotimes_h P_{0h}$, with spike-and-slab marginals:

\[ P_{0h}(\cdot) = \pi_{0h} \delta_0(\cdot) + (1 - \pi_{0h}) N(\cdot \mid 0, \psi_h^{-1}) \]

→ Cluster 별로 다른 basis subset 활성화. 함수마다 다른 형태 표현.

2.4.3 Heavy-Tail Shrinkage 대안

\[ \theta_{ch}^* \sim N(0, \psi_{ch}^{-1}), \quad \psi_{ch}^* \sim \text{Gamma}(\nu/2, \nu/2) \]

$\nu = 1$ → Cauchy marginal. Block updating 으로 mixing 빠름.

직관 — DP-FDA vs GP-FDA

	GP-FDA (Ch.21)	DP-FDA (Ch.23)
Subject 함수	모두 다름, smooth	Cluster 단위 (cluster 안에서 동일)
Smoothness	Kernel 의 길이 척도	Basis 의 부드러움
응용	환자 trajectory 점진적 차이	환자 type 별 다른 trajectory
계산	$O(n^3)$ Cholesky	Blocked Gibbs $O(N \cdot n)$

선택 기준:

환자별 trajectory 가 부드럽게 다른 경우 (예: 정상 노화) → GP.
환자가 명확한 subtype (예: 질병 진행 패턴 다양) → DP.
둘이 혼합된 경우 → HDP-FDA (다음 § 23.5).

3 § 23.5 Hierarchical Dependence

여러 그룹의 분포에 dependence 부여 — 단일 DP 로는 부족.

3.1 동기 — Comet Assay (Genotoxicity)

데이터: $T$ 개 dose group × subjects × DNA damage measurement.

Group $t$: 처치 농도 $x_t \in \{0, 5, 20, 50, 100\}$ μM 의 hydrogen peroxide.
$y_{ti}$ = 세포 $i$ 의 DNA damage level.
각 group 의 분포 $f_t$ 가 dose 에 따라 변함.

문제: 각 group 별 독립 DP → 정보 공유 부재. 단일 DP → group 차이 무시.

해결: Hierarchical / nested / convex 구조의 DP — group 간 적절한 dependence.

3.2 Hierarchical DP (HDP) — Group 간 Atom 공유

각 그룹의 분포 $P_j$ 가 같은 atom 집합 공유, weight 만 다름:

\[ P_j = \sum_{h=1}^\infty \pi_{jh} \delta_{\theta_h^*}, \quad \theta_h^* \sim P_{00}, \quad \pi_j \sim \text{HDP-stick}(\alpha, \beta) \]

3.2.1 Two-Level Structure

Bottom: $G_0 \sim \text{DP}(\beta P_{00})$ — global atom set.
Top: $G_j \mid G_0 \sim \text{DP}(\alpha G_0)$ — group $j$ 의 분포 (같은 atoms, 다른 weights).

직관 — HDP 의 Atom 공유 메커니즘

$G_j$ 가 $G_0$ 에서 sampling 된 DP — atoms 자체가 $G_0$ 의 atom (subset).

따라서:

모든 group 이 공통 atom pool 에서 추출.
같은 atom $\theta_h^*$ 에 다른 weight $\pi_{jh}$ 가능.
새 group 이 들어와도 같은 atom 사용 → cross-group information sharing.

비교:

독립 DP: 각 group 의 atom 이 다름. 정보 공유 없음.
HDP: Atom 공유 + weight 자유. borrowing of strength + group flexibility.

응용:

Document topic models (HDP-LDA): 각 문서가 다른 topic 비율, topic 자체는 corpus 공통.
Speech: Speaker 별 다른 phoneme 비율, phoneme cluster 는 공통.
Comet assay: Dose 별 다른 DNA damage cluster 비율, cluster type 은 공통.

3.2.2 Convergence Limit

$\alpha \to 0$: 한 group 의 모든 subject 가 단일 cluster ($\beta P_{00}$ marginal).
$\alpha \to \infty$: $G_j \equiv G_0$ 모든 group 이 같은 분포.

→ $\alpha$ 가 between-group 차이의 정도 제어, $\beta$ 가 total cluster 수 제어.

3.2.3 HDP-Hierarchical Clustering

State $i$, hospital $j$, quality $y_{ij}$:

\[ y_{ij} \sim N(\mu_{S_{ij}}^*, \phi_{S_{ij}}^{*-1}), \quad \Pr(S_{ij} = h) = \pi_{jh}, \quad (\mu_h^*, \phi_h^{*-1}) \sim P_{00} \]

→ state 안의 hospital 끼리 우선 cluster, state 간에도 cluster 가능. Soft hierarchical clustering.

Hyperprior: $\alpha, \beta \sim \text{Gamma}(1, 1)$ default.

3.3 식 (23.17)~(23.18) Nested DP (NDP)

HDP 가 atom 공유 + weight 변동인 반면, NDP 는 distribution 자체를 cluster.

\[ P_j \sim P, \quad P \sim \text{DP}(\alpha P_0), \quad P_0 \equiv \text{DP}(\beta P_{00}) \quad (23.17) \]

3.3.1 Stick-Breaking 표현

\[ P_j \sim P = \sum_{h=1}^\infty \pi_h \delta_{P_h^*}, \quad \pi \sim \text{stick}(\alpha), \quad P_h^* \sim \text{DP}(\beta P_{00}) \quad (23.18) \]

$P$ 의 atom 이 분포 $P_h^*$ (random measure) 자체.
$P_j$ 가 그중 하나에 매핑.

3.3.2 Distribution-Level Cluster

\[ \Pr(P_j = P_{j'}) = \frac{1}{1 + \alpha} \]

→ 두 group 의 분포가 정확히 같을 사전 확률 = $1/(1+\alpha)$.

HDP vs NDP — 본질적 차이

측면	HDP	NDP
Group 간 공유	같은 atom, 다른 weight	같은 분포 (atom + weight 모두)
$\Pr(P_j = P_{j'})$	0	$1/(1+\alpha)$
응용	정보 공유, weight 차이	“같은 분포” 검정
해석	Soft clustering	Hard distribution clustering

HDP 적합:

Topic model — 문서마다 topic 비율 다름, topic 자체 공통.
Cross-population genetics — 인종마다 allele 빈도 다름, allele 자체 공통.

NDP 적합:

Multi-treatment 비교 — 두 dose 가 같은 효과인지 검정.
Center 비교 — 어떤 hospital 들이 같은 quality 분포인지.

NDP 는 베이즈 가설 검정 도구로 사용.

3.4 식 (23.19) Convex Mixture

Group 의 분포를 global + group-specific 의 convex 결합:

\[ P_c = \pi G_0 + (1 - \pi) G_c, \quad G_c \sim \text{DP}(\alpha G_0), \quad \pi \sim \text{Beta}(a, b) \quad (23.19) \]

$G_0$ = 모든 group 공통 분포.
$G_c$ = group-specific deviation.
$\pi$ = 공통성 정도 (1 = 완전 공통, 0 = 독립).

직관 — 왜 Additive 가 아니라 Convex 인가

Gaussian hierarchical model: $\mu_c = \alpha + \beta_c$ — additive (overall + deviation).

확률 측도는 additive 안 됨 (음수 분포 불가). Convex combination 이 자연:

$\pi G_0 + (1 - \pi) G_c \geq 0$ ✓ (둘 다 non-negative).
$\int (\pi G_0 + (1 - \pi) G_c) = \pi + (1 - \pi) = 1$ ✓ (확률 보존).

Indicator $z_{ci} \sim \text{Bernoulli}(\pi)$:

$z_{ci} = 1$: subject 가 global $G_0$ 에서.
$z_{ci} = 0$: subject 가 group-specific $G_c$ 에서.

따라서 $\pi$ 가 “subject 가 global 패턴에 속할 비율”.

3.5 식 (23.20) Dynamic AR Mixture

Time-ordered group 에 대해 first-order AR:

\[ P_t = (1 - \pi) P_{t-1} + \pi G_t, \quad P_0 = G_0, \quad G_t \sim \text{DP}(\alpha P_0) \quad (23.20) \]

$P_t$ = 시점 $t$ 의 분포.
$G_t$ = 시점 $t$ 의 새 deviation.
$\pi$ = innovation 의 강도.

직관 — Dynamic Mixture 의 Atom 누적

각 시점 $t$ 마다 새 atom 추가. 이전 atom 은 weight 가 점차 감소 ($1 - \pi$ factor) 하지만 사라지지 않음.

→ “Markov chain in measure space”.

응용:

Ordered dose response: 인접 dose 끼리 비슷, 멀어지면 점점 다름.
Music data: 시간에 따라 새 음악 style 등장, 옛 style 도 잔존.
Disease progression: stage 별 전환 분포.

한계: atom 이 누적만 됨, 사라지지 않음 → 매우 긴 시계열에서 문제 발생.

해결: $P_0 \sim$ HDP 로 모든 atom 이 global pool 공유 → atom reuse.

4 § 23.6 Density Regression

연속 predictor $x$ 에 따라 conditional density $p(y \mid x)$ 가 변할 때.

4.1 식 (23.21) Mixture of Experts (Finite)

\[ p(y \mid x) = \sum_{h=1}^H \pi_h(x) N(y \mid x \beta_h, \tau_h^{-1}) \quad (23.21) \]

$H$ 개 linear regression “expert”.
$\pi_h(x)$ — gating function (predictor-dependent).

직관 — Mixture of Experts (Jordan & Jacobs 1994)

Hard assignment: $\pi_h(x) \in \{0, 1\}$ — 각 $x$ 영역마다 하나의 expert.

Soft assignment: $\pi_h(x) \in [0, 1]$ — gating function 이 부드럽게 expert 들 사이 보간.

Gating function 형태:

Logistic: $\pi_h(x) = e^{w_h x} / \sum_l e^{w_l x}$.
Tree-based: 결정 트리.
Probabilistic: HME (Hierarchical Mixture of Experts).

핵심: expert 자체가 단순 (linear regression), gating function 이 비선형 결합 → piecewise linear 의 부드러운 일반화.

4.2 Dependent Stick-Breaking (DDP)

$\pi_h(x)$ 가 stick-breaking 인데 각 $x$ 마다 다름:

\[ P_x = \sum_{h=1}^\infty \pi_h(x) \delta_{\theta_h^*(x)}, \qquad \pi_h(x) = V_h(x) \prod_{l < h}(1 - V_l(x)) \]

$V_h(x)$ = $x$-dependent stick.
$\theta_h^*(x)$ = $x$-dependent atom (단순화: $\theta_h^*(x) = \theta_h^*$ 고정).

4.2.1 MacEachern (1999) DDP

$V_h \sim Q$ 가 stochastic process, marginal $V_h(x) \sim \text{Beta}(1, \alpha)$ for all $x$.

→ Each $x$ 의 marginal 이 DP 이지만 $x$ 따라 dependent.

4.3 Kernel Stick-Breaking (Dunson-Pillai-Park 2007)

\[ V_h(x) = K_{\psi_h}(x, \Gamma_h) V_h, \quad V_h \sim \text{Beta}(1, \alpha), \quad \Gamma_h \sim G, \quad \psi_h \sim H \]

$K_\psi(x, \Gamma)$ = bounded kernel (Gaussian 등).
$\Gamma_h$ = atom 의 위치, $\psi_h$ = bandwidth.

직관 — Spatial Locality

$K(x, \Gamma_h) = 1$ at $x = \Gamma_h$ → $V_h(x) = V_h$ (full weight).
$K(x, \Gamma_h) \to 0$ as $|x - \Gamma_h| \to \infty$ → $V_h(x) \to 0$ (no weight).

해석:

각 $x$ 영역마다 다른 cluster set 활성화.
$x \approx \Gamma_h$ 인 곳에서는 cluster $h$ 가 dominant.
거리가 멀어지면 다른 cluster 가 dominant.

응용: spatially heterogeneous density — 지리적으로 다른 분포 (예: 도시 vs 농촌의 소득 분포).

Kernel 이 평탄 ($K \approx 1$ 모든 $x$) → 표준 DP 로 회귀.

4.4 Probit Stick-Breaking (Chung-Dunson 2009)

\[ \pi_h(x) = V_h(x) \prod_{l<h}(1 - V_l(x)), \quad V_h(x) = \Phi(\alpha_h + \mu_h(x)), \quad \alpha_h \sim N(\mu, 1) \]

$\Phi$ = standard normal CDF.
$\mu_h(x)$ = arbitrary 회귀 모형 (linear, GP 등).

직관 — PSBP 와 GP 의 결합

$_h(x) = $ GP draw 면 — GP 가 stick-breaking weight 의 변동을 제어.

$\mu = 0$, no $x$: $V_h \sim \text{Beta}(1, 1) = U(0, 1)$ → $\alpha = 1$ DP 와 동치.
$\mu_h(x)$ 가 GP: $V_h(x)$ 가 $x$ 에 따라 부드럽게 변화 → density 가 smooth dependence.

장점:

Probit augmentation 으로 conjugate Gibbs.
GP 의 표현력 + DP 의 sparsity 결합.
Variable selection (각 dimension 의 GP length scale 학습).

이것이 현재 Bayesian density regression 의 state-of-the-art.

4.5 Glucose Tolerance 사례 (간략)

당뇨 epidemiology: 환자 특성 (age, BMI, family history 등) → glucose tolerance 분포.

PSBP 적용:

정상인 region 에서 단봉 분포.
당뇨 region 에서 multimodal (정상·prediabetic·diabetic).
Predictor 변화에 따라 density 형태 자체가 변형.

4.6 Density Regression 비교 (GP vs Mixture of Experts vs DDP)

접근	장점	단점	응용
GP regression (Ch.21)	Smooth, principled	Single mode	부드러운 회귀
LGP density regression (Ch.21 § 21.5)	Density 자체	정규화 적분	일변량 density
Finite mixture of experts (Ch.22)	단순, fast	$H$ 결정	표준 응용
DDP / kernel SBP / probit SBP (Ch.23)	비모수 + dependence	계산 복잡	고급 density regression

실무 권장

단순 회귀 + smooth density: GP regression.
회귀 + density 모양 변화: probit stick-breaking + GP gating.
회귀 + spatial heterogeneity: kernel stick-breaking.
빠른 prototype: finite mixture of experts (Ch.22).

5 § 23.7 Bibliographic Note

5.1 DP·DPM 정전

Ferguson (1973) — DP 원전.
Antoniak (1974) — DPM 도입.
Sethuraman (1994) — Stick-breaking 정리.
Escobar & West (1995), MacEachern (1994) — DPM Gibbs sampler.
Neal (2000) — MCMC algorithms (Algorithm 8 marginal Gibbs).
Ishwaran & James (2001) — Blocked Gibbs.

5.2 Hierarchical / Nested / Dependent

Teh, Jordan, Beal, Blei (2006) — HDP.
Rodriguez, Dunson, Gelfand (2008) — NDP.
MacEachern (1999, 2000) — DDP framework.
De Iorio, Müller, Rosner, MacEachern (2004) — ANOVA-DDP.
Müller, Quintana, Rosner (2004) — Convex mixtures.

5.3 Density Regression

Dunson, Pillai, Park (2007) — Kernel stick-breaking.
Chung, Dunson (2009) — Probit stick-breaking.
Griffin, Steel (2006) — Order-based DDP.
Dunson (2010a) — Conditional density review.

5.4 Computation

Jain, Neal (2004) — Split-merge MCMC.
Blei, Jordan (2006) — Variational DPM.
Walker (2007) — Slice sampling for stick-breaking.
Kalli, Griffin, Walker (2011) — Slice sampling 일반화.

5.5 Surveys

Hjort, Holmes, Müller, Walker (eds., 2010) — Bayesian Nonparametrics (Cambridge).
Ghosal, van der Vaart (2017) — Fundamentals of Nonparametric Bayesian Inference.
Müller, Quintana, Jara, Hanson (2015) — Bayesian Nonparametric Data Analysis.

6 Ch.23 시리즈 결산

6.1 3 편의 핵심

편	한 줄 요약
Overview (04-23-0)	“DP 가 finite mixture 의 무한 일반화·Ch.22 sparse Dirichlet 의 한계”
§ 23.1~23.3 (04-23-1)	“Stick-breaking 식 (23.3)·Polya urn / CRP·marginal vs blocked Gibbs”
§ 23.4~23.7 (본 편)	“DPM 이 hierarchical 부품·HDP·NDP·density regression”

6.2 Ch.23 의 수식 통합

번호	수식	의미
(23.1)	$P(B_1), \ldots, P(B_k) \sim \text{Dir}(\alpha P_0(B_1), \ldots)$	DP partition 정의
(23.2)	$E(P(B) \mid y) = (\alpha P_0 + n \hat F_n)/(\alpha + n)$	사후 가중 평균
(23.3)	$\pi_h = V_h \prod (1 - V_l), V_h \sim \text{Beta}(1, \alpha)$	Stick-breaking
(23.4)	$f(y) = \int \mathcal{K}(y \mid \theta) dP(\theta)$	Kernel mixture
(23.5)	$f(y) = \sum_h \pi_h \mathcal{K}(y \mid \theta_h^*)$	DPM
(23.6)	Polya urn predictive	Sequential
(23.7)	Conditional 예측	Marginal Gibbs 기초
(23.10)	$\epsilon \sim N(0, \phi_i^{-1}), \phi_i \sim P, P \sim \text{DP}$	Nonparametric residual
(23.13)	$\mu_i = \mu_{S_i}^*$	Latent class via DP
(23.17)	$P_j \sim P, P \sim \text{DP}, P_0 \equiv \text{DP}$	Nested DP
(23.19)	$P_c = \pi G_0 + (1 - \pi) G_c$	Convex mixture
(23.20)	$P_t = (1 - \pi) P_{t-1} + \pi G_t$	Dynamic AR
(23.21)	$p(y\\|x) = \sum_h \pi_h(x) N(y\\|x\beta_h, \tau_h^{-1})$	Mixture of experts

6.3 Ch.23 의 시퀀스

§ 23.1 Bayesian histogram (Dirichlet conjugate)
  → § 23.2 DP — partition 적분 소거 + stick-breaking 구성
  → § 23.3 DPM — kernel mixture for 연속 데이터
  → § 23.4 hierarchical 부품 (residual / random effect / FDA)
  → § 23.5 multi-group dependence (HDP, NDP, convex)
  → § 23.6 density regression (DDP, kernel/probit stick-breaking)

매 단계가 직전의 자연스러운 일반화.

7 Part V (Ch.19~23) 전체 결산

7.1 다섯 장의 사다리

장	제목	핵심 도구	응용
Ch.19	Parametric Nonlinear	도메인 ODE, 4PL	PBPK, 약물 동태
Ch.20	Basis Function	$\sum \beta_h b_h(x)$, B-spline, shrinkage	smooth 회귀
Ch.21	Gaussian Process	$\mu \sim \text{GP}(m, k)$, kernel 대수	smooth surface, FDA
Ch.22	Finite Mixture	$\sum \lambda_h f_h$, latent indicator	density estimation, classification
Ch.23	Dirichlet Process	$P \sim \text{DP}$, stick-breaking	nonparametric Bayes

7.2 Part V 의 두 축 — 함수 vs 분포

함수 모델 ($\mu(x)$ 사전):
  Ch.19 (parametric nonlinear)
    → Ch.20 (basis function, finite)
      → Ch.21 (GP, infinite)

분포 모델 ($P$ 사전):
  Ch.22 (finite mixture, $H$)
    → Ch.23 (DP, $\infty$)

Part V 의 통합 원칙

점진적 일반화:

Ch.19: parametric, 도메인 의존.
Ch.20: basis 가중합, 유한 차원 nonparametric.
Ch.21: 함수 자체에 무한 차원 prior.
Ch.22: 분포에 유한 component prior.
Ch.23: 분포에 무한 component prior.

선택 가이드:

상황	추천
도메인 이론 명확	Ch.19
Smooth 함수, 1-3D	Ch.20 (B-spline)
Smooth 함수 + uncertainty	Ch.21 (GP)
Heterogeneous 분포, 도메인 cluster	Ch.22 (mixture)
Heterogeneous + unknown $H$ + flexibility	Ch.23 (DP)

실무에서는 여러 도구를 조합 — 예: Ch.21 GP regression + Ch.23 DP 잔차 (식 23.10).

7.3 Part V 의 계산 진화

Ch	계산 핵심
19	HMC + ODE solver
20	Conjugate Gibbs (basis 선택)
21	Cholesky $O(n^3)$, sparse approximations
22	EM/ECM, finite Gibbs
23	Polya urn / blocked Gibbs (stick-breaking)

→ 모든 장에서 MCMC 가 핵심, 각 모델 구조에 맞춤.

8 통합 체크리스트 — § 23.4~23.7

§ 23.4 Beyond Density Estimation

Residual nonparametric: scale mixture (단봉) vs location mixture (multimodal).
ANOVA random effect: 도메인 cluster 신뢰 → DPM, 개별 차이 → Ch.5 hierarchical.
FDA basis coefficient: variable selection $P_0$ vs heavy-tail shrinkage.
Identifiability: $P, g$ 둘 다 nonparametric 시 mean centering.

§ 23.5 Hierarchical Dependence

HDP vs NDP 결정:
- 도메인적으로 atom 공유 (topic, allele) → HDP.
- 분포 cluster 검정 (treatment 동등성) → NDP.
HDP hyperprior: $\alpha \sim \text{Gamma}(1, 1), \beta \sim \text{Gamma}(1, 1)$.
Convex mixture (식 23.19): 그룹간 명확한 global vs local 구조.
Dynamic mixture (식 23.20): time-ordered group, atom reuse 위해 HDP base 권장.

§ 23.6 Density Regression

Smooth dependence + 단봉: GP regression (Ch.21).
Density 형태 변화: probit stick-breaking + GP gating.
Spatial heterogeneity: kernel stick-breaking.
단순 prototype: finite mixture of experts (Ch.22).

§ 23.7 + Ch.23 결산

5 가지 응용 (residual·ANOVA·FDA·HDP·density regression) 의 통합 원리 — DPM 이 hierarchical 부품.
Part V 다섯 장 (Ch.19~23) 의 사다리 인지.
도메인 ↔︎ 모델 매핑 (위 표).
실무에서 여러 도구 조합 권장.

9 코드 — HDP 시뮬레이션 (간략)

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(0)


def hdp_sample(J, n_per_group, alpha, beta, base_sampler, N_top=20, N_bot=50):
    """HDP 의 stick-breaking 시뮬레이션."""
    # bottom level: G_0 ~ DP(beta * P_00)
    V_top = rng.beta(1, beta, N_top)
    pi_top = np.zeros(N_top)
    remaining = 1.0
    for h in range(N_top):
        pi_top[h] = V_top[h] * remaining
        remaining *= (1 - V_top[h])
    theta_top = np.array([base_sampler() for _ in range(N_top)])

    # top level: G_j | G_0 ~ DP(alpha * G_0)
    # Each G_j 가 G_0 의 atom 을 weight 로 sampling
    groups = []
    for j in range(J):
        # group j 의 weights — Dirichlet(alpha * pi_top + epsilon)
        # 단순화: top atom 을 alpha 배 cluster size 로 stick-breaking
        atom_indices = rng.choice(N_top, size=N_bot, p=pi_top)
        unique_idx, counts = np.unique(atom_indices, return_counts=True)
        weights = counts / counts.sum()
        atoms = theta_top[unique_idx]
        groups.append((atoms, weights))
    return theta_top, pi_top, groups


# example: 3 groups, normal base
theta_top, pi_top, groups = hdp_sample(
    J=3, n_per_group=100, alpha=2, beta=3,
    base_sampler=lambda: rng.normal(0, 2),
)

print(f"Top-level atoms: {theta_top[:5].round(2)}")
print(f"Top-level weights: {pi_top[:5].round(3)}")
for j, (atoms, weights) in enumerate(groups):
    print(f"Group {j}: atoms = {atoms[:3].round(2)}, weights = {weights[:3].round(3)}")

코드 가이드

HDP 의 stick-breaking 표현 — 두 단계 (top: $G_0$, bottom: $G_j$).
모든 $G_j$ 가 같은 atom pool 사용 → atom 공유 시각화.
PyMC, Edward2, NumPyro 등에서 HDP 직접 모델링은 더 복잡 — Variational Inference 또는 Slice sampling 권장 (Walker 2007).

전문 패키지: gensim (HDP-LDA), bayesianbnp (Bayesian Nonparametrics in Python).

10 관련 주제

Ch.23 시리즈

Part V (Ch.19~23)

관련 개념 (cross-category)

11 참고문헌

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.), Ch.23 § 23.4~23.7. CRC Press.
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet Processes. JASA, 101(476), 1566-1581. (HDP)
Rodriguez, A., Dunson, D. B., & Gelfand, A. E. (2008). The Nested Dirichlet Process. JASA, 103(483), 1131-1154. (NDP)
MacEachern, S. N. (1999). Dependent Nonparametric Processes. Proc. Bayesian Stat. Sci. Sect., ASA. (DDP)
MacEachern, S. N. (2000). Dependent Dirichlet Processes. Technical Report, Ohio State.
De Iorio, M., Müller, P., Rosner, G. L., & MacEachern, S. N. (2004). An ANOVA Model for Dependent Random Measures. JASA, 99(465), 205-215.
Müller, P., Quintana, F., & Rosner, G. (2004). A Method for Combining Inference across Related Nonparametric Bayesian Models. JRSS B, 66(3), 735-749.
Dunson, D. B., Pillai, N., & Park, J. H. (2007). Bayesian Density Regression. JRSS B, 69(2), 163-183.
Chung, Y., & Dunson, D. B. (2009). Nonparametric Bayes Conditional Distribution Modeling with Variable Selection. JASA, 104(488), 1646-1660. (Probit stick-breaking)
Griffin, J. E., & Steel, M. F. J. (2006). Order-Based Dependent Dirichlet Processes. JASA, 101(473), 179-194.
Dunson, D. B. (2010a). Nonparametric Bayes Applications to Biostatistics. In Hjort et al. (eds.), Bayesian Nonparametrics. Cambridge.
Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical Mixtures of Experts and the EM Algorithm. Neural Computation, 6(2), 181-214.
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive Mixtures of Local Experts. Neural Computation, 3(1), 79-87.
Blei, D. M., & Jordan, M. I. (2006). Variational Inference for Dirichlet Process Mixtures. Bayesian Analysis, 1(1), 121-143.
Jain, S., & Neal, R. M. (2004). A Split-Merge MCMC Procedure for the DPM. JCGS, 13(1), 158-182.
Walker, S. G. (2007). Sampling the Dirichlet Mixture Model with Slices. Communications in Statistics, 36(1), 45-54.
Kalli, M., Griffin, J. E., & Walker, S. G. (2011). Slice Sampling Mixture Models. Statistics and Computing, 21(1), 93-105.
Ferguson, T. S. (1973). A Bayesian Analysis of Some Nonparametric Problems. Annals of Statistics, 1(2), 209-230.
Sethuraman, J. (1994). A Constructive Definition of Dirichlet Priors. Statistica Sinica, 4, 639-650.
Hjort, N. L., Holmes, C. C., Müller, P., & Walker, S. G. (eds.) (2010). Bayesian Nonparametrics. Cambridge University Press.
Ghosal, S., & van der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference. Cambridge University Press.
Müller, P., Quintana, F. A., Jara, A., & Hanson, T. (2015). Bayesian Nonparametric Data Analysis. Springer.