<!DOCTYPE html>

Linear models

<div class="navbar-header">
  <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-bs-toggle="collapse" data-target="#navbar" data-bs-target="#navbar">
    <span class="icon-bar"></span>
    <span class="icon-bar"></span>
    <span class="icon-bar"></span>
  </button>
  <a class="navbar-brand" href="index.html">P8105</a>
</div>
<div id="navbar" class="navbar-collapse collapse">
  <ul class="nav navbar-nav">
    
  </ul>
  <ul class="nav navbar-nav navbar-right">
    <li>

Schedule

Homework and Projects

Communication

  </ul>
</div><!--/.nav-collapse -->

Linear regression models are fundamental in statistics and data science. When seeking to understand how covariates are associated with outcomes, linear models are among the first, best options. Although other regression approaches are possible, the flexibility and interpretability and of linear models make them essential.

This content assumes some familiarity with linear models, and focuses on the implementation of models in R rather than on the theory or interpretation of the models themselves.

This is the first module in the Linear Models topic.

Overview

Learning Objectives

Review fundamentals of linear and generalized linear models, fit models in R, and tidy results for further analysis.

Slide Deck

Linear Models from Jeff Goldsmith.

Video Lecture

Example

I’ll write code for today’s content in a new R Markdown document called linear_models.Rmd in a linear_models directory / repo. The code chunk below loads some usual packages and sets a seed for reproducibility.

library(tidyverse)
library(p8105.datasets)

set.seed(1)

Model fitting

The code below loads and cleans the Airbnb data, which we’ll use as a primary example for fitting linear models.

data("nyc_airbnb")

nyc_airbnb = 
  nyc_airbnb |> 
  mutate(stars = review_scores_location / 2) |> 
  rename(
    borough = neighbourhood_group,
    neighborhood = neighbourhood) |> 
  filter(borough != "Staten Island") |> 
  select(price, stars, borough, neighborhood, room_type)

An good place to start is to consider price as an outcome that may depend on rating and borough. We fit that initial model in the following code.

fit = lm(price ~ stars + borough, data = nyc_airbnb)

The lm function begins with the formula specification – outcome on the left of the ~ and predictors separated by + on the right. As we’ll see shortly, interactions between variables can be specified using *. You can also specify an intercept-only model (outcome ~ 1), a model with no intercept (outcome ~ 0 + …), and a model using all available predictors (outcome ~ .).

R will treat categorical (factor) covariates appropriately and predictably: indicator variables are created for each non-reference category and included in your model, and the factor level is treated as the reference. As with ggplot, being careful with factors is therefore critical!

nyc_airbnb = 
  nyc_airbnb |> 
  mutate(
    borough = fct_infreq(borough),
    room_type = fct_infreq(room_type))

fit = lm(price ~ stars + borough, data = nyc_airbnb)

It’s important to note that changing reference categories won’t change “fit” or statistical sigificance, but can affect ease of interpretation.

Tidying output

The output of a lm is an object of class lm – a very specific list that isn’t a dataframe but that can be manipulated using other functions. Some common functions for interacting with lm fits are below, although we omit the output.

summary(fit)
summary(fit)$coef
coef(fit)
fitted.values(fit)

The reason that we omit the output is that it’s a huge pain to deal with. summary produces an object of class summary.lm, which is also a list – that’s how we extracted the coefficients using summary(fit)$coef. coef produces a vector of coefficient values, and fitted.values is a vector of fitted values. None of this is tidy.

It’s helpful to know about the products of lm and to know there are a range of ways to interact with models in base R. That said, for the most part it’s easiest to use tidy tools.

The broom package has functions for obtaining a quick summary of the model and for cleaning up the coefficient table.

fit |> 
  broom::glance()

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df   logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>    <dbl>  <dbl>  <dbl>
## 1    0.0342        0.0341  182.      271. 6.73e-229     4 -202113. 4.04e5 4.04e5
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

fit |> 
  broom::tidy()

## # A tibble: 5 × 5
##   term            estimate std.error statistic   p.value
##   <chr>              <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)         19.8     12.2       1.63 1.04e-  1
## 2 stars               32.0      2.53     12.7  1.27e- 36
## 3 boroughBrooklyn    -49.8      2.23    -22.3  6.32e-109
## 4 boroughQueens      -77.0      3.73    -20.7  2.58e- 94
## 5 boroughBronx       -90.3      8.57    -10.5  6.64e- 26

Both of these functions produce data frames, which makes it straightforward to include the results in subsequent steps.

fit |> 
  broom::tidy() |> 
  select(term, estimate, p.value) |> 
  mutate(term = str_replace(term, "^borough", "Borough: ")) |> 
  knitr::kable(digits = 3)

term	estimate	p.value
(Intercept)	19.839	0.104
stars	31.990	0.000
Borough: Brooklyn	-49.754	0.000
Borough: Queens	-77.048	0.000
Borough: Bronx	-90.254	0.000

As an aside, broom::tidy works with lots of things, including most of the functions for model fitting you’re likely to run into (survival, mixed models, additive models, …).

Diagnostics

Regression diagnostics can identify issues in model fit, especially related to certain failures in model assumptions. Examining residuals and fitted values are therefore an imporant component of any modeling exercise.

The modelr package can be used to add residuals and fitted values to a dataframe.

modelr::add_residuals(nyc_airbnb, fit)

## # A tibble: 40,492 × 6
##   price stars borough neighborhood room_type        resid
##   <dbl> <dbl> <fct>   <chr>        <fct>            <dbl>
## 1    99     5 Bronx   City Island  Private room      9.47
## 2   200    NA Bronx   City Island  Private room     NA   
## 3   300    NA Bronx   City Island  Entire home/apt  NA   
## 4   125     5 Bronx   City Island  Entire home/apt  35.5 
## 5    69     5 Bronx   City Island  Private room    -20.5 
## 6   125     5 Bronx   City Island  Entire home/apt  35.5 
## # ℹ 40,486 more rows

modelr::add_predictions(nyc_airbnb, fit)

## # A tibble: 40,492 × 6
##   price stars borough neighborhood room_type        pred
##   <dbl> <dbl> <fct>   <chr>        <fct>           <dbl>
## 1    99     5 Bronx   City Island  Private room     89.5
## 2   200    NA Bronx   City Island  Private room     NA  
## 3   300    NA Bronx   City Island  Entire home/apt  NA  
## 4   125     5 Bronx   City Island  Entire home/apt  89.5
## 5    69     5 Bronx   City Island  Private room     89.5
## 6   125     5 Bronx   City Island  Entire home/apt  89.5
## # ℹ 40,486 more rows

Like many things in the tidyverse, the first argument is a dataframe. That makes it easy to included steps adding residuals or predictions in pipeline of commands to conduct inspections and perform diagnostics.

nyc_airbnb |> 
  modelr::add_residuals(fit) |> 
  ggplot(aes(x = borough, y = resid)) + geom_violin()

nyc_airbnb |> 
  modelr::add_residuals(fit) |> 
  ggplot(aes(x = stars, y = resid)) + geom_point()

This example has some obvious issues, most notably the presence of extremely large outliers in price and a generally skewed residual distribution. There are a few things we might try to do here – including creating a formal rule for the exclusion of outliers, transforming the price variable (e.g. using a log transformation), or fitting a model that is robust to outliers. Dealing with these issues isn’t really the purpose of this class, though, so we’ll note the issues and move on; shortly we’ll look at using the bootstrap for inference in cases like this, where standard approaches to inference may fail.

(For what it’s worth, I’d probably use a combination of median regression, which is less sensitive to outliers than OLS, and maybe bootstrapping for inference. If that’s not feasible, I’d omit rentals with price over $1000 (< 0.5% of the sample) from the primary analysis and examine these separately. I usually avoid transforming the outcome, because the results model is difficult to interpret.)

Hypothesis testing

We’ll comment briefly on hypothesis testing. Model summaries include results of t-tests for single coefficients, and are the standard way of assessing statistical significance.

Testing multiple coefficients is somewhat more complicated. A useful approach is to use nested models, meaning that the terms in a simple “null” model are a subset of the terms in a more complex “alternative” model. The are formal tests for comparing the null and alternative models, even when several coefficients are added in the alternative model. Tests of this kind are required to assess the significance of a categorical predictor with more than two levels, as in the example below.

fit_null = lm(price ~ stars + borough, data = nyc_airbnb)
fit_alt = lm(price ~ stars + borough + room_type, data = nyc_airbnb)

The test of interest is implemented in the anova() function which, of course, can be summarized using broom::tidy().

anova(fit_null, fit_alt) |> 
  broom::tidy()

## # A tibble: 2 × 7
##   term                        df.residual    rss    df   sumsq statistic p.value
##   <chr>                             <dbl>  <dbl> <dbl>   <dbl>     <dbl>   <dbl>
## 1 price ~ stars + borough           30525 1.01e9    NA NA            NA       NA
## 2 price ~ stars + borough + …       30523 9.21e8     2  8.42e7     1394.       0

Note that this works for nested models only. Comparing non-nested models is a common problem that requires other methods; we’ll see one approach in cross validation.

Nesting data

We’ll now turn our attention to fitting models to datasets nested within variables – meaning, essentially, that we’ll use nest() to create a list column containing datasets and fit separate models to each. This is very different from fitting nested models, even though the terminology is similar.

In the airbnb data, we might think that star ratings and room type affects price differently in each borough. One way to allow this kind of effect modification is through interaction terms:

nyc_airbnb |> 
  lm(price ~ stars * borough + room_type * borough, data = _) |> 
  broom::tidy() |> 
  knitr::kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	95.694	19.184	4.988	0.000
stars	27.110	3.965	6.838	0.000
boroughBrooklyn	-26.066	25.080	-1.039	0.299
boroughQueens	-4.118	40.674	-0.101	0.919
boroughBronx	-5.627	77.808	-0.072	0.942
room_typePrivate room	-124.188	2.996	-41.457	0.000
room_typeShared room	-153.635	8.692	-17.676	0.000
stars:boroughBrooklyn	-6.139	5.237	-1.172	0.241
stars:boroughQueens	-17.455	8.539	-2.044	0.041
stars:boroughBronx	-22.664	17.099	-1.325	0.185
boroughBrooklyn:room_typePrivate room	31.965	4.328	7.386	0.000
boroughQueens:room_typePrivate room	54.933	7.459	7.365	0.000
boroughBronx:room_typePrivate room	71.273	18.002	3.959	0.000
boroughBrooklyn:room_typeShared room	47.797	13.895	3.440	0.001
boroughQueens:room_typeShared room	58.662	17.897	3.278	0.001
boroughBronx:room_typeShared room	83.089	42.451	1.957	0.050

This works, but the output takes time to think through – the expected change in price comparing an entire apartment to a private room in Queens, for example, involves the main effect of room type and the Queens / private room interaction.

Alternatively, we can nest within boroughs and fit borough-specific models associating price with rating and room type:

nest_lm_res =
  nyc_airbnb |> 
  nest(data = -borough) |> 
  mutate(
    models = map(data, \(df) lm(price ~ stars + room_type, data = df)),
    results = map(models, broom::tidy)) |> 
  select(-data, -models) |> 
  unnest(results)

The results of this approach are given in the table below.

nest_lm_res |> 
  select(borough, term, estimate) |> 
  mutate(term = fct_inorder(term)) |> 
  pivot_wider(
    names_from = term, values_from = estimate) |> 
  knitr::kable(digits = 3)

borough	(Intercept)	stars	room_typePrivate room	room_typeShared room
Bronx	90.067	4.446	-52.915	-70.547
Queens	91.575	9.654	-69.255	-94.973
Brooklyn	69.627	20.971	-92.223	-105.839
Manhattan	95.694	27.110	-124.188	-153.635

The estimates here are the same as those in the model containing interactions, but are easier to extract from the output.

Fitting models to nested datasets is a way of performing stratified analyses. These have a tradeoff: stratified models make it easy to interpret covariate effects in each stratum, but don’t provide a mechanism for assessing the significance of differences across strata.

An even more extreme example is the assessment of neighborhood effects in Manhattan. The code chunk below fits neighborhood-specific models:

manhattan_airbnb =
  nyc_airbnb |> 
  filter(borough == "Manhattan")

manhattan_nest_lm_res =
  manhattan_airbnb |> 
  nest(data = -neighborhood) |> 
  mutate(
    models = map(data, \(df) lm(price ~ stars + room_type, data = df)),
    results = map(models, broom::tidy)) |> 
  select(-data, -models) |> 
  unnest(results)

And the chunk below shows neighborhood-specific estimates for the coefficients related to room type.

manhattan_nest_lm_res |> 
  filter(str_detect(term, "room_type")) |> 
  ggplot(aes(x = neighborhood, y = estimate)) + 
  geom_point() + 
  facet_wrap(~term) + 
  theme(axis.text.x = element_text(angle = 80, hjust = 1))

There is, generally speaking, a reduction in room price for a private room or a shared room compared to an entire apartment, but this varies quite a bit across neighborhoods.

With this many factor levels, it really isn’t a good idea to fit models with main effects or interactions for each. Instead, you’d be best-off using a mixed model, with random intercepts and slopes for each neighborhood. Although it’s well beyond the scope of this class, code to fit a mixed model with neighborhood-level random intercepts and random slopes for room type is below. And, of course, we can tidy the results using a mixed-model spinoff of the broom package.

manhattan_airbnb |> 
  lme4::lmer(price ~ stars + room_type + (1 + room_type | neighborhood), data = _) |> 
  broom.mixed::tidy()

## # A tibble: 11 × 6
##    effect   group        term                       estimate std.error statistic
##    <chr>    <chr>        <chr>                         <dbl>     <dbl>     <dbl>
##  1 fixed    <NA>         (Intercept)                 250.        26.6      9.41 
##  2 fixed    <NA>         stars                        -3.16       5.00    -0.631
##  3 fixed    <NA>         room_typePrivate room      -124.         7.80   -15.9  
##  4 fixed    <NA>         room_typeShared room       -157.        12.9    -12.2  
##  5 ran_pars neighborhood sd__(Intercept)              59.3       NA       NA    
##  6 ran_pars neighborhood cor__(Intercept).room_typ…   -0.987     NA       NA    
##  7 ran_pars neighborhood cor__(Intercept).room_typ…   -1.000     NA       NA    
##  8 ran_pars neighborhood sd__room_typePrivate room    36.7       NA       NA    
##  9 ran_pars neighborhood cor__room_typePrivate roo…    0.992     NA       NA    
## 10 ran_pars neighborhood sd__room_typeShared room     43.6       NA       NA    
## 11 ran_pars Residual     sd__Observation             198.        NA       NA

Mixed models are pretty great!

Binary outcomes

Linear models are appropriate for outcomes that follow a continuous distribution, but binary outcomes are common. In these cases, logistic regression is a useful analytic framework.

The Washington Post has gathered data on homicides in 50 large U.S. cities and made the data available through a GitHub repository; the final CSV is here. You can read their accompanying article here. We’ll use data on unresolved murders in Baltimore, MD to illustrate logistic regression in R. The code below imports, cleans, and generally wrangles the data for analysis.

baltimore_df = 
  read_csv("data/homicide-data.csv") |> 
  filter(city == "Baltimore") |> 
  mutate(
    resolved = as.numeric(disposition == "Closed by arrest"),
    victim_age = as.numeric(victim_age),
    victim_race = fct_relevel(victim_race, "White")) |> 
  select(resolved, victim_age, victim_race, victim_sex)

## Rows: 52179 Columns: 12
## ── Column specification ───────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): uid, victim_last, victim_first, victim_race, victim_age, victim_sex...
## dbl (3): reported_date, lat, lon
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Using these data, we can fit a logistic regression for the binary “resolved” outcome and victim demographics as predictors. This uses the glm function with the family specified to account for the non-Gaussian outcome distribution.

fit_logistic = 
  baltimore_df |> 
  glm(resolved ~ victim_age + victim_race + victim_sex, data = _, family = binomial())

Many of the same tools we used to work with lm fits can be used for glm fits. The table below summaries the coefficients from the model fit; because logistic model estimates are log odds ratios, we include a step to compute odds ratios as well.

fit_logistic |> 
  broom::tidy() |> 
  mutate(OR = exp(estimate)) |>
  select(term, log_OR = estimate, OR, p.value) |> 
  knitr::kable(digits = 3)

term	log_OR	OR	p.value
(Intercept)	1.190	3.287	0.000
victim_age	-0.007	0.993	0.027
victim_raceAsian	0.296	1.345	0.653
victim_raceBlack	-0.842	0.431	0.000
victim_raceHispanic	-0.265	0.767	0.402
victim_raceOther	-0.768	0.464	0.385
victim_sexMale	-0.880	0.415	0.000

Homicides in which the victim is Black are substantially less likely to be resolved that those in which the victim is white; for other races the effects are not significant, possible due to small sample sizes. Homicides in which the victim is male are significantly less like to be resolved than those in which the victim is female. The effect of age is statistically significant, but careful data inspections should be conducted before interpreting too deeply.

We can also compute fitted values; similarly to the estimates in the model summary, these are expressed as log odds and can be transformed to produce probabilities for each subject.

baltimore_df |> 
  modelr::add_predictions(fit_logistic) |> 
  mutate(fitted_prob = boot::inv.logit(pred))

## # A tibble: 2,827 × 6
##   resolved victim_age victim_race victim_sex   pred fitted_prob
##      <dbl>      <dbl> <fct>       <chr>       <dbl>       <dbl>
## 1        0         17 Black       Male       -0.654       0.342
## 2        0         26 Black       Male       -0.720       0.327
## 3        0         21 Black       Male       -0.683       0.335
## 4        1         61 White       Male       -0.131       0.467
## 5        1         46 Black       Male       -0.864       0.296
## 6        1         27 Black       Male       -0.727       0.326
## # ℹ 2,821 more rows

Other materials

This page touches on ideas that arise in several chapters on modeling in R for Data Science. These tend to assume that this is your first exposure to linear models but good reading:
The modelr package also has a website

The code that I produced working examples in lecture is here.

이 문서의 맥락에서 mixed model이 “great”한 이유를 설명하면:

0.1 이전 방법들의 한계

문제 상황: Manhattan의 neighborhood별로 가격 모델을 피팅하려 함

Interaction 방식 (price ~ stars * borough + room_type * borough): 작동하지만 output 해석이 복잡함
Stratified 방식 (nest()로 neighborhood별 별도 모델): 각 stratum 내 해석은 쉽지만, 두 가지 문제 발생:
- neighborhood가 수십 개 → 각각 별도 계수 추정 시 overfitting 위험
- strata 간 차이의 통계적 유의성 검정 불가

0.2 Mixed Model이 해결하는 것

lme4::lmer(price ~ stars + room_type + (1 + room_type | neighborhood), data = _)

요소	의미
`stars + room_type`	고정 효과 (fixed effects) — 전체 평균 효과
`(1 \\| neighborhood)`	neighborhood별 random intercept — 동네마다 기본 가격이 다름
`(room_type \\| neighborhood)`	neighborhood별 random slope — room_type 효과가 동네마다 다름

핵심 장점: Partial Pooling (부분 풀링) - 데이터가 적은 neighborhood는 전체 평균 쪽으로 수축(shrink)되어 overfitting 방지 - 데이터가 많은 neighborhood는 자체 데이터에 더 가중치 - 결과적으로 stratified 모델보다 더 안정적인 추정치 제공

즉, “neighborhood가 너무 많아 interaction/stratified 접근이 한계에 달했을 때” mixed model이 이 두 방법의 단점을 동시에 해결하기 때문에 great하다는 것입니다.

0.3 `(1 + room_type | neighborhood)` 분해 설명

이 표현은 interaction term이 아닙니다. lme4의 random effects 문법입니다.

0.3.1 문법 구조

(무엇을 | 누구별로)

즉, “neighborhood 별로, [무엇]을 random하게 추정해라” 는 뜻입니다.

0.3.2 각 부분의 의미

부분	의미
`1`	random intercept — neighborhood마다 기본 가격(절편)이 다름
`room_type`	random slope — neighborhood마다 room_type의 효과 크기가 다름
`\\| neighborhood`	위 두 가지를 neighborhood 단위로 추정

0.3.3 구체적 예시로 이해하기

Fixed effect만 있다면 (mixed model 없이): \[\text{price} = \beta_0 + \beta_1 \cdot \text{room\_type}\]

모든 neighborhood에 대해 $\beta_0$, $\beta_1$ 이 동일하게 적용됨.

(1 + room_type | neighborhood) 추가 시: \[\text{price} = (\beta_0 + b_{0,j}) + (\beta_1 + b_{1,j}) \cdot \text{room\_type}\]

$b_{0,j}$: neighborhood $j$의 intercept 보정값 (random intercept)
$b_{1,j}$: neighborhood $j$의 room_type 효과 보정값 (random slope)

예를 들어: - Upper West Side: Private room이 Entire apt보다 -$150 저렴 - Harlem: Private room이 Entire apt보다 -$80 저렴

이렇게 같은 room_type이라도 neighborhood마다 효과가 다른 것을 모델링하는 것이 random slope입니다.

0.3.4 Interaction과의 차이

	Interaction (`room_type * neighborhood`)	Random slope (`room_type \\| neighborhood`)
추정 방식	각 neighborhood에 대해 독립적으로 계수 추정	전체 분포($\sim N(0, \sigma^2)$)를 가정하고 부분 풀링으로 추정
neighborhood 수가 많을 때	Overfitting, 계수 불안정	데이터 적은 곳은 평균으로 수축 → 안정적
파라미터 수	neighborhood 수만큼 증가	분포의 분산만 추정 (파라미터 절약)

즉, room_type | neighborhood는 “room_type의 효과가 neighborhood마다 랜덤하게 변한다”는 확률적 가정이고, interaction은 “각 neighborhood의 효과를 고정값으로 따로따로 추정”하는 것입니다.

네, 저 random efect의 괄호안에 여러 변수를 넣는 것이 가능합니다.

0.3.5 기본 문법

(변수1 + 변수2 + ... | grouping변수)

0.3.6 예시들

# random intercept만
(1 | neighborhood)

# random intercept + room_type slope
(1 + room_type | neighborhood)

# random intercept + 여러 변수 slope
(1 + room_type + stars | neighborhood)

# intercept 없이 slope만
(0 + room_type | neighborhood)

0.3.7 주의할 점

변수를 추가할수록 추정해야 할 분산-공분산 행렬(variance-covariance matrix) 크기가 급격히 커집니다.

변수가 $k$개면 공분산 행렬 크기: $\frac{k(k+1)}{2}$ 개 파라미터

random effects 변수 수	추정 파라미터 수
1개 (`1`)	1개 (분산만)
2개 (`1 + room_type`)	3개 (분산 2 + 공분산 1)
3개 (`1 + room_type + stars`)	6개

데이터가 충분하지 않으면 모델이 수렴하지 않거나(convergence failure) 불안정해질 수 있어서, 실제로는 꼭 필요한 변수만 넣는 것이 일반적입니다.

0.3.8 여러 grouping 변수도 가능

# neighborhood별 random intercept + hospital별 random intercept
(1 | neighborhood) + (1 | hospital)

이렇게 | 뒤의 grouping 변수를 달리해서 여러 random effect 항을 동시에 모델에 포함시킬 수도 있습니다.

네, 정확히 그 방향입니다. 조금 더 정확히 설명하면:

0.4 Overfitting보다는 “추정 불안정” 문제

Mixed model에서는 고전적 overfitting보다 파라미터를 신뢰있게 추정할 데이터가 부족해지는 문제가 먼저 발생합니다.

0.4.1 구체적으로 무슨 일이 일어나나

neighborhood가 50개 있고, (1 + room_type + stars | neighborhood) 를 넣으면:

추정해야 할 공분산 행렬 파라미터: 6개
하지만 어떤 neighborhood는 데이터가 5~10개밖에 없을 수 있음
데이터 5개로 6개 파라미터를 추정하려 하면 → 수렴 실패(convergence failure)

0.4.2 증상들

증상	의미
`Model failed to converge` 경고	파라미터 추정을 못 끝냄
`singular fit` 경고	공분산 행렬이 0에 가까운 분산 추정 → 사실상 해당 random effect가 필요 없다는 신호
추정값이 극단적으로 크거나 작음	불안정한 추정

0.4.3 Overfitting과의 관계

엄밀히는: - Fixed effect의 overfitting: 파라미터가 훈련 데이터에 과도하게 맞춰짐 - Random effect의 문제: 분산-공분산 추정 자체가 불가능하거나 불안정함

그러나 random slope를 과도하게 넣으면 각 group의 slope가 지나치게 자유롭게 추정되어 결과적으로 overfitting과 유사한 효과가 납니다. Mixed model의 partial pooling이 이를 어느 정도 억제하지만, 데이터가 너무 적으면 그 억제력도 한계가 있습니다.

0.4.4 실용적 원칙

“Keep it maximal” (Barr et al., 2013) vs 실용적 접근

이론적으로는 가능한 한 많은 random effect를 넣는 게 좋지만
실제로는 수렴하는 가장 복잡한 모델을 찾는 것이 현실적 기준입니다.

좋은 질문입니다. Mixed model이 전통 통계의 정점처럼 보이지만, 한계가 분명히 있고 그 한계를 극복하는 방법들도 있습니다.

0.5 Mixed Model의 한계

0.5.1 1. 분포 가정에 민감

Random effects가 정규분포를 따른다고 가정합니다. \[b_j \sim N(0, \sigma^2)\] 현실 데이터가 이 가정을 위반하면 추정값이 왜곡됩니다.

0.5.2 2. 비선형 관계 처리 어려움

변수 간 관계가 복잡한 곡선이면 선형 가정이 깨집니다.

0.5.3 3. 고차원 데이터에 취약

변수가 수백~수천 개면 변수 선택 자체가 문제가 됩니다.

0.5.4 4. 인과관계 추론 불가

관찰 데이터 기반이라 confounding 문제를 완전히 해결하지 못합니다.

0.6 Mixed Model 너머의 모델들

0.6.1 전통 통계 계열 (수식 기반)

모델	Mixed model 대비 해결하는 것
GAM (Generalized Additive Model)	비선형 관계를 spline으로 유연하게 모델링
GAMM (GAM + Mixed)	GAM + random effects 결합
Quantile Regression	평균이 아닌 분위수(중앙값 등) 예측, 이상치에 강건
Survival Model (Cox)	시간-사건 데이터 (사망, 재발 등)
Bayesian Hierarchical Model	Mixed model의 Bayesian 버전, 사전 분포로 불확실성 명시적 처리

0.6.2 머신러닝 계열 (패턴 기반)

모델	특징
Regularized Regression (Lasso, Ridge, Elastic Net)	고차원 데이터에서 변수 선택 자동화
Random Forest / XGBoost	비선형 관계, 변수 간 복잡한 상호작용 자동 처리
Neural Network	극도로 복잡한 패턴 학습 가능

0.7 큰 그림으로 보면

선형 회귀
    ↓ (범주형 결과)
GLM (logistic, Poisson...)
    ↓ (군집/반복 측정 데이터)
Mixed Model / GEE
    ↓ (비선형 관계)
GAM / GAMM
    ↓ (불확실성을 명시적으로)
Bayesian Hierarchical Model

전통 통계와 머신러닝의 경계에 있는 모델이 Regularized Regression (Lasso 등)이고, 이 수업에서도 뒤에 나오는 cross_validation.html과 stat_learning.html에서 다루는 내용이 바로 그 방향입니다.

결론적으로, Mixed model은 “해석 가능한 전통 통계”의 정점에 가깝지만, 비선형성/고차원/인과추론 문제에서는 GAM, Bayesian, 또는 ML 계열이 필요합니다.

0.8 Mixed Model의 구조를 DL로 대응하는 방법

Mixed model이 해결하는 핵심 구조는 두 가지입니다: 1. Group-level variation (neighborhood마다 다른 효과) 2. Partial pooling (데이터 적은 group은 전체 평균으로 수축)

DL에서는 이를 다음과 같이 대응합니다.

0.9 대응 기법들

0.9.1 1. Entity Embedding (가장 직접적 대응)

# Keras 예시
neighborhood_input = Input(shape=(1,))
neighborhood_embed = Embedding(n_neighborhoods, embed_dim)(neighborhood_input)

Mixed Model	Entity Embedding
Random intercept per neighborhood	Embedding vector per neighborhood
Partial pooling	Weight regularization (L2)
`(1 \\| neighborhood)`	`Embedding(n_neighborhoods, 1)`
`(1 + room_type \\| neighborhood)`	`Embedding(n_neighborhoods, k)`

Embedding은 사실상 학습 가능한 random effect입니다.

0.9.2 2. Neural Mixed Effects Model

Mixed model을 명시적으로 DL에 통합한 연구들:

NLME-Net: 비선형 fixed effect를 NN으로, random effect는 유지
DeepMind의 Neural Process: group-level uncertainty를 latent variable로 모델링

전통 Mixed Model:  y = Xβ + Zb + ε
Neural 버전:       y = NN(X) + Zb + ε
                       ↑ 비선형 fixed effect를 NN으로 대체

0.9.3 3. Hierarchical / Multi-task Learning

Group 구조가 있는 데이터에 자연스럽게 대응:

                 [공유 레이어]
                /      |      \
        [Bronx]  [Manhattan]  [Brooklyn]
        전용 head   전용 head    전용 head

공유 레이어 = fixed effects (전체에 공통)
group 전용 head = random effects (group별 보정)

0.9.4 4. Attention / Transformer (시계열/패널 데이터)

반복 측정 데이터(같은 대상을 여러 번 측정)에서:

Mixed Model	Transformer
Subject-level random effect	Self-attention으로 같은 subject 내 관측값 연결
시간에 따른 correlation 구조	Positional encoding + attention

0.9.5 5. Bayesian Neural Network (가장 이론적으로 완결)

Mixed model의 Bayesian 해석을 DL로 확장:

\[\text{Mixed Model} \subset \text{Bayesian Hierarchical} \subset \text{Bayesian NN}\]

Random effects = weight에 prior 분포를 부여
Pyro, TensorFlow Probability, NumPyro 등으로 구현

0.10 정리

상황	추천 DL 기법
Group 변수가 범주형 (neighborhood 등)	Entity Embedding
Group별 다른 패턴 학습	Multi-task Learning
반복 측정, 시계열	Transformer / LSTM
불확실성 정량화가 중요	Bayesian Neural Network
Fixed effect만 비선형으로 바꾸고 싶다	Neural Mixed Effects

실용적으로는 Entity Embedding이 가장 많이 쓰이고, Kaggle 등 competition에서도 tabular 데이터의 범주형 변수 처리에 표준처럼 사용됩니다. Netflix, Spotify 같은 추천 시스템이 바로 이 구조입니다 (user/item embedding = random intercept).

0.11 Personalization을 위한 DL 모델들

Personalization의 핵심 문제는 “개인마다 다른 반응을 어떻게 모델링하나” 입니다.

0.12 데이터 유형별 접근

0.12.1 1. User-Item Interaction 데이터 (클릭, 구매, 시청 등)

Matrix Factorization → Neural CF

전통: User embedding · Item embedding = 선호도 점수
DL:   NN(User embedding, Item embedding) = 선호도 점수

모델	특징
Neural Collaborative Filtering (NCF)	MF를 NN으로 대체, 비선형 상호작용 학습
Two-Tower Model	User tower / Item tower 분리, 대규모 서비스 표준 (YouTube, Google)
DeepFM	FM(Feature interaction) + DNN 결합

0.12.2 2. Sequential 데이터 (행동 순서가 중요할 때)

“이 사람이 A→B→C를 봤으면 다음에 D를 볼 것이다”

모델	특징
SASRec	Self-attention 기반, 긴 시퀀스 처리
BERT4Rec	BERT를 추천에 적용, Masked item prediction
GRU4Rec	RNN 기반 세션 추천

0.12.3 3. 풍부한 Feature가 있을 때 (나이, 성별, 지역, 컨텍스트 등)

모델	특징
Wide & Deep (Google, 2016)	Wide(암기) + Deep(일반화) 결합, 앱스토어 추천에 사용
DCN (Deep & Cross Network)	Feature interaction을 자동으로 학습
xDeepFM	Explicit + Implicit feature interaction

0.12.4 4. 콜드 스타트 문제 (신규 유저/아이템)

기존 interaction 데이터가 없는 경우:

모델	특징
Meta-Learning (MAML)	적은 데이터로 빠르게 개인화, “few-shot personalization”
DropoutNet	의도적으로 feature를 drop해서 cold start에 강건하게 학습
LLM + RAG	텍스트 기반 유저 프로파일로 추천

0.12.5 5. 최근 트렌드: LLM 기반 Personalization

[유저 행동 로그] → LLM prompt로 변환 → 개인화된 추천/응답 생성

접근	예시
Prompt-based	유저 히스토리를 텍스트로 넣어 GPT에게 추천 요청
Fine-tuned LLM	개인 데이터로 LLM을 추가 학습
LLM + Retrieval	유저 프로파일 DB에서 관련 정보 검색 후 LLM에 주입

0.13 실제 서비스별 사용 모델

서비스	모델
YouTube	Two-Tower + Sequential
Netflix	Neural CF + Contextual Bandits
Spotify	Session-based (RNN/Transformer)
TikTok	Multi-task DNN (클릭률 + 시청 시간 동시 최적화)
Amazon	BERT4Rec + Graph Neural Network

0.14 어떤 걸 선택해야 하나

interaction 데이터 많음 + 순서 중요?  → Sequential (SASRec, BERT4Rec)
interaction 데이터 많음 + 순서 무관?  → Two-Tower / NCF
feature가 풍부함?                      → Wide & Deep / DCN
신규 유저 많음?                        → Meta-Learning / LLM
텍스트/이미지 등 비정형 데이터 있음?  → Multi-modal + LLM

현재 업계 표준은 Two-Tower로 후보군을 추리고 → DCN/xDeepFM으로 정밀 랭킹하는 2단계 구조입니다.

0.15 솔직한 현실

0.15.1 실제로 대부분의 회사는 이렇습니다

상위 1%   (Google, Netflix, TikTok)  → Two-Tower, BERT4Rec, 자체 연구
상위 5%   (중견 테크 기업)            → Wide & Deep, NCF 정도
상위 20%  (스타트업 성장기)           → 간단한 MF, LightFM
나머지 80% (대부분의 회사)            → SQL 집계, 규칙 기반, A/B 테스트

0.15.2 왜 고급 기법을 안 쓰나

1. 데이터가 없다 - 개인화 DL은 보통 수백만 건 이상의 interaction 데이터 필요 - 대부분의 기업은 그 규모에 못 미침

2. ROI 문제 - Simple rule-based → 구현 1주일 - Two-Tower 모델 → 구현/운영에 수개월 + MLOps 팀 필요 - 성능 차이가 생각보다 작은 경우가 많음

3. 운영 복잡도 - 모델 재학습, 모니터링, 피처 파이프라인 유지 - 작은 팀에서는 유지보수가 더 큰 부담

4. 설명 가능성 요구 - 의료, 금융, 법률 분야는 규제상 “왜 이 결정을 내렸나” 설명 필요 - Black box 모델은 여기서 쓰기 어려움

0.15.3 분야별 현실

분야	실제로 많이 쓰는 것
대형 테크/플랫폼	고급 DL 추천 시스템
이커머스 중소	협업 필터링, 간단한 임베딩
의료/바이오	여전히 Linear/Logistic, Mixed Model이 주류
공공/정책	회귀분석, 생존분석이 표준
금융	Logistic + XGBoost (해석 가능성 중요)
스타트업 초기	규칙 기반 → 데이터 쌓이면 ML

0.15.4 핵심 관점

“항상 가장 단순한 모델이 이길 때까지 복잡한 모델을 정당화하기 어렵다”

실제 업계에서 유명한 경험칙:

Netflix: 수백만 달러짜리 추천 알고리즘 대회(Netflix Prize) 우승 모델을 실제 서비스에 적용하지 않았음 — 운영 복잡도가 성능 향상보다 컸기 때문
Google: “Most of our gains come from better data, not better models”

0.15.5 결론

지금 배우는 Linear Model, Mixed Model은: - 의료/공중보건/사회과학에서는 현재도 표준 - 해석 가능성, 소규모 데이터, 규제 환경에서는 DL보다 우월 - 고급 DL은 데이터와 인프라가 뒷받침될 때만 의미가 있음

이 과목(p8105)이 Mixed Model까지 가르치는 것 자체가 공중보건 분야의 실무 표준을 반영하는 것입니다.

0.16 Hyperpersonalization이란

일반 personalization이 “이 유저 그룹은 이걸 좋아한다” 라면, hyperpersonalization은 “이 사람이 지금 이 순간, 이 맥락에서 무엇을 원하는가” 입니다.

0.17 핵심 구성 요소별 모델

0.17.1 1. Real-time Context 반영

단순 과거 데이터가 아닌 지금 이 순간의 상태를 반영:

맥락	방법
현재 세션 행동	Session-based Transformer (SASRec)
시간/날씨/위치	Contextual feature로 embedding에 추가
실시간 감정/의도	NLP로 현재 쿼리 분석

0.17.2 2. Multi-task Learning

클릭률만 최적화하면 낚시성 콘텐츠가 올라오는 문제 → 여러 목표를 동시에:

                    [공유 레이어]
                   /      |      \
            클릭률    시청시간    만족도
            예측      예측        예측

TikTok, YouTube가 실제로 사용하는 구조입니다.

0.17.3 3. Causal ML (인과 기반 개인화)

“이 사람에게 쿠폰을 줬을 때 진짜 효과가 있나?”

상관관계가 아닌 인과관계 기반 개인화:

모델	역할
Uplift Model	개입(쿠폰/알림)의 개인별 실제 효과 추정
Double ML	Confounding 제거 후 처리 효과 추정
Meta-learner (T/S/X-learner)	CATE (개인별 처리 효과) 추정

일반 추천:  "이 사람은 구매할 확률이 높다"
Causal:    "이 사람은 쿠폰을 줬을 때만 구매한다 (없어도 살 사람에겐 낭비)"

0.17.4 4. LLM 기반 Hyperpersonalization (현재 가장 핫)

[유저 프로파일]  +  [실시간 맥락]  +  [아이템 정보]
        ↓
    LLM Prompt
        ↓
  개인화된 문구/추천/응답 생성

기법	설명
RAG + 개인 프로파일	유저 히스토리를 벡터 DB에 저장, 요청 시 검색해서 LLM에 주입
Fine-tuned LLM	개인 데이터로 추가 학습
LoRA per user	유저별 경량 어댑터 레이어 학습 (연구 단계)

0.17.5 5. Reinforcement Learning (장기적 개인화)

단기 클릭이 아닌 장기 만족도 최적화:

모델	설명
Contextual Bandit	각 유저-상황에 맞는 action 선택, exploration/exploitation 균형
Deep RL (DQN, PPO)	장기 보상(가입 유지, LTV) 최적화

Netflix의 썸네일 A/B 테스트, Spotify의 플레이리스트 자동 생성이 Bandit 계열입니다.

0.18 전체 아키텍처 (실제 서비스 기준)

[데이터 수집]
실시간 행동 로그 + 과거 데이터 + 외부 컨텍스트
        ↓
[Feature Store]
유저 임베딩, 아이템 임베딩, 실시간 피처
        ↓
[Retrieval 단계]  ← Two-Tower / ANN 검색
후보 수백만 → 수백
        ↓
[Ranking 단계]   ← DCN / Multi-task DNN
수백 → 상위 10개
        ↓
[Re-ranking]     ← Causal ML / RL / LLM
다양성, 공정성, 장기 가치 고려
        ↓
[콘텐츠 생성]    ← LLM
개인화된 문구, 설명, UI 생성

0.19 현실적 진입 장벽

필요 조건	규모
유저 데이터	수십만 명 이상
실시간 파이프라인	Kafka, Flink 등
모델 서빙 인프라	수십 ms 응답 요구
ML 엔지니어	전담 팀 필요
데이터 프라이버시	GDPR, 개인정보보호법 대응

결론적으로 hyperpersonalization은 단일 모델이 아니라 Retrieval → Ranking → Causal → LLM 생성이 파이프라인으로 연결된 시스템이며, 현재 가장 앞선 곳은 LLM을 이 파이프라인의 마지막 단계(문구/UI 생성)에 통합하는 방향으로 가고 있습니다.

0.20 Functional Data Analysis (FDA)의 위치

FDA는 전통 통계와 ML 사이의 독립적인 패러다임입니다. 범주 자체가 다릅니다.

0.21 왜 다른 범주인가

기존 통계/ML은 모두 데이터 포인트가 벡터라고 가정합니다:

\[\mathbf{x}_i = (x_1, x_2, ..., x_p)\]

FDA는 데이터 포인트가 함수(곡선)라고 봅니다:

\[x_i = f_i(t), \quad t \in [0, T]\]

일반 통계:  환자 i의 데이터 = [나이, 혈압, 체중, ...]   ← 벡터
FDA:        환자 i의 데이터 = 시간에 따른 혈압 곡선 전체 ← 함수

0.22 전통 통계 계열과의 관계

FDA는 전통 통계의 확장이지만 수학적 기반이 다릅니다:

	전통 통계	FDA
데이터 공간	$\mathbb{R}^p$ (유한 차원)	함수 공간 (무한 차원)
평균	스칼라/벡터 평균	평균 함수 $\mu(t)$
분산	분산-공분산 행렬	공분산 함수 $K(s,t)$
회귀	$Y = X\beta$	$Y = \int \beta(t)X(t)dt$
수학 기반	선형대수	함수해석학, Hilbert space

0.23 앞서 논의한 구조에 FDA 배치

선형 회귀
    ↓
GLM
    ↓
Mixed Model / GEE
    ↓
GAM ──────────────────── FDA ←── 여기와 인접하지만 별도 분기
    ↓                     │
Bayesian Hierarchical     │ (함수를 데이터로 보는 패러다임 전환)
    ↓                     ↓
ML (RF, XGBoost)    Functional Regression
    ↓                Functional PCA (FPCA)
DL                   Functional Mixed Model

GAM과 가장 인접한 이유는 둘 다 함수를 추정하지만: - GAM: 함수가 covariate → outcome 관계를 설명 - FDA: 함수가 데이터 자체

0.24 FDA의 핵심 도구들

방법	역할	전통 통계 대응
FPCA (Functional PCA)	곡선의 주요 변동 패턴 추출	PCA
Functional Regression	곡선으로 결과 예측	Linear Regression
Functional Mixed Model	개인별 곡선 + 그룹 효과	Mixed Model
Basis expansion (spline, Fourier)	연속 곡선을 유한 계수로 표현	—

0.25 실제 사용 분야

FDA가 자연스럽게 맞는 데이터:

분야	데이터 예시
공중보건/의료	성장 곡선, 활동량 모니터링(액셀러로미터), EEG, fMRI
기상	연간 기온 곡선
금융	주가의 일중 변동 곡선
스포츠 과학	동작 분석 (관절 각도의 시간 函수)

p8105 같은 공중보건 과목맥락에서는 웨어러블 기기 데이터 (매 분마다 측정되는 심박수, 활동량)가 전형적인 FDA 적용 사례입니다.

0.26 결론

FDA는 전통 통계의 수학적 언어(회귀, 분산, 검정)를 함수 공간으로 일반화한 것으로, 전통 통계도 ML도 아닌 독립적 패러다임입니다. 단, 전통 통계에 더 가깝고, R의 fda, refund 패키지로 구현합니다.

1 GLM 통합 프레임워크 (제목에서 약속한 핵심 내용)

위 P8105 HTML에서 lm → glm으로 자연스럽게 흘렀지만, 왜 이것들이 모두 같은 프레임워크인가를 명시적으로 설명한다.

1.1 GLM의 세 구성 요소

모든 GLM은 세 요소로 정의된다:

구성 요소	역할	예시
Random component	결과 변수의 분포	Normal, Binomial, Poisson
Systematic component	선형 예측자 $\eta = X\beta$	설계 행렬 × 계수
Link function	$\mu$와 $\eta$를 연결	identity, logit, log

\[g(\mu) = \eta = X\beta\]

$\mu = E[Y]$: 기댓값
$g(\cdot)$: link function
$\eta$: linear predictor

1.2 t-test부터 포아송까지 모두 GLM

분석 방법	분포	Link	코드
t-test / 선형 회귀	Gaussian	identity ($\mu = X\beta$)	`lm()` 또는 `glm(family=gaussian)`
로지스틱 회귀	Binomial	logit ($\log\frac{\mu}{1-\mu} = X\beta$)	`glm(family=binomial)`
포아송 회귀	Poisson	log ($\log\mu = X\beta$)	`glm(family=poisson)`
감마 회귀	Gamma	inverse 또는 log	`glm(family=Gamma)`

직관: 결과 변수의 분포만 바꾸면 같은 glm() 함수로 모두 처리된다. 프레임워크가 같다.

1.3 설계 행렬 (Design Matrix)

price ~ stars + borough를 R에서 실행하면 내부적으로 다음 행렬이 생성된다:

\[X = \begin{bmatrix} 1 & \text{stars}_1 & \mathbb{1}[\text{Brooklyn}]_1 & \mathbb{1}[\text{Queens}]_1 & \mathbb{1}[\text{Bronx}]_1 \\ 1 & \text{stars}_2 & \mathbb{1}[\text{Brooklyn}]_2 & \mathbb{1}[\text{Queens}]_2 & \mathbb{1}[\text{Bronx}]_2 \\ \vdots & \vdots & \vdots & \vdots & \vdots \end{bmatrix}\]

핵심: - 절편 열 ($\mathbf{1}$) 자동 포함 - 범주형 변수는 dummy coding으로 자동 변환 (기준 범주 = Manhattan) - 기준 범주를 바꿔도 모델 fit은 동일, 해석의 편의만 달라짐

# R: 내부 설계 행렬 확인
model.matrix(~ stars + borough, data = nyc_airbnb) |> head(3)

# Python: pandas로 동일한 설계 행렬 생성
import pandas as pd
X = pd.get_dummies(df[['stars', 'borough']], drop_first=True)
X.insert(0, 'intercept', 1)

1.4 Link Function 직관

Link	수식	언제 쓰나
identity	$\mu = X\beta$	결과가 실수 전체 범위 (연속형)
logit	$\log\frac{\mu}{1-\mu} = X\beta$	결과가 0~1 확률 (이진)
log	$\log\mu = X\beta$	결과가 양수 (카운트, 비율)
inverse	$1/\mu = X\beta$	결과가 양수 + 우측 꼬리 (생존 시간 등)

logit link 직관: 확률 $p \in (0,1)$를 실수 전체로 늘려준다. \[\text{log-odds} = \log\frac{p}{1-p} \in (-\infty, +\infty)\] 그래서 $X\beta$가 어떤 값이어도 확률로 역변환하면 항상 $(0,1)$ 안에 들어온다.

1.5 Python statsmodels 코드

R의 lm / glm에 해당하는 Python 구현:

import statsmodels.formula.api as smf
import statsmodels.api as sm
import pandas as pd

# 데이터 (NYC Airbnb)
# df = pd.read_csv(...)

# 1. 선형 회귀 (Gaussian + identity link)
fit_lm = smf.ols('price ~ stars + C(borough)', data=df).fit()
print(fit_lm.summary())

# 2. 로지스틱 회귀 (Binomial + logit link)
fit_logistic = smf.glm(
    'resolved ~ victim_age + C(victim_race) + C(victim_sex)',
    data=baltimore_df,
    family=sm.families.Binomial()
).fit()

# OR 계산
import numpy as np
OR = np.exp(fit_logistic.params)
print(OR)

# 3. 포아송 회귀 (Poisson + log link)
fit_poisson = smf.glm(
    'count ~ stars + C(borough)',
    data=df,
    family=sm.families.Poisson()
).fit()

# 4. broom::tidy() 상당 — params + conf_int
summary_df = pd.DataFrame({
    'estimate': fit_lm.params,
    'std_error': fit_lm.bse,
    'p_value': fit_lm.pvalues
})
print(summary_df)

1.6 AI Agent 실무 예시

시나리오: AI Agent 개인화 실험 — 개인화 여부가 사용자 만족도(5점 척도)에 미치는 효과

import statsmodels.formula.api as smf
import pandas as pd

# 데이터 구조: user_id, session_id, personalized(0/1), segment, satisfaction(1-5)
# df = pd.read_csv('agent_experiment.csv')

# GLM: 만족도 ~ 개인화 + 세그먼트 (단순 독립 관측치 가정)
fit_glm = smf.glm(
    'satisfaction ~ personalized + C(segment)',
    data=df,
    family=smf.families.Gaussian()  # 연속형 → Gaussian
).fit()

print(fit_glm.summary())

# 해석: personalized 계수 = 개인화 시 평균 만족도 변화
# 예: β = 0.42 → 개인화 시 만족도 0.42점 상승 (5점 척도)

# R 버전
library(tidyverse)

fit_glm <- glm(
  satisfaction ~ personalized + segment,
  data = agent_df,
  family = gaussian()
)

broom::tidy(fit_glm)
# term          estimate std.error statistic p.value
# personalized  0.42     0.08      5.25      <0.001
# segmentMIEP   0.31     0.10      3.10       0.002

왜 GLM을 쓰는가: 만족도가 5점 척도라면 사실 순서형(ordinal)이지만, 연속형으로 근사하는 것이 실용적이며 해석이 쉽다. 더 정확하게는 ordinal logistic regression을 써야 한다.

1.7 GLM → Mixed Model로의 자연스러운 확장

위 AI Agent 예시에서 같은 사용자가 여러 세션에 걸쳐 반복 측정된다면:

문제: 같은 user_id의 데이터는 독립이 아님 → GLM 가정 위반
해결: Random effect로 사용자별 기준선 차이를 모델에 포함

# statsmodels MixedLM (선형 혼합 모델)
import statsmodels.formula.api as smf

fit_lmm = smf.mixedlm(
    'satisfaction ~ personalized + C(segment)',
    data=df,
    groups=df['user_id']  # user_id별 random intercept
).fit()

print(fit_lmm.summary())

library(lme4)

fit_lmm <- lmer(
  satisfaction ~ personalized + segment + (1 | user_id),
  data = agent_df
)

broom.mixed::tidy(fit_lmm)

모델	가정	적합 상황
GLM	모든 관측치 독립	횡단면 데이터 (1인 1회 측정)
LMM	같은 user 내 관측치는 상관	반복 측정 / 종단 데이터

1.8 전체 흐름 요약

t-test
  ↓ (회귀로 일반화)
선형 회귀 (lm)
  ↓ (비정규 결과 변수)
GLM (glm: logistic, Poisson, Gamma)
  ↓ (반복 측정, 군집 데이터)
Mixed Model / GEE
  ↓ (비선형 공변량 관계)
GAM / GAMM
  ↓ (결과 변수 자체가 함수)
FDA

핵심 원리: 분포만 바꾸면 같은 프레임워크(glm())로 처리 가능. 독립성 가정이 깨지면 Mixed Model / GEE로 넘어간다.

단계	새로운 복잡성	해결책
비정규 분포	Gaussian 가정 위반	GLM (link function 선택)
반복 측정	독립성 가정 위반	LMM / GEE (random effect / working correlation)
비선형 관계	선형 가정 위반	GAM (spline)
함수형 데이터	벡터 → 함수 공간	FDA