Kwangmin Kim - MINERVA Phase C-5 — 의도 분류와 토픽 클러스터링

1 의도 vs 토픽 — 두 가지 다른 질문

질의 텍스트에 두 가지 분석을 동시에 한다.

	의도 분류 (Intent)	토픽 클러스터링 (Topic)
패러다임	Supervised	Unsupervised
라벨	사람이 사전 정의 (예: “검색”·“요약”·“코드 생성”)	데이터에서 자동 발견
안정성	높음 (라벨 고정)	낮음 (재훈련 시 변동)
새 카테고리 발견	어려움	자연스러움
운영 적합	라우팅·도구 권한·대시보드	새 use case 탐색·콘텐츠 갭

두 접근이 상호 보완한다 — 알려진 의도는 분류로, 알 수 없는 패턴은 클러스터링으로. 운영 시스템에서는 둘 다 돌리고, 알려진 의도가 잡지 못한 query만 클러스터링 결과로 분석.

C20 대화 로깅의 query_intent_class와 query_topic_cluster 두 필드가 정확히 이 분리를 반영한다.

2 의도 분류 — 라벨 정의

라벨이 모든 것의 출발점. 너무 거칠면 라우팅에 도움 안 되고, 너무 세분이면 데이터 부족·라벨 일관성 무너짐.

# config/intents.yaml
intents:
  knowledge_lookup:                 # 가장 흔한 — 사실·정의·참조
    description: 사실·정의·문서 위치 묻기
    examples:
      - "벡터 DB와 ES의 차이는?"
      - "지난 분기 매출은?"

  task_assist:                      # 작업 도움 — 코드·요약·번역
    description: 사용자 작업 보조
    examples:
      - "이 함수를 typer로 변환해줘"
      - "이 회의록을 영어로 요약해줘"

  troubleshoot:                     # 문제 해결
    description: 오류·증상·재현 단계
    examples:
      - "embedding 모델이 OOM 나는데"

  decision_support:                 # 비교·추천·결정
    description: 옵션 비교·근거 제시
    examples:
      - "PG vs Mongo 어느게 좋을까"

  small_talk:                       # 잡담·인사
    description: 작업 외 대화

  out_of_scope:                     # 시스템이 도와줄 수 없는
    description: 비-도메인 질문 (날씨·정치 등)

권장 라벨 수: 5~12개. 더 많으면 분류기·사람·운영 모두 어려움.

3 의도 분류 알고리즘

3.1 LLM zero/few-shot — 빠른 시작

# app/intent/llm_classifier.py
INTENT_PROMPT = """다음 사용자 질의를 아래 6개 의도 중 하나로 분류:
- knowledge_lookup
- task_assist
- troubleshoot
- decision_support
- small_talk
- out_of_scope

질의: {query}

답변(라벨만): """


def classify_llm(query: str) -> str:
    response = llm_call(INTENT_PROMPT.format(query=query))
    return response.strip()

강점: 라벨링 데이터 없이 시작, 라벨 추가·수정 즉시 반영. 약점: 비용 — query마다 LLM 호출. latency·throughput 부담.

운영에서는 샘플링 + 캐시 — 모든 query에 LLM 분류는 비현실적. 학습용·평가용·신규 유형 검출용으로 한정.

3.2 임베딩 + 분류기 — 운영 표준

# app/intent/embedding_classifier.py
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression

model = SentenceTransformer("BAAI/bge-m3")


class EmbeddingClassifier:
    def __init__(self):
        self.clf = LogisticRegression(max_iter=1000, multi_class="multinomial")

    def fit(self, queries: list[str], labels: list[str]):
        X = model.encode(queries, normalize_embeddings=True)
        self.clf.fit(X, labels)

    def predict(self, queries: list[str]) -> list[str]:
        X = model.encode(queries, normalize_embeddings=True)
        return self.clf.predict(X).tolist()

    def predict_proba(self, queries: list[str]):
        X = model.encode(queries, normalize_embeddings=True)
        return self.clf.predict_proba(X)

강점: - LLM 호출 없음 — 임베딩 1회 + 행렬곱 (분당 수만 query 처리) - 신뢰도 (predict_proba)로 fallback 결정 - 새 라벨 추가 시 데이터·재학습만 필요

약점: - 라벨링 데이터 필요 (의도당 100~500 예시) - 신규 의도 발견 안 됨 (분류기는 학습 라벨만 출력)

신뢰도 임계 분기:

def classify_with_fallback(query: str) -> str:
    probs = embedding_classifier.predict_proba([query])[0]
    top_idx = probs.argmax()
    if probs[top_idx] < 0.5:
        return classify_llm(query)        # LLM fallback
    return classes[top_idx]

3.3 Active Learning

라벨링 비용 줄이는 핵심.

# 한 라운드 — uncertainty sampling
import numpy as np

def select_for_labeling(unlabeled: list[str], n: int = 50) -> list[str]:
    probs = embedding_classifier.predict_proba(unlabeled)
    # 가장 불확실한 샘플 = entropy 최대
    entropies = -np.sum(probs * np.log(probs + 1e-10), axis=1)
    top = np.argsort(entropies)[-n:]
    return [unlabeled[i] for i in top]

수만 개 query 중 라벨링이 가장 도움될 50개를 골라 사람에게. 무작위 라벨링보다 5~10× 효율적.

3.4 평가

from sklearn.metrics import classification_report, confusion_matrix

# Stratified k-fold (의도별 표본 균형)
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in skf.split(X, y):
    clf.fit(X[train_idx], y[train_idx])
    pred = clf.predict(X[test_idx])
    print(classification_report(y[test_idx], pred))

핵심 지표: - Macro-F1 — 의도 간 불균형 무시한 평균 (소수 의도도 동등) - Confusion matrix — 어느 의도끼리 혼동되는가 (라벨 정의 보완 신호) - Per-class precision — out_of_scope 잘못 분류 시 영향 큼 → 정밀도 중시

4 토픽 클러스터링 — 새 패턴 발견

4.1 K-means + 임베딩

from sklearn.cluster import KMeans

X = model.encode(queries, normalize_embeddings=True)
km = KMeans(n_clusters=20, random_state=42, n_init=10).fit(X)
topic_id = km.labels_

강점: 단순, 빠름, 클러스터 수 제어. 약점: K를 사전 지정해야 함. 비구형 클러스터·노이즈에 약함.

4.2 HDBSCAN — 클러스터 수 자동·노이즈 분리

import hdbscan

clusterer = hdbscan.HDBSCAN(min_cluster_size=30, metric="euclidean")
labels = clusterer.fit_predict(X)
# labels == -1 은 노이즈

강점: 클러스터 수 자동, 노이즈를 따로 분류. 약점: hyperparameter (min_cluster_size) 조정 필요.

4.3 BERTopic — 종합 파이프라인

from bertopic import BERTopic

topic_model = BERTopic(
    embedding_model=model,
    min_topic_size=30,
    nr_topics="auto",                 # HDBSCAN 활용
    representation_model=KeyBERT      # 토픽별 키워드
)

topics, probs = topic_model.fit_transform(queries)
print(topic_model.get_topic_info())

강점: 임베딩·차원 축소(UMAP)·클러스터링(HDBSCAN)·키워드 추출 한 패키지. 약점: 패키지 의존, 디버깅 시 컴포넌트별 분리 어려움.

운영 권장 — BERTopic으로 빠르게 시작 → 안정되면 자체 파이프라인 분해.

4.4 토픽 라벨링

자동 라벨링 — TF-IDF 상위 키워드 또는 LLM 요약:

def label_topic(cluster_queries: list[str]) -> str:
    sample = "\n".join(cluster_queries[:30])
    prompt = f"다음 질의 묶음의 공통 주제를 한 단어로 요약 (한국어):\n{sample}"
    return llm_call(prompt).strip()

LLM 라벨은 변동성 — 매 갱신마다 미세 차이. 운영에서는: - 첫 라벨링 후 사람이 한 번 확정 - 클러스터 재훈련 시 Hungarian matching으로 기존 라벨 매핑 (C17 세그멘테이션 동일 패턴)

5 의도 + 토픽 결합 — 계층 분석

intent: knowledge_lookup
├── topic_cluster_3  : "벡터 DB·검색"
├── topic_cluster_7  : "프로덕트 가격·라이센스"
└── topic_cluster_12 : "사내 프로세스·정책"

intent: task_assist
├── topic_cluster_5  : "코드 변환·typer 마이그레이션"
└── topic_cluster_19 : "회의록 요약·번역"

이 결합으로: - 의도가 같아도 토픽별로 응답 품질이 다른지 분석 가능 - 알려진 의도 안에서 신규 토픽 등장 → 데이터 갭 신호 - 모르는 의도(out_of_scope)도 토픽으로 묶이면 새 카테고리 후보

6 Drift Detection

분포 변화 감지:

# 의도 분포 시계열
from scipy.stats import chi2_contingency

def detect_intent_drift(current_dist: dict, baseline_dist: dict) -> float:
    # 같은 라벨 키 정렬
    keys = sorted(set(current_dist) | set(baseline_dist))
    obs = [current_dist.get(k, 0) for k in keys]
    exp = [baseline_dist.get(k, 0) for k in keys]
    chi2, p, _, _ = chi2_contingency([obs, exp])
    return p

p < 0.001이면 분포 변화 — 알림. 원인: - 신규 사용자 그룹 진입 - 새 use case 등장 - 시스템 변경(예: 새 모델 도입 후 사용자 행동 변화)

토픽 클러스터링도 비슷 — 매 분기 ARI(Adjusted Rand Index)로 안정성 점검.

7 C17 세그멘테이션과 연동

# 사용자별 의도 분포 → 행동 feature
def user_intent_profile(user_id: str, lookback_days: int = 30) -> dict:
    queries = load_user_queries(user_id, lookback_days)
    intents = embedding_classifier.predict([q.text for q in queries])
    counts = Counter(intents)
    total = sum(counts.values())
    return {f"intent_{k}_pct": v / total for k, v in counts.items()}

이 프로파일이 C17 세그멘테이션의 행동 차원으로 추가. “task_assist 80%인 사용자”는 별도 세그먼트가 자연.

8 C23 피드백 루프 입력

# 의도별 thumbs_up_rate 모니터링 → 약점 자동 발견
def intent_quality_report():
    df = clickhouse.query("""
        SELECT query_intent_class, count(), avg(thumbs_up) as score
        FROM structured_query
        WHERE thumbs_up IS NOT NULL
          AND timestamp >= now() - INTERVAL 7 DAY
        GROUP BY query_intent_class
        HAVING count() > 100
    """)
    return df.sort_values("score").head(3)         # 가장 약한 의도 3개

이 결과가 C23 피드백 루프의 입력 — 약한 의도에 대해 프롬프트·few-shot·검색 범위 개선.

9 MINERVA 적용

app/analysis/intent/
├── llm_classifier.py        # zero/few-shot
├── embedding_classifier.py  # 운영 표준
├── train.py                 # active learning + retrain
├── eval.py                  # macro-F1·confusion·per-class
└── drift.py                 # 분포 변화 알림

app/analysis/topic/
├── bertopic_pipeline.py     # 빠른 시작
├── kmeans_baseline.py       # 작은 N
├── label.py                 # 자동·확정·매칭
└── stability.py             # ARI 모니터링

scripts/
├── intent_train.py          # 매월 재학습
├── intent_eval.py           # 분기 회귀 평가
├── topic_refresh.py         # 분기 클러스터 재학습
└── drift_check.py           # 매주 분포 점검

C20 대화 로깅의 feature 계층에 결과 채움.

10 자주 발생하는 함정

10.1 Label Noise

라벨링자가 다르면 같은 query에 다른 라벨. 분류기 학습 데이터 노이즈가 5%를 넘으면 성능 한계.

해법: - 이중 라벨링 + 일치도(Cohen’s kappa) 측정 - 라벨 가이드 문서 + 예시·반례 - 일치 안 되는 query는 labeled set에서 제외

10.2 Class Imbalance

out_of_scope 1%, knowledge_lookup 60% — 분류기가 majority로 편향.

해법: - stratified sampling + class_weight - 소수 클래스에 oversampling (SMOTE) 또는 weighted loss - 평가는 macro-F1로 (accuracy는 의미 없음)

10.3 Prompt Drift

LLM 분류기는 system prompt 미세 변경에 민감. “코드 생성”이 갑자기 다른 라벨 나옴.

해법: - prompt 버전 관리 (git tracked) - 분기마다 동일 평가셋으로 회귀 — 변화 감지 - 운영용은 LLM보다 임베딩+분류기 권장

10.4 Cluster Mismatch over Time

토픽 클러스터 매 갱신마다 ID·의미가 달라짐 → 대시보드·라우팅 깨짐.

해법: - Hungarian matching (C17) - 또는 anchor topics 고정 — 핵심 토픽은 사람이 정의해 매 갱신마다 매핑

10.5 “All Other” 폭증

모든 클러스터·의도에 안 맞는 query를 other로 묶으면 시간 흐를수록 other가 가장 큰 클러스터. 의미 없음.

해법: - other 비율 모니터링 — 임계값 초과 시 retraining 신호 - other 안에서 별도 클러스터링 → 신규 의도 후보 발견

11 정리

영역	핵심
의도 vs 토픽	supervised vs unsupervised, 알려진 vs 새 발견
의도 알고리즘	LLM(빠른 시작) → 임베딩+분류기(운영) → fallback 신뢰도
라벨 수	5~12개 — 너무 많으면 노이즈, 적으면 라우팅 무용
토픽 알고리즘	BERTopic(시작) → K-means or HDBSCAN(분해)
평가	Macro-F1·confusion·per-class precision
결합	intent × topic 계층 분석 — 약한 영역 자동 발견
Drift	의도 분포 chi-square·토픽 ARI 매 주·분기 점검
함정	label noise·class imbalance·prompt drift·cluster mismatch·`other` 폭증

12 응용 분야

시나리오	활용
라우팅 (모델·도구 분기)	의도 분류 (예: troubleshoot → debug 도구 부착)
신규 use case 발견	토픽 클러스터링 (`other` 안 클러스터)
데이터 갭 분석	intent × topic 매트릭스에서 thumbs_up 낮은 셀
사용자 행동 프로파일	user_id별 intent 분포 → C17 세그멘테이션
의도별 품질 회귀	weekly per-intent thumbs_up_rate
콘텐츠 우선순위	토픽별 query 양 + 품질 → 보강 우선순위

13 관련 주제

선행 학습 (선수)

C20 대화 로깅 설계 — feature 계층의 query_intent_class·query_topic_cluster
03편 RAG 파이프라인 — 임베딩 모델 활용 토대

18-LangGraph 시리즈 cross-reference

#10 프롬프트 분류와 라우팅 — 의도 분류의 라우팅 응용 이론
#18 스킬 라우팅 확장성 — 의도-스킬 매핑

후속 (Phase C-5)

C22 응답 품질 평가 — 의도별 품질 분리 평가
C23 피드백 루프 — 약한 의도 자동 강화

Cross-reference

C17 사용자 세그멘테이션 — 의도 프로파일이 행동 차원
C18 개인화 — 의도별 응답 스타일 분기