Kwangmin Kim - MINERVA Phase C-8 — 지식 품질 모니터링 (커버리지·신선도·정확성·드리프트)

1 왜 지속 모니터링인가

C31 lifecycle·C32 청킹으로 도입 시점의 품질은 보장된다. 그러나 시간이 흐르면 다른 종류의 품질 손실이 발생한다.

시간 흐름의 품질 손실	결과
Coverage Gap — 사용자가 자주 묻지만 문서 없는 영역	반복 refusal, “정보 없음” 응답 누적
Stale Knowledge — source 변경 감지 실패	옛 정책·가격으로 답변
Low-quality Doc — 인용은 되지만 thumbs_down 많은 문서	답변 품질 ↓ 누적
Citation Drift — 같은 query에 다른 문서 인용	일관성 깨짐
Topic Distribution Drift	사용자 분포 변화 — 새 use case 발견 못 함

C33은 5종 신호로 이 손실을 자동 탐지하고 C23 피드백 루프에 처치 후보로 등록한다.

2 5종 신호

[Coverage]       사용자 의도 vs 문서 분포 — 갭 자동 발견
[Freshness]     source 변경 감지·stale 문서 추적
[Accuracy]      인용 품질 (사람·LLM judge·thumbs)
[Citation Health] 분산·일관성·인용 깊이
[Drift]         시계열 변화 — 의도·토픽·인용 패턴

각 신호가 다른 종류의 손실을 잡음 — 단일 신호 monitoring으로는 부족.

3 신호 1 — Coverage Gap

사용자가 자주 묻는데 문서 부족한 영역 자동 발견.

# app/knowledge/monitor/coverage.py
def detect_coverage_gaps(period_days: int = 30) -> list[dict]:
    # 1. C21 의도·토픽 분류 결과로 query 분포
    intent_topic_dist = clickhouse.query(f"""
        SELECT query_intent_class, query_topic_cluster, count() as n
        FROM feature_table
        WHERE timestamp >= now() - INTERVAL {period_days} DAY
        GROUP BY query_intent_class, query_topic_cluster
        HAVING n > 50
    """)

    # 2. 각 의도·토픽 셀의 평균 retrieval count, refusal rate
    for cell in intent_topic_dist:
        cell["avg_retrieval"] = avg_retrieval_count(cell)       # 검색된 문서 수
        cell["avg_thumbs_up"] = avg_thumbs_up(cell)
        cell["refusal_rate"] = refusal_rate(cell)
        cell["citation_diversity"] = unique_docs_cited(cell)

    # 3. 갭 정의
    gaps = [
        c for c in intent_topic_dist
        if c["refusal_rate"] > 0.2                              # 자주 답 못 줌
        or c["avg_retrieval"] < 2                                # 검색 결과 빈약
        or c["avg_thumbs_up"] < 0.4                              # 답해도 만족 ↓
    ]
    return sorted(gaps, key=lambda c: -c["n"])                  # 영향 큰 순

상위 10개 갭이 C23 피드백 루프의 입력 — 가설 자동 생성:

# C23 통합
for gap in detect_coverage_gaps()[:10]:
    hypothesis = {
        "target": "knowledge",
        "text": f"의도 {gap['query_intent_class']}·토픽 {gap['query_topic_cluster']}의 문서 부족",
        "proposed_action": "관련 source 추가·기존 문서 확장·새 collection 도입",
        "priority": gap["n"],
    }
    c23.register_treatment_candidate(hypothesis)

3.1 Coverage Heatmap

의도×토픽 매트릭스로 시각:

def coverage_heatmap():
    df = clickhouse.query("""
        SELECT query_intent_class, query_topic_cluster,
               avg(citation_count) as cit,
               avg(thumbs_up) as thumbs,
               sum(is_refusal) / count() as refusal
        FROM feature_table
        WHERE timestamp >= now() - INTERVAL 30 DAY
        GROUP BY query_intent_class, query_topic_cluster
    """)
    # plotly heatmap — 빨간 셀 = coverage gap
    return df

대시보드에 매주 갱신 — 운영팀이 한 눈에 어디 보강이 필요한지 본다.

4 신호 2 — Freshness

문서가 source 변경을 따라가는지.

def freshness_audit(staleness_days: int = 90) -> dict:
    cutoff = datetime.utcnow() - timedelta(days=staleness_days)

    return {
        "stale_active": db.indexed_docs.count(
            stage="active",
            metadata__last_modified__lt=cutoff,
        ),
        "stale_high_citation": db.indexed_docs.count(
            stage="active",
            metadata__last_modified__lt=cutoff,
            metadata__citation_count__gt=100,                    # 자주 인용되는데 오래됨
        ),
        "source_unreachable": _count_unreachable_sources(),       # source URL 깨짐
        "drift_detected": _count_drift_alerts(period_days=30),
    }

4.1 Source Drift 감지

source의 last_modified와 indexed의 last_modified 비교:

async def detect_source_drift():
    for indexed in db.indexed_docs.find(stage="active"):
        try:
            source_modified = await fetch_source_modified(indexed.source_url)
            if source_modified > indexed.metadata["last_modified"]:
                alert(
                    f"{indexed.doc_id}: source updated {source_modified} > "
                    f"indexed {indexed.metadata['last_modified']}"
                )
                enqueue_reindex(indexed.doc_id)
        except SourceUnreachable:
            mark_unreachable(indexed.doc_id)

매일 cron — 새벽에 일괄 점검 후 알림.

4.2 Topic-specific Freshness

도메인마다 stale 임계가 다름:

# config/freshness.yaml
default_staleness_days: 90

per_collection:
  pricing: 30                # 가격 변경은 빠르게 추적
  tax_policies: 60
  technical_specs: 180
  archived_minutes: -1       # 회의록은 stale 안 됨 (사실 기록)

5 신호 3 — Accuracy

인용된 문서의 답변이 실제로 정확한가. C22 fused_score를 문서 단위로 분해:

def doc_quality_score(doc_id: str, period_days: int = 30) -> dict:
    """문서가 인용된 응답들의 품질 통계."""
    df = clickhouse.query(f"""
        SELECT
            count() as citations,
            avg(thumbs_up) as thumbs_up,
            sum(is_refusal) / count() as refusal_assoc,
            avg(quality_fused_score) as fused,
            sum(hallucination_flag) as hallucination_count
        FROM feature_table
        WHERE arrayContains(retrieved_doc_ids, '{doc_id}')
          AND timestamp >= now() - INTERVAL {period_days} DAY
    """).iloc[0]
    return df

5.1 Bad Doc 탐지

def bad_docs_report() -> list[dict]:
    df = clickhouse.query("""
        SELECT doc_id, citations, thumbs_up, hallucination_count
        FROM (
            SELECT arrayJoin(retrieved_doc_ids) as doc_id, ...
            FROM feature_table
            WHERE timestamp >= now() - INTERVAL 30 DAY
            GROUP BY doc_id
            HAVING citations > 20
        )
    """)

    bad = df[
        (df.thumbs_up < 0.3) |                    # 인용된 응답 thumbs_up 낮음
        (df.hallucination_count > 5)              # hallucination 자주 동반
    ]
    return bad.sort_values("citations", ascending=False).to_dict("records")

상위 bad doc은 즉시 review 큐 — owner에게 알림. 응답에 hallucination이 자주 동반된 문서는 청킹·메타·내용 자체에 문제 있을 가능성.

6 신호 4 — Citation Health

인용 패턴 자체의 품질.

def citation_health() -> dict:
    return {
        # 분산 — 한 query에 여러 문서가 인용되는가
        "avg_citations_per_response": avg_citations_per_response(),

        # 다양성 — top 10% 문서가 전체 인용의 몇 %를 차지하는지 (Gini)
        "concentration_gini": gini_coefficient_citations(),

        # 깊이 — 청크 단위 vs 문서 단위 인용 비율
        "chunk_level_ratio": chunk_level_citation_ratio(),

        # 재사용 — 같은 청크가 여러 query에서 인용되는 평균 횟수
        "avg_citation_reuse": avg_citation_reuse(),

        # No-citation responses — 인용 0개 응답 비율
        "no_citation_rate": no_citation_rate(),
    }

6.1 알람 임계

메트릭	임계	의미
`avg_citations` < 1.5	warning	인용이 부족 — hallucination 위험
`concentration_gini` > 0.85	warning	소수 문서에 과의존
`no_citation_rate` > 10%	critical	답변에 근거 부족
`chunk_level_ratio` < 0.5	info	청크 단위 인용 부족 — 청킹 개선 여지

각 임계 위반은 자동 C23 가설 생성 트리거.

7 신호 5 — Distribution Drift

시계열 분포 변화.

# scipy chi-square 또는 KL divergence
def detect_distribution_drift(metric: str, period_days: int = 7,
                                baseline_days: int = 90):
    recent = distribution(metric, days=period_days)
    baseline = distribution(metric, days=baseline_days)

    chi2, p, _, _ = chi2_contingency([recent, baseline])
    if p < 0.001:
        return {"drift": True, "p": p, "metric": metric}

    # KL divergence — 어느 카테고리에서 변했나
    kl = kl_divergence(recent, baseline)
    return {"drift": False, "p": p, "kl": kl}


# 매주 점검할 메트릭
DRIFT_METRICS = [
    "intent_distribution",
    "topic_cluster_distribution",
    "citation_doc_distribution",
    "language_distribution",
    "segment_distribution",
]

drift 발견 시 — 원인 추적: - 새 사용자 그룹 진입? - 시스템 변경 (새 모델·새 reranker)? - 외부 이벤트 (정책·시장 변화)? - 새 문서 도입 후 인용 분포 변화?

8 대시보드·알람 설계

8.1 대시보드 계층

[Operations Dashboard]
├── 5종 신호 KPI 카드
├── Coverage Heatmap (의도×토픽)
├── Freshness Timeline (stale 문서 추세)
├── Bad Docs Top 10
├── Citation Health Trend (4주)
└── Drift Detection Alerts (지난 7일)

[Per-Collection Drilldown]
├── 컬렉션별 5종 신호
├── 문서별 quality score 분포
└── 청킹 품질 메트릭 (C32)

[Per-Doc Detail]
├── 인용 횟수·thumbs 추세
├── 인용된 query 샘플
└── deprecation 권고 신호

운영팀은 매일 Operations Dashboard만 본다 — drilldown은 알람 발생 시.

8.2 알람 우선순위

등급	메트릭	즉시성	채널
Critical	source_unreachable·hallucination_spike·permission_leak	즉시 page	Slack + email + on-call
Warning	coverage_gap·freshness < 90d·bad_doc	4시간 내	Slack channel
Info	drift detected·citation health change	일일 다이제스트	weekly report

C25 Kill Switch와 결합 — critical 알람이 자동 collection disable 트리거할 수 있다.

9 자동 재인덱싱·재청킹 트리거

# scripts/auto_reindex.py — cron 매일
def auto_reindex_decisions():
    for doc in db.indexed_docs.find(stage="active"):
        # 신호 합산
        score = 0
        if is_stale(doc, days_threshold=collection_freshness(doc.collection)):
            score += 3
        if doc_quality_score(doc.doc_id)["fused"] < 0.5:
            score += 2
        if has_chunking_drift(doc):
            score += 1
        if source_modified(doc) > doc.indexed_at:
            score += 5                     # 가장 강한 신호

        if score >= 5:
            enqueue_reindex(doc.doc_id, priority="high")
        elif score >= 3:
            enqueue_reindex(doc.doc_id, priority="normal")
        elif score >= 1:
            schedule_review(doc.doc_id)    # 사람 검토 큐

10 C23 피드백 루프와 결합

Phase C-5의 C23이 응답 품질 약점을 처치 등록. C33이 같은 패턴을 지식 단위로:

# C33 → C23 통합
def feedback_loop_for_knowledge():
    # 1. 약점 셀 (coverage gap)
    gaps = detect_coverage_gaps()
    for gap in gaps[:5]:
        c23.register_treatment(target="knowledge", hypothesis=gap_to_hypothesis(gap))

    # 2. 부실 문서
    bad_docs = bad_docs_report()
    for doc in bad_docs[:3]:
        c23.register_treatment(
            target="knowledge",
            hypothesis={"action": "review_or_deprecate", "doc_id": doc.doc_id},
        )

    # 3. drift
    drift = detect_distribution_drift("intent_distribution")
    if drift["drift"]:
        c23.register_alert(
            target="knowledge",
            text=f"intent distribution drift detected (p={drift['p']:.4f})",
        )

C23이 처치 등록 → C19 실험 파이프라인 → 검증 후 ship. Phase C-5·C-8이 같은 인프라로 통합.

11 MINERVA 적용

app/knowledge/monitor/
├── coverage.py              # 의도×토픽 갭 탐지
├── freshness.py              # stale·source drift
├── accuracy.py               # bad doc·hallucination 연관
├── citation_health.py        # 분산·다양성·재사용
├── drift.py                  # 분포 변화 chi-square·KL
├── auto_reindex.py           # 신호 합산 → 재인덱싱 결정
└── alerts.py                 # 임계·channel·escalation

scripts/
├── weekly_quality_report.py # 5종 신호 종합 보고서
├── coverage_heatmap.py       # 시각 대시보드
└── bad_doc_review.py         # 사람 검토 큐 생성

config/
├── freshness.yaml            # collection별 staleness 임계
├── alerts.yaml               # 임계·채널 매트릭스
└── reindex_triggers.yaml     # 자동 재인덱싱 점수 룰

C31 lifecycle·C32 청킹·C20 logging·C22 quality·C23 feedback 모두를 호출 — Phase C-5와 Phase C-8을 잇는 자연스런 클로저.

12 Phase C-8 통합 요약

[C31] 지식 문서 생명주기 ─── 6단계 게이트로 도입·갱신·폐기 안전
              │
              ↓
[C32] 청킹 전략 고도화 ─── indexed 단계의 핵심 기술 (5축 + 유형별)
              │
              ↓
[C33] 지식 품질 모니터링 ─── 5종 신호로 시간 흐름의 손실 자동 탐지 (이 글)
              │
              ↓
        다시 [C31] 새 collection 도입·재인덱싱

Phase C-8이 완성되면 지식 기반이 시간이 갈수록 좋아지는 시스템 — 사람이 매주 모든 문서를 review하지 않아도 자동 게이트·탐지·처치 루프가 작동.

13 자주 발생하는 함정

13.1 Vanity Metric

대시보드가 “총 문서 수·인덱싱 속도” 같은 surface 메트릭만 — 실제 사용 품질과 연관 약함.

해법: - 5종 신호 모두 사용자 결과 (thumbs·refusal·fused) 기반 - 단순 카운트 메트릭은 보조로만, 의사결정에 사용 X

13.2 Alert Fatigue

자주 알람 → 운영팀이 무시 → critical도 놓침.

해법: - 임계 정확히 튜닝 (false positive 최소화) - 등급별 채널 분리 (Critical만 page) - weekly aggregated digest (Info급) - 알람 처리 통계 — 무시율 ↑ 시 임계 재평가

13.3 Overcorrection

coverage gap 발견 → 즉시 30개 문서 추가 → 일부는 noise → 다른 검색 점수 분산.

해법: - 새 문서 도입은 C31 canary 거쳐야 - 한 번에 배치 추가 X — 점진 (5~10개 단위) - 추가 후 영향 monitoring 후 다음 배치

13.4 Drift Misinterpretation

분포 변화 = 무조건 문제로 해석. 그러나 정상 변화 (계절·신규 사용자·신규 use case)도 있음.

해법: - drift 알람은 alert가 아닌 review 트리거 - 최근 외부 이벤트와 cross-check (사람 결정) - 의도된 변화는 baseline 재설정 후 모니터링 재개

13.5 Bad Doc Reflex Deletion

낮은 thumbs_up 문서 즉시 deprecate → 그 문서가 유일한 정보원이었으면 coverage gap 발생.

해법: - bad doc은 review 큐 우선, deprecate는 후속 - 대체 문서 존재 확인 후 deprecate - C31 lifecycle 그대로 — 30일 sunset

13.6 Cross-source Inconsistency

source A에서 옛 정보, source B에서 새 정보 — 검색 시 둘 다 인용되어 답변 모순.

해법: - source 우선순위 (canonical source 명시) - 같은 정보의 여러 source에 sensitivity·last_modified 비교 - 모순 알람 — LLM이 답변 생성 시 conflicting info 감지하면 user에게 표시

13.7 Self-fulfilling Prophecy

“이 문서는 인용 적어 bad doc”으로 분류 → 검색 점수 낮춤 → 더 적게 인용 → 더 bad로 평가.

해법: - 인용 횟수 ranking은 retrieval에 영향 X (검색은 임베딩 기반) - bad doc 평가는 인용 시점의 thumbs·hallucination만, 인용 횟수 자체는 metric X

14 정리

영역	핵심
5종 신호	Coverage·Freshness·Accuracy·Citation Health·Drift
Coverage	의도×토픽 매트릭스 갭 (refusal·thumbs·검색 결과 빈약)
Freshness	source last_modified vs indexed last_modified, collection별 임계
Accuracy	doc 단위 thumbs·hallucination·bad doc 자동 review 큐
Citation Health	분산·다양성·깊이·no-citation 비율
Drift	chi-square·KL — 분포 변화 시계열
자동 재인덱싱	신호 합산 점수로 priority 결정
C23 통합	모든 신호가 처치 후보 자동 등록
함정	vanity metric·alert fatigue·overcorrection·drift misinterpret·reflex delete·cross-source·self-fulfilling

15 응용 분야

시나리오	활용 신호
신규 부서 도입 후 커버리지 점검	Coverage
source 변경 빠른 반영	Freshness + auto reindex
품질 회귀 발견	Accuracy (doc 단위 fused 추세)
검색 다양성 점검	Citation Health (Gini)
새 use case 발견	Drift (intent distribution)
분기 운영 회고	5종 신호 종합 weekly report

16 관련 주제

선행 학습 (선수 — Phase C-8 전체)

C31 지식 문서 생명주기 — 6단계 게이트
C32 청킹 전략 고도화 — 청크 품질 메트릭

Cross-reference (Phase C-5와의 연동)

C20 대화 로깅 — 모든 신호의 데이터 토대
C21 의도·토픽 — Coverage Gap의 차원
C22 응답 품질 평가 — fused_score·hallucination 신호
C23 피드백 루프 — 처치 등록 통합

후속 (Phase C-9 진입)

C34 관측성 설계 — 본 편 대시보드·알람 패턴 확장 (OpenTelemetry)
C35 비용 최적화 — 재인덱싱·임베딩 비용 통제
C36 보안·접근 제어 — sensitivity·permission 모니터링

Cross-reference (운영)

C25 Kill Switch — critical 알람 자동 collection disable
C24 하네싱 — permission leak 탐지가 audit log 기반