Kwangmin Kim - MINERVA 테스트 전략 분석

1 개요

AI Agent 테스트는 일반 소프트웨어와 다른 어려움을 갖는다. LLM 출력이 비결정적이고, 외부 검색 인덱스가 변경되며, 네트워크 지연이 개입한다. MINERVA의 테스트 전략은 이 비결정적 외부 의존을 격리하고, 결정론적으로 검증 가능한 순수 함수에 집중한다.

2 현재 테스트 구조

tests/
  agents/
    qna_chatbot/
      test_agent.py                # 70+ 케이스
    data_standardizer/
      test_agent_stream.py
      test_domain_auditor.py
      test_domain_classifier.py
      test_post_processing.py
  core/
    rag/
      test_metadata_loader.py
    test_llm.py

2.1 케이스 분포 (qna_chatbot/test_agent.py 기준)

테스트 클래스	케이스 수	검증 대상
`TestConstructor`	6	생성자 입력 검증
`TestTruncateLeadTail`	7	history 축약 경계
`TestFormatHistory`	5	대화 포맷
`TestCleanResponseText`	5	“참고:” 섹션 제거
`TestExtractCitedIndices`	7	인용 인덱스 추출
`TestFilterCitations`	3	인용 필터링
`TestDocsToCitations`	5	Document → Citation 변환
`TestResolvePromptPath`	4	prompt 경로 우선순위
`TestStreamErrorHandling`	4	stream() 에러 처리
`TestBuildContextString`	7	context 조립
`TestBuildChainInputs`	6	chain inputs
`TestDocumentsMetadata`	6	문서별 메타
`TestFilterDocsBySource`	4	source filter
`TestRetrieveDocsSourceFilter`	4	retrieve + filter 통합

총 ~73개. 모두 외부 API 의존 없이 실행 가능하다.

2.2 미보유 영역

영역	파일	가성비
라우터 (`routers/qna_chatbot.py`)	없음	TestClient + mock으로 즉시 가능
실험 시스템 (`core/experiments.py`)	없음	sticky_hash 결정론성 검증 핵심
통계 검정 (`core/stats.py`)	없음	z-test/Welch/SRM χ²/lift — 순수 수학 함수, 단위 테스트 최적
답변 신호 (`core/answer_features.py`)	없음	정규식 5패턴(citation/table/code/qna/principle) — 경계값 테스트 가성비 최상
관련성 보정 (`core/relevance.py`)	없음	display_score scaling — 단조성·경계값(0/1)·한국어 코사인 분포
토큰 단가 (`core/pricing.py`)	없음	모델별 단가 표·env override·미등록 모델 폴백
Config 로더 (`core/config.py`)	없음	YAML `${VAR:default}` 치환·`_FALLBACK_DEFAULT` 폴백
Retriever 통합 (`core/rag/retriever.py`)	없음	Azure 통합은 비싸지만 `_FakeVectorStore` mock 가능
메트릭 로거 (`core/metrics_logger.py`)	없음	마지막 절의 PG 스키마 회귀 테스트로 일부 커버
프론트엔드	없음	Vitest + jsdom (lib/citationMarking, lib/conversations 우선)

결정론 순수 함수에 우선 투자

stats.py/answer_features.py/relevance.py는 외부 의존이 없는 순수 함수라 mock도 필요 없다. SRM χ²의 df=1 erfc 계산, citation 정규식의 한글 § 매칭, display_score의 piecewise linear 매핑 — 이런 함수들이 잘못되면 A/B 분석·UI 표시·관측성이 모두 동시에 깨지지만, 단위 테스트는 가장 쉽다. 라우터 통합 테스트보다 우선해서 채워야 할 영역이다.

3 테스트 경계 설계

3.1 핵심 원칙: LLM/RAG를 경계 밖으로

# tests/agents/qna_chatbot/test_agent.py

def _make_agent(**overrides) -> QnaChatbotAgent:
    """LLM/Retriever lazy 라 생성자만 호출하면 외부 의존 없음."""
    kwargs = {"document_path": "dummy.md"}
    kwargs.update(overrides)
    return QnaChatbotAgent(**kwargs)

QnaChatbotAgent의 _prompt, _llm, _retrievers는 모두 lazy 초기화된다. 생성자 호출만으로는 Azure Search나 OpenAI에 연결하지 않는다. 이 설계 덕분에 대부분의 테스트가 실제 서비스 없이 실행 가능하다.

3.2 문서 스텁

class _FakeDoc:
    """langchain Document 대용. page_content + metadata 만 사용."""
    def __init__(self, page_content: str, metadata: dict | None = None):
        self.page_content = page_content
        self.metadata = metadata or {}

LangChain Document 객체 대신 최소한의 인터페이스만 구현한 스텁을 사용한다. 실제 임베딩 없이 citation 파싱, context 조립, source filter 로직을 검증한다.

4 검증된 영역

4.1 생성자 입력 검증

class TestConstructor:
    def test_both_document_path_and_documents_raises(self):
        with pytest.raises(ValueError, match="하나만 지정"):
            QnaChatbotAgent(document_path="a.md", documents={"x": "b.md"})

    def test_neither_document_path_nor_documents_raises(self):
        with pytest.raises(ValueError, match="필수"):
            QnaChatbotAgent()

    def test_default_document_not_in_documents_raises(self):
        with pytest.raises(ValueError, match="not in documents"):
            QnaChatbotAgent(
                documents={"표준화": "a.md"},
                default_document="없는키",
            )

잘못된 인자로 agent를 생성하면 명확한 ValueError가 발생하는지 검증한다.

4.2 히스토리 처리 — `_truncate_with_lead_tail`

class TestTruncateLeadTail:
    def test_long_text_split_into_lead_tail(self):
        text = "A" * 200 + "B" * 100 + "C" * 200
        result = agent._truncate_with_lead_tail(text, lead_chars=200, tail_chars=150)
        assert result.startswith("A" * 200)
        assert result.endswith("C" * 150)
        assert "...(중략)..." in result

    def test_faq_section_stripped_before_truncation(self):
        text = "본문 내용입니다.\n\n관련 질문\n- 추가 Q1\n- 추가 Q2"
        result = agent._truncate_with_lead_tail(text)
        assert "관련 질문" not in result

    def test_max_chars_boundary_keeps_full_text(self):
        # lead(200) + tail(150) + 중략표시(15) = 365자까지 보존
        text = "X" * 365
        assert agent._truncate_with_lead_tail(text) == text

경계값(365, 366자)과 FAQ 섹션 제거를 검증한다. 정규식 패턴의 정확성을 경계값으로 확인하는 좋은 예시다.

4.3 Citation 파싱 파이프라인

세 단계 파이프라인을 각각 독립적으로 테스트한다.

# 1단계: 말미 '참고:' 섹션 제거
class TestCleanResponseText:
    def test_trailing_references_section_stripped(self):
        text = "본문 [1] 입니다.\n참고:\n[1] doc-a.md\n[2] doc-b.md"
        result = agent._clean_response_text(text)
        assert "참고:" not in result
        assert "본문 [1] 입니다" in result

    def test_inline_citation_marker_preserved(self):
        text = "본문 [1, §7.2.4] 입니다.\n참고:\n[1] doc.md"
        result = agent._clean_response_text(text)
        assert "[1, §7.2.4]" in result  # 인라인 마커는 유지

# 2단계: 인용 인덱스 추출
class TestExtractCitedIndices:
    def test_section_marker_citation(self):
        assert agent._extract_cited_indices("본문 [3, §7.2.4] 입니다.") == {3}

    def test_duplicate_citations_collapsed(self):
        assert agent._extract_cited_indices("[1] 본문 [1] 다시 [1]") == {1}

# 3단계: 실제 인용된 Citation만 필터
class TestFilterCitations:
    def test_no_marker_keeps_all(self):
        """마커 없으면 전체 source 유지 (PoC parity)."""
        citations = [Citation(index=i, content=f"doc-{i}") for i in range(1, 4)]
        result = agent._filter_citations("인용 없는 본문", citations)
        assert {c.index for c in result} == {1, 2, 3}

    def test_cited_index_out_of_range_filtered_out(self):
        """답변이 [99]를 인용해도 실제 목록에 없으면 제거."""
        citations = [Citation(index=i, content=f"doc-{i}") for i in range(1, 4)]
        result = agent._filter_citations("[99] [1]", citations)
        assert {c.index for c in result} == {1}

4.4 스트리밍 에러 처리 — `_FakeChain` mock

class _FakeChain:
    """LCEL chain.stream() mock — 토큰 시퀀스 또는 예외 시뮬레이션."""
    def __init__(self, chunks=None, *, raise_after=None, raise_exc=None):
        self._chunks = chunks or []
        self._raise_after = raise_after
        self._raise_exc = raise_exc or RuntimeError("LLM 통신 실패 가정")

    def stream(self, _inputs):
        for i, chunk in enumerate(self._chunks):
            if self._raise_after is not None and i >= self._raise_after:
                raise self._raise_exc
            yield chunk

class TestStreamErrorHandling:
    def test_stream_failure_after_partial_tokens(self, monkeypatch):
        """첫 토큰 yield 후 실패 — 토큰 frame들 + error frame, done 없음."""
        fake_chain = _FakeChain(
            chunks=["안", "녕", "하"],
            raise_after=2,
            raise_exc=TimeoutError("LLM 응답 timeout"),
        )
        monkeypatch.setattr(
            agent, "_prepare",
            lambda _q: ([], fake_chain, {}, "gpt-4.1"),
        )
        events = list(agent.stream(self._query()))
        token_events = [e for e in events if e.type == "token"]
        error_events = [e for e in events if e.type == "error"]
        done_events  = [e for e in events if e.type == "done"]
        assert len(token_events) == 2          # raise_after=2: 0,1번 토큰만 나옴
        assert len(error_events) == 1
        assert len(done_events) == 0           # done 없이 종료 검증

monkeypatch로 _prepare()를 가로채서 실제 RAG 없이 스트리밍 에러 경로를 검증한다.

4.5 Context 조립 — `_build_context_string`

class TestBuildContextString:
    def test_source_name_appears_as_attribution(self):
        """metadata.source_name이 있으면 (출처: ...) prefix."""
        docs = [_FakeDoc("DAMA 본문", metadata={"source_name": "DAMA-DMBOK"})]
        result = agent._build_context_string(docs)
        assert "[Document 1] (출처: DAMA-DMBOK)" in result

    def test_no_source_name_keeps_legacy_format(self):
        """source_name 없으면 기존 형식 유지 (회귀 방지)."""
        docs = [_FakeDoc("본문", metadata={"section": "§7.1.4"})]
        assert agent._build_context_string(docs) == "[Document 1]\n본문"

    def test_mixed_attribution_per_doc(self):
        docs = [
            _FakeDoc("외부", metadata={"source_name": "DAMA-DMBOK"}),
            _FakeDoc("사내", metadata={}),
            _FakeDoc("사전", metadata={"source_name": "단어 사전"}),
        ]
        result = agent._build_context_string(docs)
        assert "[Document 2]\n사내" in result   # legacy 형식

4.6 대화 이력 슬라이싱 — `history_turns`

class TestBuildChainInputs:
    def test_history_turns_limits_recent_pairs(self):
        """history_turns=1 → 최근 1턴(2 메시지)만 포함."""
        agent.config.conversation.history_turns = 1
        query = Query(
            text="후속",
            history=[
                ConversationTurn(role="user", content="첫 질문"),
                ConversationTurn(role="assistant", content="첫 답"),
                ConversationTurn(role="user", content="두번째 질문"),
                ConversationTurn(role="assistant", content="두번째 답"),
            ],
        )
        result = agent._build_chain_inputs(query, [])
        assert "두번째 질문" in result["chat_history"]
        assert "첫 질문" not in result["chat_history"]

config.conversation.history_turns가 프롬프트 내 히스토리 길이에 실제로 영향을 주는지 검증한다.

4.7 Source Filter — post-retrieval 메타데이터 필터

class TestFilterDocsBySource:
    def test_filter_by_source_type(self):
        docs = [
            _FakeDoc("A", metadata={"source_type": "external_reference", "source_name": "DAMA-DMBOK"}),
            _FakeDoc("B", metadata={"source_type": "internal_standard"}),
            _FakeDoc("C", metadata={"source_type": "external_reference", "source_name": "DAMA-DMBOK"}),
        ]
        filtered = agent._filter_docs_by_source(docs, "external_reference")
        assert len(filtered) == 2

    def test_filter_handles_missing_metadata(self):
        """metadata 없는 doc도 안전 처리."""
        docs = [_FakeDoc("no meta")]
        filtered = agent._filter_docs_by_source(docs, "DAMA-DMBOK")
        assert filtered == []

5 미검증 영역

5.1 라우터 레이어

tests/에 라우터 테스트가 존재하지 않는다.

tests/agents/qna_chatbot/test_agent.py  ← 있음
# FastAPI TestClient를 사용한 /run, /stream 테스트  ← 없음

현재 검증되지 않은 라우터 동작:

_build_agent()의 double-checked locking 정확성
/stream 응답이 실제로 text/event-stream Content-Type인지
log_run()이 RunResponse 반환 후 호출되는지
HTTP 500 응답 형식 (에러 핸들러 미적용으로 FastAPI 기본 응답)

5.2 실험 시스템

# data/experiments/*.yaml 로드 테스트  ← 없음
# sticky_hash 배정 결정론성 테스트     ← 없음
# apply_overrides 타입 불일치 검출     ← 없음

apply_overrides()가 존재하지 않는 dotted key를 받으면 AttributeError가 발생한다. 이 경계에 대한 테스트가 없다.

5.3 Config 시스템

# get_config() YAML 로드 실패 폴백 테스트  ← 없음
# ${VAR:default} 치환 정확성 테스트         ← 없음
# _FALLBACK_DEFAULT 값 검증                 ← 없음

YAML 파싱 오류 시 _FALLBACK_DEFAULT로 폴백하는 동작이 테스트되지 않는다.

5.4 Retriever 통합

ParentChunkRetriever.hybrid_search()는 Azure AI Search와 직접 통신한다. 현재 인수 단위 테스트(mock retriever 사용)는 존재하지만, 실제 인덱스를 대상으로 한 통합 테스트는 없다.

5.5 프론트엔드

frontend/src/
  __tests__/   ← 없음

React 컴포넌트, handleSend(), localStorage 저장·복원 로직에 대한 테스트가 전혀 없다.

6 테스트 보강 우선순위

우선순위	대상	이유
높음	`core/stats.py` 4개 검정	A/B 결과 신뢰도 직결, 외부 의존 0 — 즉시 추가 가능
높음	`core/answer_features.py` 정규식 5패턴	자동 신호화 결과가 모든 운영 메트릭의 입력, 한글·`§`·복합 마커 경계
높음	라우터 `/run`, `/stream`	에러 핸들러 부재로 P1·P2 취약점 보유
높음	`apply_overrides()` 에러 케이스	존재하지 않는 key, 타입 불일치
중간	`core/relevance.py` display_score 보정	단조성·0/1 경계 — UI 표시 일관성
중간	`core/pricing.py` 비용 계산	미등록 모델 폴백, env override
중간	`resolve_config_for_user()` 폴백	실험 없을 때, force_arm 미존재 시
중간	`assign_arm()` 결정론성	동일 user_id → 항상 같은 arm
낮음	localStorage 저장·복원	Vitest + jsdom — `lib/conversations.ts`
낮음	`_active_config` thread safety	concurrent.futures로 경쟁 조건 검출

6.1 라우터 테스트 예시

from fastapi.testclient import TestClient
from unittest.mock import MagicMock, patch

from services.api.main import app

client = TestClient(app)

def test_run_returns_200_with_mocked_agent():
    mock_response = MagicMock()
    mock_response.text = "답변"
    mock_response.citations = []
    mock_response.run_id = "test-run-id"
    mock_response.latency_ms = 100
    mock_response.ttft_ms = 100
    mock_response.arm_id = None
    mock_response.model = "gpt-4.1"

    with patch("services.api.routers.qna_chatbot._build_agent") as mock_build:
        mock_agent = MagicMock()
        mock_agent.run.return_value = mock_response
        mock_build.return_value = (mock_agent, None, None, {})

        resp = client.post("/agents/qna_chatbot/run", json={"text": "질문"})
        assert resp.status_code == 200
        data = resp.json()
        assert data["response"]["text"] == "답변"

6.2 실험 시스템 테스트 예시

def test_apply_overrides_invalid_key_raises():
    config = RAGConfig()
    with pytest.raises(AttributeError):
        apply_overrides(config, {"nonexistent.key": "value"})

def test_sticky_hash_is_deterministic():
    """동일 user_id + experiment → 항상 같은 arm."""
    exp = Experiment(
        name="test_exp", agent="qna_chatbot",
        arms={"control": Arm(traffic=0.5), "treatment": Arm(traffic=0.5)},
    )
    first  = assign_arm(exp, user_id="user-42")
    second = assign_arm(exp, user_id="user-42")
    assert first == second

def test_sticky_hash_different_users_may_differ():
    """다른 user_id는 다른 arm에 배정될 수 있다 (항상은 아님)."""
    exp = Experiment(
        name="test_exp", agent="qna_chatbot",
        arms={"control": Arm(traffic=0.5), "treatment": Arm(traffic=0.5)},
    )
    arms_assigned = {assign_arm(exp, user_id=f"user-{i}") for i in range(100)}
    assert len(arms_assigned) > 1  # 100명 중 두 arm 모두 배정됨을 확인

고급 테스트 패턴 — 별편으로 분리

본 글의 진단·보강 우선순위 위에 적용할 6가지 고급 테스트 패턴(Property-Based, Snapshot, 동시성, Contract, Mutation, 비결정성 처리)과 CI 분리 전략, PG 마이그레이션 회귀 테스트는 12-1편 고급 테스트 패턴으로 분리했다. 본 글은 “지금 어디까지 검증되어 있고 어디부터 보강할지” 진단에 집중한다.

7 정리

MINERVA의 테스트는 “LLM/RAG는 믿지 않고, 그 경계를 테스트한다”는 원칙으로 설계되어 있다. 생성자 검증, citation 파싱, context 조립, 에러 이벤트 순서 등 결정론적으로 검증 가능한 부분이 잘 커버된다. 미검증 영역은 라우터, 실험 시스템, 프론트엔드로 집중된다. Phase C-1 분석 결과를 토대로 이 취약점들을 Phase C-2에서 순차적으로 보강한다.