Kwangmin Kim - MINERVA Phase C-8 — 청킹 전략 고도화 (문서 유형별 최적 분할 + 메타데이터)

1 왜 청킹이 RAG 품질을 결정하는가

retrieval은 “임베딩이 가까운 청크 top-k를 가져오는” 단순한 메커니즘이다. 그 단순함이 청킹의 책임을 키운다.

청킹 결정	영향
너무 작게 (50~100 토큰)	컨텍스트 손실, 임베딩이 표면적 단어에 끌림
너무 크게 (2000+ 토큰)	임베딩이 평균화되어 미세 의미 못 잡음, 한 청크 = 여러 주제
임의 위치 분할	문장·표·코드 블록 가운데 — 의미 단절
메타데이터 누락	검색 시 “이게 어느 문서·어느 섹션인지” 모름
overlap 없음	경계 정보 누락 — query가 경계에 걸치면 못 잡힘

해법은 문서 유형별 패턴. PDF 청킹과 코드 청킹은 다른 알고리즘이 필요하다.

2 청킹 5축

1. 단위 (Granularity)
   token · sentence · paragraph · section · document

2. 크기 (Size)
   typical 200~800 tokens, max·min 한도

3. Overlap
   인접 청크 간 공유 (10~20% 권장)

4. 메타데이터
   source·section·heading·page·sensitivity·언어·author 등

5. 계층 (Hierarchy)
   parent·child 관계, 요약 청크, hypothetical query

각 축이 독립적으로 조정 가능. 문서 유형별로 다른 조합이 적합.

3 기본 전략 5종

3.1 Fixed-size — 가장 단순

def fixed_size_chunk(text: str, size: int = 500, overlap: int = 50) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

적합: 빠른 prototype. 다른 전략 비교 baseline. 약점: 문장·문단 가운데 자름. 의미 단절.

3.2 Recursive Character — 일반 표준

LangChain·LlamaIndex 기본. 우선순위 separator로 단계적 분할:

SEPARATORS = ["\n\n", "\n", ". ", " ", ""]


def recursive_chunk(text: str, max_size: int = 800, overlap: int = 100) -> list[str]:
    if len(text) <= max_size:
        return [text]

    for sep in SEPARATORS:
        if sep in text:
            parts = text.split(sep)
            chunks = []
            current = ""
            for p in parts:
                if len(current) + len(p) <= max_size:
                    current = (current + sep + p).lstrip(sep)
                else:
                    if current:
                        chunks.append(current)
                    current = p
            if current:
                chunks.append(current)

            # overlap 추가 — 인접 청크 끝부분을 다음 시작에
            return _add_overlap(chunks, overlap)

    return [text[i:i+max_size] for i in range(0, len(text), max_size - overlap)]

적합: 일반 산문 (블로그·문서·매뉴얼). 약점: 표·코드·헤딩 같은 구조 인식 못 함.

3.3 Sentence-based — NLP 기반

nltk·spaCy로 문장 분리 후 그룹:

import spacy
nlp = spacy.load("ko_core_news_sm")


def sentence_chunk(text: str, max_size: int = 800) -> list[str]:
    doc = nlp(text)
    sentences = [s.text for s in doc.sents]

    chunks, current = [], ""
    for s in sentences:
        if len(current) + len(s) <= max_size:
            current += " " + s
        else:
            chunks.append(current.strip())
            current = s
    if current:
        chunks.append(current.strip())
    return chunks

적합: 한국어·일본어처럼 문장 구분이 명확한 텍스트. 약점: NLP 모델 의존. 외국어 mix 문서에서 정확도 ↓.

3.4 Markdown-aware — 헤딩 보존

import re

def markdown_aware_chunk(text: str) -> list[dict]:
    # 헤딩으로 분할
    sections = re.split(r"^(#{1,6} .+)$", text, flags=re.MULTILINE)

    chunks = []
    current_path = []                    # ["# Top", "## Section A"]
    body = ""

    for piece in sections:
        if re.match(r"^#{1,6} ", piece):
            if body.strip():
                chunks.append({"path": list(current_path), "content": body.strip()})
            level = piece.count("#", 0, piece.index(" "))
            current_path = current_path[:level - 1] + [piece]
            body = ""
        else:
            body += piece

    if body.strip():
        chunks.append({"path": list(current_path), "content": body.strip()})

    return chunks

각 청크에 path (heading 경로) 메타데이터. 검색 결과에서 “어느 섹션인지” 즉시 표시.

적합: README·정책 문서·블로그·기술 매뉴얼.

3.5 Semantic — 임베딩 기반 분할

문장 embedding의 cosine similarity가 임계 미만으로 떨어지는 지점에서 분할:

def semantic_chunk(text: str, threshold: float = 0.5) -> list[str]:
    sentences = split_sentences(text)
    embeddings = embed_batch(sentences)

    chunks, current = [], [sentences[0]]
    for i in range(1, len(sentences)):
        sim = cosine(embeddings[i-1], embeddings[i])
        if sim < threshold:                # 의미 단절 → 새 청크
            chunks.append(" ".join(current))
            current = [sentences[i]]
        else:
            current.append(sentences[i])
    if current:
        chunks.append(" ".join(current))
    return chunks

적합: 주제 전환이 자주 일어나는 긴 문서. 약점: 임베딩 비용·threshold tuning 필요.

4 문서 유형별 권장

유형	권장 전략	이유
Markdown 문서	Markdown-aware + recursive	헤딩 = 자연 경계
PDF·DOCX	Layout-aware + sentence	표·이미지 분리, OCR 후 sentence
코드	AST 기반 (function·class 단위)	function 가운데 자름 = 무용
표·CSV	row 단위 또는 cell 단위	행은 자연 record
대화 로그	turn 단위 (user·assistant 짝)	한 turn = 한 단위 의미
법률·정책	조항 단위 + heading	“제3조 (…)” 자연 분리
학술 논문	section 기반 (Abstract·Methods·Results 등)	section = self-contained
메일·메신저	메일 단위 또는 thread	한 메일 = 한 컨텍스트
위키·FAQ	Q&A 짝	한 Q&A = 자족적

5 코드 청킹 — AST 기반

# app/knowledge/chunkers/code.py
import ast


def chunk_python_by_ast(source: str, file_path: str) -> list[dict]:
    tree = ast.parse(source)
    chunks = []
    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            chunks.append({
                "type": type(node).__name__,
                "name": node.name,
                "content": ast.get_source_segment(source, node),
                "line_start": node.lineno,
                "line_end": node.end_lineno,
                "metadata": {"file": file_path, "imports": _extract_imports(tree)},
            })
    return chunks

각 함수·클래스가 한 청크. 메타데이터에 import 목록도 포함 — 의존성까지 검색 가능.

다른 언어 (Go·TypeScript·Rust 등) — tree-sitter가 통일 인터페이스.

6 표 청킹

# Markdown table 또는 CSV
def chunk_table(table: list[list[str]], header_row: bool = True) -> list[dict]:
    if header_row:
        headers = table[0]
        rows = table[1:]
    else:
        headers = [f"col_{i}" for i in range(len(table[0]))]
        rows = table

    chunks = []
    for i, row in enumerate(rows):
        # 각 행을 자연어로 변환 (검색 친화)
        content = ", ".join(f"{h}: {v}" for h, v in zip(headers, row))
        chunks.append({
            "row_index": i,
            "content": content,
            "raw_row": row,
            "metadata": {"headers": headers},
        })
    return chunks

행을 “field: value, field: value” 형태로 변환 — 자연어 query와 임베딩 매칭이 자연.

7 Parent-Child Chunking

작은 청크로 retrieval, 큰 청크로 컨텍스트 제공.

class HierarchicalChunker:
    def chunk(self, text: str) -> dict:
        # 작은 청크 (검색용 — 정확도)
        small_chunks = self._chunk(text, size=300)

        # 큰 청크 (생성용 — 컨텍스트)
        large_chunks = self._chunk(text, size=1500)

        # 매핑 — 작은 청크가 어느 큰 청크에 속하는지
        mapping = self._build_parent_map(small_chunks, large_chunks)

        return {
            "small": small_chunks,
            "large": large_chunks,
            "parent_of": mapping,
        }


# 검색 시
small_hits = vector_store.search(query, top_k=5)
large_contexts = [chunks["large"][mapping[s.id]] for s in small_hits]
# → small이 정확하게 매칭, large가 충분한 컨텍스트 제공

작은 청크는 임베딩 정밀, 큰 청크는 LLM에 제공. retrieval 정확도와 답변 품질 모두 ↑.

8 Hypothetical Query Embedding

청크 자체 대신 그 청크가 답할 만한 질문을 임베딩.

def add_hypothetical_queries(chunk: str, n: int = 3) -> list[str]:
    prompt = f"""다음 청크가 답할 만한 사용자 질문 {n}개를 작성:

청크:
{chunk}

질문 (한 줄씩):"""
    questions = llm_call(prompt).split("\n")
    return questions[:n]


# 인덱싱 시
for chunk in chunks:
    questions = add_hypothetical_queries(chunk.text)
    chunk_embedding = embed(chunk.text)
    question_embeddings = embed_batch(questions)

    # 모든 임베딩을 같은 청크 ID로 저장
    for emb in [chunk_embedding] + question_embeddings:
        vector_store.add(chunk_id=chunk.id, embedding=emb)

사용자 query는 보통 질문 형태 — 청크 본문보다 질문 임베딩과 더 가까움. retrieval 정확도 향상.

비용 — 청크당 LLM 호출 N번. indexing 시점에 한 번만 (검색 시 추가 비용 X).

9 Summary Embedding

긴 청크의 요약을 별도 임베딩 — 두 종류 검색.

def add_summary(chunk: str) -> str:
    return llm_call(f"다음을 한 문장으로 요약: {chunk}")


for chunk in chunks:
    summary = add_summary(chunk.text)
    vector_store.add(chunk_id=chunk.id, embedding=embed(chunk.text), kind="content")
    vector_store.add(chunk_id=chunk.id, embedding=embed(summary), kind="summary")

요약 임베딩은 high-level query (예: “보안 정책 개요”)에 적합, 본문 임베딩은 specific query (예: “GDPR 14조 통지 기한”)에 적합. 두 임베딩 모두 검색 후 RRF (Reciprocal Rank Fusion)로 결합.

10 메타데이터 주입

검색 결과에 도움될 모든 정보를 청크와 함께 저장:

class ChunkMetadata(BaseModel):
    doc_id: str
    chunk_id: str
    source: str                          # "confluence", "github_wiki" 등
    source_url: str
    title: str
    section_path: list[str] = []         # markdown 경로
    page_number: int | None = None
    line_range: tuple[int, int] | None = None
    language: str
    sensitivity: str
    owners: list[str]
    indexed_at: datetime
    last_modified: datetime
    chunk_type: str                       # "text" | "code" | "table" | "list"
    siblings: list[str] = []             # 같은 섹션 다른 청크 ID

이 메타데이터로 검색 시 필터 + 응답 생성 시 인용 + drift 감지 모두 가능.

10.1 메타데이터를 임베딩에 포함

청크 텍스트에 메타를 prefix:

def format_for_embedding(chunk: str, meta: ChunkMetadata) -> str:
    prefix = f"[{meta.title}] {' > '.join(meta.section_path)}\n\n"
    return prefix + chunk


embedding = embed(format_for_embedding(chunk.text, chunk.metadata))

질의가 “보안 정책의 이중인증 부분”이면 — 텍스트만으로는 못 잡고 heading “보안 정책 > 인증” 같은 prefix가 매칭.

11 청크 품질 평가

def chunk_quality_metrics(chunks: list) -> dict:
    return {
        "avg_size": np.mean([len(c.text) for c in chunks]),
        "size_std": np.std([len(c.text) for c in chunks]),
        "min_size": min(len(c.text) for c in chunks),
        "max_size": max(len(c.text) for c in chunks),
        "broken_sentences": sum(1 for c in chunks if not c.text.strip().endswith(".")),
        "broken_code_blocks": sum(1 for c in chunks if c.text.count("```") % 2 != 0),
        "missing_metadata": sum(1 for c in chunks if not c.metadata.section_path),
        "duplicate_rate": _compute_duplicate_rate(chunks),
    }

자동 게이트 — broken_code_blocks > 0 또는 min_size < 50이면 indexed 단계 진입 차단.

12 C31·C33과의 결합

청킹 결정	C31 lifecycle	C33 모니터링
청킹 알고리즘 선택	indexed 단계 게이트	drift 신호 (분포 변화)
청크 크기 분포	quality validation	weekly 분포 monitoring
메타데이터 일관성	indexed validation	missing metadata 비율
임베딩 모델	변경 시 전체 재인덱싱	모델 deprecation 추적

청킹 변경 자체도 C26 lifecycle 패턴 — 새 알고리즘 도입 시 canary로 점진 검증.

13 MINERVA 적용

app/knowledge/chunkers/
├── base.py                  # BaseChunker contract
├── recursive.py              # 일반 표준
├── markdown.py               # heading-aware
├── code.py                   # AST 기반
├── table.py                  # row·cell 단위
├── pdf.py                    # layout-aware (pdfplumber)
├── dialog.py                 # turn 단위
├── semantic.py               # 임베딩 기반 분할
└── hierarchical.py           # parent-child

app/knowledge/enhance/
├── hypothetical_queries.py   # LLM 기반 가상 질문
├── summary.py                 # 청크 요약
└── metadata.py                # 메타 주입·검증

scripts/
├── chunk_eval.py             # 청크 품질 메트릭
├── chunking_ab.py             # 새 알고리즘 vs baseline
└── chunk_cost.py              # LLM enhancement 비용 추적

C31 indexed 단계의 핵심 단계 — 본 편이 그 알고리즘 카탈로그.

14 자주 발생하는 함정

14.1 Small Chunk Problem

너무 작게 자르면 청크가 표면적 키워드 매칭만 — 의미 검색 무력화.

# 검출
if np.mean([len(c.text) for c in chunks]) < 100:
    alert("chunks too small")

해법: 최소 200 tokens, 일반적 권장 300~800.

14.2 Context Loss at Boundaries

문장·표·코드 가운데 자름 → 의미 단절.

해법: - recursive separator priority - 코드는 AST 단위 - 표는 row 단위 + header 메타에 포함

14.3 Metadata Bloat

메타데이터가 청크 텍스트보다 길어지면 — 임베딩이 메타에 dominated.

해법: - 임베딩에는 짧은 prefix만 (title + section path) - 나머지 메타는 vector store payload로 (검색 결과 반환만)

14.4 Hypothetical Query Drift

LLM이 생성한 가상 질문이 실제 사용자 query 분포와 다름 → 비용 들였는데 검색 정확도 안 오름.

해법: - 실제 query log에서 학습 (C20 raw·structured) - 분기 평가 — hypothetical vs real query embedding 매칭률 - 작동 안 하면 제거 (sunk cost 경계)

14.5 Embedding Pollution

청크에 noise (광고 footer·navigation·HTML tag 잔재)가 들어가면 임베딩 의미 깨짐.

해법: - preprocessing 단계 cleaning 강화 (boilerplate removal, html2text) - 청크 자동 검증 — 비-content 비율 (예: HTML tag 비율 > 5% alert)

14.6 Strategy Mismatch

PDF에 fixed-size 적용 → 표·이미지 가운데 잘림 + OCR error 분산. 결과 무용.

해법: - 문서 유형 감지 (extension·MIME) → 자동 chunker 라우팅 - new source 추가 시 sample 100 문서로 chunker 후보 비교

14.7 Re-chunking Cost

청킹 알고리즘 변경 → 전체 재청킹 + 재임베딩. 비용 폭증.

해법: - 변경은 canary로 점진 (C31) - 새 컬렉션에 별도 인덱싱 → 점진 전환 - delta indexing — 변경된 문서만

15 정리

영역	핵심
5축	단위·크기·overlap·메타·계층
기본 5종	Fixed·Recursive·Sentence·Markdown-aware·Semantic
유형별	Markdown=heading, Code=AST, Table=row, Dialog=turn, 정책=조항
고급	Parent-Child(검색·생성 분리), Hypothetical query, Summary
메타	검색 필터·인용·drift 감지의 토대
품질 메트릭	size 분포·broken·missing·duplicate
함정	small chunk·context loss·metadata bloat·drift·pollution·mismatch·re-chunking 비용

16 응용 분야

시나리오	권장
사내 정책 문서 (Markdown)	Markdown-aware + section path 메타
코드베이스 검색	AST chunker + import 메타
분기 보고서 (PDF + 표)	Layout-aware + table chunker 결합
회의록·메일	turn 단위 + thread ID 메타
학술 논문	section 기반 + abstract summary 별도
FAQ·매뉴얼	Q&A 짝 + Hypothetical query 추가
다국어 문서	sentence-based (NLP) + language 메타

17 관련 주제

선행 학습 (선수)

C31 지식 문서 생명주기 — indexed 단계의 핵심 기술
03편 RAG 파이프라인 — 검색 단계 토대
C24 하네싱 — sensitivity·collection 메타

후속 (Phase C-8)

C33 지식 품질 모니터링 — 청크 품질 drift 신호 (Phase C-8 클로저)

Cross-reference

C22 응답 품질 평가 — citation 품질이 청크 품질과 직접 연결
C28 스킬 레지스트리 — collection 의존성에서 chunker 정보 포함
Engineering: JSON Schema — 메타데이터 schema validation