Kwangmin Kim - MINERVA Phase C-9 — 비용 최적화 (토큰 예산·캐싱·모델 티어링·배치)

1 왜 비용이 폭발하는가

운영 초기 LLM 비용은 적다 — 사용자 100명·월 만 query면 $50~100. 그러나 1년 차에 사용자 5천 명·월 100만 query가 되면 비용이 5~10× 증가한다. 그 이상으로 가는 경우가 흔하다.

사용자·query 증가 패턴	비용 증가
사용자 10×	직접 query 10×
평균 query당 token 1.5×	사용자가 시스템 활용 깊어짐
Multi-step plan 도입	step당 LLM 호출 증가
Tool 호출 결과 prompt 추가	input token 증가
도입 모델 더 비싼 모델	토큰당 비용 증가
Embedding 재인덱싱 빈도	embedding 호출 증가

비용 최적화는 사용자 경험 손실 없이 절감해야 — 단순히 “싼 모델로 바꾸자”는 답이 아니다.

2 5축 분해

Total Cost = Input Tokens + Output Tokens + Embedding + 외부 API + Storage·Compute

각 축이 다른 비율을 차지하고 다른 최적화 기법이 적합:

축	비중 (전형)	주요 최적화
Input tokens	40~60%	prompt cache·prompt compression·context 줄이기
Output tokens	20~30%	response cache·max_tokens 제한·요약
Embedding	5~15%	delta indexing·재사용·가벼운 모델
외부 API (search·tool)	5~10%	cache·rate limit
Storage·Compute	5~10%	tiered storage·sampling

대시보드에서 5축 비중 시각 — 어디 최적화할지 결정.

3 기법 1 — Token Budget

C24 Resource Quota에서 정의한 token budget을 비용 관점에서:

class CostBudget:
    """사용자·세그먼트·실험별 비용 한도."""

    def __init__(self, max_per_query_usd: float, daily_max_per_user_usd: float):
        self.max_per_query = max_per_query_usd
        self.daily_max_per_user = daily_max_per_user_usd

    def check_query(self, estimated_cost: float) -> bool:
        return estimated_cost <= self.max_per_query

    def check_daily(self, user_id: str, requested: float) -> bool:
        used = redis.get(f"daily_cost:{user_id}:{today}")
        return (used + requested) <= self.daily_max_per_user

3.1 Per-segment 한도

# config/cost.yaml
default:
  max_per_query_usd: 0.05
  daily_max_per_user_usd: 5.0

per_segment:
  rnd_engineer:
    max_per_query_usd: 0.20    # 깊은 분석 허용
    daily_max_per_user_usd: 50.0
  sales_default:
    max_per_query_usd: 0.02
    daily_max_per_user_usd: 1.0

세그먼트별 다른 한도 — C18 개인화 패턴.

4 기법 2 — Prompt Cache (Provider)

OpenAI·Anthropic 같은 provider가 system prompt를 cache:

# Anthropic prompt caching
response = await client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT,            # 수천 토큰
            "cache_control": {"type": "ephemeral"},  # 5분 TTL
        }
    ],
    messages=[{"role": "user", "content": query}],
)

효과: - 첫 호출: 정상 비용 (cache 생성) - 다음 호출 (5분 내): cached input은 90% 할인 - 같은 시스템 prompt를 여러 사용자가 쓰면 누적 절감

# 비용 계산 예
# system_prompt: 5000 tokens
# user_query: 200 tokens
# cache hit rate: 80%

# 캐시 없이: (5000 + 200) × 1000 calls × $0.003/1K = $15.6
# 캐시 사용: (5000 × 0.2 + 5000 × 0.8 × 0.1 + 200) × 1000 × $0.003/1K = $4.8
# → 70% 절감

4.1 Cache 친화 prompt 설계

System prompt 구조 (cacheable)
    ↓
Few-shot examples (cacheable)
    ↓
User-specific context (cacheable if 같은 user 반복)
    ↓
Per-query content (not cached)

순서가 중요 — 변하지 않는 부분이 앞에 와야 cache가 유효.

5 기법 3 — Response Cache (전체 응답)

같은 query·같은 context면 같은 응답 — 결정성 활용.

# app/cache/response_cache.py
def cache_key(query: str, context: dict) -> str:
    parts = [
        query,
        context.get("system_prompt_version"),
        context.get("retrieved_doc_ids"),
        context.get("model"),
        context.get("temperature", 0),
    ]
    return hash_md5(json.dumps(parts, sort_keys=True))


async def with_response_cache(query: str, context: dict, ttl_sec: int = 3600):
    key = cache_key(query, context)
    if cached := redis.get(f"resp:{key}"):
        metric_cache_hit.add(1)
        return cached

    response = await llm.complete(prompt(query, context))
    redis.setex(f"resp:{key}", ttl_sec, response.json())
    return response

적합: FAQ·정책 query·반복 사실 조회. 부적합: 시간 민감 정보 (가격·환율·뉴스).

5.1 TTL 결정

# config/cache_ttl.yaml
default: 3600                          # 1시간

by_intent:
  knowledge_lookup: 86400              # 24시간 (사실은 안 변함)
  decision_support: 3600
  troubleshoot: 1800                    # 새 fix가 자주 나옴
  small_talk: 0                         # cache 안 함

C18 개인화와 trade-off — 페르소나마다 다른 응답이면 cache 키에 segment 포함.

6 기법 4 — Semantic Cache

정확히 같은 query는 드물지만 — 의미적으로 같은 query는 자주 발생.

# 사용자 A: "VPN 어떻게 설정해?"
# 사용자 B: "VPN 설정 방법은?"
# → 같은 답


import numpy as np

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.9):
        self.threshold = similarity_threshold
        self.index = vector_store.create_collection("semantic_cache")

    async def get_or_generate(self, query: str, context: dict, generator):
        embedding = await embed(query)
        hits = self.index.search(embedding, top_k=1)

        if hits and hits[0].score > self.threshold:
            cached_response = redis.get(f"sem_resp:{hits[0].id}")
            if cached_response and self._context_compatible(hits[0], context):
                return cached_response

        response = await generator()

        # 캐시에 저장
        cache_id = uuid4()
        self.index.add(id=cache_id, embedding=embedding, metadata=context)
        redis.setex(f"sem_resp:{cache_id}", 3600, response.json())
        return response

강점: hit rate가 exact cache보다 높음 (FAQ 5~30% → 30~60%). 약점: - 임베딩 비용 (작지만 호출당 발생) - threshold tuning — 너무 낮으면 잘못된 응답 - context (사용자 권한·세그먼트) 호환 검증 필수

6.1 Threshold Tuning

# A/B로 결정
# threshold 0.85 → hit rate 50%, 사용자 thumbs_up_rate 0.45 (낮음 — 잘못 매칭)
# threshold 0.90 → hit rate 30%, 사용자 thumbs_up_rate 0.55 (정상)
# threshold 0.95 → hit rate 10%, 사용자 thumbs_up_rate 0.58
# → 0.90이 비용·품질 균형

7 기법 5 — Model Tiering

Cheap → Expensive 단계적 사용:

# Cascading
async def cascading_router(query: str, intent: str):
    # 1. 단순 의도는 작은 모델
    if intent in ["small_talk", "simple_lookup"]:
        return await llm_small.complete(query, model="gpt-4o-mini")

    # 2. 중간 — 큰 모델 시도
    response = await llm_medium.complete(query, model="gpt-4o")

    # 3. 응답 confidence 낮으면 더 큰 모델 fallback
    if response.confidence < 0.6:
        response = await llm_large.complete(query, model="o1-preview")

    return response

7.1 가격 비교 (2025 기준 참고)

gpt-4o-mini    : $0.15 / 1M input,  $0.60 / 1M output
gpt-4o         : $2.50 / 1M input,  $10.00 / 1M output
o1-mini        : $3.00 / 1M input,  $12.00 / 1M output
o1-preview     : $15.00 / 1M input, $60.00 / 1M output
claude-3.5-sonnet: $3.00 / 1M input, $15.00 / 1M output

mini와 preview 사이가 100×. 대부분 query는 mini로 충분 — tiering이 평균 비용 5~10× 절감.

7.2 Tiering 결정 기준

기준	분기
Intent 단순함	small (mini)
외부 노출 응답 (정확성 critical)	medium 이상
Multi-step 추론 필요	large (o1·deep think)
Code 생성	medium 이상
Refusal 가능	small first → escalate

C16 Bandit arm으로 학습 — 어느 query 유형에 어느 tier가 적합한지 자동.

8 기법 6 — Batch Processing

오프라인 작업 (재인덱싱·매일 보고서·평가)은 batch API:

# OpenAI Batch API — 50% 할인, 24h 내 처리
batch = await client.batches.create(
    input_file_id=upload_jsonl_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)
# → 정상 비용의 50%

적합: 결과 즉시 필요 X — 재인덱싱·summary·골든셋 평가·embedding 재계산. 부적합: 사용자 query.

9 추가 기법 — Prompt Compression

긴 system prompt·few-shot을 자동 요약·압축:

# LLMLingua, AutoCompressors
from llmlingua import PromptCompressor

compressor = PromptCompressor()
compressed = compressor.compress_prompt(
    long_prompt,
    target_token=500,        # 5000 → 500
    keep_first=0.3,
)

효과: 70~90% input 토큰 절감, 응답 품질 미세 손실 (5% 이내). 위험: 의미 손실 — 사전 평가 필수.

10 추가 기법 — RAG 효율화

# 1. Top-k 줄이기 — 5~10이 전형, 3~5로 줄여 LLM input 감소
docs = retriever.search(query, top_k=5)    # 10 → 5

# 2. Rerank threshold로 cutoff
docs = reranker.rerank_with_threshold(query, docs, min_score=0.5)

# 3. Chunk 크기 최적화 (C32와 결합)
# 청크가 작으면 더 많이 retrieval, 큰 청크 적게

C32 청킹 전략·Parent-Child 패턴이 비용·품질 균형.

11 Cost SLO·Alert

# config/cost_slo.yaml
slos:
  - name: avg_cost_per_query
    target: 0.025                       # $0.025 평균
    window: 7d
    alert_threshold: 0.04               # 60% 초과 시 alert

  - name: daily_cost_total
    target: 500                          # $500/일
    window: 1d
    alert_threshold: 750

burn rate가 임계 초과 시 — 자동 fallback (cheap tier 강제) 또는 알람:

async def cost_burn_monitor():
    burn = recent_cost_burn_rate()
    if burn > 1.5:
        notify_oncall("cost burn 50% 초과")
        # canary 자동 강제 — Bandit이 cheap arm 우선 (epsilon 일시 ↑)
        bandit_router.force_explore_cheap_arms(duration_min=60)

12 C16·C19와 결합

12.1 Bandit 보상에 비용 반영

def reward_with_cost(thumbs_up: int, cost_usd: float, lambda_cost: float = 50.0) -> float:
    """thumbs_up - λ·cost — 비용 가중 보상."""
    return thumbs_up - lambda_cost * cost_usd

# 같은 thumbs_up이면 싼 arm이 우대됨

12.2 실험에 비용 guardrail

C19 실험 spec guardrail에 cost_per_query 포함:

metrics:
  primary: thumbs_up_rate
  guardrail:
    - p95_latency_ms
    - cost_per_query_usd                # 비용 회귀 방지

13 MINERVA 적용

app/cost/
├── budget.py                  # per-query·per-user·per-segment
├── prompt_cache.py            # provider cache 활용 prompt 구조화
├── response_cache.py          # exact cache (Redis)
├── semantic_cache.py          # 임베딩 기반 cache
├── tiering.py                  # cascading router (small→medium→large)
├── batch.py                    # OpenAI Batch API helper
├── compression.py              # LLMLingua·custom 요약
└── slo.py                      # cost burn rate monitor

scripts/
├── cost_dashboard.py          # 5축 분해 dashboard
├── tier_eval.py                # tier별 품질·비용 비교
├── cache_audit.py              # cache hit rate·invalidation 분석
└── batch_run.py                # 오프라인 batch 작업

config/
├── cost.yaml                   # 한도·SLO
├── cache_ttl.yaml              # intent별 TTL
└── tiering.yaml                # 모델 가격·라우팅 룰

C16 Bandit·C19 실험·C24 quota·C34 metric 모두와 자연스럽게 통합.

14 자주 발생하는 함정

14.1 Cache Poisoning

악의적·잘못된 응답이 cache에 들어가 며칠간 모든 사용자에게 노출.

해법: - cache 키에 prompt·model·system_prompt_version 모두 포함 - C22 Output Guard 통과 응답만 cache - thumbs_down 발생 시 즉시 invalidate

14.2 Tier Mismatch

simple intent로 분류해서 small model 호출 → 사용자가 깊은 답을 원했음 → 만족 ↓.

해법: - tier 결정에 confidence 포함 (cascading) - Bandit으로 tier 선택 학습 - 분기마다 tier별 thumbs_up_rate 비교 — 회귀 catch

14.3 Premature Optimization

비용 5% 절감하려고 코드 복잡도 200% — 디버깅·운영 부담만.

해법: - 5축 dashboard에서 큰 비용 항목 (40%+) 우선 최적화 - 작은 항목은 trade-off 분석 후 결정 - “10% 비용 절감 vs 50% 코드 복잡도”는 보통 X

14.4 Cache TTL Drift

24h TTL 설정했는데 정보가 6h마다 바뀜 → 사용자가 옛 답 받음.

해법: - domain별 TTL 명시 (cache_ttl.yaml) - source 변경 webhook 발생 시 관련 cache invalidate - 사용자 explicit refresh 버튼 (UI)

14.5 Batch Latency Trap

오프라인 batch가 24h SLA → 운영이 batch 결과를 즉시 사용하려 시도 → 실패.

해법: - batch는 명시 sync 작업 (재인덱싱·평가) — 사용자 query는 절대 X - batch 완료 알림 시 후속 자동화

14.6 Provider Lock-in

OpenAI에 모든 비용·구조 의존 → 가격 인상 시 협상력 X.

해법: - multi-provider 추상화 (02-1 BaseAgent v2) - 분기마다 provider 가격·품질 비교 - 비용 critical 워크로드는 self-hosted 검토

14.7 Embedding Re-cost

embedding 모델 변경 → 전체 재인덱싱 비용 폭증.

해법: - 모델 변경은 critical 결정 — 분기 회의 사전 검토 - canary로 점진 (C31) - 새 모델 vs 기존 모델 cost·품질 quantify

14.8 Cost Visibility Lag

월말 invoice로 처음 알아챔 → 이미 폭증.

해법: - 실시간 cost dashboard (분당·시간당) - 일일 burn rate alert - 사용자별 cost 추적 (개인화 한도)

15 정리

영역	핵심
5축 분해	input·output·embedding·외부 API·storage
Token Budget	per-query·per-user·per-segment 한도
Prompt Cache	provider 측 90% 할인 (system prompt 정렬 중요)
Response Cache	exact + intent별 TTL
Semantic Cache	임베딩 기반 — threshold 0.9 권장
Model Tiering	mini → 4o → o1 cascade, Bandit으로 학습
Batch	오프라인 작업 50% 할인
Compression	LLMLingua 70~90% input 절감
결합	C16 Bandit 보상·C19 guardrail·C34 cost SLO
함정	poisoning·tier mismatch·premature opt·TTL drift·batch latency·lock-in·re-cost·visibility lag

16 응용 분야

시나리오	핵심 기법
FAQ 비용 절감	response cache + semantic cache
개발자 R&D query	큰 모델 허용 + per-segment 한도 ↑
매일 회귀 평가	batch API 50%
Multi-step plan 비용	step별 tier 결정 + token budget
신규 모델 도입	A/B + cost guardrail (C19)
비용 spike 대응	burn rate alert + 자동 cheap fallback
사용자 폭증	semantic cache hit rate ↑

17 관련 주제

선행 학습 (선수)

C24 하네싱 quota — token budget 토대
C16 지능형 라우팅 — model tiering의 학습 layer
C34 관측성 — cost metric·dashboard

후속 (Phase C-9)

C36 보안·접근 제어 — Phase C-9 클로저

Cross-reference

C18 개인화 — segment별 모델 tier 분기
C19 실험 파이프라인 — cost guardrail 통합
C22 응답 품질 — cache 응답도 quality 평가
C31 지식 lifecycle — embedding 비용 추적