Kwangmin Kim - MINERVA Phase C-6 — 실행 제어 (Timeout·Retry·Circuit Breaker·Kill Switch)

1 왜 실행 제어인가

C24 하네싱이 “어떤 행동이 허용되는가”라면, C25는 “허용된 행동이 외부 의존 때문에 실패할 때 어떻게 회복하는가”다. LLM Agent는 외부 의존이 많다.

외부 의존	흔한 실패 모드
LLM API (OpenAI·Anthropic·Azure)	rate limit·timeout·content filter·intermittent 5xx
벡터 DB·검색 엔진	인덱싱 지연·쿼리 timeout·connection pool 고갈
사내 도구 API	인증 만료·downtime·schema drift
데이터베이스	lock·deadlock·replication 지연
외부 웹 검색·번역	network·rate limit·로컬 정책

각 실패 모드에 맞는 방어 패턴이 다르다 — 단일 방어로는 부족.

2 5대 패턴 개요

Timeout         "이 호출이 N초를 넘기면 포기"
Retry           "실패하면 backoff로 N번 재시도"
Circuit Breaker "한 의존이 계속 실패하면 일정 시간 통째로 우회"
Bulkhead        "한 의존의 부담이 다른 부분에 못 번지게 격리"
Kill Switch     "긴급 시 즉시 정지 (사람·자동)"

각 패턴은 다른 시간 척도에서 작동: - Timeout: 단일 호출 (수백 ms~수십 초) - Retry: 단일 query 안 (몇 초) - Circuit Breaker: 분~시간 - Bulkhead: 시스템 설계 (지속적) - Kill Switch: 즉시·인간 결정

3 Timeout — 가장 기본

# app/control/timeout.py
import asyncio


async def with_timeout(coro, seconds: float, on_timeout=None):
    try:
        return await asyncio.wait_for(coro, timeout=seconds)
    except asyncio.TimeoutError:
        if on_timeout:
            return on_timeout()
        raise

3.1 부분 vs 전체 timeout

# 부분 — 각 외부 호출
response = await with_timeout(llm.acomplete(prompt), 30)
docs = await with_timeout(retriever.asearch(query), 5)

# 전체 — query 한 건 전체 latency 한도
async def run_with_overall_timeout(query, deadline_sec=60):
    return await with_timeout(agent.run(query), deadline_sec)

둘 다 필요: - 부분만 — LLM이 30초 + retriever 5초 + reranker 5초 + LLM 30초 = 70초 응답 - 전체만 — LLM이 60초 hang하면 전체 query 60초 대기

부분 timeout으로 단계별 protect, 전체 timeout으로 사용자 경험 보장.

3.2 Connection vs Read Timeout

HTTP client는 두 단계 timeout — connection (서버 응답 시작) + read (응답 전체 받기).

import httpx

client = httpx.AsyncClient(timeout=httpx.Timeout(
    connect=5.0,         # TCP 연결 + SSL handshake
    read=30.0,           # 첫 byte 후 전체 읽기
    write=5.0,
    pool=10.0,           # connection pool 대기
))

LLM streaming은 read timeout이 길어야 — 응답이 토큰 단위로 와야 하기 때문.

4 Retry — 회복 가능 실패

4.1 적합한 실패만 재시도

RETRYABLE_ERRORS = (
    httpx.ConnectTimeout,
    httpx.ReadTimeout,
    openai.RateLimitError,
    openai.APIConnectionError,
    asyncio.TimeoutError,
)

NON_RETRYABLE_ERRORS = (
    openai.AuthenticationError,
    openai.BadRequestError,                  # 입력 잘못 — 재시도해도 같은 에러
    openai.PermissionDeniedError,
)


async def retry_call(fn, max_attempts=3, base_delay=1.0):
    for attempt in range(max_attempts):
        try:
            return await fn()
        except NON_RETRYABLE_ERRORS:
            raise                            # 즉시 실패
        except RETRYABLE_ERRORS as e:
            if attempt == max_attempts - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
            await asyncio.sleep(delay)

4.2 Exponential Backoff + Jitter

def backoff_delay(attempt: int, base: float = 1.0, max_delay: float = 30.0) -> float:
    """attempt: 0, 1, 2, ... → 1s, 2s, 4s, 8s, ... + jitter."""
    exp = base * (2 ** attempt)
    jitter = random.uniform(0, exp / 4)
    return min(exp + jitter, max_delay)

Jitter 중요: 여러 클라이언트가 동시에 실패하고 동시에 재시도하면 thundering herd — 의존 서비스가 회복 못 함. 무작위 지연으로 분산.

4.3 Idempotency

재시도 가능하려면 호출이 idempotent해야 한다 (같은 호출 N번 = 1번과 같은 결과).

# safe to retry
search(query="...")          # 같은 결과
get_user(id=42)              # 변경 없음

# unsafe — retry 위험
send_email(...)              # 중복 발송
charge_credit(...)           # 중복 결제
db.insert({"id": uuid()})    # 같은 데이터 N행

해법: - Idempotency Key — 클라이언트가 UUID 생성, 서버는 같은 키 거부

async def send_email_safe(to, body, idempotency_key):
    if redis.exists(f"sent:{idempotency_key}"):
        return                                # 이미 처리됨
    await send_email(to, body)
    redis.set(f"sent:{idempotency_key}", "1", ex=86400)

4.4 Retry Budget

무한 재시도는 cascading failure를 유발. 시간 또는 횟수 한도:

class RetryBudget:
    def __init__(self, max_attempts=3, total_deadline_sec=60):
        self.max_attempts = max_attempts
        self.deadline = time.time() + total_deadline_sec
        self.attempts = 0

    def should_retry(self) -> bool:
        return self.attempts < self.max_attempts and time.time() < self.deadline

    def consume(self):
        self.attempts += 1

C24 하네싱의 token·time budget과 결합: 재시도가 budget을 소모.

5 Circuit Breaker — 시스템 보호

한 의존이 계속 실패하면 — 더 이상 호출하지 않고 즉시 fallback. 의존도 회복할 시간 확보.

# app/control/circuit_breaker.py
from enum import Enum
import time


class State(Enum):
    CLOSED = "closed"          # 정상
    OPEN = "open"               # 차단 — 즉시 실패
    HALF_OPEN = "half_open"     # 시험 — 한 호출만 허용


class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=60, half_open_max=3):
        self.state = State.CLOSED
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.opened_at = None
        self.half_open_max = half_open_max
        self.half_open_calls = 0

    async def call(self, fn, fallback=None):
        if self.state == State.OPEN:
            if time.time() - self.opened_at > self.reset_timeout:
                self.state = State.HALF_OPEN
                self.half_open_calls = 0
            else:
                if fallback:
                    return fallback()
                raise CircuitOpenError()

        if self.state == State.HALF_OPEN:
            if self.half_open_calls >= self.half_open_max:
                if fallback:
                    return fallback()
                raise CircuitOpenError()
            self.half_open_calls += 1

        try:
            result = await fn()
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        if self.state == State.HALF_OPEN:
            self.state = State.CLOSED
        self.failure_count = 0

    def _on_failure(self):
        self.failure_count += 1
        if self.failure_count >= self.failure_threshold:
            self.state = State.OPEN
            self.opened_at = time.time()

5.1 State 전이

CLOSED → (N 연속 실패) → OPEN
OPEN  → (timeout 후) → HALF_OPEN
HALF_OPEN → (성공) → CLOSED
HALF_OPEN → (실패) → OPEN

5.2 Fallback 전략

breaker = CircuitBreaker(failure_threshold=5, reset_timeout=60)

async def llm_call_with_breaker(prompt):
    return await breaker.call(
        lambda: openai_client.complete(prompt),
        fallback=lambda: anthropic_client.complete(prompt)    # 다른 모델로
    )

의존 실패 시 fallback	예시
다른 provider	OpenAI down → Anthropic
캐시된 결과	retriever down → 캐시
단순한 응답	reranker down → top-k 그대로
refusal	모두 down → “잠시 후 다시 시도” 안내 메시지

C24의 Output Guard와 결합 — fallback 응답도 가드 통과해야.

6 Bulkhead — 격리

선박의 격벽 — 한 칸에 물이 차도 다른 칸은 안전. 한 의존의 부담이 다른 부분에 번지지 않게.

6.1 Thread/Connection Pool 분리

# 의존별 별도 pool
llm_client = httpx.AsyncClient(limits=httpx.Limits(max_connections=20))
db_pool = await asyncpg.create_pool(min_size=5, max_size=20)
search_client = httpx.AsyncClient(limits=httpx.Limits(max_connections=10))

# LLM이 hang해서 20 connection 다 차도 DB·search는 멀쩡

6.2 Semaphore

# 의존별 동시성 제한
llm_sem = asyncio.Semaphore(50)
db_sem = asyncio.Semaphore(20)

async def llm_call_isolated(prompt):
    async with llm_sem:
        return await llm.complete(prompt)

LLM 50 concurrent + DB 20 concurrent로 분리. LLM이 폭증해도 DB queries 가능.

6.3 우선순위 큐

높은 우선순위 query는 별도 처리:

# 구조: 일반 큐 + VIP 큐
async def submit_query(query: Query, priority: str = "normal"):
    queue = vip_queue if priority == "vip" else normal_queue
    await queue.put(query)

VIP가 일반 큐 부하 영향 받지 않음. 일반 사용자 폭증 시 VIP 응답 보존.

7 Kill Switch — 긴급 정지

규칙 기반 회복이 못 잡는 위기 — 즉시 사람 결정으로 정지.

# app/control/kill_switch.py
from dataclasses import dataclass


@dataclass
class KillSwitch:
    """전역·세그먼트별·기능별 정지."""
    is_globally_off: bool = False
    disabled_segments: set[str] = field(default_factory=set)
    disabled_features: set[str] = field(default_factory=set)
    disabled_experiments: set[str] = field(default_factory=set)
    reason: str = ""
    activated_by: str = ""
    activated_at: datetime | None = None


switch = KillSwitch()


def check_allowed(query: Query) -> tuple[bool, str | None]:
    if switch.is_globally_off:
        return False, "system temporarily disabled"
    seg = query.segment.get("department")
    if seg in switch.disabled_segments:
        return False, f"segment {seg} disabled: {switch.reason}"
    if query.intent in switch.disabled_features:
        return False, f"feature disabled"
    for exp_id in query.experiments:
        if exp_id in switch.disabled_experiments:
            return False, f"experiment {exp_id} disabled"
    return True, None

7.1 활성화 방법

# CLI — 사람 결정
python -m scripts.kill_switch enable --segment finance --reason "data leak suspected"
python -m scripts.kill_switch disable_global --reason "incident i-123"

# 자동 — C19 monitor가 SRM·guardrail 임계 초과 시

활성화는 반드시 audit log + 알림 — Slack·email로 모든 관계자 통지. 시간이 흘러 잊히는 것 방지.

7.2 자동 vs 사람

트리거	게이트
C19 monitor가 SRM 위반 detect	자동 — 해당 실험만 disable
외부 LLM provider 장애	자동 — 해당 segment 또는 fallback
데이터 누출 의심	사람 — 빠른 escalation
정책 위반 신고	사람 — governance review
보안 사고	사람 — 전역 정지 즉시

7.3 Re-enable 절차

def re_enable(switch_id, approver, justification):
    audit("kill_switch_disable_revert", switch_id, approver, justification)
    notify_team(f"{switch_id} re-enabled by {approver}")
    switch.disabled_segments.discard(...)
    # 또는 단계적 — 5%로 재시작 후 모니터링

전역 disable 후 즉시 100% 재시작은 위험. Canary — 5% → 25% → 100% 순차 재시작 + 매 단계 검증.

8 Failure Mode 분류

Transient (일시적)
- network blip, rate limit, timeout
- 대응: Retry (backoff·jitter)

Persistent (지속)
- API 변경, 인증 만료, schema drift
- 대응: Circuit Breaker → 사람 알림 → 코드 수정

Cascading (전파)
- A 실패 → B 부담 ↑ → B 실패 → 전체 down
- 대응: Bulkhead·Circuit Breaker로 격리

Catastrophic (재해)
- 데이터 누출, 보안 사고, 정책 위반
- 대응: Kill Switch (사람) + 사후 분석

9 운영 — Dashboard·SLO·Alert

9.1 핵심 메트릭

메트릭	임계	알림
p95 e2e latency	> 5s	warning
Timeout rate	> 1%	warning
Retry rate	> 10%	info
Circuit Breaker trip	any	warning
Kill Switch active	any	critical
Fallback invocation rate	> 5%	warning
Concurrent users / cap	> 80%	info

9.2 SLO 정의

# config/slo.yaml
availability: 99.9%                      # 월 다운타임 ≤ 43분
p95_latency: 3s
p99_latency: 8s
error_rate: < 1%

burn_rate_alerts:
  - "1h burn rate > 14"                  # 빠른 burn — 즉시 page
  - "6h burn rate > 6"                   # 느린 burn — 시간 내 대응

C19의 monitor와 결합 — 같은 alert 인프라로 일관.

10 LangGraph 구현 패턴

# app/control/decorators.py
def with_timeout(seconds):
    def decorator(node_fn):
        async def wrapped(state):
            return await asyncio.wait_for(node_fn(state), timeout=seconds)
        return wrapped
    return decorator


def with_retry(max_attempts=3, retryable=RETRYABLE_ERRORS):
    def decorator(node_fn):
        async def wrapped(state):
            return await retry_call(lambda: node_fn(state),
                                      max_attempts=max_attempts)
        return wrapped
    return decorator


def with_breaker(breaker, fallback=None):
    def decorator(node_fn):
        async def wrapped(state):
            return await breaker.call(lambda: node_fn(state), fallback=fallback)
        return wrapped
    return decorator


# 적용 — composable
graph.add_node(
    "llm_call",
    with_timeout(30)(
        with_retry(3)(
            with_breaker(llm_breaker, fallback=cached_response)(
                llm_node
            )
        )
    )
)

각 데코레이터가 직교 — 추가·제거가 한 줄. C24 가드 데코레이터와 자연스럽게 chain.

11 MINERVA 적용

app/control/
├── timeout.py              # async timeout helper
├── retry.py                 # retry_call + RETRYABLE_ERRORS
├── circuit_breaker.py        # State machine
├── bulkhead.py               # semaphore·pool
├── kill_switch.py            # global·segment·feature·experiment
├── decorators.py             # LangGraph 통합
└── slo.py                    # 메트릭 수집·burn rate

scripts/
├── kill_switch.py            # CLI
└── breaker_status.py         # 운영 점검

config/
├── retry.yaml                # endpoint별 max_attempts
├── breaker.yaml              # endpoint별 threshold·timeout
├── bulkhead.yaml             # pool 크기
└── slo.yaml                  # 메트릭·임계

02-1 BaseAgent v2의 모든 도구 호출이 본 편 데코레이터로 wrap. C24 하네싱의 audit log에 실행 제어 결정도 함께 기록.

12 자주 발생하는 함정

12.1 Silent Retries

retry가 audit log·메트릭에 안 잡히면 — 외부 의존 문제가 숨겨짐. 사용자 latency만 길어짐.

해법: - 매 retry마다 metric 증가 (retry_count{endpoint=...}) - Retry budget 소진은 별도 로그 - Timeout이 retry로 인해 발생한 경우 명시 표시

12.2 Thundering Herd

Circuit breaker가 OPEN → reset_timeout 후 모든 클라이언트가 동시에 재시도 → 의존 다시 다운.

해법: - HALF_OPEN의 half_open_max로 시험 호출 제한 - reset_timeout에 jitter 추가 - 클라이언트별 별도 jitter

12.3 Circuit Breaker Too Sensitive

failure_threshold=2처럼 너무 민감하면 정상 변동에 OPEN. 의존이 멀쩡한데 우회.

해법: - 퍼센트 기반 임계 — “최근 50개 호출 중 30% 실패” - 시간 윈도우 (sliding window of failures) - staging에서 실제 traffic으로 임계 튜닝

12.4 Kill Switch Sloppy

긴급에 활성화했다가 잊고 — 사용자가 며칠간 disable 상태. 또는 reason이 모호해 누구도 re-enable 못 함.

해법: - Kill Switch 활성화 시 expiration 의무 (24h·1주·영구) - expiration 도달 시 자동 알림 - 매주 active switch 보고서

12.5 Cascading Retries

Layer 1이 retry 3 → Layer 2가 retry 3 → Layer 3이 retry 3 → 1 query당 27 backend 호출. 트래픽 폭증.

해법: - Retry budget을 시스템 전체로 정의 - 깊이별 retry 제한 (outer 0·middle 1·inner 3) - Header propagation — 안쪽 layer는 retry 안 함

12.6 Bulkhead Wrong Sizing

Pool이 너무 작으면 정상 부하에 connection 고갈. 너무 크면 격리 의미 없음.

해법: - 부하 테스트로 사이즈 결정 - 분기마다 사용률 review - Auto-scaling은 신중 — 갑작스런 expand는 thundering herd

12.7 Fallback Quality

fallback이 너무 약하면 사용자가 “이게 답인가?” 혼란. 너무 유사하면 의존 회복 인센티브 없음.

해법: - fallback에 명시 표시 (“부분 응답” 라벨) - fallback 사용률 SLI — 임계 초과 시 root cause 조사 - C22 quality 평가에 fallback 별도 카테고리

13 정리

영역	핵심
Timeout	부분·전체·connect/read 분리, streaming은 read 길게
Retry	RETRYABLE만, exponential + jitter, idempotency 필수
Circuit Breaker	CLOSED→OPEN→HALF_OPEN 전이, fallback 매핑
Bulkhead	pool·semaphore·우선순위로 의존별 격리
Kill Switch	전역·세그먼트·기능·실험 분리, expiration 의무
Failure Mode	transient·persistent·cascading·catastrophic
운영	SLO·burn rate·dashboard·alert
함정	silent retries·thundering herd·sensitive breaker·sloppy kill·cascading retries·wrong sizing·fallback quality

14 응용 분야

시나리오	핵심 패턴
LLM provider rate limit	Retry (backoff) + Circuit Breaker (다른 provider fallback)
벡터 DB 인덱싱 중 latency 폭증	Timeout + Bulkhead (검색용 별도 pool)
외부 API 인증 만료	NON_RETRYABLE → 즉시 알림 (retry 안 함)
새 모델 배포 후 회귀 발견	Kill Switch (해당 실험 disable + canary 재시작)
한 사용자가 1000 query 폭주	Bulkhead (per-user concurrency) + Retry budget
데이터 누출 의심	Kill Switch 전역 즉시 + 사후 분석

15 관련 주제

선행 학습 (선수)

C24 하네싱 아키텍처 — Tool Guard·audit log와 결합
10편 에러 전파 — 실패가 어떻게 전파되는지 토대
02-1 BaseAgent v2 — 모든 도구 호출에 데코레이터 적용

후속 (Phase C-6)

C26 에이전트 생명주기 — Kill Switch와 폐기·롤백 결합

Cross-reference

C19 실험 파이프라인 — Kill Switch 자동 트리거 (SRM·guardrail)
C22 응답 품질 — fallback 품질 모니터링
Engineering: Python async — asyncio.wait_for·gather 토대
Engineering: Docker Compose — pool·healthcheck 인프라