AI Agent Debugging & Testing Guide — Best Practices 2026

TL;DR

AIエージェントのデバッグは、従来のソフトウェアとは異なる課題があります：非決定性、状態管理、外部依存です。本記事では、決定論的テスト、可観測性の設計、最新ツールを活用した実践的なデバッグ手法を解説します。LangChain/AutoGen/crewAIユーザー必見。

はじめに：なぜAIエージェントのデバッグは難しいのか

2026年、AIエージェントは「実験的プロジェクト」から「本番システム」へと移行しました。しかし、多くのチームがデバッグの壁にぶつかっています。

「AIエージェントをデバッグするのは、夢を追跡するようなものです。同じ入力を与えても、毎回異なる結果が得られます。」— LangChain創業者、Harrison Chase^1

典型的な失敗談（実運用からのエピソード）：

ステージングで成功したのに、本番で失敗
- 原因：外部APIのレート制限が非同期で発動
- 教訓：決定論的モックが不十分だった
1回だけ発生したバグを再現できない
- 原因：LLMの非決定性
- 教訓：シード固定とログの重要性
エージェントがループして終わらない
- 原因：ツール選択の無限再帰
- 教訓：実行時間の制限が必要

本記事では、これらの問題に対する実践的な解決策を提供します。

第1章：AIエージェント特有の課題

1. 非決定性（Non-Determinism）

問題: 同じ入力で毎回異なる出力

原因:

LLMの温度パラメータ
サンプリング戦略
外部APIの応答時間

解決策:

# Bad: 非決定的なテスト
def test_agent_response():
    agent = CodeAgent()
    response = agent.run("Write a function")
    assert response.status == "success"  # 偶々失敗する

# Good: 決定論的モック
def test_agent_response_deterministic():
    agent = CodeAgent(llm=MockLLM(seed=42))
    response = agent.run("Write a function")
    assert response.status == "success"
    assert "function" in response.content

2. 状態管理（State Management）

問題: マルチステップ実行で状態が複雑に変化

原因:

エージェントの内部状態
ツールの副作用
コンテキストウィンドウの制限

解決策:

from dataclasses import dataclass
from typing import Optional

@dataclass
class AgentState:
    """エージェントの状態を明示的に管理"""
    step: int = 0
    history: list[str] = None
    current_tool: Optional[str] = None
    error_count: int = 0

    def to_dict(self) -> dict:
        """ロギング用シリアライズ"""
        return asdict(self)

class ObservableAgent:
    def __init__(self):
        self.state = AgentState()
        self.logger = structlog.get_logger()

    def run(self, task: str):
        self.logger.info("agent_started", state=self.state.to_dict())

        while self.state.step < 10:
            self.state.step += 1
            # ... エージェントロジック ...

            self.logger.info(
                "agent_step",
                step=self.state.step,
                tool=self.state.current_tool,
                state=self.state.to_dict()
            )

3. 外部依存（External Dependencies）

問題: 外部API/データベースに依存

解決策: 依存性注入 + モック

# interfaceを定義
class ToolExecutor(ABC):
    @abstractmethod
    def execute(self, tool_name: str, **kwargs) -> ToolResult:
        pass

# 本番実装
class RealToolExecutor(ToolExecutor):
    def execute(self, tool_name: str, **kwargs) -> ToolResult:
        # 実際のツール実行
        pass

# テスト用モック
class MockToolExecutor(ToolExecutor):
    def __init__(self, responses: dict):
        self.responses = responses

    def execute(self, tool_name: str, **kwargs) -> ToolResult:
        return self.responses.get(tool_name, ToolResult(success=False))

# 依存性注入
class Agent:
    def __init__(self, executor: ToolExecutor):
        self.executor = executor

# テスト
def test_agent_with_mock():
    mock = MockToolExecutor({
        "search": ToolResult(success=True, data=["result1"])
    })
    agent = Agent(executor=mock)
    # テストロジック...

第2章：テスト戦略

ピラミッド：単体 → 統合 → E2E

        /\
       /E2E\      ← 少数、重要なユースケース
      /------\
     /統合テスト\  ← 中数、API/DB統合
    /----------\
   /単体テスト   \ ← 多数、個別機能
  /--------------\

単体テスト（Unit Tests）

目的: 個別機能の正確性を保証

対象:

ツール関数
プロンプトテンプレート
状態遷移ロジック

import pytest
from unittest.mock import Mock, patch

def test_tool_executor_search():
    """ツール実行の単体テスト"""
    executor = RealToolExecutor()

    with patch('requests.get') as mock_get:
        mock_get.return_value.json.return_value = {
            "results": ["item1", "item2"]
        }

        result = executor.execute("search", query="test")

        assert result.success is True
        assert len(result.data) == 2
        mock_get.assert_called_once()

def test_prompt_template():
    """プロンプトテンプレートのテスト"""
    template = PromptTemplate(
        template="Summarize: {text}",
        input_variables=["text"]
    )

    prompt = template.format(text="Hello world")

    assert "Summarize: Hello world" in prompt
    assert "{text}" not in prompt  # 置換完了

def test_state_transition():
    """状態遷移のテスト"""
    state = AgentState()

    # 初期状態
    assert state.step == 0
    assert state.error_count == 0

    # エラー発生
    state.error_count += 1
    assert state.error_count == 1

    # リセット
    state.error_count = 0
    assert state.error_count == 0

統合テスト（Integration Tests）

目的: コンポーネント間の連携を検証

対象:

エージェントとLLM
エージェントとツール
マルチエージェントシステム

@pytest.mark.integration
def test_agent_with_llm():
    """エージェントとLLMの統合テスト"""
    # テスト用LLM（決定論的）
    llm = ChatOpenAI(
        model="gpt-4o-mini",
        temperature=0,  # 決定論的
        api_key="test-key"
    )

    agent = CodeAgent(llm=llm)

    # モックでAPIレスポンス
    with patch('openai.ChatCompletion.create') as mock_create:
        mock_create.return_value = {
            "choices": [{"message": {"content": "def hello(): pass"}}]
        }

        result = agent.run("Write a hello function")

        assert "def hello" in result.content
        assert result.success is True

@pytest.mark.integration
def test_multi_agent_collaboration():
    """マルチエージェントの統合テスト"""
    researcher = Agent(role="researcher")
    writer = Agent(role="writer")

    # モックLLM
    mock_llm = MockLLM(responses=[
        "Research complete: AI agents are trending",
        "Article written: The Future of AI Agents"
    ])

    researcher.llm = mock_llm
    writer.llm = mock_llm

    # コラボレーション
    research = researcher.run("Research AI agents")
    article = writer.run(f"Write article based on: {research.content}")

    assert "AI agents" in research.content
    assert "Article" in article.content

E2Eテスト（End-to-End Tests）

目的: 重要なユースケースを通しで検証

@pytest.mark.e2e
@pytest.mark.slow
def test_customer_support_agent_flow():
    """カスタマーサポートエージェントのE2Eテスト"""

    # 1. セットアップ
    agent = CustomerSupportAgent()
    db = TestDatabase()

    # 2. シナリオ実行
    user_input = "注文#12345の状況を確認"

    response = agent.handle_message(
        user_id="test-user",
        message=user_input,
        context={"db": db}
    )

    # 3. アサーション
    assert response.status == "success"
    assert "注文#12345" in response.content
    assert "発送済み" in response.content or "処理中" in response.content

    # 4. 副作用の検証
    db.assert_query_executed("SELECT * FROM orders WHERE id = 12345")

第3章：可観測性（Observability）

「測定できないものは改善できない。」— Peter Drucker

1. 構造化ロギング

import structlog

# ロガー設定
structlog.configure(
    processors=[
        structlog.stdlib.add_log_level,
        structlog.stdlib.add_logger_name,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ]
)

logger = structlog.get_logger()

# エージェント実行でログ
class Agent:
    def run(self, task: str):
        logger.info("agent_execution_started", task=task)

        try:
            result = self._execute(task)
            logger.info(
                "agent_execution_completed",
                task=task,
                result_status=result.status,
                result_tokens=result.token_count,
                duration_ms=result.duration
            )
            return result
        except Exception as e:
            logger.error(
                "agent_execution_failed",
                task=task,
                error=str(e),
                error_type=type(e).__name__,
                traceback=traceback.format_exc()
            )
            raise

2. トレーシング（Tracing）

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

# トレーサー設定
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

class Agent:
    @tracer.start_as_current_span("agent.execute")
    def run(self, task: str):
        # スパン属性
        span = trace.get_current_span()
        span.set_attribute("task", task)
        span.set_attribute("agent.type", self.__class__.__name__)

        # ツール実行をネストしたスパンで
        with tracer.start_as_current_span("tool.search"):
            result = self.tools.search(task)
            span.set_attribute("tool.result_count", len(result))

        return result

3. デバッガー活用

Inspect AI^2: LangChain/LlamaIndexのデバッガー

# インストール
pip install inspect_ai

# デバッグセッション開始
inspect run --debug agent.py

# ブレークポイント設定
# agent.py内で:
import inspect; inspect.breakpoint()

# 実行時の変数確認
inspect> print(state)
inspect> print(last_llm_response)

DebuggerAI^3: リアルタイムLLMデバッガー

from debuggerai import DebugContext

# デバッグコンテキスト作成
debugger = DebugContext()

@debugger.watch  # 関数実行を監視
def agent_step(state: AgentState):
    response = llm.predict(state.prompt)

    # 実行時の状態を確認
    debugger.inspect(
        prompt=state.prompt,
        response=response,
        tokens_used=response.usage.total_tokens
    )

    return response

# 実行
debugger.start()
agent.run("Test task")
debugger.stop()

# 実行ログを確認
print(debugger.get_trace())

第4章：ベストプラクティス（実運用からの教訓）

1. 決定論的テストを優先

Bad: ランダム性を含むテスト

# 毎回結果が変わる
def test_agent_random():
    agent = Agent(temperature=0.7)  # 非決定論的
    result = agent.run("Summarize this")
    assert len(result) > 100  # 偶々失敗

Good: シード固定・モック活用

# 毎回同じ結果
def test_agent_deterministic():
    agent = Agent(temperature=0.0, seed=42)  # 決定論的
    result = agent.run("Summarize this")
    assert len(result) == 127  # 固定値

2. テストデータの管理

問題: テストデータが散乱、メンテナンス困難

解決策: フィクスチャ centralized

# tests/fixtures/agent_data.py
@pytest.fixture
def sample_task():
    return "Summarize the following article in 3 bullets..."

@pytest.fixture
def expected_summary():
    return """• Point 1
• Point 2
• Point 3"""

# tests/test_agent.py
def test_agent_summarization(sample_task, expected_summary):
    agent = Agent()
    result = agent.run(sample_task)

    assert result.strip() == expected_summary.strip()

3. エッジケースをテスト

よくある失敗:

# 1. 空入力
def test_agent_empty_input():
    agent = Agent()
    result = agent.run("")
    assert result.status == "error"
    assert "empty" in result.message.lower()

# 2. 超長入力
def test_agent_long_input():
    agent = Agent()
    long_text = "word " * 100000  # コンテキスト超過
    result = agent.run(long_text)
    assert result.status == "error"  # またはトリム済み

# 3. 特殊文字
def test_agent_special_chars():
    agent = Agent()
    special = "Test: <>\"'&\n\t\x00"
    result = agent.run(special)
    assert result.status == "success"

# 4. 外部API障害
def test_agent_api_failure():
    agent = Agent()
    with patch('requests.get') as mock_get:
        mock_get.side_effect = ConnectionError("API down")
        result = agent.run("Search for something")
        assert result.status == "error"
        assert "API" in result.message or "connection" in result.message.lower()

4. パフォーマンステスト

import time

@pytest.mark.performance
def test_agent_latency():
    """エージェントの応答時間をテスト"""
    agent = Agent()

    start = time.time()
    result = agent.run("Quick task")
    duration = time.time() - start

    assert result.status == "success"
    assert duration < 5.0  # 5秒以内

@pytest.mark.performance
def test_agent_concurrent():
    """並列実行のテスト"""
    agent = Agent()

    from concurrent.futures import ThreadPoolExecutor

    with ThreadPoolExecutor(max_workers=5) as executor:
        futures = [
            executor.submit(agent.run, f"Task {i}")
            for i in range(10)
        ]
        results = [f.result() for f in futures]

    # 全て成功
    assert all(r.status == "success" for r in results)

5. 回帰テスト

問題: 過去のバグが再発

解決策: 過去の失敗ケースをテスト化

# tests/test_regression.py
# Issue #42: エージェントが無限ループ
def test_regression_infinite_loop():
    agent = Agent(max_steps=10)  # 安全策
    result = agent.run("Recursively search forever")

    assert agent.state.step <= 10  # ループ制限
    assert result.status == "error"

# Issue #78: 特定のプロンプトでクラッシュ
def test_regression_prompt_crash():
    agent = Agent()
    # 過去にクラッシュしたプロンプト
    crash_prompt = "Translate to emojis only: 🚀🔥💻"
    result = agent.run(crash_prompt)

    assert result.status in ["success", "error"]  # クラッシュしない

第5章：最新ツール紹介

1. Inspect AI

特徴: LangChain/LlamaIndexのビジュアルデバッガー

機能:

ステップ実行
変数のinspect
LLMコールの可視化

インストール:

pip install inspect-ai

使用例:

from inspect_ai import Agent, Tool

agent = Agent(
    name="researcher",
    tools=[Tool.search, Tool.read],
    debug=True  # デバッグモード有効化
)

# 実行すると、ブラウザでデバッガーが起動
result = agent.run("Research latest AI trends")

2. DebuggerAI

特徴: リアルタイムLLMデバッガー

機能:

ブレークポイント
変数ウォッチ
実行トレース

使用例:

from debuggerai import debug

@debug.watch
def my_agent(prompt):
    response = llm.predict(prompt)
    # ここでブレークポイント
    debug.breakpoint()
    return response

3. LangSmith

特徴: LangChain公式の可観測性プラットフォーム

機能:

実行トレース
パフォーマンス分析
A/Bテスト

使用例:

from langsmith import traceable

@traceable(name="my_agent")
def my_agent(task):
    # エージェントロジック
    return result

# 実行データがLangSmithダッシュボードに送信
result = my_agent("Test task")

4. TruLens

特徴: LLMアプリの評価フレームワーク

機能:

RAGの品質評価
回答の適合性スコア
幻覚検出

使用例:

from trulens_eval import Tru, Feedback

# 評価指標定義
f_relevance = Feedback(relevance).on_input_output()

# エージェントの評価
tru = Tru()
recorder = tru.run(
    agent=my_agent,
    feedbacks=[f_relevance]
)

# 結果確認
print(recorder.feedback)

第6章：実運用ワークフロー

開発サイクル

1. 単体テストを書く（TDD）
   ↓
2. エージェント実装
   ↓
3. 統合テスト
   ↓
4. 手動デバッグ（Inspect AI）
   ↓
5. E2Eテスト
   ↓
6. 本番デプロイ
   ↓
7. 可観測性データ収集
   ↓
8. フィードバック → 1へ

チェックリスト

開発前:

[ ] テスト戦略を決定
[ ] モック/フィクスチャを準備
[ ] ロギング/トレーシングを設定

実装中:

[ ] 単体テストを先に書く
[ ] CI/CDでテスト自動実行
[ ] コードレビューでテストカバレッジ確認

デプロイ前:

[ ] E2Eテスト通過
[ ] パフォーマンステスト通過
[ ] セキュリティレビュー完了

本番運用中:

[ ] ログ監視
[ ] エラートレンド分析
[ ] 定期的な回帰テスト

結論

AIエージェントのデバッグは、従来のソフトウェアとは異なる課題がありますが、決定論的テスト、可観測性の設計、最新ツールを活用することで、克服可能です。

重要なポイント：

決定論的モック: テストの再現性を確保
可観測性: ロギング・トレーシングで問題を可視化
段階的テスト: 単体→統合→E2E
最新ツール: Inspect AI、DebuggerAI、LangSmith

「デバッグはバグを見つける作業ではなく、システムを理解するプロセスです。」— Brian Kernighan

AIエージェント開発において、デバッグスキルは最強の武器です。本記事が、あなたのエージェント開発を加速することを願っています。

This article was researched and written by Pengu Press AI.