Why Testing Matters for AI Agents

The Challenge with Traditional AI Agents

Most AI agent frameworks make testing nearly impossible due to their reliance on monolithic prompts and unpredictable behavior patterns.

Reliability

Ensure your agent behaves consistently across different inputs and scenarios

Regression Prevention

Catch when changes to your agent break existing functionality

Quality Assurance

Validate agent behavior before deploying to production

Compliance

Meet enterprise requirements for auditable and testable systems

NOMOS Testing Approach

NOMOS enables comprehensive testing through its step-based architecture:

Unit Testing for AI

Each step in your agent can be tested independently with unit testing, while complete user scenarios can be validated with end-to-end testing.

Testing Architecture

NOMOS supports two primary approaches for writing tests:

# YAML Test Configuration: Define tests declaratively using YAML files
llm:
  provider: openai
  model: gpt-4o-mini

unit:
  test_greeting_response:
    input: ""
    expectation: "Greets the user warmly and asks how to help"

  test_order_taking_with_context:
    context:
      current_step_id: "take_order"
      history:
        - type: summary
          summary:
            - "Customer expressed interest in ordering coffee"
            - "Agent moved to order-taking step"
    input: "I'd like a large latte"
    expectation: "Acknowledges the order and asks for any additional items"

  test_invalid_transition:
    context:
      current_step_id: "greeting"
    input: "Process my payment"
    expectation: "Explains that payment processing comes after order confirmation"
    invalid: true  # This test expects the agent to NOT transition inappropriately

Choosing Your Testing Approach

Configuration-Based

Best for: Simple test cases, quick setup, non-developers

  • Declarative YAML syntax
  • No programming required
  • Built-in test runner
  • Easy to maintain

Pythonic (pytest)

Best for: Complex logic, custom assertions, developers

  • Full Python programming capabilities
  • Custom test fixtures and utilities
  • Advanced assertions with smart_assert
  • Integration with existing Python test suites

Pytest Features for NOMOS

When using the Pythonic approach, NOMOS provides special pytest features:

# AI-Powered Test Validation: Use smart_assert for natural language test validation
def test_tool_call_validation(agent: Agent):
    """Test that agent makes correct tool calls."""
    decision, _, _ = agent.next("I want to calculate my budget for $5000 income")

    # Traditional assertion
    assert decision.action.value == "TOOL_CALL"
    assert decision.tool_call.tool_name == "calculate_budget"

    # Smart assertion using natural language
    smart_assert(
        decision,
        "Calls the calculate_budget tool with monthly income of 5000",
        agent.llm
    )

    # You can also check negative cases
    with pytest.raises(AssertionError):
        smart_assert(
            decision,
            "Responds with text instead of calling a tool",
            agent.llm
        )

Test Configuration

Basic Test Structure

Each test case includes:

# The message or query to send to the agent
input: "I want to order a coffee"

Advanced Test Scenarios

Testing Tool Usage:

test_tool_integration:
  context:
    current_step_id: "check_inventory"
  input: "Do you have medium lattes available?"
  expectation: "Uses get_available_coffee_options tool and provides accurate availability"

Testing Step Transitions:

test_step_routing:
  context:
    current_step_id: "order_complete"
  input: "Thank you, goodbye"
  expectation: "Transitions to farewell step and thanks customer"

Testing Error Handling:

test_invalid_input:
  context:
    current_step_id: "payment"
  input: "banana helicopter"
  expectation: "Asks for clarification about payment method"
  invalid: true

Running Tests

Command Line Interface

# YAML Configuration Testing: Use the NOMOS CLI to run YAML-defined tests

# Run all tests for an agent
nomos test --config config.agent.yaml --tests tests.agent.yaml

# Run specific test cases
nomos test --config config.agent.yaml --tests tests.agent.yaml --filter "test_greeting"

# Run tests with verbose output
nomos test --config config.agent.yaml --tests tests.agent.yaml --verbose

# Generate test coverage report
nomos test --config config.agent.yaml --tests tests.agent.yaml --coverage

Testing Best Practices

1. Test Each Step Independently

Step-Level Testing

Create tests for each step’s specific behavior and available tools

# Test greeting step
test_greeting:
  context:
    current_step_id: "greeting"
  input: "Hello"
  expectation: "Warm greeting and explanation of available services"

# Test order step
test_order_taking:
  context:
    current_step_id: "take_order"
  input: "I want a latte"
  expectation: "Uses menu tools and confirms order details"

2. Test Transitions and Routing

Flow Testing

Verify that your agent transitions correctly between steps

test_order_to_payment:
  context:
    current_step_id: "confirm_order"
  input: "Yes, proceed with payment"
  expectation: "Transitions to payment step and requests payment details"

3. Test Edge Cases and Error Handling

Robustness Testing

Ensure your agent handles unexpected inputs gracefully

test_unclear_input:
  context:
    current_step_id: "take_order"
  input: "Maybe something warm"
  expectation: "Asks clarifying questions about coffee preferences"

4. Test Tool Integration

Tool Testing

Verify that tools are called correctly with proper parameters

test_tool_parameters:
  context:
    current_step_id: "add_item"
  input: "Add a large cappuccino to my order"
  expectation: "Calls add_to_cart with coffee_type='Cappuccino', size='Large'"

End-to-End Testing

While unit testing validates individual steps, end-to-end (E2E) testing validates complete user scenarios from start to finish. NOMOS provides scenario-based testing to simulate real user interactions.

Scenario Testing

E2E tests use scenarios that describe complete user journeys:

# YAML Scenario Definition: Define scenarios declaratively
llm:
  provider: openai
  model: gpt-4o-mini

scenarios:
  complete_coffee_order:
    scenario: "New customer wants to order a medium cappuccino with an extra shot and pay by card"
    expectation: "Agent should greet, show menu, take order, confirm details, and process payment"
    max_turns: 15

  handle_unavailable_item:
    scenario: "Customer orders an item that's not available and needs an alternative"
    expectation: "Agent politely explains unavailability and suggests alternatives"
    max_turns: 8

Scenario Configuration

Define E2E test scenarios in your test configuration:

# tests.agent.yaml
llm:
  provider: openai
  model: gpt-4o-mini

scenarios:
  complete_coffee_order:
    scenario: "New customer wants to order a medium cappuccino with an extra shot and pay by card"
    expectation: "Agent should greet, show menu, take order, confirm details, and process payment"
    max_turns: 15

  handle_unavailable_item:
    scenario: "Customer orders an item that's not available and needs an alternative"
    expectation: "Agent politely explains unavailability and suggests alternatives"
    max_turns: 8

  complex_multi_item_order:
    scenario: "Customer orders multiple different drinks with modifications for a group"
    expectation: "Agent accurately captures all items and modifications, confirms total"
    max_turns: 20

Running E2E Tests

Execute end-to-end tests using the NOMOS CLI:

# Run all E2E scenarios
nomos test --e2e ./e2e_tests.yaml

# Run specific scenario
nomos test --e2e ./e2e_tests.yaml --scenario complete_coffee_order

# Run with detailed output
nomos test --e2e ./e2e_tests.yaml --verbose

When to Use E2E Testing

  • Validate complete user workflows
  • Test complex multi-step interactions
  • Verify agent behavior across step transitions
  • Ensure agents handle edge cases gracefully
  • Test integration with external systems

E2E Best Practices

Effective E2E Testing

  1. Representative Scenarios: Test real user journeys, not just happy paths
  2. Edge Cases: Include scenarios for errors, unavailable items, and unusual requests
  3. Conversation Limits: Set reasonable max_turns to prevent infinite loops
  4. Clear Expectations: Write specific, measurable success criteria
  5. Incremental Testing: Start with simple scenarios, add complexity gradually

Continuous Integration

GitHub Actions Example

name: NOMOS Agent Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Setup Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'

      - name: Install dependencies
        run: |
          pip install nomos
          pip install -r requirements.txt

      - name: Run agent tests
        run: |
          nomos test --config config.agent.yaml --tests tests.agent.yaml --coverage
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Testing Strategy

1

Start with Happy Path

Test the main user journey through your agent with typical inputs

2

Add Edge Cases

Include tests for unusual inputs, error conditions, and boundary cases

3

Test Tool Integration

Verify that each tool is called correctly with proper parameters

4

Validate Transitions

Ensure step routing works correctly under different conditions

5

Performance Testing

Test response times and resource usage under load

Start Small, Scale Up

Begin with a few critical test cases and gradually expand your test suite as your agent becomes more complex.

Real-World Example

Here’s how the barista agent might be tested using both approaches:

# YAML Test Configuration: Declarative test definitions
llm:
  provider: openai
  model: gpt-4o-mini

unit:
  test_greeting_new_customer:
    input: "Hi there"
    expectation: "Greets warmly and offers to show menu or take order"

  test_menu_inquiry:
    context:
      current_step_id: "start"
    input: "What drinks do you have?"
    expectation: "Uses get_available_coffee_options and lists available drinks with prices"

  test_add_to_cart:
    context:
      current_step_id: "take_coffee_order"
    input: "I'll have a large latte"
    expectation: "Calls add_to_cart with correct parameters and confirms addition"

  test_invalid_payment_method:
    context:
      current_step_id: "finalize_order"
    input: "I'll pay with bitcoin"
    expectation: "Explains accepted payment methods (Card or Cash)"
    invalid: true

scenarios:
  complete_order_flow:
    scenario: "Customer orders a medium cappuccino and pays with card"
    expectation: "Successfully processes order from greeting to payment completion"
    max_turns: 12

This comprehensive testing approach ensures your NOMOS agents are reliable, predictable, and ready for production deployment.