Observability

NOMOS provides comprehensive observability features that give you deep insights into your agent’s behavior, performance, and decision-making processes. Unlike traditional AI frameworks that operate as black boxes, NOMOS offers granular visibility into every step, tool call, and transition.

Why Observability Matters for AI Agents

The Black Box ProblemTraditional AI agents are often impossible to debug or monitor effectively, making it difficult to understand why they behaved in unexpected ways or optimize their performance.

Debug Agent Behavior

Understand exactly what your agent is thinking and why it made specific decisions

Performance Monitoring

Track response times, error rates, and resource utilization across steps and tools

Production Insights

Monitor agent behavior in production to identify issues and optimization opportunities

Compliance & Auditing

Maintain detailed audit trails for enterprise compliance and regulatory requirements

NOMOS Observability Stack

NOMOS provides multiple layers of observability:

Multi-Layer MonitoringFrom high-level session tracking to granular tool execution monitoring, NOMOS gives you complete visibility into your agent’s operation.

1. Structured Logging

Built-in logging with configurable levels and output formats:

# config.agent.yaml
logging:
  enable: true
  handlers:
    - type: "stderr"
      level: "INFO"
      format: "{time:YYYY-MM-DD at HH:mm:ss} | {level} | {message}"
    - type: "file"
      level: "DEBUG"
      format: "{time} | {level} | {name} | {message}"

2. OpenTelemetry Tracing

Distributed tracing for deep insight into agent execution:

# Enable tracing
import os
os.environ["ENABLE_TRACING"] = "true"
os.environ["SERVICE_NAME"] = "my-agent"
os.environ["SERVICE_VERSION"] = "1.0.0"

# Automatic instrumentation
from nomos import Agent
agent = Agent.from_config(config, llm)
session = agent.create_session()  # Automatically traced

3. Elastic APM Integration

Production-ready monitoring with Elastic APM:

# Environment variables for Elastic APM
export ELASTIC_APM_SERVER_URL="http://localhost:8200"
export ELASTIC_APM_TOKEN="your-apm-token"
export ENABLE_TRACING="true"

Logging Configuration

Basic Logging Setup

Enable structured logging in your agent configuration:

# config.agent.yaml
name: my_agent
# ... other config ...

logging:
  enable: true
  handlers:
    - type: "stderr"
      level: "INFO"
      format: "{time:YYYY-MM-DD at HH:mm:ss} | {level} | {message}"

Advanced Logging Configuration

Configure multiple handlers with different levels:

logging:
  enable: true
  handlers:
    # Console output for development
    - type: "stderr"
      level: "INFO"
      format: "{time} | {level:<8} | {message}"

    # Detailed file logging for debugging
    - type: "file"
      level: "DEBUG"
      format: "{time:YYYY-MM-DD HH:mm:ss.SSS} | {level:<8} | {name} | {function}:{line} | {message}"

    # Error-only logging for alerts
    - type: "file"
      level: "ERROR"
      format: "{time} | ERROR | {message} | {extra}"

Environment-Based Logging

Control logging through environment variables:

# Enable logging
export NOMOS_ENABLE_LOGGING="true"
export NOMOS_LOG_LEVEL="DEBUG"

# Run your agent
nomos run --config config.agent.yaml

Programmatic Logging Control

from nomos.utils.logging import log_info, log_debug, log_error

# In your tools or custom code
def my_tool(query: str) -> str:
    log_info(f"Processing query: {query}")

    try:
        result = process_query(query)
        log_debug(f"Query result: {result}")
        return result
    except Exception as e:
        log_error(f"Query processing failed: {str(e)}")
        raise

OpenTelemetry Tracing

Automatic Instrumentation

NOMOS automatically instruments key operations when tracing is enabled:

Session Creation
Step Execution
Tool Calls
LLM Calls

Agent.create_session()

Span: Agent.create_session
Attributes:
- agent.name: "my_agent"
- agent.class: "Agent"
- session.id: "uuid-123"

Custom Tracing

Add custom spans to your tools:

from opentelemetry import trace

def complex_calculation(data: str) -> str:
    """Tool with custom tracing."""
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("data_processing") as span:
        span.set_attribute("data.size", len(data))
        span.set_attribute("operation", "calculation")

        try:
            # Your processing logic
            result = expensive_operation(data)
            span.set_attribute("result.size", len(result))
            span.set_attribute("processing.success", True)
            return result

        except Exception as e:
            span.record_exception(e)
            span.set_attribute("processing.success", False)
            raise

Elastic APM Integration

Setup and Configuration

Install Elastic APM Server (or use Elastic Cloud)
Configure NOMOS for Elastic APM:

# Environment variables
export ENABLE_TRACING="true"
export ELASTIC_APM_SERVER_URL="http://localhost:8200"
export ELASTIC_APM_TOKEN="your-secret-token"
export SERVICE_NAME="nomos-agent"
export SERVICE_VERSION="1.0.0"

Start your agent:

nomos run --config config.agent.yaml

Elastic APM Features

Distributed Tracing

End-to-End VisibilityTrack requests across your entire agent workflow:

Session creation and lifecycle
Step-by-step execution flow
Tool call dependencies
LLM API interactions
External service calls

Performance Monitoring

Response Time AnalysisMonitor performance metrics:

Average response times per step
Tool execution duration
LLM API latency
Error rates and patterns
Throughput and capacity metrics

Error Tracking

Exception ManagementAutomatic error capture and analysis:

Tool execution failures
LLM API errors
Step transition issues
Custom exception tracking
Error correlation across services

Service Map

Dependency VisualizationVisual representation of your agent’s architecture:

Agent components and their relationships
External service dependencies
Tool usage patterns
Performance bottlenecks identification

Elastic APM Dashboard Examples

Agent Performance Dashboard:

{
  "visualization": "line_chart",
  "metric": "transaction.duration.avg",
  "filters": {
    "service.name": "nomos-agent",
    "transaction.type": "session"
  },
  "group_by": "step.id"
}

Tool Usage Analytics:

{
  "visualization": "bar_chart",
  "metric": "span.count",
  "filters": {
    "span.type": "tool_call"
  },
  "group_by": "tool.name"
}

Production Monitoring

Docker Deployment with Observability

# docker-compose.yml
version: '3.8'
services:
  nomos-agent:
    image: nomos:latest
    environment:
      - ENABLE_TRACING=true
      - ELASTIC_APM_SERVER_URL=http://elasticsearch:8200
      - ELASTIC_APM_TOKEN=${APM_TOKEN}
      - SERVICE_NAME=nomos-production
      - NOMOS_ENABLE_LOGGING=true
      - NOMOS_LOG_LEVEL=INFO
    volumes:
      - ./config.agent.yaml:/app/config.agent.yaml
      - ./logs:/app/logs
    ports:
      - "8000:8000"
    depends_on:
      - elasticsearch

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.8.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports:
      - "9200:9200"

  kibana:
    image: docker.elastic.co/kibana/kibana:8.8.0
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch

Kubernetes Monitoring

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nomos-agent
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nomos-agent
  template:
    metadata:
      labels:
        app: nomos-agent
    spec:
      containers:
      - name: nomos-agent
        image: nomos:latest
        env:
        - name: ENABLE_TRACING
          value: "true"
        - name: ELASTIC_APM_SERVER_URL
          valueFrom:
            secretKeyRef:
              name: elastic-config
              key: apm-server-url
        - name: ELASTIC_APM_TOKEN
          valueFrom:
            secretKeyRef:
              name: elastic-config
              key: apm-token
        - name: SERVICE_NAME
          value: "nomos-k8s"
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        resources:
          requests:
            memory: "256Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "500m"

Monitoring Best Practices

1. Structured Data Collection

Consistent Attributes

Use consistent attribute naming across your traces and logs

# Good: Consistent attribute naming
span.set_attribute("user.id", user_id)
span.set_attribute("session.id", session_id)
span.set_attribute("step.id", current_step)

# Avoid: Inconsistent naming
span.set_attribute("userId", user_id)
span.set_attribute("sessionID", session_id)
span.set_attribute("current_step", current_step)

2. Performance Monitoring

Key Metrics

Track essential performance indicators

# Custom metrics in tools
def expensive_tool(query: str) -> str:
    start_time = time.time()
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("expensive_operation") as span:
        span.set_attribute("operation.type", "data_processing")
        span.set_attribute("query.length", len(query))

        result = process_data(query)

        duration = time.time() - start_time
        span.set_attribute("operation.duration_ms", duration * 1000)
        span.set_attribute("result.status", "success")

        return result

3. Error Context Collection

Rich Error Information

Capture comprehensive error context for debugging

def error_prone_tool(data: str) -> str:
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("risky_operation") as span:
        try:
            span.set_attribute("input.data_type", type(data).__name__)
            span.set_attribute("input.length", len(data))

            result = risky_operation(data)
            return result

        except ValueError as e:
            span.record_exception(e)
            span.set_attribute("error.type", "validation_error")
            span.set_attribute("error.input", data[:100])  # First 100 chars
            raise
        except Exception as e:
            span.record_exception(e)
            span.set_attribute("error.type", "unexpected_error")
            raise

4. Security Considerations

Data Redaction

Protect sensitive information in logs and traces

def secure_tool(user_data: str) -> str:
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("secure_operation") as span:
        # Redact sensitive data
        redacted_data = redact_sensitive_info(user_data)
        span.set_attribute("input.redacted", redacted_data)
        span.set_attribute("input.length", len(user_data))

        # Process without logging actual sensitive data
        result = process_user_data(user_data)

        span.set_attribute("result.status", "processed")
        return result

def redact_sensitive_info(data: str) -> str:
    """Redact sensitive information from data."""
    import re
    # Remove email addresses
    data = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', data)
    # Remove phone numbers
    data = re.sub(r'\b\d{3}-\d{3}-\d{4}\b', '[PHONE]', data)
    # Remove SSNs
    data = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', data)
    return data

Observability in Action

Real-World Example: Barista Agent

Here’s how observability looks for the barista agent:

# config.agent.yaml with observability
name: barista
persona: "Friendly coffee shop assistant..."

logging:
  enable: true
  handlers:
    - type: "stderr"
      level: "INFO"
      format: "{time} | {level} | Barista | {message}"

steps:
  - step_id: start
    description: "Greet customer and show menu"
    available_tools:
      - get_available_coffee_options

Trace Output:

Session.create_session [duration: 15ms]
├── agent.name: "barista"
├── session.id: "uuid-456"
└── session.start_time: "2025-07-02T10:30:00Z"

Session.next [duration: 1.2s]
├── current_step: "start"
├── decision.action: "TOOL_CALL"
├── tool.name: "get_available_coffee_options"
└── Session._run_tool [duration: 50ms]
    ├── tool.kwargs: "{}"
    ├── tool.result: "Available Coffee Options:\nLatte: Small ($3.00)..."
    └── tool.success: true

Session.next [duration: 800ms]
├── current_step: "start"
├── decision.action: "RESPOND"
├── llm._get_output [duration: 750ms]
│   ├── llm.provider: "openai"
│   ├── llm.model: "gpt-4o-mini"
│   └── llm.success: true
└── decision.response: "Welcome! Here are our coffee options..."

Log Output:

2025-07-02 10:30:00.123 | INFO | Barista | Session created for user interaction
2025-07-02 10:30:00.138 | INFO | Barista | Executing step: start
2025-07-02 10:30:00.145 | DEBUG | Barista | Calling tool: get_available_coffee_options
2025-07-02 10:30:00.195 | INFO | Barista | Tool execution successful: get_available_coffee_options
2025-07-02 10:30:00.945 | INFO | Barista | LLM decision: RESPOND
2025-07-02 10:30:00.946 | INFO | Barista | Response sent to user

Troubleshooting Observability

Common Issues

Tracing Not Working

Problem: No traces appearing in APMSolutions:

Verify ENABLE_TRACING=true is set
Check ELASTIC_APM_SERVER_URL is correct
Ensure ELASTIC_APM_TOKEN has proper permissions
Check network connectivity to APM server

Missing Tool Traces

Problem: Tool calls not showing in tracesSolutions:

Ensure tools are called through NOMOS framework
Check that tool functions are properly decorated
Verify tool execution doesn’t bypass tracing

High Overhead

Problem: Tracing causing performance issuesSolutions:

Reduce trace sampling rate
Filter out verbose operations
Use async span export
Optimize span attribute collection

Start Simple, Scale UpBegin with basic logging, then add tracing as needed. Enable detailed debugging only when investigating specific issues to minimize performance impact.

NOMOS observability gives you the insights needed to build, debug, and optimize reliable AI agents for production use. The combination of structured logging, distributed tracing, and APM integration provides comprehensive visibility into your agent’s behavior and performance.

Getting Started

Why NOMOS?

Building Blocks

Tools

Development & Deployment

Quality Assurance

Examples & Guides

​Why Observability Matters for AI Agents

Debug Agent Behavior

Performance Monitoring

Production Insights

Compliance & Auditing

​NOMOS Observability Stack

​1. Structured Logging

​2. OpenTelemetry Tracing

​3. Elastic APM Integration

​Logging Configuration

​Basic Logging Setup

​Advanced Logging Configuration

​Environment-Based Logging

​Programmatic Logging Control

​OpenTelemetry Tracing

​Automatic Instrumentation

​Custom Tracing

​Elastic APM Integration

​Setup and Configuration

​Elastic APM Features

​Elastic APM Dashboard Examples

​Production Monitoring

​Docker Deployment with Observability

​Kubernetes Monitoring

​Monitoring Best Practices

​1. Structured Data Collection

Consistent Attributes

​2. Performance Monitoring

Key Metrics

​3. Error Context Collection

Rich Error Information

​4. Security Considerations

Data Redaction

​Observability in Action

​Real-World Example: Barista Agent

​Troubleshooting Observability

​Common Issues

Why Observability Matters for AI Agents

NOMOS Observability Stack

1. Structured Logging

2. OpenTelemetry Tracing

3. Elastic APM Integration

Logging Configuration

Basic Logging Setup

Advanced Logging Configuration

Environment-Based Logging

Programmatic Logging Control

OpenTelemetry Tracing

Automatic Instrumentation

Custom Tracing

Elastic APM Integration

Setup and Configuration

Elastic APM Features

Elastic APM Dashboard Examples

Production Monitoring

Docker Deployment with Observability

Kubernetes Monitoring

Monitoring Best Practices

1. Structured Data Collection

2. Performance Monitoring

3. Error Context Collection

4. Security Considerations

Observability in Action

Real-World Example: Barista Agent

Troubleshooting Observability

Common Issues