Skip to main content
NOMOS provides comprehensive observability features that give you deep insights into your agent’s behavior, performance, and decision-making processes. Unlike traditional AI frameworks that operate as black boxes, NOMOS offers granular visibility into every step, tool call, and transition.

Why Observability Matters for AI Agents

The Black Box ProblemTraditional AI agents are often impossible to debug or monitor effectively, making it difficult to understand why they behaved in unexpected ways or optimize their performance.

Debug Agent Behavior

Understand exactly what your agent is thinking and why it made specific decisions

Performance Monitoring

Track response times, error rates, and resource utilization across steps and tools

Production Insights

Monitor agent behavior in production to identify issues and optimization opportunities

Compliance & Auditing

Maintain detailed audit trails for enterprise compliance and regulatory requirements

NOMOS Observability Stack

NOMOS provides multiple layers of observability:
Multi-Layer MonitoringFrom high-level session tracking to granular tool execution monitoring, NOMOS gives you complete visibility into your agent’s operation.

1. Structured Logging

Built-in logging with configurable levels and output formats:
# config.agent.yaml
logging:
  enable: true
  handlers:
    - type: "stderr"
      level: "INFO"
      format: "{time:YYYY-MM-DD at HH:mm:ss} | {level} | {message}"
    - type: "file"
      level: "DEBUG"
      format: "{time} | {level} | {name} | {message}"

2. OpenTelemetry Tracing

Distributed tracing for deep insight into agent execution:
# Enable tracing
import os
os.environ["ENABLE_TRACING"] = "true"
os.environ["SERVICE_NAME"] = "my-agent"
os.environ["SERVICE_VERSION"] = "1.0.0"

# Automatic instrumentation
from nomos import Agent
agent = Agent.from_config(config, llm)
session = agent.create_session()  # Automatically traced

3. Elastic APM Integration

Production-ready monitoring with Elastic APM:
# Environment variables for Elastic APM
export ELASTIC_APM_SERVER_URL="http://localhost:8200"
export ELASTIC_APM_TOKEN="your-apm-token"
export ENABLE_TRACING="true"

Logging Configuration

Basic Logging Setup

Enable structured logging in your agent configuration:
# config.agent.yaml
name: my_agent
# ... other config ...

logging:
  enable: true
  handlers:
    - type: "stderr"
      level: "INFO"
      format: "{time:YYYY-MM-DD at HH:mm:ss} | {level} | {message}"

Advanced Logging Configuration

Configure multiple handlers with different levels:
logging:
  enable: true
  handlers:
    # Console output for development
    - type: "stderr"
      level: "INFO"
      format: "{time} | {level:<8} | {message}"

    # Detailed file logging for debugging
    - type: "file"
      level: "DEBUG"
      format: "{time:YYYY-MM-DD HH:mm:ss.SSS} | {level:<8} | {name} | {function}:{line} | {message}"

    # Error-only logging for alerts
    - type: "file"
      level: "ERROR"
      format: "{time} | ERROR | {message} | {extra}"

Environment-Based Logging

Control logging through environment variables:
# Enable logging
export NOMOS_ENABLE_LOGGING="true"
export NOMOS_LOG_LEVEL="DEBUG"

# Run your agent
nomos run --config config.agent.yaml

Programmatic Logging Control

from nomos.utils.logging import log_info, log_debug, log_error

# In your tools or custom code
def my_tool(query: str) -> str:
    log_info(f"Processing query: {query}")

    try:
        result = process_query(query)
        log_debug(f"Query result: {result}")
        return result
    except Exception as e:
        log_error(f"Query processing failed: {str(e)}")
        raise

OpenTelemetry Tracing

Automatic Instrumentation

NOMOS automatically instruments key operations when tracing is enabled:
  • Session Creation
  • Step Execution
  • Tool Calls
  • LLM Calls
Agent.create_session()
Span: Agent.create_session
Attributes:
- agent.name: "my_agent"
- agent.class: "Agent"
- session.id: "uuid-123"

Custom Tracing

Add custom spans to your tools:
from opentelemetry import trace

def complex_calculation(data: str) -> str:
    """Tool with custom tracing."""
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("data_processing") as span:
        span.set_attribute("data.size", len(data))
        span.set_attribute("operation", "calculation")

        try:
            # Your processing logic
            result = expensive_operation(data)
            span.set_attribute("result.size", len(result))
            span.set_attribute("processing.success", True)
            return result

        except Exception as e:
            span.record_exception(e)
            span.set_attribute("processing.success", False)
            raise

Elastic APM Integration

Setup and Configuration

  1. Install Elastic APM Server (or use Elastic Cloud)
  2. Configure NOMOS for Elastic APM:
# Environment variables
export ENABLE_TRACING="true"
export ELASTIC_APM_SERVER_URL="http://localhost:8200"
export ELASTIC_APM_TOKEN="your-secret-token"
export SERVICE_NAME="nomos-agent"
export SERVICE_VERSION="1.0.0"
  1. Start your agent:
nomos run --config config.agent.yaml

Elastic APM Features

End-to-End VisibilityTrack requests across your entire agent workflow:
  • Session creation and lifecycle
  • Step-by-step execution flow
  • Tool call dependencies
  • LLM API interactions
  • External service calls
Response Time AnalysisMonitor performance metrics:
  • Average response times per step
  • Tool execution duration
  • LLM API latency
  • Error rates and patterns
  • Throughput and capacity metrics
Exception ManagementAutomatic error capture and analysis:
  • Tool execution failures
  • LLM API errors
  • Step transition issues
  • Custom exception tracking
  • Error correlation across services
Dependency VisualizationVisual representation of your agent’s architecture:
  • Agent components and their relationships
  • External service dependencies
  • Tool usage patterns
  • Performance bottlenecks identification

Elastic APM Dashboard Examples

Agent Performance Dashboard:
{
  "visualization": "line_chart",
  "metric": "transaction.duration.avg",
  "filters": {
    "service.name": "nomos-agent",
    "transaction.type": "session"
  },
  "group_by": "step.id"
}
Tool Usage Analytics:
{
  "visualization": "bar_chart",
  "metric": "span.count",
  "filters": {
    "span.type": "tool_call"
  },
  "group_by": "tool.name"
}

Production Monitoring

Docker Deployment with Observability

# docker-compose.yml
version: '3.8'
services:
  nomos-agent:
    image: nomos:latest
    environment:
      - ENABLE_TRACING=true
      - ELASTIC_APM_SERVER_URL=http://elasticsearch:8200
      - ELASTIC_APM_TOKEN=${APM_TOKEN}
      - SERVICE_NAME=nomos-production
      - NOMOS_ENABLE_LOGGING=true
      - NOMOS_LOG_LEVEL=INFO
    volumes:
      - ./config.agent.yaml:/app/config.agent.yaml
      - ./logs:/app/logs
    ports:
      - "8000:8000"
    depends_on:
      - elasticsearch

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.8.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports:
      - "9200:9200"

  kibana:
    image: docker.elastic.co/kibana/kibana:8.8.0
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch

Kubernetes Monitoring

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nomos-agent
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nomos-agent
  template:
    metadata:
      labels:
        app: nomos-agent
    spec:
      containers:
      - name: nomos-agent
        image: nomos:latest
        env:
        - name: ENABLE_TRACING
          value: "true"
        - name: ELASTIC_APM_SERVER_URL
          valueFrom:
            secretKeyRef:
              name: elastic-config
              key: apm-server-url
        - name: ELASTIC_APM_TOKEN
          valueFrom:
            secretKeyRef:
              name: elastic-config
              key: apm-token
        - name: SERVICE_NAME
          value: "nomos-k8s"
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        resources:
          requests:
            memory: "256Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "500m"

Monitoring Best Practices

1. Structured Data Collection

Consistent Attributes

Use consistent attribute naming across your traces and logs
# Good: Consistent attribute naming
span.set_attribute("user.id", user_id)
span.set_attribute("session.id", session_id)
span.set_attribute("step.id", current_step)

# Avoid: Inconsistent naming
span.set_attribute("userId", user_id)
span.set_attribute("sessionID", session_id)
span.set_attribute("current_step", current_step)

2. Performance Monitoring

Key Metrics

Track essential performance indicators
# Custom metrics in tools
def expensive_tool(query: str) -> str:
    start_time = time.time()
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("expensive_operation") as span:
        span.set_attribute("operation.type", "data_processing")
        span.set_attribute("query.length", len(query))

        result = process_data(query)

        duration = time.time() - start_time
        span.set_attribute("operation.duration_ms", duration * 1000)
        span.set_attribute("result.status", "success")

        return result

3. Error Context Collection

Rich Error Information

Capture comprehensive error context for debugging
def error_prone_tool(data: str) -> str:
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("risky_operation") as span:
        try:
            span.set_attribute("input.data_type", type(data).__name__)
            span.set_attribute("input.length", len(data))

            result = risky_operation(data)
            return result

        except ValueError as e:
            span.record_exception(e)
            span.set_attribute("error.type", "validation_error")
            span.set_attribute("error.input", data[:100])  # First 100 chars
            raise
        except Exception as e:
            span.record_exception(e)
            span.set_attribute("error.type", "unexpected_error")
            raise

4. Security Considerations

Data Redaction

Protect sensitive information in logs and traces
def secure_tool(user_data: str) -> str:
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("secure_operation") as span:
        # Redact sensitive data
        redacted_data = redact_sensitive_info(user_data)
        span.set_attribute("input.redacted", redacted_data)
        span.set_attribute("input.length", len(user_data))

        # Process without logging actual sensitive data
        result = process_user_data(user_data)

        span.set_attribute("result.status", "processed")
        return result

def redact_sensitive_info(data: str) -> str:
    """Redact sensitive information from data."""
    import re
    # Remove email addresses
    data = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', data)
    # Remove phone numbers
    data = re.sub(r'\b\d{3}-\d{3}-\d{4}\b', '[PHONE]', data)
    # Remove SSNs
    data = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', data)
    return data

Observability in Action

Real-World Example: Barista Agent

Here’s how observability looks for the barista agent:
# config.agent.yaml with observability
name: barista
persona: "Friendly coffee shop assistant..."

logging:
  enable: true
  handlers:
    - type: "stderr"
      level: "INFO"
      format: "{time} | {level} | Barista | {message}"

steps:
  - step_id: start
    description: "Greet customer and show menu"
    available_tools:
      - get_available_coffee_options
Trace Output:
Session.create_session [duration: 15ms]
├── agent.name: "barista"
├── session.id: "uuid-456"
└── session.start_time: "2025-07-02T10:30:00Z"

Session.next [duration: 1.2s]
├── current_step: "start"
├── decision.action: "TOOL_CALL"
├── tool.name: "get_available_coffee_options"
└── Session._run_tool [duration: 50ms]
    ├── tool.kwargs: "{}"
    ├── tool.result: "Available Coffee Options:\nLatte: Small ($3.00)..."
    └── tool.success: true

Session.next [duration: 800ms]
├── current_step: "start"
├── decision.action: "RESPOND"
├── llm._get_output [duration: 750ms]
│   ├── llm.provider: "openai"
│   ├── llm.model: "gpt-4o-mini"
│   └── llm.success: true
└── decision.response: "Welcome! Here are our coffee options..."
Log Output:
2025-07-02 10:30:00.123 | INFO | Barista | Session created for user interaction
2025-07-02 10:30:00.138 | INFO | Barista | Executing step: start
2025-07-02 10:30:00.145 | DEBUG | Barista | Calling tool: get_available_coffee_options
2025-07-02 10:30:00.195 | INFO | Barista | Tool execution successful: get_available_coffee_options
2025-07-02 10:30:00.945 | INFO | Barista | LLM decision: RESPOND
2025-07-02 10:30:00.946 | INFO | Barista | Response sent to user

Troubleshooting Observability

Common Issues

Problem: No traces appearing in APMSolutions:
  • Verify ENABLE_TRACING=true is set
  • Check ELASTIC_APM_SERVER_URL is correct
  • Ensure ELASTIC_APM_TOKEN has proper permissions
  • Check network connectivity to APM server
Problem: Tool calls not showing in tracesSolutions:
  • Ensure tools are called through NOMOS framework
  • Check that tool functions are properly decorated
  • Verify tool execution doesn’t bypass tracing
Problem: Tracing causing performance issuesSolutions:
  • Reduce trace sampling rate
  • Filter out verbose operations
  • Use async span export
  • Optimize span attribute collection
Start Simple, Scale UpBegin with basic logging, then add tracing as needed. Enable detailed debugging only when investigating specific issues to minimize performance impact.
NOMOS observability gives you the insights needed to build, debug, and optimize reliable AI agents for production use. The combination of structured logging, distributed tracing, and APM integration provides comprehensive visibility into your agent’s behavior and performance.
I