NOMOS provides comprehensive observability features that give you deep insights into your agent’s behavior, performance, and decision-making processes. Unlike traditional AI frameworks that operate as black boxes, NOMOS offers granular visibility into every step, tool call, and transition.
Why Observability Matters for AI Agents
The Black Box Problem Traditional AI agents are often impossible to debug or monitor effectively, making it difficult to understand why they behaved in unexpected ways or optimize their performance.
Debug Agent Behavior Understand exactly what your agent is thinking and why it made specific decisions
Performance Monitoring Track response times, error rates, and resource utilization across steps and tools
Production Insights Monitor agent behavior in production to identify issues and optimization opportunities
Compliance & Auditing Maintain detailed audit trails for enterprise compliance and regulatory requirements
NOMOS Observability Stack
NOMOS provides multiple layers of observability:
Multi-Layer Monitoring From high-level session tracking to granular tool execution monitoring, NOMOS gives you complete visibility into your agent’s operation.
1. Structured Logging
Built-in logging with configurable levels and output formats:
# config.agent.yaml
logging :
enable : true
handlers :
- type : "stderr"
level : "INFO"
format : "{time:YYYY-MM-DD at HH:mm:ss} | {level} | {message}"
- type : "file"
level : "DEBUG"
format : "{time} | {level} | {name} | {message}"
2. OpenTelemetry Tracing
Distributed tracing for deep insight into agent execution:
# Enable tracing
import os
os.environ[ "ENABLE_TRACING" ] = "true"
os.environ[ "SERVICE_NAME" ] = "my-agent"
os.environ[ "SERVICE_VERSION" ] = "1.0.0"
# Automatic instrumentation
from nomos import Agent
agent = Agent.from_config(config, llm)
session = agent.create_session() # Automatically traced
3. Elastic APM Integration
Production-ready monitoring with Elastic APM:
# Environment variables for Elastic APM
export ELASTIC_APM_SERVER_URL = "http://localhost:8200"
export ELASTIC_APM_TOKEN = "your-apm-token"
export ENABLE_TRACING = "true"
Logging Configuration
Basic Logging Setup
Enable structured logging in your agent configuration:
# config.agent.yaml
name : my_agent
# ... other config ...
logging :
enable : true
handlers :
- type : "stderr"
level : "INFO"
format : "{time:YYYY-MM-DD at HH:mm:ss} | {level} | {message}"
Advanced Logging Configuration
Configure multiple handlers with different levels:
logging :
enable : true
handlers :
# Console output for development
- type : "stderr"
level : "INFO"
format : "{time} | {level:<8} | {message}"
# Detailed file logging for debugging
- type : "file"
level : "DEBUG"
format : "{time:YYYY-MM-DD HH:mm:ss.SSS} | {level:<8} | {name} | {function}:{line} | {message}"
# Error-only logging for alerts
- type : "file"
level : "ERROR"
format : "{time} | ERROR | {message} | {extra}"
Environment-Based Logging
Control logging through environment variables:
# Enable logging
export NOMOS_ENABLE_LOGGING = "true"
export NOMOS_LOG_LEVEL = "DEBUG"
# Run your agent
nomos run --config config.agent.yaml
Programmatic Logging Control
from nomos.utils.logging import log_info, log_debug, log_error
# In your tools or custom code
def my_tool ( query : str ) -> str :
log_info( f "Processing query: { query } " )
try :
result = process_query(query)
log_debug( f "Query result: { result } " )
return result
except Exception as e:
log_error( f "Query processing failed: { str (e) } " )
raise
OpenTelemetry Tracing
Automatic Instrumentation
NOMOS automatically instruments key operations when tracing is enabled:
Session Creation
Step Execution
Tool Calls
LLM Calls
Agent.create_session() Span: Agent.create_session
Attributes:
- agent.name: "my_agent"
- agent.class: "Agent"
- session.id: "uuid-123"
Custom Tracing
Add custom spans to your tools:
from opentelemetry import trace
def complex_calculation ( data : str ) -> str :
"""Tool with custom tracing."""
tracer = trace.get_tracer( __name__ )
with tracer.start_as_current_span( "data_processing" ) as span:
span.set_attribute( "data.size" , len (data))
span.set_attribute( "operation" , "calculation" )
try :
# Your processing logic
result = expensive_operation(data)
span.set_attribute( "result.size" , len (result))
span.set_attribute( "processing.success" , True )
return result
except Exception as e:
span.record_exception(e)
span.set_attribute( "processing.success" , False )
raise
Elastic APM Integration
Setup and Configuration
Install Elastic APM Server (or use Elastic Cloud)
Configure NOMOS for Elastic APM:
# Environment variables
export ENABLE_TRACING = "true"
export ELASTIC_APM_SERVER_URL = "http://localhost:8200"
export ELASTIC_APM_TOKEN = "your-secret-token"
export SERVICE_NAME = "nomos-agent"
export SERVICE_VERSION = "1.0.0"
Start your agent:
nomos run --config config.agent.yaml
Elastic APM Features
End-to-End Visibility Track requests across your entire agent workflow:
Session creation and lifecycle
Step-by-step execution flow
Tool call dependencies
LLM API interactions
External service calls
Exception Management Automatic error capture and analysis:
Tool execution failures
LLM API errors
Step transition issues
Custom exception tracking
Error correlation across services
Dependency Visualization Visual representation of your agent’s architecture:
Agent components and their relationships
External service dependencies
Tool usage patterns
Performance bottlenecks identification
Elastic APM Dashboard Examples
Agent Performance Dashboard:
{
"visualization" : "line_chart" ,
"metric" : "transaction.duration.avg" ,
"filters" : {
"service.name" : "nomos-agent" ,
"transaction.type" : "session"
},
"group_by" : "step.id"
}
Tool Usage Analytics:
{
"visualization" : "bar_chart" ,
"metric" : "span.count" ,
"filters" : {
"span.type" : "tool_call"
},
"group_by" : "tool.name"
}
Production Monitoring
Docker Deployment with Observability
# docker-compose.yml
version : '3.8'
services :
nomos-agent :
image : nomos:latest
environment :
- ENABLE_TRACING=true
- ELASTIC_APM_SERVER_URL=http://elasticsearch:8200
- ELASTIC_APM_TOKEN=${APM_TOKEN}
- SERVICE_NAME=nomos-production
- NOMOS_ENABLE_LOGGING=true
- NOMOS_LOG_LEVEL=INFO
volumes :
- ./config.agent.yaml:/app/config.agent.yaml
- ./logs:/app/logs
ports :
- "8000:8000"
depends_on :
- elasticsearch
elasticsearch :
image : docker.elastic.co/elasticsearch/elasticsearch:8.8.0
environment :
- discovery.type=single-node
- xpack.security.enabled=false
ports :
- "9200:9200"
kibana :
image : docker.elastic.co/kibana/kibana:8.8.0
environment :
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
ports :
- "5601:5601"
depends_on :
- elasticsearch
Kubernetes Monitoring
# deployment.yaml
apiVersion : apps/v1
kind : Deployment
metadata :
name : nomos-agent
spec :
replicas : 3
selector :
matchLabels :
app : nomos-agent
template :
metadata :
labels :
app : nomos-agent
spec :
containers :
- name : nomos-agent
image : nomos:latest
env :
- name : ENABLE_TRACING
value : "true"
- name : ELASTIC_APM_SERVER_URL
valueFrom :
secretKeyRef :
name : elastic-config
key : apm-server-url
- name : ELASTIC_APM_TOKEN
valueFrom :
secretKeyRef :
name : elastic-config
key : apm-token
- name : SERVICE_NAME
value : "nomos-k8s"
- name : POD_NAME
valueFrom :
fieldRef :
fieldPath : metadata.name
resources :
requests :
memory : "256Mi"
cpu : "100m"
limits :
memory : "512Mi"
cpu : "500m"
Monitoring Best Practices
1. Structured Data Collection
Consistent Attributes Use consistent attribute naming across your traces and logs
# Good: Consistent attribute naming
span.set_attribute( "user.id" , user_id)
span.set_attribute( "session.id" , session_id)
span.set_attribute( "step.id" , current_step)
# Avoid: Inconsistent naming
span.set_attribute( "userId" , user_id)
span.set_attribute( "sessionID" , session_id)
span.set_attribute( "current_step" , current_step)
Key Metrics Track essential performance indicators
# Custom metrics in tools
def expensive_tool ( query : str ) -> str :
start_time = time.time()
tracer = trace.get_tracer( __name__ )
with tracer.start_as_current_span( "expensive_operation" ) as span:
span.set_attribute( "operation.type" , "data_processing" )
span.set_attribute( "query.length" , len (query))
result = process_data(query)
duration = time.time() - start_time
span.set_attribute( "operation.duration_ms" , duration * 1000 )
span.set_attribute( "result.status" , "success" )
return result
3. Error Context Collection
Rich Error Information Capture comprehensive error context for debugging
def error_prone_tool ( data : str ) -> str :
tracer = trace.get_tracer( __name__ )
with tracer.start_as_current_span( "risky_operation" ) as span:
try :
span.set_attribute( "input.data_type" , type (data). __name__ )
span.set_attribute( "input.length" , len (data))
result = risky_operation(data)
return result
except ValueError as e:
span.record_exception(e)
span.set_attribute( "error.type" , "validation_error" )
span.set_attribute( "error.input" , data[: 100 ]) # First 100 chars
raise
except Exception as e:
span.record_exception(e)
span.set_attribute( "error.type" , "unexpected_error" )
raise
4. Security Considerations
Data Redaction Protect sensitive information in logs and traces
def secure_tool ( user_data : str ) -> str :
tracer = trace.get_tracer( __name__ )
with tracer.start_as_current_span( "secure_operation" ) as span:
# Redact sensitive data
redacted_data = redact_sensitive_info(user_data)
span.set_attribute( "input.redacted" , redacted_data)
span.set_attribute( "input.length" , len (user_data))
# Process without logging actual sensitive data
result = process_user_data(user_data)
span.set_attribute( "result.status" , "processed" )
return result
def redact_sensitive_info ( data : str ) -> str :
"""Redact sensitive information from data."""
import re
# Remove email addresses
data = re.sub( r ' \b [ A-Za-z0-9._%+- ] + @ [ A-Za-z0-9.- ] + \. [ A-Z|a-z ] {2,} \b ' , '[EMAIL]' , data)
# Remove phone numbers
data = re.sub( r ' \b\d {3} - \d {3} - \d {4} \b ' , '[PHONE]' , data)
# Remove SSNs
data = re.sub( r ' \b\d {3} - \d {2} - \d {4} \b ' , '[SSN]' , data)
return data
Observability in Action
Real-World Example: Barista Agent
Here’s how observability looks for the barista agent:
# config.agent.yaml with observability
name : barista
persona : "Friendly coffee shop assistant..."
logging :
enable : true
handlers :
- type : "stderr"
level : "INFO"
format : "{time} | {level} | Barista | {message}"
steps :
- step_id : start
description : "Greet customer and show menu"
available_tools :
- get_available_coffee_options
Trace Output:
Session.create_session [duration: 15ms]
├── agent.name: "barista"
├── session.id: "uuid-456"
└── session.start_time: "2025-07-02T10:30:00Z"
Session.next [duration: 1.2s]
├── current_step: "start"
├── decision.action: "TOOL_CALL"
├── tool.name: "get_available_coffee_options"
└── Session._run_tool [duration: 50ms]
├── tool.kwargs: "{}"
├── tool.result: "Available Coffee Options:\nLatte: Small ($3.00)..."
└── tool.success: true
Session.next [duration: 800ms]
├── current_step: "start"
├── decision.action: "RESPOND"
├── llm._get_output [duration: 750ms]
│ ├── llm.provider: "openai"
│ ├── llm.model: "gpt-4o-mini"
│ └── llm.success: true
└── decision.response: "Welcome! Here are our coffee options..."
Log Output:
2025-07-02 10:30:00.123 | INFO | Barista | Session created for user interaction
2025-07-02 10:30:00.138 | INFO | Barista | Executing step: start
2025-07-02 10:30:00.145 | DEBUG | Barista | Calling tool: get_available_coffee_options
2025-07-02 10:30:00.195 | INFO | Barista | Tool execution successful: get_available_coffee_options
2025-07-02 10:30:00.945 | INFO | Barista | LLM decision: RESPOND
2025-07-02 10:30:00.946 | INFO | Barista | Response sent to user
Troubleshooting Observability
Common Issues
Problem : No traces appearing in APMSolutions :
Verify ENABLE_TRACING=true
is set
Check ELASTIC_APM_SERVER_URL
is correct
Ensure ELASTIC_APM_TOKEN
has proper permissions
Check network connectivity to APM server
Problem : Tracing causing performance issuesSolutions :
Reduce trace sampling rate
Filter out verbose operations
Use async span export
Optimize span attribute collection
Start Simple, Scale Up Begin with basic logging, then add tracing as needed. Enable detailed debugging only when investigating specific issues to minimize performance impact.
NOMOS observability gives you the insights needed to build, debug, and optimize reliable AI agents for production use. The combination of structured logging, distributed tracing, and APM integration provides comprehensive visibility into your agent’s behavior and performance.