Eval Framework & Simulation
Benchmark agent performance, A/B test models, and run safe sandbox simulations before deploying to production.
The Eval Framework gives you a systematic way to measure, compare, and improve agent performance before deploying to production. Combined with the Simulation sandbox, you can test with confidence.
What is the Eval Framework?
The Eval Framework lets you run standardized evaluations on any agent or workflow. Each eval measures:
- Accuracy — How often the agent produces the correct output for known test cases
- Latency — Response time from prompt to completion
- Cost Efficiency — Average cost per execution across test runs
- Consistency — Output variance across multiple identical prompts
- Safety Score — How often the agent refuses harmful or out-of-scope requests
Running an Evaluation
A/B Testing Models
One of the most powerful features is side-by-side model comparison:
- Run the same test prompt against GPT-4o and Claude Sonnet simultaneously
- Compare output quality, response time, and cost side-by-side
- See which model is best for each specific task type
- Save results and compare across evaluation runs over time
This helps you make data-driven decisions about model allocation — use cheap models for simple tasks and premium models for complex reasoning.
Sandbox Simulation
Simulation mode runs workflows in an isolated sandbox environment. Unlike evaluations which measure performance, simulations validate workflow logic:
- Step through each node of a workflow with test data
- Verify condition routing works as expected
- Check output formatting before connecting integrations
- Estimate total execution cost without spending real credits
Simulations don't count toward your monthly execution quota. Run as many as you need during development.
Best Practices
- Create a representative test dataset that covers edge cases, not just happy paths
- Run evaluations after any prompt or model change to catch regressions
- Use A/B testing when upgrading to a new model version
- Compare eval results across dates to track performance trends
- Always simulate complex workflows before deployment, especially those with condition routing
