DeepEval GEval Evaluator Example
This example demonstrates how to extend the agent-controlEvaluator base class to create custom evaluators using external libraries like DeepEval.
Overview
DeepEval’s GEval is an LLM-as-a-judge metric that uses chain-of-thoughts (CoT) to evaluate LLM outputs based on custom criteria. This example shows how to:- Extend the base Evaluator class - Create a custom evaluator by implementing the required interface
- Configure evaluation criteria - Define custom quality metrics (coherence, relevance, correctness, etc.)
- Register via entry points - Make the evaluator discoverable by the agent-control server
- Integrate with agent-control - Use the evaluator in controls to enforce quality standards
Architecture
- Uses a flat layout with Python files at the root (configured via
packages = ["."]in pyproject.toml) - Modules use absolute imports (e.g.,
from config import X) rather than relative imports - Entry point
evaluator:DeepEvalEvaluatorreferences the module directly - Install with
uv pip install -e .to register the entry point for server discovery
Key Components
-
DeepEvalEvaluatorConfig (
config.py)- Pydantic model defining configuration options
- Based on DeepEval’s GEval API parameters
- Validates that either
criteriaorevaluation_stepsis provided
-
DeepEvalEvaluator (
evaluator.py)- Extends
Evaluator[DeepEvalEvaluatorConfig] - Implements the
evaluate()method - Registered with
@register_evaluatordecorator - Handles LLMTestCase creation and metric execution
- Extends
-
Q&A Agent Demo (
qa_agent.py)- Complete working agent with DeepEval quality controls
- Uses
@control()decorator for automatic evaluation - Demonstrates handling
ControlViolationError
-
Setup Script (
setup_controls.py)- Creates agent and registers with server
- Configures DeepEval-based controls
- Creates 3 quality controls (coherence, relevance, correctness)
-
Entry Point Registration (
pyproject.toml)- Registers evaluator with server via
project.entry-points - Depends on
agent-control-evaluators>=5.0.0,agent-control-models>=5.0.0, andagent-control-sdk>=5.0.0 - In monorepo: uses workspace dependencies (editable installs)
- For third-party: can use published PyPI packages
- Enables automatic discovery when server starts
- Registers evaluator with server via
How It Works
1. Extending the Evaluator Base Class
The evaluator follows the standard pattern for all agent-control evaluators:2. Entry Point Registration
The evaluator is registered viapyproject.toml:
3. Configuration
DeepEval’s GEval supports two modes: With Criteria (auto-generates evaluation steps):4. Using in Control Definitions
Once registered, the evaluator can be used in control definitions:execution: "server"- Required fieldscope: {"stages": ["post"]}- Apply to all function calls at post stagecondition.selector.path: "*"- Pass full data so evaluator gets both input and outputevaluation_params: ["input", "actual_output"]- Both fields required for relevance checks
Getting Started from Fresh Clone
This example demonstrates custom evaluator development within the agent-control monorepo. It uses workspace dependencies (editable installs) to work with the latest development versions of:agent-control-models- Base evaluator classes and typesagent-control-sdk- Agent Control SDK for integrationdeepeval- DeepEval evaluation framework
1. Clone Repository
2. Start Database and Server
http://localhost:8000.
3. Install DeepEval Example
- Dependencies:
deepeval>=1.0.0,openai>=1.0.0,pydantic>=2.0.0, etc. - Workspace packages (as editable installs):
agent-control-models,agent-control-sdk - This evaluator package in editable mode, which registers the entry point for server discovery
deepeval-geval = "evaluator:DeepEvalEvaluator" makes the evaluator discoverable by the server.
4. Set Environment Variables
5. Restart Server
After installing the DeepEval example, restart the server so it can discover the new evaluator:6. Setup Agent and Controls
7. Run the Q&A Agent
Testing the Agent
Interactive Commands
Once the agent is running, try these commands:- Accept questions with coherent, relevant responses
- Block questions that produce incoherent or irrelevant responses
- Show which control triggered when quality checks fail
What to Expect
Good Quality Responses (Pass controls):Evaluation Parameters
DeepEval supports multiple test case parameters:input- The user query or promptactual_output- The LLM’s generated responseexpected_output- Reference/ground truth answercontext- Additional context for evaluationretrieval_context- Retrieved documents (for RAG)tools_called- Tools invoked by the agentexpected_tools- Expected tool usage- Plus MCP-related parameters
evaluation_params config field.
Important: For relevance checks, always include both input and actual_output so the evaluator can compare the question with the answer.
For Third-Party Developers
This example shows the evaluator architecture for extending agent-control. While this specific example is set up for monorepo development, the same pattern works for third-party evaluators using published packages. To create your own evaluator:- Extend the Evaluator base class from
agent-control-evaluators(published on PyPI) - Define a configuration model using Pydantic
- Register via entry points in your
pyproject.toml - Install your package so the server can discover the entry point
- Restart the server to load the new evaluator
Production Deployment
For production deployments, build your evaluator as a Python wheel and install it on your agent-control server: Development (this example):-
Self-Hosted Server (Full Control)
- Deploy your own agent-control server instance
- Install custom evaluator packages (wheel, source, or private PyPI)
- Your agents connect to this server via the SDK
- Complete control over evaluators and policies
-
Managed Service (If Available)
- Use a hosted agent-control service
- May require coordination to install custom evaluators
- Or use only built-in/approved evaluators
execution: "server"), so your agent applications only need the lightweight SDK installed. The evaluator package must be installed where the agent-control server runs, not in your agent application.
Extending This Example
Creating Your Own Custom Evaluator
Follow this pattern to create evaluators for other libraries:-
Define a Config Model
-
Implement the Evaluator
-
Register via Entry Point
-
Install and Use
Adding More GEval Metrics
You can create specialized evaluators for specific use cases:- Bias Detection: Evaluate responses for bias or fairness
- Safety: Check for harmful or unsafe content
- Style Compliance: Ensure responses match brand guidelines
- Technical Accuracy: Validate technical correctness
- Tone Assessment: Evaluate emotional tone and sentiment
Resources
- DeepEval Documentation: https://deepeval.com/docs/metrics-llm-evals
- G-Eval Guide: https://www.confident-ai.com/blog/g-eval-the-definitive-guide
- Agent Control Evaluators: Base evaluator class
- CrewAI Example: Using agent-control as a consumer
Key Takeaways
- Entry Points are Critical: The server discovers evaluators via
project.entry-points, not PYTHONPATH - Extensibility: The
Evaluatorbase class makes it easy to integrate any evaluation library - Configuration: Pydantic models provide type-safe, validated configuration
- Registration: The
@register_evaluatordecorator handles registration automatically - Integration: Evaluators work seamlessly with agent-control’s policy system
- Control Logic:
matched=Truetriggers the action (deny/allow), so invert when quality passes
Troubleshooting
Controls not triggering
- Check that
execution: "server"is in control definition - Use
scope: {"stages": ["post"]}instead ofstep_types - Use empty selector
{}to pass full data (input + output) - Restart server after evaluator code changes
Evaluator not found
The server couldn’t discover the evaluator. Check:-
Entry point registration in
pyproject.toml: -
Package is installed:
-
Server was restarted after package installation:
-
Verify registration:
- Check server logs for evaluator discovery messages during startup
Wrong evaluation results
- For relevance: include both
inputandactual_outputinevaluation_params - Check that
matchedlogic is inverted (trigger when quality fails) - Lower threshold to be more strict (0.5 instead of 0.7)
Import errors: “cannot import name ‘X’”
If you see import errors likeImportError: cannot import name 'AgentRef':
-
Stale editable install: Reinstall the package
-
For agent-control-models specifically:
-
Clear Python cache if issues persist:
-
Verify installation:
Package not discoverable: “attempted relative import”
If you seeattempted relative import with no known parent package:
-
Ensure the package is installed:
-
Verify entry point registration:
-
Check pyproject.toml has:
DeepEval telemetry files
- DeepEval creates a
.deepeval/directory with telemetry files in the working directory - When the evaluator runs on the server, files appear in
server/.deepeval/ - These files don’t need to be committed (add
.deepeval/to.gitignore) - To disable telemetry: set environment variable
DEEPEVAL_TELEMETRY_OPT_OUT="true"
License
This example is part of the agent-control project.Source Code
View the complete example with all scripts and setup instructions:DeepEval Custom Evaluator Example