The quality of your prompt optimization is only as good as the evaluation metrics you use. A well-chosen evaluator provides a clear signal to the optimizer, guiding it toward prompts that produce high-quality results. This cookbook explores three powerful methods for evaluating prompt performance within theDocumentation Index
Fetch the complete documentation index at: https://futureagi.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
agent-opt framework:
- Using the FutureAGI Platform (Recommended): The easiest method, leveraging pre-built, production-grade evaluators.
- Using a Local LLM-as-a-Judge: The most flexible method for nuanced, semantic evaluation.
- Using a Local Heuristic Metric: The fastest and cheapest method for objective, rule-based checks.
1. Using the FutureAGI Platform (Recommended)
This is the simplest and most powerful way to evaluate your prompts. By specifying a pre-builteval_template from the FutureAGI platform, you can leverage sophisticated, production-grade evaluators without writing any custom code.
Example: Evaluating Summarization Quality
Here, we’ll use the built-insummary_quality template. Our unified Evaluator will handle the API calls to the platform, where a judge model will compare the generated_output against the original article.
When to use it: This is the recommended approach for most use cases. It’s perfect for standard tasks like summarization, RAG faithfulness (
context_adherence), and general answer quality (answer_relevance).2. Using a Local LLM-as-a-Judge
For maximum flexibility, you can define your own evaluation logic using a local LLM-as-a-judge. This is ideal for custom tasks or when you need a very specific evaluation rubric.Example: Creating a “Toxicity” Judge
We will create aCustomLLMJudge that scores a response based on a simple toxicity check.
When to use it: Best for tasks requiring nuanced, semantic understanding of quality that can’t be captured by simple rules. Ideal for evaluating style, tone, creativity, and complex correctness.
3. Using a Local Heuristic (Rule-Based) Metric
Sometimes, you need to enforce strict, objective rules. Heuristic metrics are fast, cheap, and run locally without API calls. Your library comes with a suite of pre-built heuristics that you can combine for powerful, rule-based evaluation.Example: Enforcing Output Length and Keywords
Let’s create an evaluator that scores a summary based on two criteria, giving 50% weight to each:- The summary’s length must be under 15 words.
- It must contain the keyword “JWST”.
LengthLessThan and Contains, with the AggregatedMetric.
When to use it: Ideal for tasks with objective, easily measurable success criteria like output format (e.g.,
IsJson), length constraints, or the presence/absence of specific keywords (ContainsAll, ContainsNone).Next Steps
Optimizers Overview
Learn about the different optimization algorithms.
How-To: Using the SDK
See a complete end-to-end example of running an optimization.