Documentation Index
Fetch the complete documentation index at: https://futureagi.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
eval_config={"k": 3} to see how recall drops when only the top 3 chunks are considered.
| Input | |||
|---|---|---|---|
| Required Input | Type | Description | |
hypothesis | string | JSON-serialized list of retrieved chunks in ranked order | |
reference | string | JSON-serialized list of ground-truth relevant chunks |
| Output | ||
|---|---|---|
| Field | Description | |
| Result | Returns a score between 0 and 1, where 1 means all relevant chunks were found in the top K results | |
| Reason | Short summary string of the score, e.g. Recall@3: 0.5 |
| Parameter | |||
|---|---|---|---|
| Name | Type | Description | |
eval_config (evalConfig in JS/TS) | dict / Record<string, any> | Optional. Pass {"k": N} to limit evaluation to the top N retrieved chunks. Defaults to using the full list. |
Batch evaluation
To evaluate multiple queries in a single call, pass a list of JSON-serialized inputs. Each element represents one retrieval evaluation:Python
How it works
Recall@K answers the question: “Of all the chunks that should have been retrieved, how many actually appear in the top K results?” Formula:eval_config), the evaluator uses the full retrieved list. Pass eval_config={"k": N} to limit evaluation to the top N chunks.
What to do when Recall@K is Low
If recall is low, the retriever is missing relevant context:- Increase the number of chunks retrieved (higher K) to capture more relevant results
- Improve the embedding model or chunking strategy so relevant content ranks higher
- Check if ground-truth chunks are being split across multiple smaller chunks, causing partial matches
- Ensure the query is being embedded with the same model used for document embeddings
- Consider hybrid retrieval (combining dense and sparse methods) to catch different types of relevance
Differentiating Recall@K with Similar Evals
- Precision@K: Recall@K measures how many relevant chunks were found, while Precision@K measures how many retrieved chunks are actually relevant. High recall with low precision means the retriever finds everything but also returns noise.
- NDCG@K: NDCG@K goes beyond recall by also considering ranking order, giving more credit when relevant chunks appear earlier in results.
- Hit Rate: Hit Rate only checks if at least one relevant chunk was retrieved, while Recall@K measures the fraction of all relevant chunks found.