Documentation Index
Fetch the complete documentation index at: https://futureagi.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
| Input | |||
|---|---|---|---|
| Required Input | Type | Description | |
hypothesis | string | JSON-serialized list of retrieved chunks in ranked order | |
reference | string | JSON-serialized list of ground-truth relevant chunks |
| Output | ||
|---|---|---|
| Field | Description | |
| Result | Returns a score between 0 and 1, where 1 means every chunk in the top K is relevant | |
| Reason | Short summary string of the score, e.g. Precision@3: 0.333 |
| Parameter | |||
|---|---|---|---|
| Name | Type | Description | |
eval_config (evalConfig in JS/TS) | dict / Record<string, any> | Optional. Pass {"k": N} to limit evaluation to the top N retrieved chunks. Defaults to using the full list. |
Batch evaluation
To evaluate multiple queries in a single call, pass a list of JSON-serialized inputs. Each element represents one retrieval evaluation:Python
How it works
Precision@K answers the question: “Of the top K chunks the retriever returned, how many are actually relevant?” Formula:eval_config), the evaluator uses the full retrieved list. Pass eval_config={"k": N} to limit evaluation to the top N chunks.
What to do when Precision@K is Low
If precision is low, the retriever is returning too much irrelevant content:- Reduce the number of chunks retrieved (lower K) to keep only the most confident matches
- Improve the embedding model to better distinguish relevant from irrelevant content
- Apply a similarity threshold to filter out low-confidence results before passing to the LLM
- Review your chunking strategy: chunks that are too large may contain a mix of relevant and irrelevant content
- Consider re-ranking retrieved results with a cross-encoder before passing them to the generator
Differentiating Precision@K with Similar Evals
- Recall@K: Precision@K measures retrieval quality (how clean the results are), while Recall@K measures retrieval coverage (how many relevant items were found). Optimizing one often trades off against the other.
- NDCG@K: NDCG@K considers both relevance and ranking position, while Precision@K treats all positions equally within the top K.
- Chunk Utilization: Precision@K evaluates retrieval quality before generation, while Chunk Utilization measures how well the generator actually uses the retrieved chunks.