Precision@K

import json
from fi.evals import Evaluator

evaluator = Evaluator()

result = evaluator.evaluate(
    eval_templates="precision_at_k",
    inputs={
        "hypothesis": json.dumps([
            "Paris is the capital of France.",
            "France is in Europe.",
            "The Eiffel Tower was built in 1889.",
            "Napoleon was born in Corsica.",
            "The Louvre is in Paris."
        ]),
        "reference": json.dumps([
            "Paris is the capital of France.",
            "The Eiffel Tower was built in 1889.",
            "The Louvre is in Paris."
        ])
    },
    eval_config={"k": 5}
)

print(result.eval_results[0].output)   # 0.6
print(result.eval_results[0].reason)

import { Evaluator } from "@future-agi/ai-evaluation";

const evaluator = new Evaluator();

const result = await evaluator.evaluate(
  "precision_at_k",
  {
    hypothesis: JSON.stringify([
      "Paris is the capital of France.",
      "France is in Europe.",
      "The Eiffel Tower was built in 1889.",
      "Napoleon was born in Corsica.",
      "The Louvre is in Paris."
    ]),
    reference: JSON.stringify([
      "Paris is the capital of France.",
      "The Eiffel Tower was built in 1889.",
      "The Louvre is in Paris."
    ])
  },
  {
    evalConfig: { k: 5 },
  }
);

console.log(result.eval_results[0]?.output);   // 0.6
console.log(result.eval_results[0]?.reason);

In this example, 5 chunks are retrieved. Of those 5, 3 are in the ground truth (“Paris is the capital…”, “The Eiffel Tower…”, and “The Louvre is in Paris.”), giving a precision of 3/5 = 0.6.


Required Input	Type	Description
`hypothesis`	`string`	JSON-serialized list of retrieved chunks in ranked order
`reference`	`string`	JSON-serialized list of ground-truth relevant chunks

Output
	Field	Description
	Result	Returns a score between 0 and 1, where 1 means every chunk in the top K is relevant
	Reason	Short summary string of the score, e.g. `Precision@3: 0.333`

Parameter
	Name	Type	Description
	`eval_config` (`evalConfig` in JS/TS)	`dict` / `Record<string, any>`	Optional. Pass `{"k": N}` to limit evaluation to the top N retrieved chunks. Defaults to using the full list.

Batch evaluation

To evaluate multiple queries in a single call, pass a list of JSON-serialized inputs. Each element represents one retrieval evaluation:

Python

results = evaluator.evaluate(
    eval_templates="precision_at_k",
    inputs={
        "hypothesis": [
            json.dumps(["Paris is the capital of France.", "France is in Europe.", "Napoleon was born in Corsica."]),
            json.dumps(["The sky is blue.", "Water is wet."]),
            json.dumps(["Unrelated 1.", "Unrelated 2.", "Unrelated 3.", "The Louvre is in Paris."]),
        ],
        "reference": [
            json.dumps(["Paris is the capital of France.", "The Eiffel Tower was built in 1889."]),
            json.dumps(["The sky is blue.", "Water is wet."]),
            json.dumps(["The Louvre is in Paris."]),
        ],
    },
    eval_config={"k": 3},
)

for i, r in enumerate(results.eval_results):
    print(f"Query {i+1}: {r.output}")
# Query 1: 0.333   (1 relevant in top 3 / 3)
# Query 2: 0.667   (2 relevant in top 3 / 3)
# Query 3: 0.0     (0 relevant in top 3 / 3)

How it works

Precision@K answers the question: “Of the top K chunks the retriever returned, how many are actually relevant?” Formula:

Precision@K = (number of relevant items in top K) / K

The denominator is always K, even if fewer than K items were retrieved. Matching is based on exact string equality between retrieved chunks and ground-truth chunks.

Pass eval_config={"k": N} to evaluate only the top N retrieved chunks. For example, eval_config={"k": 3} checks precision within the first 3 results only.

A precision of 1.0 means every retrieved chunk is useful; a precision of 0.5 means half the results are noise. Low precision means your LLM receives irrelevant context, which can increase cost (more tokens) and in some cases cause the model to hallucinate based on misleading information. By default (without eval_config), the evaluator uses the full retrieved list. Pass eval_config={"k": N} to limit evaluation to the top N chunks.

What to do when Precision@K is Low

If precision is low, the retriever is returning too much irrelevant content:

Reduce the number of chunks retrieved (lower K) to keep only the most confident matches
Improve the embedding model to better distinguish relevant from irrelevant content
Apply a similarity threshold to filter out low-confidence results before passing to the LLM
Review your chunking strategy: chunks that are too large may contain a mix of relevant and irrelevant content
Consider re-ranking retrieved results with a cross-encoder before passing them to the generator

Differentiating Precision@K with Similar Evals

Recall@K: Precision@K measures retrieval quality (how clean the results are), while Recall@K measures retrieval coverage (how many relevant items were found). Optimizing one often trades off against the other.
NDCG@K: NDCG@K considers both relevance and ranking position, while Precision@K treats all positions equally within the top K.
Chunk Utilization: Precision@K evaluates retrieval quality before generation, while Chunk Utilization measures how well the generator actually uses the retrieved chunks.

Get Started

Guides

Batch evaluation

How it works

What to do when Precision@K is Low

Differentiating Precision@K with Similar Evals

​Batch evaluation

​How it works

​What to do when Precision@K is Low

​Differentiating Precision@K with Similar Evals

Batch evaluation

How it works

What to do when Precision@K is Low

Differentiating Precision@K with Similar Evals