Recall@K

import json
from fi.evals import Evaluator

evaluator = Evaluator()

result = evaluator.evaluate(
    eval_templates="recall_at_k",
    inputs={
        "hypothesis": json.dumps([
            "Paris is the capital of France.",
            "The Eiffel Tower was built in 1889.",
            "France is in Europe.",
            "The Louvre is in Paris.",
            "Napoleon was born in Corsica."
        ]),
        "reference": json.dumps([
            "Paris is the capital of France.",
            "The Eiffel Tower was built in 1889.",
            "The Louvre is in Paris."
        ])
    },
    eval_config={"k": 5}
)

print(result.eval_results[0].output)   # 1.0
print(result.eval_results[0].reason)

import { Evaluator } from "@future-agi/ai-evaluation";

const evaluator = new Evaluator();

const result = await evaluator.evaluate(
  "recall_at_k",
  {
    hypothesis: JSON.stringify([
      "Paris is the capital of France.",
      "The Eiffel Tower was built in 1889.",
      "France is in Europe.",
      "The Louvre is in Paris.",
      "Napoleon was born in Corsica."
    ]),
    reference: JSON.stringify([
      "Paris is the capital of France.",
      "The Eiffel Tower was built in 1889.",
      "The Louvre is in Paris."
    ])
  },
  {
    evalConfig: { k: 5 },
  }
);

console.log(result.eval_results[0]?.output);   // 1.0
console.log(result.eval_results[0]?.reason);

In this example, 5 chunks are retrieved and 3 are in the ground truth. With K set to 5 (the full list), all 3 relevant chunks appear in the retrieved results, giving a recall of 3/3 = 1.0. Try setting eval_config={"k": 3} to see how recall drops when only the top 3 chunks are considered.


Required Input	Type	Description
`hypothesis`	`string`	JSON-serialized list of retrieved chunks in ranked order
`reference`	`string`	JSON-serialized list of ground-truth relevant chunks

Output
	Field	Description
	Result	Returns a score between 0 and 1, where 1 means all relevant chunks were found in the top K results
	Reason	Short summary string of the score, e.g. `Recall@3: 0.5`

Parameter
	Name	Type	Description
	`eval_config` (`evalConfig` in JS/TS)	`dict` / `Record<string, any>`	Optional. Pass `{"k": N}` to limit evaluation to the top N retrieved chunks. Defaults to using the full list.

Batch evaluation

To evaluate multiple queries in a single call, pass a list of JSON-serialized inputs. Each element represents one retrieval evaluation:

Python

results = evaluator.evaluate(
    eval_templates="recall_at_k",
    inputs={
        "hypothesis": [
            json.dumps(["Paris is the capital of France.", "France is in Europe.", "Napoleon was born in Corsica."]),
            json.dumps(["The sky is blue.", "Water is wet."]),
            json.dumps(["Unrelated 1.", "Unrelated 2.", "Unrelated 3.", "The Louvre is in Paris."]),
        ],
        "reference": [
            json.dumps(["Paris is the capital of France.", "The Eiffel Tower was built in 1889."]),
            json.dumps(["The sky is blue.", "Water is wet."]),
            json.dumps(["The Louvre is in Paris."]),
        ],
    },
    eval_config={"k": 3},
)

for i, r in enumerate(results.eval_results):
    print(f"Query {i+1}: {r.output}")
# Query 1: 0.5   (1 of 2 relevant found in top 3)
# Query 2: 1.0   (2 of 2 relevant found)
# Query 3: 0.0   (relevant chunk at position 4, outside top 3)

How it works

Recall@K answers the question: “Of all the chunks that should have been retrieved, how many actually appear in the top K results?” Formula:

Recall@K = (number of relevant items in top K) / (total number of relevant items)

Matching is based on exact string equality between retrieved chunks and ground-truth chunks. A recall of 1.0 means the retriever found every relevant chunk; a recall of 0.5 means half of the relevant chunks are missing. By default (without eval_config), the evaluator uses the full retrieved list. Pass eval_config={"k": N} to limit evaluation to the top N chunks.

Pass eval_config={"k": N} to evaluate only the top N retrieved chunks. For example, eval_config={"k": 3} checks if relevant chunks appear in the first 3 results.

What to do when Recall@K is Low

If recall is low, the retriever is missing relevant context:

Increase the number of chunks retrieved (higher K) to capture more relevant results
Improve the embedding model or chunking strategy so relevant content ranks higher
Check if ground-truth chunks are being split across multiple smaller chunks, causing partial matches
Ensure the query is being embedded with the same model used for document embeddings
Consider hybrid retrieval (combining dense and sparse methods) to catch different types of relevance

Differentiating Recall@K with Similar Evals

Precision@K: Recall@K measures how many relevant chunks were found, while Precision@K measures how many retrieved chunks are actually relevant. High recall with low precision means the retriever finds everything but also returns noise.
NDCG@K: NDCG@K goes beyond recall by also considering ranking order, giving more credit when relevant chunks appear earlier in results.
Hit Rate: Hit Rate only checks if at least one relevant chunk was retrieved, while Recall@K measures the fraction of all relevant chunks found.

Get Started

Guides

Batch evaluation

How it works

What to do when Recall@K is Low

Differentiating Recall@K with Similar Evals

​Batch evaluation

​How it works

​What to do when Recall@K is Low

​Differentiating Recall@K with Similar Evals

Batch evaluation

How it works

What to do when Recall@K is Low

Differentiating Recall@K with Similar Evals