NDCG@K - Future AGI Documentation

import json
from fi.evals import Evaluator

evaluator = Evaluator()

result = evaluator.evaluate(
    eval_templates="ndcg_at_k",
    inputs={
        "hypothesis": json.dumps([
            "France is in Europe.",
            "Paris is the capital of France.",
            "Napoleon was born in Corsica.",
            "The Eiffel Tower was built in 1889.",
            "The Louvre is in Paris."
        ]),
        "reference": json.dumps([
            "Paris is the capital of France.",
            "The Eiffel Tower was built in 1889.",
            "The Louvre is in Paris."
        ])
    },
    eval_config={"k": 5}
)

print(result.eval_results[0].output)   # Score reflecting ranking quality
print(result.eval_results[0].reason)

import { Evaluator } from "@future-agi/ai-evaluation";

const evaluator = new Evaluator();

const result = await evaluator.evaluate(
  "ndcg_at_k",
  {
    hypothesis: JSON.stringify([
      "France is in Europe.",
      "Paris is the capital of France.",
      "Napoleon was born in Corsica.",
      "The Eiffel Tower was built in 1889.",
      "The Louvre is in Paris."
    ]),
    reference: JSON.stringify([
      "Paris is the capital of France.",
      "The Eiffel Tower was built in 1889.",
      "The Louvre is in Paris."
    ])
  },
  {
    evalConfig: { k: 5 },
  }
);

console.log(result.eval_results[0]?.output);   // Score reflecting ranking quality
console.log(result.eval_results[0]?.reason);

In this example, 3 relevant chunks are scattered across positions 2, 4, and 5 instead of being at the top. NDCG penalizes this because a perfect retriever would place all 3 relevant chunks at positions 1, 2, and 3.


Required Input	Type	Description
`hypothesis`	`string`	JSON-serialized list of retrieved chunks in ranked order
`reference`	`string`	JSON-serialized list of ground-truth relevant chunks

Output
	Field	Description
	Result	Returns a score between 0 and 1, where 1 means all relevant chunks appear at the top of the ranked list in ideal order
	Reason	Short summary string of the score, e.g. `NDCG@3: 0.469`

Parameter
	Name	Type	Description
	`eval_config` (`evalConfig` in JS/TS)	`dict` / `Record<string, any>`	Optional. Pass `{"k": N}` to limit evaluation to the top N retrieved chunks. Defaults to using the full list.

Batch evaluation

To evaluate multiple queries in a single call, pass a list of JSON-serialized inputs. Each element represents one retrieval evaluation:

Python

results = evaluator.evaluate(
    eval_templates="ndcg_at_k",
    inputs={
        "hypothesis": [
            json.dumps(["Paris is the capital of France.", "France is in Europe.", "Napoleon was born in Corsica."]),
            json.dumps(["The sky is blue.", "Water is wet."]),
            json.dumps(["Unrelated 1.", "Unrelated 2.", "Unrelated 3.", "The Louvre is in Paris."]),
        ],
        "reference": [
            json.dumps(["Paris is the capital of France.", "The Eiffel Tower was built in 1889."]),
            json.dumps(["The sky is blue.", "Water is wet."]),
            json.dumps(["The Louvre is in Paris."]),
        ],
    },
    eval_config={"k": 3},
)

for i, r in enumerate(results.eval_results):
    print(f"Query {i+1}: {r.output}")
# Query 1: score reflects that 1 relevant chunk is at position 1 (good ranking)
# Query 2: 1.0 (both relevant chunks at top positions)
# Query 3: 0.0 (relevant chunk at position 4, outside top 3)

How it works

NDCG@K measures not just whether relevant chunks were retrieved, but whether they appear early in the ranked results. It applies a logarithmic discount to lower-ranked positions, so a relevant chunk at position 1 contributes much more to the score than the same chunk at position 5. Formula:

DCG@K  = Σ  relevance(i) / log₂(i + 1)     for i = 1 to K
NDCG@K = DCG@K / IDCG@K

Where:

relevance(i) is 1 if the item at position i is in the ground truth, 0 otherwise
IDCG@K (Ideal DCG) is the best possible DCG if all relevant items were ranked first
Duplicate items in the retrieved list are only credited once

A score of 1.0 means the retriever placed all relevant chunks at the very top in the best possible order. A lower score means relevant chunks are buried below irrelevant ones. By default (without eval_config), the evaluator uses the full retrieved list. Pass eval_config={"k": N} to limit evaluation to the top N chunks. Matching is based on exact string equality.

Pass eval_config={"k": N} to evaluate only the top N retrieved chunks. For example, eval_config={"k": 3} measures ranking quality within the first 3 results only.

What to do when NDCG@K is Low

If NDCG@K is low, relevant chunks are being retrieved but ranked poorly:

Apply a re-ranking model (cross-encoder) to reorder results by relevance after initial retrieval
Fine-tune the embedding model on domain-specific data to improve ranking accuracy
Check if your similarity metric (cosine, dot product) is appropriate for your embedding model
Consider using a hybrid retrieval approach where sparse (BM25) and dense scores are combined for better ranking
Review query preprocessing: adding context to short queries can improve ranking quality

Differentiating NDCG@K with Similar Evals

Recall@K: Recall@K only checks if relevant chunks appear in the top K, regardless of position. NDCG@K also rewards placing them higher in the ranking.
Precision@K: Precision@K measures the fraction of relevant results without considering order, while NDCG@K penalizes relevant results that appear late.
MRR: MRR only cares about where the first relevant chunk appears, while NDCG@K evaluates the ranking quality across all relevant chunks.

​Batch evaluation

​How it works

​What to do when NDCG@K is Low

​Differentiating NDCG@K with Similar Evals

Batch evaluation

How it works

What to do when NDCG@K is Low

Differentiating NDCG@K with Similar Evals