Documentation Index
Fetch the complete documentation index at: https://futureagi.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
| Input | |||
|---|---|---|---|
| Required Input | Type | Description | |
hypothesis | string | JSON-serialized list of retrieved chunks in ranked order | |
reference | string | JSON-serialized list of ground-truth relevant chunks |
| Output | ||
|---|---|---|
| Field | Description | |
| Result | Returns a score between 0 and 1, where 1 means all relevant chunks appear at the top of the ranked list in ideal order | |
| Reason | Short summary string of the score, e.g. NDCG@3: 0.469 |
| Parameter | |||
|---|---|---|---|
| Name | Type | Description | |
eval_config (evalConfig in JS/TS) | dict / Record<string, any> | Optional. Pass {"k": N} to limit evaluation to the top N retrieved chunks. Defaults to using the full list. |
Batch evaluation
To evaluate multiple queries in a single call, pass a list of JSON-serialized inputs. Each element represents one retrieval evaluation:Python
How it works
NDCG@K measures not just whether relevant chunks were retrieved, but whether they appear early in the ranked results. It applies a logarithmic discount to lower-ranked positions, so a relevant chunk at position 1 contributes much more to the score than the same chunk at position 5. Formula:relevance(i)is 1 if the item at position i is in the ground truth, 0 otherwiseIDCG@K(Ideal DCG) is the best possible DCG if all relevant items were ranked first- Duplicate items in the retrieved list are only credited once
eval_config), the evaluator uses the full retrieved list. Pass eval_config={"k": N} to limit evaluation to the top N chunks. Matching is based on exact string equality.
What to do when NDCG@K is Low
If NDCG@K is low, relevant chunks are being retrieved but ranked poorly:- Apply a re-ranking model (cross-encoder) to reorder results by relevance after initial retrieval
- Fine-tune the embedding model on domain-specific data to improve ranking accuracy
- Check if your similarity metric (cosine, dot product) is appropriate for your embedding model
- Consider using a hybrid retrieval approach where sparse (BM25) and dense scores are combined for better ranking
- Review query preprocessing: adding context to short queries can improve ranking quality
Differentiating NDCG@K with Similar Evals
- Recall@K: Recall@K only checks if relevant chunks appear in the top K, regardless of position. NDCG@K also rewards placing them higher in the ranking.
- Precision@K: Precision@K measures the fraction of relevant results without considering order, while NDCG@K penalizes relevant results that appear late.
- MRR: MRR only cares about where the first relevant chunk appears, while NDCG@K evaluates the ranking quality across all relevant chunks.