RAG Evaluation Framework

This framework allows for automated retrieval evaluation of the History RAG system. It is designed to verify the quality of the generated topic indices and embeddings by running standardized queries against a “golden” evaluation set.

How it Works

The evaluation follows these steps:

In-Memory Ingestion: The provided ZIP file (containing embeddings.npy, index.pkl, and topics/*.json) is ingested into a fast InMemoryTopicStore. This avoids the need for a live Spanner instance.
Query Execution: Each query in the evaluation set is converted into an embedding vector using the Gemini API.
Retrieval: The system searches for the top 5 most similar chunks in the memory store.
Metric Calculation: The retrieved topic names are compared against the expected names to calculate Recall@5 and MRR.

Metrics Explained

Recall@5 (Coverage)

Recall@5 measures if the “correct” topic was found anywhere in the top 5 results.

Formula: (Number of relevant topics found in top 5) / (Total number of relevant topics)
Significance: If Recall@5 is low, it means the LLM is never seeing the relevant information because it wasn't retrieved in the first place. For a high-quality index, this should be > 0.80.

MRR - Mean Reciprocal Rank (Ranking Quality)

MRR measures how high up the first correct topic appears in the results.

Formula: 1 / Rank of the first relevant result (e.g., 1.0 if at rank 1, 0.5 if at rank 2).
Significance: A high MRR indicates that the system is not only finding the right data but ranking it at the very top. High ranking accuracy leads to better summaries because LLMs often prioritize the first pieces of context they read. Aim for > 0.70.

How to Run the Evaluation

Prepare your Evaluation Set: Create a JSON file (e.g., eval_set.json) with the following structure:

```json
{
  "test_cases": [
    {
      "query": "How do I handle authentication?",
      "expected_topic_names": ["AuthMiddleware Implementation"]
    }
  ]
}
```

Set your API Key:
```
export GEMINI_API_KEY="your-api-key"
```

Execute the Tool:

bazelisk run //rag/go/eval/eval_tool --
  --zip_path=/path/to/data.zip
  --eval_set_path=./eval_set.json
  --config_path=./rag/configs/demo.json

Interpreting Results

Recall@5 = 1.0, MRR = 1.0: Perfect retrieval. The top result is exactly what was expected.
Recall@5 = 1.0, MRR = 0.2: The system found the right data, but it was buried at the 5th position. This suggests that while the embeddings are somewhat relevant, the ranking needs improvement (possibly due to noisy chunks).
Recall@5 = 0.0: Total failure. The embedding for the query is not matching the topic chunks at all. This usually indicates a mismatch in the embedding model or poor chunking strategy.