rag/go/eval/README.md - buildbot - Git at Google

 # RAG Evaluation Framework

 This framework allows for automated retrieval evaluation of the History RAG system. It is designed
 to verify the quality of the generated topic indices and embeddings by running standardized queries
 against a "golden" evaluation set.

 ## How it Works

 The evaluation follows these steps:

 1. **In-Memory Ingestion**: The provided ZIP file (containing `embeddings.npy`, `index.pkl`, and
    `topics/*.json`) is ingested into a fast `InMemoryTopicStore`. This avoids the need for a live
    Spanner instance.
 2. **Query Execution**: Each query in the evaluation set is converted into an embedding vector
    using the Gemini API.
 3. **Retrieval**: The system searches for the top 5 most similar chunks in the memory store.
 4. **Metric Calculation**: The retrieved topic names are compared against the expected names to
    calculate **Recall@5** and **MRR**.

 ## Metrics Explained

 ### Recall@5 (Coverage)

 Recall@5 measures if the "correct" topic was found anywhere in the top 5 results.

 - **Formula**: `(Number of relevant topics found in top 5) / (Total number of relevant topics)`
 - **Significance**: If Recall@5 is low, it means the LLM is never seeing the relevant information
   because it wasn't retrieved in the first place. For a high-quality index, this should
   be **> 0.80**.

 ### MRR - Mean Reciprocal Rank (Ranking Quality)

 MRR measures how high up the first correct topic appears in the results.

 - **Formula**: `1 / Rank of the first relevant result` (e.g., 1.0 if at rank 1, 0.5 if at rank 2).
 - **Significance**: A high MRR indicates that the system is not only finding the right data but
   ranking it at the very top. High ranking accuracy leads to better summaries because LLMs often
   prioritize the first pieces of context they read. Aim for **> 0.70**.

 ## How to Run the Evaluation

 1.  **Prepare your Evaluation Set**: Create a JSON file (e.g., `eval_set.json`) with the following
     structure:

         ```json
         {
           "test_cases": [
             {
               "query": "How do I handle authentication?",
               "expected_topic_names": ["AuthMiddleware Implementation"]
             }
           ]
         }
         ```

 2.  **Set your API Key**:

     ```bash
     export GEMINI_API_KEY="your-api-key"
     ```

 3.  **Execute the Tool**:
     ```bash
     bazelisk run //rag/go/eval/eval_tool --
       --zip_path=/path/to/data.zip
       --eval_set_path=./eval_set.json
       --config_path=./rag/configs/demo.json
     ```

 ## Interpreting Results

 - **Recall@5 = 1.0, MRR = 1.0**: Perfect retrieval. The top result is exactly what was expected.
 - **Recall@5 = 1.0, MRR = 0.2**: The system found the right data, but it was buried at the 5th
   position. This suggests that while the embeddings are somewhat relevant, the ranking needs
   improvement (possibly due to noisy chunks).
 - **Recall@5 = 0.0**: Total failure. The embedding for the query is not matching the topic chunks
   at all. This usually indicates a mismatch in the embedding model or poor chunking strategy.
	# RAG Evaluation Framework

	This framework allows for automated retrieval evaluation of the History RAG system. It is designed
	to verify the quality of the generated topic indices and embeddings by running standardized queries
	against a "golden" evaluation set.

	## How it Works

	The evaluation follows these steps:

	1. In-Memory Ingestion: The provided ZIP file (containing `embeddings.npy`, `index.pkl`, and
	`topics/*.json`) is ingested into a fast `InMemoryTopicStore`. This avoids the need for a live
	Spanner instance.
	2. Query Execution: Each query in the evaluation set is converted into an embedding vector
	using the Gemini API.
	3. Retrieval: The system searches for the top 5 most similar chunks in the memory store.
	4. Metric Calculation: The retrieved topic names are compared against the expected names to
	calculate Recall@5 and MRR.

	## Metrics Explained

	### Recall@5 (Coverage)

	Recall@5 measures if the "correct" topic was found anywhere in the top 5 results.

	- Formula: `(Number of relevant topics found in top 5) / (Total number of relevant topics)`
	- Significance: If Recall@5 is low, it means the LLM is never seeing the relevant information
	because it wasn't retrieved in the first place. For a high-quality index, this should
	be > 0.80.

	### MRR - Mean Reciprocal Rank (Ranking Quality)

	MRR measures how high up the first correct topic appears in the results.

	- Formula: `1 / Rank of the first relevant result` (e.g., 1.0 if at rank 1, 0.5 if at rank 2).
	- Significance: A high MRR indicates that the system is not only finding the right data but
	ranking it at the very top. High ranking accuracy leads to better summaries because LLMs often
	prioritize the first pieces of context they read. Aim for > 0.70.

	## How to Run the Evaluation

	1. Prepare your Evaluation Set: Create a JSON file (e.g., `eval_set.json`) with the following
	structure:

	```json
	{
	"test_cases": [
	{
	"query": "How do I handle authentication?",
	"expected_topic_names": ["AuthMiddleware Implementation"]
	}
	]
	}
	```

	2. Set your API Key:

	```bash
	export GEMINI_API_KEY="your-api-key"
	```

	3. Execute the Tool:
	```bash
	bazelisk run //rag/go/eval/eval_tool --
	--zip_path=/path/to/data.zip
	--eval_set_path=./eval_set.json
	--config_path=./rag/configs/demo.json
	```

	## Interpreting Results

	- Recall@5 = 1.0, MRR = 1.0: Perfect retrieval. The top result is exactly what was expected.
	- Recall@5 = 1.0, MRR = 0.2: The system found the right data, but it was buried at the 5th
	position. This suggests that while the embeddings are somewhat relevant, the ranking needs
	improvement (possibly due to noisy chunks).
	- Recall@5 = 0.0: Total failure. The embedding for the query is not matching the topic chunks
	at all. This usually indicates a mismatch in the embedding model or poor chunking strategy.