When you have a RAG based system in place it becomes necessary to have methods of evaluating it, much like one evaluates a Machine Learning model by various criteria.
Some metrics for evaluation are:
- Search precision
- Contextual relevance
- Response accuracy
<aside>
💡
During the generation phase the LLM may overlook the context and hallucinate i.e. fabricate information. Such responses are not grounded in reality
</aside>
- It becomes necessary to enrich the context such that retrieved documents has less contextual gap.
- Precision is important for retrieval as not all documents retrieved are relevant to the query.
- Recall is important as not all relevant documents may be retrieved.
- Another fallacy is call Lost in the Middle where LLMs struggle to get crucial information position in the middle of documents, especially so with longer contexts. This leads to incomplete , less useful results.
Solution
As with any evaluation, there will be some metrics. By understanding these metrics we can understand what we can do differently with each phase of the RAG pipeline.
Frameworks
Ragas