Chunking documents are good for RAG. It lets you only pass the relevant sections of your documents to the LLM potentially saving tokens and ensuring that it has the relevant context.
One approach towards this is to chunk documents further! You have parent chunks and child chunks. During the RAG workflow, first your query gets matched against the parent and only then against the chunk. This ensures that you are more likely to get a relevant context as it matched against a more-concise chunk first.
There are variations to this such as BM25 which uses an LLM to generate the smaller chunks but lets see how we can implement this using Langchain
First, define your LLM and Embedding model
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import LocalFileStore
from dotenv import load_dotenv
from os import getenv
load_dotenv()
llm = AzureChatOpenAI(
deployment_name=getenv("GPT4O_NAME"), max_tokens=4000, temperature=0
)
openai_embeddings = AzureOpenAIEmbeddings(
azure_deployment=getenv("EMBEDDINGS_NAME"),
api_key=getenv("OPENAI_API_KEY"),
azure_endpoint=getenv("AZURE_OPENAI_ENDPOINT"),
)
docs_store = LocalFileStore("./static/cache/docs_cache")
query_store = LocalFileStore("./static/cache/query_cache")
embeddings = CacheBackedEmbeddings.from_bytes_store(
openai_embeddings,
query_embedding_cache=query_store,
namespace=openai_embeddings.model,
document_embedding_cache=docs_store,
)
Next, define our Text Splitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
parent_splitter = RecursiveCharacterTextSplitter(
chunk_size=12000,
length_function=len,
is_separator_regex=False,
)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=4000)
Improve the ParentDocumentRetriever
from langchain.retrievers import ParentDocumentRetriever
Now, we need a vector store as well as a docstore
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
I am using Chroma here, you may try this with a vector database or FAISS
vector_store = Chroma(
collection_name="full_documents", embedding_function=embeddings, persist_directory="./static/chroma",
)
Define our Retriever
retriever = ParentDocumentRetriever(
vectorstore=vector_store,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
Create your documents and add them to the vector store
retriever.add_documents(doc)