Natural Language Processing (NLP) has seen remarkable advances in recent years, largely due to the development of powerful language models like GPT, BERT, and T5. While these models are incredibly adept at generating human-like text, they still face challenges when required to generate real-time responses based on information outside their training data. This is where Retrieval-Augmented Generation (RAG) comes into play. In this blog post, we’ll explore what RAG is, how it works, and why it is an essential tool in modern NLP.
What is RAG?
RAG, or Retrieval-Augmented Generation, is an advanced NLP framework that combines two crucial components: retrieval and generation. Unlike traditional language models that rely solely on their internal knowledge, RAG uses an external knowledge base to retrieve relevant information and then synthesises a response using a large language model (LLM). This approach is particularly beneficial when generating text in real-time or when the required context falls outside the LLM’s training data.
For example, let’s say you want to use a language model trained on general English text to answer a query related to a specific medical topic. Since the LLM might not have sufficient domain-specific knowledge, RAG can incorporate external information to fill this gap, providing a more relevant and accurate response.
Why Use RAG Over Fine-Tuning?
While fine-tuning or transfer learning is often used to adapt a model to new data, RAG offers a more flexible and dynamic approach. Fine-tuning requires large amounts of domain-specific data and time to train the model, while RAG can access external knowledge bases and retrieve the necessary information on the fly. This makes RAG ideal for scenarios requiring real-time information retrieval and responses, particularly when the needed context is not included in the LLM’s internal dataset.
Components of RAG
RAG consists of three main components:
- Vector Embeddings
- Retriever
- Generator
Let’s go through each of these components in detail.
1. Vector Embeddings
Vector embeddings are numerical representations of text data. They encode the semantic meaning of words or phrases into vectors of fixed lengths, which can vary depending on the LLM used. This numerical representation allows for comparison and retrieval of relevant text.
For example, the word “python” can be converted into a vector of numerical values, which helps in identifying its relation to other text data. These embeddings can vary based on the model; one model might produce embeddings of length 384, while another might generate vectors of length 512 or more. Apart from the length these embeddings can vary in terms of the numerical values as well.
We will get vector embeddings of word “python” as shown below. This embedding has been generated using the mode all-MiniLM-L6-v2 Model.
array([-5.61514683e-02, 1.77425984e-02, -5.91333769e-02, 4.02669609e-02,
-4.38507907e-02, -1.47820100e-01, 5.41596226e-02, 5.14107198e-02,
-4.56306674e-02, -4.09573242e-02, 2.96872910e-02, 1.86908059e-02,
6.04560189e-02, 1.35979680e-02, -6.41614664e-03, -7.10446993e-03,
………………………… TRUNCATED DATA ………………………………………
7.99969397e-03, -5.72937261e-03, 7.71107376e-02, 4.30024341e-02,
-3.11650913e-02, 3.87991290e-03, -7.41218105e-02, 4.61037159e-02,
3.26175652e-02, 1.46994665e-01, 1.28343910e-01, -1.50316618e-02],
dtype=float32)
2. Retriever
In RAG, the retriever’s role is to find the most relevant documents or pieces of information from a large collection of text data. To accomplish this, RAG uses vector stores, which store the embeddings generated earlier and facilitate efficient retrieval. Depending on the design, a vector store can store additional details such as raw text, document metadata i.e., page numbers, document id, and more.
Some well-known vector stores include:
- FAISS (Facebook AI Similarity Search)
- ChromaDB
These vector stores compare the query’s embeddings with the stored embeddings, retrieving the most relevant documents. If you choose not to use a vector store, you’ll need to handle both storage and retrieval manually, which can be cumbersome and less efficient. Hence, using vector stores is recommended for their ease, accuracy, and efficiency. We can use methods like cosine similarity to find relevant texts as per the query in case you do not want to use a vector store.
3. Generator (LLM)
The final component of RAG is the generator, which is typically a large language model such as GPT, BERT, or T5. The LLM synthesizes an answer based on the retrieved documents, allowing for a more contextually accurate and relevant response. This is where RAG truly shines, as it enables the LLM to generate answers even for queries that require knowledge outside its original training data.
The RAG Process Explained
Let’s walk through the RAG process step-by-step to understand how it works:
1. Text Data Preparation
First, we need a collection of raw text data. For instance, consider text data related to Python programming, which includes information like Python’s history, installation steps, object-oriented programming (OOP) concepts, and PEP8 guidelines.
2. Converting Text to Embeddings
The next step is to convert this raw text data into embeddings using a pre-trained model. Each word, phrase, or document is represented as a vector [numerical values], encapsulating its semantic meaning. These embeddings will then be stored in a vector store for future reference.
3. Storing the Embeddings
The embeddings generated in the previous step are stored in a vector store like FAISS or ChromaDB. These vector stores not only store the embeddings but also enable efficient retrieval of relevant documents when a query is made.
4. User Query and Retrieval
When a user asks a query, such as “How do I install Python?”, the query is converted into embeddings using the same model used to generate embeddings for the raw text data. The vector store then compares the query embeddings with the stored embeddings, identifying the most relevant documents.
5. Answer Generation
The retrieved documents are then passed to the LLM, which synthesizes a detailed answer based on the information provided. For instance, the LLM might generate a step-by-step guide on Python installation using the relevant retrieved documents.
Advantages of RAG
- Real-Time Information: RAG allows for real-time information retrieval, making it possible to handle queries with up-to-date or domain-specific content.
- Dynamic Context: By using external knowledge bases, RAG can go beyond the limitations of an LLM’s internal training data.
- Efficiency: Vector stores enable quick retrieval, and using embeddings ensures the retrieval process is both relevant and contextually accurate.
Conclusion
Retrieval-Augmented Generation (RAG) is a powerful approach that extends the capabilities of traditional language models. By incorporating retrieval mechanisms and vector embeddings, RAG provides a way to generate real-time responses based on both internal and external knowledge. This flexibility makes it particularly useful in situations where the context required for text generation is beyond the scope of the LLM’s training data.
If you’re working on NLP applications that require real-time and domain-specific knowledge retrieval, RAG might be the solution you’re looking for.
Further Reading and Resources
For more detailed code examples and practical applications of RAG, check out the resources shared below.
- https://docs.llamaindex.ai/en/stable/examples/low_level/oss_ingestion_retrieval/
- https://python.langchain.com/v0.2/docs/tutorials/rag/
- https://huggingface.co/learn/cookbook/en/advanced_rag
- https://huggingface.co/blog/ray-rag
- https://wildestimagination.dev/unlock-postgresql-data-with-llamaindex/
Video on the theory is available at: