Retrieval-augmented generation (RAG) helps large language models (LLMs) deal with large amounts of input data by preselecting information from the source to give the model a smaller, more specific chunk of information to base answers on. However, the larger so-called context windows of newer, more powerful LLMs allow for significant amounts of text to be added as context to a query (up to 700,000 words in some cases), so it might seem that the technologically more complex RAG is no longer necessary. The question arises: Is that really so?
What is RAG?
Retrieval-augmented generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by dynamically incorporating external data during the text generation process. In simple terms, a RAG system searches the sources for fragments containing information relevant to the query, then gives those fragments to the LLM, thus allowing it to ignore all the irrelevant fragments. We have written about RAG as a tool to curb LLM hallucinations, but in general, this technique allows AI systems to provide more accurate, up-to-date, contextually relevant responses by combining the full processing power of pretrained language models with real-time information retrieval.
In practical terms, RAG enables the creation of a broader range of systems where users can upload large sets of documents and ask natural-language questions about their content. This capability is particularly valuable for businesses dealing with vast, dispersed datasets, as it allows for precise answer retrieval even for complex queries. By combining the strengths of information retrieval and language generation, RAG represents a significant advancement in AI-powered information processing and question-answering systems.
What is an LLM context window?
Before diving into the comparison, it is essential to understand what a context window in a large language model (LLM) actually is. An LLM is trained on a certain amount of text and can of course answer queries by referring to its training data—hence their usefulness for answering general knowledge questions. However, an LLM can also provide answers based on information given to it as part of the query. This is, arguably, more useful in most professional applications. This information provided alongside the query is called the context, and the context window is the maximum amount of text that an LLM can process and consider at one time when generating responses or performing tasks.
The size of this window is typically measured in tokens. A token is a unit of text that the model processes, which can be as short as a single character or as long as a word. On average, one token corresponds to about four characters in English text, or roughly three-quarters of a word. The context window size determines how much information the model can “see” and use to inform its responses. For instance, a model with the maximum number of 128,000 tokens can process approximately 96,000 words at once—equivalent to a sizable novel. However, a model with a token limit of 8,000 for the context window—a number typical of even advanced models such as the original GPT-4 or LLaMA-2—might have trouble answering questions based on longer passages (usually information from the middle of the text is lost, as shown by the “needle-in-a-needlestack” test).
Thus, a larger context window allows an LLM to maintain coherence over long passages, understand complex narratives, and draw connections between distant parts of a text. This capability is particularly valuable for tasks such as document summarization, long-form content generation, and in-depth analysis of extensive datasets. This is especially important in professional applications where an LLM is used to retrieve information from a user’s proprietary dataset (such as product documentation, internal document base, etc.).
Read more on AI
What is synthetic data and how it can help us break the data wall?
Boosting productivity with an AI personal assistant—three real-life use cases
Finding the best LLM—a guide for 2024
Small language models (SLMs)—a gentle introduction
Generative AI implementation in business: how to do it responsibly and ethically
AI in manufacturing: Four nonobvious implementations
Ten business use cases for generative AI virtual assistants
Generative AI in knowledge management—a practical solution for enterprises
Curbing ChatGPT hallucinations with retrieval augmented generation (RAG)
Large language models (LLMs)—a simple introduction
RAG vs. fine-tuning vs. prompt engineering—different strategies to curb LLM hallucinations
Large context windows in LLMs
Recent advancements in large language model technology have led to a significant expansion of context window sizes. Let us review some of the latest developments in publicly available LLMs:
- Llama 3 (Meta) currently offers only 8,000 tokens as its context window size. However, Meta have announced they will expand this in the near future.
- Mistral AI models offer various context window sizes, with some versions supporting up to 32,000 tokens. Mistral has been making waves in the AI community thanks to its efficient and powerful models.
- GPT-4 (OpenAI) originally had a context window of 8,000 tokens (later expanded to 32,000), which was considered large when it was first introduced but has since been surpassed by newer models.
- GPT-4o (OpenAI) provides a context window of 128,000 tokens, a significant increase from its predecessors. This expanded context allows for more comprehensive document analysis and generation tasks.
- Claude 3.5 (Anthropic) offers a context window of up to 200,000 tokens in its developer preview version. This massive context window allows Claude 3 to process and understand entire books or large collections of documents in a single query.
- Gemini 1.5 Pro (Google) boasts a context window of one million tokens in its limited preview, equivalent to around 700,000 words. It is the large language model with the biggest context window size available.
As increasing the size of LLMs is becoming increasingly untenable, expanding the context window size is another way of increasing the usefulness of generative AI, alongside solutions such as small language models or Meta AI’s extended training time approach.
However, it is worth noting that larger context windows are often available only in premium or limited-access versions of the models, as is the case with the current leader: Gemini 1.5 Pro. Many publicly available versions, such as LLaMA-3, GPT-4, and GPT-3.5, still operate with a smaller context window, which may impact their practical applicability for certain tasks. Moreover, simply having a larger context window for an AI model does not guarantee optimal performance. The model must be trained to effectively utilize this expanded context, which presents its own set of challenges in terms of computational resources and training methodologies.
Large context window vs. RAG: comparison
To sum up, a larger context window in a large language model (LLM) can serve as an alternative to retrieval-augmented generation (RAG) by allowing the model to retain and process more information at once. While RAG retrieves relevant context information from external sources to supplement a model’s knowledge, a larger context window enables an LLM to hold more text within its immediate attention span. This expanded capacity allows a model to directly access and utilize a greater amount of relevant context information without the need for external retrieval mechanisms. As a result, the model can potentially handle more complex tasks, maintain coherence over longer sequences, and draw connections between distantly related pieces of information more effectively. However, this does not come without its costs. So, the question remains whether the development of larger context windows will indeed make RAG obsolete. Let us look at some specific factors.
Cost
Large context windows: Processing large amounts of text through expansive context windows can be costly. LLM providers typically charge based on the number of tokens in both the input (including context) and the generated output. For applications requiring frequent queries with large context windows, this can lead to significant expenses. For a business that analyses thousands of documents daily, the costs increase significantly when full-size documents are sent as prompt inputs to a large context window.
RAG: While setting up a RAG pipeline requires initial investment in infrastructure and data preparation, it can be more cost-effective in the long run. RAG allows for targeted retrieval of relevant information, potentially reducing the overall number of tokens processed by an LLM. The cost-effectiveness of RAG becomes apparent in scenarios where specific information needs to be retrieved frequently. Instead of processing an entire document each time, RAG can pinpoint the relevant sections, significantly reducing the number of tokens processed and, consequently, reducing the cost.
Latency
Large context windows: Loading a large context into the model can introduce latency, especially for real-time applications. However, once loaded, the model can quickly access any part of the context. The latency issue becomes particularly noticeable in interactive applications where users expect quick responses. For instance, a customer service chatbot using a large context window might suffer a noticeable delay when switching between relevant responses for different topics that require loading new context.
RAG: The retrieval step in RAG introduces some latency, but modern vector databases and efficient indexing can minimize this. For smaller queries, RAG may offer lower latency compared to loading a large context window. RAG systems can be optimized for speed by using efficient retrieval algorithms and preprocessing techniques. This makes them particularly suitable for applications requiring real-time or near-real-time responses, such as live customer support or interactive data analysis tools.
Context understanding when generating answers
Large context windows: LLMs with large context windows can maintain a better understanding of the entire input, potentially leading to more coherent and contextually relevant responses across a long document or multiple documents. This capability is particularly valuable for tasks that require a holistic understanding of extensive content, such as summarizing long research papers, analyzing complex legal documents, or generating comprehensive reports based on multiple sources. How much of the conversation history the LLM remembers also depends on how many tokens it can hold in memory.
RAG: RAG excels at providing prompt and precise answers based on the most relevant retrieved information. However, it may sometimes lack the broader context that a large context window provides. The strength of RAG lies in its ability to quickly pinpoint and utilize the most relevant data for a given query. This makes it particularly effective for question-answering systems, where direct, prompt, and accurate responses are prioritized over broader contextual understanding.
Data security and confidentiality
Large context windows: Sending large amounts of potentially sensitive data to an external LLM raises security concerns. While providers offer security measures, the risk of data exposure remains a consideration. For businesses dealing with confidential information, such as proprietary research or sensitive customer data, a need to send entire documents to an external LLM service can be a significant drawback. Even with robust security measures in place, the mere act of transmitting this data externally may violate compliance requirements or internal security policies. The looming threat of the data wall (i.e., AI developers running out of training data) also leaves the lingering threat of proprietary user data being used to train commercial LLMs (even if their developers promise not to).
RAG offers better control over data security, as sensitive information can be kept in-house. This makes RAG a preferred choice for organizations with strict data privacy requirements. With RAG, they can hold their sensitive data within their own safe systems, be that on-premise or (more likely) in a secure cloud environment. The retrieval component can be implemented within this more secure domain, ensuring that only the most relevant and nonsensitive portions of data are sent to the external LLM for processing. This approach aligns well with data protection regulations and can help mitigate risks associated with data breaches or unauthorized access.
Customization possibilities
Large context windows: Customization is limited to prompt engineering and the fine-tuning options provided by the LLM service. While larger context windows offer impressive out-of-the-box performance, the ability to tailor the system in response to specific domain knowledge or unique business requirements is somewhat limited. Users are largely dependent on the pretrained knowledge and capabilities of the model.
RAG: With RAG, there are extensive customization possibilities. Organizations can curate their knowledge bases, implement custom retrieval algorithms, and fine-tune the entire pipeline to their specific needs. The flexibility of RAG allows businesses to incorporate their proprietary data, industry-specific knowledge, and unique processes into the AI system. This level of customization can lead to more accurate and relevant outputs, especially in specialized domains or for company-specific use cases.
Transparency and debugging
Large context windows: It can be challenging to understand why a model has produced a particular output when working with very large contexts. Debugging and tracing the source of information becomes more complex and the “black box” nature of large language models becomes even more pronounced. If an error or unexpected output occurs, it can be downright impossible to pinpoint which part of the large input context influenced the problematic response.
RAG: In contrast, RAG provides clear traceability of information sources. It is easier to understand which documents influenced the model’s response, making debugging and improving the system more straightforward. This transparency is crucial for building trust in AI systems, especially in regulated industries or high-stakes decision-making processes. With RAG, it is possible to audit the exact pieces of information that led to a particular output, facilitating easier error correction and system refinement.
Conclusion
After careful consideration, it is clear that retrieval-augmented generation (RAG) is not becoming obsolete in the face of larger context windows. Instead, these technologies are likely to coexist and complement each other in the AI ecosystem. RAG continues to offer key advantages in cost-effectiveness, performance, data security, customization, transparency, and the ability to incorporate up-to-date information without retraining. Meanwhile, large context windows excel in scenarios requiring broad understanding of long documents.
If you need assistance while building your own RAG solution, we are here to help. At Fabrity, we have a range of generative AI projects in our portfolio, each at different stages—from proofs of concept (PoCs) and pilot tests to full deployments. Here is how we build custom AI-powered solutions:
- We analyze your business requirements.
- You gather the necessary data (technical and internal documentation, knowledge bases, customer support data, FAQs, etc.).
- We prepare the necessary Azure infrastructure.
- We build a custom PoC for you.
- We test and optimize it for performance.
Once the necessary data or documentation is prepared, we can develop your custom AI-powered solution within 2–3 weeks, provided there are no special requirements. Please send us an email at sales@fabrity.pl, and we will reach out to discuss all the details.