Beyond RAG: How enhanced cache construction reduces access latency and complexity for smaller workloads

Join our daily and weekly newsletters for the latest updates and exclusive content on our industry-leading AI coverage. He learns more

Retrieval Augmented Generation (RAG) has become the de facto way to customize large language models (LLMs) for personalized information. However, RAG comes with upfront technical costs and can be slow. Now, thanks to advances in long-context LLM software, organizations can bypass the RAG by including all private information in the prompt.

A New study from National Chengchi University in Taiwan says that by using long-context LLM cycles and caching techniques, you can build custom applications that outperform RAG pipelines. This approach is called cache-augmented generation (CAG), and can be a simple and effective alternative to RAG in enterprise settings where the knowledge set can be contained within a model context window.

RAG Boundaries

RAG is an efficient way to handle open domain questions and specialized tasks. It uses retrieval algorithms to collect documents relevant to the request and add context to enable LLM to formulate more accurate responses.

However, RAG presents several limitations to LLM applications. The added recovery step introduces latency that can degrade the user experience. The result also depends on the quality of document selection and classification step. In many cases, limitations of the models used for retrieval require documents to be broken down into smaller parts, which can be detrimental to the retrieval process.

Overall, RAG adds complexity to an LLM application, requiring the development, integration, and maintenance of additional components. The additional load slows down the development process.

Enhanced cache retrieval

RAG (top) vs. CAG (bottom) (Source: arXiv)

An alternative to developing a RAG pipeline is to insert the entire document set into the router and have the model select the relevant bits of the request. This approach eliminates the complexity of the RAG pipeline and problems caused by recovery errors.

However, there are three main challenges related to front-loading all documents into the router. First, long claims will slow down the model and increase inference costs. Second, the length of the LLM context window sets limits on the number of documents that fit into the vector. Finally, adding irrelevant information to the prompt may confuse the model and reduce the quality of its answers. So, simply entering all your documents into the prompt instead of choosing the most relevant ones can end up hurting the form’s performance.

The proposed CAG approach makes use of three main trends to overcome these challenges.

First, advanced caching technologies make rapid form processing faster and less expensive. The premise of the CAG is that knowledge documentation will be included in every claim submitted to the form. Therefore, you can calculate interest values for their tokens in advance instead of doing so when receiving orders. This pre-computation reduces the time it takes to process user requests.

Leading LLM providers such as OpenAI, Anthropic, and Google offer hot caching features for repetitive parts of your claim, which can include documents and knowledge instructions that you include at the beginning of the claim. With Anthropic, you can reduce costs by up to 90% and turnaround time by 85% on cached parts of a claim. Equivalent caching features have been developed for open source LLM hosting platforms.

Second, long-context LLMs facilitate the fit of more documents and knowledge into claims. Claude 3.5 Sonnet supports up to 200,000 characters, while GPT-4o supports 128,000 characters and Gemini up to 2 million characters. This makes it possible to include multiple documents or entire books in the prompt.

Finally, advanced training methods enable models to do better retrieval, reasoning, and answering questions in very long sequences. In the past year, researchers have developed several LLM benchmarks for long-sequence tasks, including… Babylong, LongICLBenchand governor. These benchmarks test LLMs on difficult problems such as multiple retrieval and multi-hop question answering. There is still room for improvement in this area, but AI labs continue to make progress.

As new generations of models continue to expand their context windows, they will be able to address larger bodies of knowledge. Furthermore, we can expect models to continue to improve their abilities to extract and use relevant information from long contexts.

“These two trends will significantly increase the usability of our approach, enabling it to handle more complex and diverse applications,” the researchers wrote. “Our methodology is therefore well positioned to become a powerful and versatile solution for knowledge-intensive tasks, leveraging the growing capabilities of the next generation of LLM holders.”

RAG vs. CAG

To compare RAG and CAG, the researchers conducted experiments on two widely known standards to answer the questions: a teamwhich focuses on context-related questions and answers from individual documents, and HotPotQAwhich requires multi-hop thinking across multiple documents.

They used the Llama-3.1-8B model with a context window of 128,000 tokens. For RAG, they combined LLM with two retrieval systems to obtain passages relevant to the question: primary BM25 algorithm And OpenAI embedding. For the CAG, they inserted multiple documents from the standard into the prompt and let the form itself determine which sections would be used to answer the question. Their experiments show that CAG outperforms the two RAG systems in most situations.

*CAG outperforms both sparse RAG (BM25 retrieval) and dense RAG (OpenAI embeddings) (Source: arXiv)*

“By pre-loading the entire context from the test set, our system eliminates retrieval errors and ensures comprehensive consideration of all relevant information,” the researchers wrote. “This advantage is particularly evident in scenarios where RAG systems may retrieve incomplete or irrelevant paragraphs, generating suboptimal answers.”

CAG also significantly reduces the time needed to create an answer, especially as the length of the reference text increases.

*CAG generation time is much smaller than RAG (Source: arXiv)*

However, CAG is not a panacea and should be used with caution. It is well suited for settings where the knowledge base does not change often and is small enough to fit into the form context window. Companies must also be wary of cases where their documents contain conflicting facts based on the context of the documents, which could confuse the model during inference.

The best way to determine if a CAG is right for your use case is to do some experimenting. Fortunately, CAG is very easy to implement and should always be considered a first step before investing in more development-intensive RAG solutions.

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from organizational transformations to hands-on deployments, so you can share insights to maximize ROI.

Read our privacy policy

Thanks for subscribing. Check out more VB newsletters here.

An error occurred.

RAG Boundaries

Enhanced cache retrieval

RAG vs. CAG

Leave a Comment Cancel reply