Why is it Relevant?

Retrieval-Augmented Generation (RAG) is foundational for utilizing LLMs in contexts requiring specific, up-to-date information. This approach is essential for scenarios demanding answers based on extensive, private data sets.

It's the technology behind solutions like Bing Chat and expert systems such as our own KalAI.

Despite increasing context window sizes, RAG is crucial for deriving insights from large data sets and enhancing quality. Mega-prompts often lead to LLMs losing track of details, especially in the middle of the context (see Lost in the Middle: How Language Models Use Long Contexts). By focusing on relevant context, RAG mitigates this problem.

What is it?

RAG involves performing a lookup for relevant information from a larger data set before prompting an LLM.

The most common approach uses vector embeddings to index the data. In an initial step or on-going process, all the text from the relevant data set is turned into vector embeddings.

When the user enters a question, the prompt is converted into an embedding vector as well. It iis then used to retrieve the k-nearest neighbours via a similarity search.

There are both exhaustive (k-nearest neighbours or kNN) and approximate, more performant search algorithms. The latter usually trade off some quality for speed.

Finally, the results from the similarity search - e.g. text chunks from the data set that are most relevant for the user’s search - are provided to the LLM as context. The LLM then uses this information to answer the question.

What Technical Solutions are There?

While there is a large amount of text embedding models, the most commonly used ones today include the following ones:

OpenAI’s ada-002 and text-embedding-3
Cohere's embed-english-v3.0
Google's text-embedding-gecko
Open source models such as E5

Another crucial element is the efficient storage and retrieval of vector embeddings. Some of the more popular ones are (in no particular order):

Pinecone
Pgvector for Postgres
Redis
AWS OpenSearch
Azure AI Search
Elasticsearch
Milvus
Qdrant

Limitations

Naïve embedding search results only include semantically similar matches to the input, lacking the capability for high-level questions such as summarization or identifying common topics in the data set.

Additionally, turning information into embeddings involves compression, and the 1536 dimensions commonly used impose a limit on the semantic meaning a vector can contain.

Conclusion

Retrieval-Augmented Generation is a pivotal technology for leveraging large language models in environments where specific, current information is vital. By focusing on relevant context and using advanced embedding models and retrieval solutions, RAG overcomes the limitations of traditional prompting methods. However, it has its own set of constraints, such as the inability to handle high-level questions and the semantic limitations of embeddings.

Outlook

In an upcoming article, we will explore more sophisticated approaches that address these limitations. Stay tuned to learn how these innovations are pushing the boundaries of what's possible with AI-driven information retrieval.

Retrieval-Augmented Generation