Developing a Virtual Assistant with RAG: Challenges, Biases, and Lessons Learned

A practical experience on how to integrate the RAG method and language models into a virtual assistant, overcoming technical and ethical challenges to provide accurate and relevant responses.

Alejandro López Correa

10/9/20245 min read

A year ago, I became interested in the RAG method to enhance the capabilities of LLMs. To learn by doing, I developed a small website featuring a virtual assistant designed to answer questions about a very specific subject: the doctrine of the Catholic Church as recorded in the Catechism. This project, although short in duration (two weeks), presented interesting challenges due to its specific domain, which demands a high level of accuracy in its responses.

For those unfamiliar, RAG stands for Retrieval Augmented Generation. The basic idea is to divide a series of documents into small text fragments, which are classified by converting them into numerical representations (vectors) that preserve their semantic meaning. This step is carried out with a specific AI model, and these numerical representations are called embeddings. Once a set of numbers is available to describe a document, it is possible to mathematically compare two documents and obtain a quantitative value for their similarity. Both the text fragment and its associated vector are stored in a database. When a query is received, it is encoded in the same way and compared with the embeddings stored in the database. A high similarity implies a strong probable relationship between the two texts: the user query and the stored fragment. Using this procedure, the most similar fragments to the user’s question are found, and everything (question and fragments) is sent to the LLM to generate a response that draws on the reference material.

This project was interesting for several reasons:

It was necessary to determine how to divide the document into meaningful fragments for better interpretation.
The bias of both the LLM and the embedding model needed to be controlled.
User query relevance needed verification.
All of this had to be done while keeping the operational cost acceptable.

Regarding the last point, to balance the cost per query, I decided to use GPT-3.5 for its good balance between capability and cost, and Ada-002 for embeddings. Today, these models are obsolete, but they were good options a year ago. One lesson learned from this is that the rapid advancement of models should be factored into the design of any project, for instance by making it easy to re-encode all texts in the database with a newer, more powerful or cheaper embedding model. OpenAI tends to raise the cost of using models marked as obsolete, so updating the project periodically may be beneficial. Any project that depends on an external API should consider these aspects.

Verifying the relevance of the user’s query is crucial for at least two reasons. First, to maintain a good image of the service by avoiding providing answers outside the defined scope or by discarding problematic queries in a controlled manner, without leaving it to the LLM, which might respond unpredictably in such cases. The scope of a project like this is closed, and all behavior is designed and tested to work well within that scope. Therefore, it is necessary to discard irrelevant queries. The second reason is that early discarding also helps manage operational costs by quickly ending unwanted queries, thus saving costs. I solved this aspect with an initial filter that used only the embedding model, which is much cheaper per token compared to using the LLM. The filter compares the user query with a wide range of predefined queries and only accepts it if it is similar to one of them. Thanks to the power of embeddings, a wide variety of different expressions retain similar semantic characteristics in terms of the similarity of their numerical vectors. This technique can be refined in various ways, and it’s even possible to develop a useful (albeit primitive) chatbot using only embeddings without the need for an LLM.

The preprocessing of reference material (in this case, the text of the Catechism of the Catholic Church, although adding similar texts like the Denzinger was also considered) is a key part of the RAG project. If very large fragments are encoded, specific semantic details are lost, and more tokens are used in the LLM query. On the other hand, very small fragments lose context and thus do not provide enough information to give a good response. Another consideration is that to improve response quality, it is essential that the input text is of high quality and does not include elements like section titles, which can “confuse” the LLM. Some approaches to preprocessing input material divide the text into fragments of identical size or by other criteria without considering the semantic aspect. I chose to use the natural division of the text into “numbers,” which may contain one or several paragraphs and heterogeneous content (like biblical quotes or poems). One goal of the project was to provide literal quotes from the reference text, so arbitrarily cutting texts was unacceptable. The process of obtaining and preprocessing the text required considerable effort. Once it was clean and well-formatted, I encoded each fragment with Ada-002 and stored everything in a database. This stage only needs to be done once, after which it is ready to provide material for user queries.

Another critical aspect was addressing the bias of the LLM (GPT-3.5). There is much talk about using LLMs to analyze documents and answer questions based on them, but the inherent bias that LLMs receive during training and alignment is not mentioned as often. Though it may be tempting, one cannot assume its objectivity, especially on sensitive topics. In reality, it behaves more like a human assistant with its own set of preconceptions and idiosyncrasies. An LLM has been trained with a massive amount of material collected from the internet, and it tends to offer an average of what it has learned. After this, the alignment stage directs the system to comply with the set of guidelines chosen by the training team. The challenge lies in imposing superficial corrections (alignment) on deep-seated impulses (the result of base training), which can be problematic. These deep impulses may often be contradictory, as they are derived from texts written by millions of different people. This can lead to inconsistencies in the model’s behavior, similar to internal conflicts, as it tries to reconcile various influences while adapting to specific guidelines during the alignment stage. Perhaps this is one of the more subtle causes of the so-called “hallucinations.” After all, as far as I have read, fine-tuning can also degrade model quality. In any case, recognizing and mitigating this is essential to maximize the utility of its responses: despite all its flaws, LLMs are incredibly useful tools with a potential that we are only beginning to explore.

Various techniques can be used to mitigate bias. Some are more complex, such as evaluating the quality of responses using a model specifically trained for it or fine-tuning a base LLM to align it with the reference material and project goals. Other more direct approaches include iteratively refining prompts to gradually correct the biases found during development and testing or querying the LLM to evaluate the quality of the response it has already given. Each technique involves different costs both in the development phase and in production. In this project, I decided to refine the LLM’s prompt until the quality of the responses was satisfactory when faced with sensitive questions.

This was a very interesting project. Acquiring theoretical knowledge is an important step, but only when it is applied in a real project does one gain experience that can be transferred to similar projects. The world of ideas is separate from the real world, and the latter presents its own challenges that require finding pragmatic solutions to create a functioning product that meets economic requirements and provides a useful service. As a multidisciplinary developer, I have tackled numerous software projects throughout my life, personally, in very diverse areas. Most of them I abandon when I reach a point where I have obtained the practical knowledge I was interested in, as if saying, “and the rest is left as an exercise for the reader.” It is not feasible to dedicate the resources needed to turn every project into a product ready to offer a service to third parties, but it is undeniable that completing a project, addressing all its details, provides a deeper and more comprehensive perspective that cannot be achieved otherwise.