Enhancing Agentic RAG: Optimizing Hybrid Retrieval for an Autonomous Academic Advising Agent

Author

Cheyanne Allred-Lopez (Advisor: Dr. Cohen)

Published

Invalid Date

Slides: slides.html ( Go to slides.qmd to edit)

Introduction

Universities are increasingly offering more diverse courses and specialized degree programs, allowing students to tailor their coursework to their interests or career goals. This subsequently increases the complexity of students’ advising needs, increasing the load on the university’s academic advisers. Advising demand peaks during specific periods, including registration, drop/add periods, and in the days following final grade issuance. This creates a bottleneck, as there are a fixed number of advisers available to address an increasing number of students’ needs. While many universities publish academic policies and degree requirements online, students may wish to verify the applicability of the information to their specific academic situation.

While chatbots have long existed to help serve customers’ requests, their performance has been limited by reliance on static information, an inability to synthesize information across many sources, a lack of context-specific reasoning, and the inability to generate natural responses. Recent advancements in large language models (LLMs), including the ability to dynamically access and synthesize information across multiple sources, understand context, autonomously adjust behavior, and generate conversational responses, present an opportunity to develop an intelligent academic advising assistant. This assistant could augment existing advisers by responding to routine questions and providing personalized guidance, consequently allowing human advisers to focus on higher-complexity requests.

Off-the-shelf LLMs have increasingly large parametric knowledge bases, though their knowledge is strictly limited to the data they were exposed to during training. These models do not have access to any information published after their training cutoff date. Despite this, models may attempt to answer a question about information that they were not trained upon, resulting in hallucinated responses (Ji et al. 2023). Furthermore, generated responses do not have source information, leaving users unable to verify the accuracy of responses. These limitations are unacceptable in an academic setting. Academic policies, degree requirements, and program offerings change regularly. An intelligent advising assistant must be able to access the most up-to-date information at all times, providing grounded, verifiable responses with clear citations.

Retrieval-augmented generation (RAG) addresses these limitations by enabling access to non-parametric knowledge, or information retrieved from external knowledge stores at the time of the user’s query rather than from within the model’s parametric knowledge (Lewis et al. 2020). This expands the model’s memory and significantly reduces hallucinations (Ji et al. 2023). The process is as follows: A pre-trained retrieval model encodes the user’s query and uses Maximum Inner Product Search (MIPS) to identify the top-K most relevant documents from the document index. The pre-trained generation model then crafts its response based on both the user’s original query and the retrieved documents. This architecture overcomes the knowledge-cutoff date problem present in LLMs. The model’s non-parametric knowledge can be updated simply by encoding and upserting the updated information to the document index. This is a more practical option, as it is computationally less expensive, less time-consuming, and less technically complex than retraining the generation model.

It is also a more effective architecture, as Lewis et al. (2020) prove that the RAG architecture outperforms state-of-the-art models on knowledge-intensive tasks and increases overall RAG pipeline performance by supplementing the generation model’s parametric knowledge. Additionally, this architecture provides transparency. Whereas parametric knowledge is stored within the model’s training weights, information obtained using a retriever can be directly cited, improving the verifiability of responses.

Since 2020, the RAG architecture has evolved significantly. Gao et al. (2024) systematically reviewed over 100 RAG-related studies, identifying three technological paradigms: Naive RAG, Advanced RAG, and modular RAG. While they document the numerous optimization techniques that can improve RAG system performance, this review focuses on techniques relevant to the development of an agentic academic adviser.

Naive RAG is identical to the architecture championed by Lewis et al. (2020), though Gao et al. (2024) note that naive RAG systems have three key limitations: retrieval of irrelevant documents, reliance on parametric memory leading to hallucinated answers, and inefficient information synthesis across multiple sources. Advanced RAG implements both pre-retrieval and post-retrieval optimization techniques to overcome the shortcomings of naive RAG. Pre-retrieval optimizations include query optimization strategies, document-index improvements, and embedding optimizations. Query optimization strategies are employed to refine the user’s intent before retrieval and include techniques such as utilizing an LLM to break down, expand, or rewrite a user’s query. Indexing optimization focuses on how data is structured. Strategies include recursively splitting a document using its structure and overlapping context between chunks, preventing context truncation. Additionally, indexes can utilize hierarchical structures, knowledge graphs, or embedded metadata to capture relationships between data chunks, increasing context to improve the relevancy of retrieved information. The authors’ comprehensive review identifies multiple embedding optimization techniques, including hybrid indexing and embedding model fine-tuning. Fine-tuning a model can be computationally expensive and technically rigorous, limiting its applicability. Hybrid indexes, however, are a more accessible alternative, utilizing both sparse and dense vector representations to capture both exact keyword matching and contextual meaning, respectively.

Sawarkar, Mangal, and Solanki (2024) provide empirical evidence of hybrid indexes’ superior performance when compared to state-of-the-art models. The authors explore indexing techniques to improve the retrieval effectiveness of RAG systems, including standard sparse indexing, sparse encoder-based vector models, and dense indexing. Standard sparse indexing, colloquially known as keyword-matching, assigns relevance based upon how many times a keyword appears in a document relative to the entire collection. Sparse-encoder based indexes improve upon this by assigning relevancy based upon learned contextual importance, utilizing an underlying neural network. Some models, including the ELSER model used by Sawarkar, Mangal, and Solanki (2024), enhance the document or query with semantically similar words prior to vectorization, improving the precision of keyword-matches. Finally, dense indexes compare the vectorized form of the user’s query to a collection of vectorized documents, selecting the top-K documents using cosine similarity. This method is efficient for capturing context. Benchmarking using the NDCG@10 metric, the authors demonstrated that semantic-based hybrid methods consistently outperform both state-of-the-art baselines and keyword-based hybrid methods. The Sparse Encoder-based semantic search, when combined with the ‘Best Fields’ query, demonstrated superior performance, even when compared to RAG pipelines specifically fine-tuned for those datasets. These findings are particularly relevant in an academic advising context, where a user’s query may require contextual understanding (benefiting from dense embeddings) or exact keyword matching (benefiting from sparse embeddings).

Gao et al. (2024) identify that robust RAG systems should be able to reject documents which are contextually-relevant but lack substantive information. They should also refrain from generating an answer when the returned results are not useful, thus minimizing hallucinations. RAG systems should also be able to synthesize information across multiple documents and identify known inaccuracies. To ensure systems possess these capabilities, Gao et al. (2024) emphasize that RAG systems inherently have two distinct objectives: retrieval and generation. Because the quality of the retrieved context fundamentally dictates the generation quality, the components should be evaluated separately.

To address the limitations of traditional end-to-end evaluation, Salemi and Zamani (2024) propose eRAG, a method for evaluating retrieval quality independently. eRAG uses an LLM to individually process each document returned by the retriever and generate a utility score based upon its ability to answer the user’s query using only that document’s content. The authors demonstrate that this method of evaluation is highly computationally efficient, reducing runtime complexity by avoiding the quadratic cost penalty applied by standard transformers to long inputs.

Upon evaluation, the authors found that eRAG is 2.5 times faster and significantly more memory-efficient than traditional end-to-end evaluation systems. eRAG also consistently outperformed more traditional systems, including the KILT benchmark and Relevance Annotation with LLM. This technique is directly applicable to the development of an intelligent advisor, as isolating the retriever enables greater visibility into how different indexing and chunking optimizations impact the system’s performance.

While Naive and Advanced RAG pipelines are linear in flow, Modular RAG offers the capability to alter this flow, in the form of advanced architecture. Building upon the foundational principles of Naive RAG and Advanced RAG, Modular RAG systems aim to improve context, flexibility, and overall efficiency by enabling recursive and adaptive loops. Recursive retrieval initializes with a complete iteration; however, it then allows the generator LLM to judge the relevance of the most recent results. The LLM may choose to recursively perform the RAG process, applying query improvement techniques as needed to improve the relevancy of results. Adaptive retrieval utilizes fine-tuned LLMs to intelligently decide if the RAG process is necessary for a given prompt, increasing computational efficiency (Gao et al. 2024).

Agentic AI systems possess autonomous planning and decision-making capabilities, making these systems perfect candidates for implementing both Modular and Advanced RAG techniques. Jaggavarapu (2025) highlights several core architectural components of agentic systems relevant to a RAG-enabled academic advisor. In the context of this study, an agent’s perception module maps directly to its ability to apply query optimization, where the agent enhances or decomposes a query to avoid ambiguous scenarios, which may impact retrieval performance. An accurate, fast, and easy-to-maintain knowledge representation system is enabled by access to a well-designed RAG knowledge base, utilizing Advanced RAG’s index and embedding optimization techniques. The strength of hierarchical design is echoed by both Gao et al. (2024) and Jaggavarapu (2025).

Optimizing an agentic system’s planning and reasoning framework can be achieved by selecting an underlying LLM which possesses hierarchical decomposition and strong adaptive capabilities. For an Agentic RAG system, the agent utilizes action execution mechanisms (tools) to interact with the knowledge base. Utilizing a ReAct framework (Yao et al. 2023), the agent can autonomously decide whether to execute the RAG pipeline by calling the RAG tool, depending upon the user’s query. Futhermore, the agent can recognize when retrieved results are non-substantive, opting to re-retrieve results with an enhanced or rewritten query. These capabilities are a natural extension of Modular RAG techniques. Finally, an agentic architecture adds an additional layer of error-handling, applying reasoning to process any error messages returned by tool calls (Jaggavarapu 2025). The agent can choose to re-execute an action mechanism with the same or different arguments, call a different tool, or return the error to the user in a natural language. These capabilites increase the system’s overall robustness.

The reviewed literature comprehensively documents the RAG architecture and recent advancements to include Naive RAG, Advanced RAG, Modular RAG, and Agentic RAG. In these studies, performance was systematically evaluated using common benchmarking datasets and metrics; however, limited research exists which evaluates the application of Advanced and Modular RAG techniques in agentic systems, specifically one employed in a domain-specific application.

This research aims to bridge that gap by implementing techniques from both Advanced RAG and Modular RAG in a single-agent context. Expanding upon the recommendations of both Gao et al. (2024) and Sawarkar, Mangal, and Solanki (2024), this study implements a hybrid index utilizing both dense and sparse-encoded embeddings. The documents are chunked based on structural hierarchy, then on chunk size, enabling a 20% context overlap. Rule-based and LLM-enabled intelligent filtering are applied to remove noisy and non-substantive chunks from the knowledge base. Finally, the content is enriched with metadata to improve retrieval precision. By isolating and evaluating the retriever independently, this studies identifies the optimal data-preparation and indexing strategies required to build a reliable agentic academic advisor.

Methodology

A robust retrieval-augmented generation (RAG) system is composed of the following components: user interface, knowledge base, retriever, and generator. In this study, the user interfaces with the RAG system via a command-line interface (CLI) AI agent, built using the LangGraph framework. The knowledge base is a Pinecone hybrid index, which retrieves contextually relevant results using the dot product similarity metric. These retrieved contexts are then passed to the generation LLM, which processes the information and generates a response. The response is returned to the user through the agentic interface. Figure 1 illustrates the user’s end-to-end workflow at a high-level.

Figure 1: End-to-End RAG System Architecture and Query Execution Flow

The following methodology sections detail the system’s implementation, including design decisions, relevant libraries, API calls, and system improvements.

Knowledge Base Development

The RAG pipeline’s knowledge base is composed of content from the University of West Florida’s Public Knowledge Base and the UWF Department of Mathematics and Statistics’ 2024-2025 Graduate Student Handbook.

Data Ingestion Process

The University of West Florida’s Public Knowledge Base was scraped from the public Confluence Tree, whereas the Department of Mathematics and Statistics’ 2024-2025 Graduate Student Handbook was ingested as a PDF file. Both data sources were converted to Markdown files; however, the ingestion and pre-processing methods differ heavily between the two sources.

Confluence URL Tree to Markdown Conversion:

The UWF Public Knowledge Base hosts information relevant to students, faculty, and staff, covering topics such as course registration, academic advising, study abroad programs, and available software resources.

The content was scraped programmatically using the Confluence REST API, starting at the root page and traversing the tree using a depth-first search algorithm. Each page’s immediate children were obtained by calling the /rest/api/content/{id}/child/page endpoint, where id is the Confluence Page ID. The recursion continues until a leaf node is reached, beginning the page processing.

Each page’s content is retrieved using an HTTP GET request, using the expand=body.storage,version parameter. The body.storage field returns Confluence’s internal XML storage format, rather than the rendered HTML. Rendered HTML is much more difficult to clean, as it includes display HTML unrelated to the page’s content. Confluence’s XML is heavily structured, increasing the ease with which the content can be parsed and cleansed. The version field provides additional metadata about the page, including the version number and last-modified timestamp. The payload also contains key metadata, including the Confluence ID and page title, which are extracted.

The BeautifulSoup Python library, with the lxml parser, was used to parse the storage-format XML into a simplified tree structure. Confluence macro components (e.g., <ac:parameter>), table-of-contents macros, images, and empty paragraphs were removed; text was unwrapped from layout wrapper tags (e.g., <ac:layout>). These elements are not relevant to the page’s content. Without removal, they act as noise, reducing RAG pipeline performance and inflating embedding, storage, and retrieval costs. Internal Confluence links were maintained, utilizing quote_plus to encode the URLs, and replacing the XML tags with standard HTML anchor tags. While internal links are not traversed, the URLs are kept as content. This allows the RAG generation module to process and return them to the user for more information. Finally, the cleansed HTML is converted by the Markdownify library into clean Markdown.

Each page’s ancestors are tracked throughout the recursion process. The relevant metadata and ancestor chain are prepended to the cleaned Markdown content as YAML frontmatter, and the Markdown file is saved to the disk. Figure 2 illustrates the final cleaned Markdown file. The content has been truncated for illustration purposes.

Figure 2: Truncated Confluence Page Markdown File
PDF to Markdown Conversion:

The UWF Department of Mathematics and Statistics’ 2024-2025 Graduate Student Handbook includes information related to the MS in Mathematical Sciences, MS in Data Science, and adjoining certificate programs. Information includes program requirements, available courses, course schedules, and faculty profiles.

The PDF document was converted to Markdown using LangChain’s DoclingLoader class. By specifying the export_type=ExportType.MARKDOWN parameter, the PDF file was parsed and converted to Markdown content in a single API call. The DoclingLoader document loader returns one Document per page of the PDF. The returned Documents are joined and saved as a Markdown file. YAML frontmatter is not appended. Any necessary metadata is attached within the text chunking process.

More information about the sourced data can be found in the Data Exploration and Visualization section. Once ingested and converted to Markdown files, both document sources utilized the same chunking and indexing strategies illustrated in Figure 3 and detailed below.

Figure 3: High Level Overview of Document Embedding Process

Chunking Strategy

The corpus utilizes a hierarchical-recursive chunking strategy. Each page was split on headers, using LangChain’s MarkdownHeaderTextSplitter method. While initial iterations split on Levels 1 through 4 headers, the strategy was refined to split on Level 1 through 2 headers following baseline testing, preventing overly-granular chunks. Headers were retained within each chunk’s content, rather than stripped. This retains critical labels which add semantic context, increasing the likelihood of retrieval.

Then, LangChain’s RecursiveCharacterTextSplitter method was used to split each chunk recursively, with a 2000-character limit and a 400-character (20%) sliding window overlap. This ensures each chunk has adequate context, without obscuring specific details which may be needed for factual queries. The sliding window also helps to preserve context across split texts. This strategy is in alignment with industry best practices (Tuychiev 2025). Each chunk was assigned a custom, deterministic ID, using either the Confluence Page ID for pages scraped from the knowledge base or the document title for ad hoc documents.

Each chunk was then filtered for noise, using both deterministic rules and LLM judgement. Rule-based filtering was applied where there was little risk of removing semantically relevant content. An LLM was called for chunks matching specific patterns (e.g., “This page contains”) that contained semantically-relevant but non-substantive content. Concrete examples of filtering can be found in the RAG Retriever Optimization section below.

The applicable metadata was extracted from each chunk. For ad hoc documents, this includes the source document’s file name, the chunk ID, and the header hierarchy. For documents scraped from UWF’s Confluence tree, metadata also included the page’s path (for nested documents), URL, version number, and last updated date. These were attached to each chunk as metadata prior to upsertion, enabling easy filtering if needed. The metadata was also prepended for content enrichment, prior to vectorization, providing additional context for the retriever. Figure 4 below contrasts the chunk’s content before and after enrichment. The addition of the source path and header hierarchy provides additional context, which can be used to assess similarity. The version number and last updated date help the generation LLM to determine relevancy if it is provided with conflicting information.

Figure 4: Content Enrichment

Indexing Strategy

Vector stores support three different indexing strategies: dense, sparse, and hybrid. This knowledge base utilizes a hybrid index, as recommended by Gao et al. (2024) and empirically supported by Sawarkar, Mangal, and Solanki (2024). The hybrid index is composed of dense vectors, which capture semantic and contextual information, and sparse-encoded vectors. It should be noted that the chosen sparse-encoder model (pinecone-sparse-english-v0) does not expand the document or query with semantically-relevant words, maintaining quick upsertion speed. The model does, however, utilize an underlying transformer to assign relevancy based upon contextual importance. All vectors are stored in a unified index structure, meaning a single retrieval may return results which are both contextually relevant and precise. The document embedding process can be viewed in Figure 5.

Figure 5: Document Embedding Generation Process

Embedding Process

The gemini-embedding-001 model was selected to generate dense embeddings. At the time of selection, the model ranked high on the MTEB Leaderboard (Hugging Face n.d.). Additionally, the model integrates natively with the LangChain framework. The model uses the Matryoshka Representation Learning (MRL) technique to truncate output vectors without significant loss of information, as demonstrated by the MTEB scores (68.16 for 2048-dim vs. 67.99 for 768-dim) (Google n.d.). Consequently, the densely embedded vectors generated for the UWF knowledge base have a dimensionality of 768, balancing storage frugality with precision. Chunks were embedded in batch sizes of 100, adhering to Gemini’s API limit. Each batch upload was buffered by a 0.5-second sleep to avoid becoming rate limited.

The pinecone-sparse-english-v0 model was utilized to generate semantically enhanced sparse vectors. The model outputs a SparseEmbedding object consisting of indices and values. The indices map to specific entries in the model’s vocabulary, while the values represent the learned importance of those terms relative to the document’s meaning (Pinecone 2025). The number of non-zero elements in the returned vector roughly corresponds to the number of meaningful words contained within the text chunk. Chunks were embedded in batch sizes of 96, adhering to Pinecone’s API limit. Each embedding batch was buffered by a 12-second sleep to avoid hitting Pinecone’s separate 250,000 token per minute rate limit. The extended sleep allows 5 batches to be processed per minute. While this is significantly slower than the Gemini embedding process, it helps to avoid becoming rate limited by Pinecone.

To optimize retrieval performance, the dense and sparse models are instantiated with task-specific embedding instructions. This was implemented via the RETRIEVAL_DOCUMENT/RETRIEVAL_QUERY parameter for Gemini and the passage/query parameter for Pinecone (Pinecone Systems 2025; Google n.d.). These parameters allow the respective models to differentiate between generating document chunk embeddings (during the upsertion process) and generating user query embeddings (during the retrieval process).

Prior to upsertion, the vectors were normalized via Euclidean normalization using Python’s NumPy library. As demonstrated in Figure 6 raw sparse values are much larger than dense values. Without normalization, the sparse vector has a magnitude of 14.55, approximately 25 times larger than the magnitude of the dense embedding. This difference in magnitude would unintentionally cause the retriever to prioritize exact keyword matching over semantically relevant results during the hybrid search process. Euclidean normalization scales the magnitude of all vectors to 1.0, maintaining an equal weight during the similarity comparison.

Figure 6: Generated Vectors: Comparison of Magnitude Before and After Euclidean Normalization

Vector Upsertion and Index Management

The system utilizes a single Pinecone hybrid index, configured with the “dotproduct” similarity metric. At the time of creation, the “dotproduct” is the only similarity metric supported by Pinecone’s hybrid index, as it is the sole metric capable of combining scores from both dense and sparse vectors. The dense embeddings generated by the gemini-embedding-001 model are optimized for similarity comparison using the cosine similarity metric, which divides the dot product by the product of the Euclidean norms. \[\textbf{similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}\]

Because all vectors are normalized prior to upsertion, their Euclidean norms equal 1. Thus, for the KB’s normalized vectors, the dot product is mathematically equivalent to cosine similarity, ensuring alignment with industry best practices (Pinecone Systems Inc. 2023). \[\textbf{similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{1 \cdot 1} = \mathbf{a} \cdot \mathbf{b}\]

To prepare the vectors for upsertion, the text chunks are aggregated with their corresponding dense and sparse embeddings. A record is constructed for each chunk, identified by the unique ID generated during preprocessing. The ‘id’ field is consequently removed from the chunk’s metadata to avoid redundancy. The chunk’s original content, which includes no enriched metadata, is also attached to the record. This allows the RAG generation LLM to access the original content upon retrieval, resulting in cleaner content returned to the user and mitigating the need for a secondary database lookup. Figure 7 illustrates the full Pinecone record schema.

Figure 7: Pinecone Record Schema

The record’s text metadata significantly increases the size of the record; thus, the vectors were upserted in batch sizes of 50, ensuring compliance with the API’s 2 MB request body limit (Pinecone n.d.).

Query Processing and Embedding

The retrieval process begins by transforming the user’s natural language query into vector representations. The RAG pipeline utilizes the same embedding models used during ingestion, ensuring alignment between the query vectors and the document vectors. The dense gemini-embedding-001 model’s task_type parameter was set to RETRIEVAL_QUERY, which optimizes the resulting vector for finding semantic matches, rather than representing document content. The sparse pinecone-sparse-english-v0 model was initialized with input_type="query". This optimizes the vector for short, interrogative inputs rather than longer passages. Both embedding types are normalized using the same Euclidean Normalization algorithm as the document base.

Hybrid Search and Scoring

Once the query vectors are generated, Pinecone’s query API is called on the knowledge base index, passing the normalized dense and sparse embeddings. No additional weighting is applied, ensuring no partiality is placed on semantic or exact-keyword results. The top_k parameter determines the number of results returned by the Pinecone API. The hybrid index calculates relevancy scores for both vector types simultaneously. This differs from traditional hybrid architectures, which require separate dense and sparse indexes. The final relevance score for a query and document is computed, and the top k results are returned.

Agentic Generation

The UWF agentic academic advisor is built using LangGraph, which is a graph-based orchestration framework that is built on top of LangChain. While LangChain is a purely linear flow, LangGraph provides the ability to loop behavior, persist state between tool executions, incorporate human response, and chain multiple agents together. OpenAI’s gpt-5 was chosen as the underlying LLM for its advanced reasoning capabilities. The advantage of an agentic RAG setup over a naive pipeline is that the agent’s underlying model can autonomously apply Advanced or Modular RAG techniques without requiring explicit infrastructure.

A ReAct (Reason-Act-Observe) agentic architecture was used. During the Reasoning phase, the agent’s underlying LLM receives the user’s input and decides which tool to call. lt then acts, autonomously choosing whether to enhance the input or passing the raw input to the tool. Finally, it observes the tool’s output and decides whether to call the tool again, to call a different tool, or to end the graph’s execution. Figure 8 illustrates this agentic workflow.

Figure 8: Basic ReAct Architecture

The AI agent greets the user with a friendly greeting upon script execution. The user’s subsequent input triggers the next execution of the LangGraph state machine; the model decides which actions to take, executes them, and returns a response to the user, finishing that specific graph execution. This process repeats until the user explicitly ends the conversation by passing exit command to the LLM.

The advantages of utilizing an autonomous agent rather than a linear pipeline are demonstrated by the following real-world interaction. A user asks the agent for assistance accessing class recordings. This begins the graph’s execution. The LLM chooses to call the perform_rag_search tool. Critically, rather than passing the user’s original query, the LLM autonomously rewrites and enhances the query. The tool is called with the following string: “How do I access class recordings on Canvas at UWF? Panopto/Zoom lecture recordings in Canvas, where to find them, troubleshooting if not visible”.

Figure 9: Autonomous LLM Query Enhancement

The RAG pipeline executes this query, returning five relevant results. This can be seen in Figure 10.

Figure 10: Pinecone Retrieval #1: Raw Results

These results are appended to the agent’s state via a ToolMessage. The graph then loops back to the agent, which reads the updated state containing the retrieved contexts. After processing these results, the model determines that more specific information is needed and decides to re-call the perform_rag_search tool, again with an altered query. Notably, the model focuses the second query on “Panopto”, likely due to the prevalence of that term in the retriever’s first batch of results. The tool call and the results from the second retrieval process are detailed in Figure 11.

Figure 11: Pinecone Retrieval #2: Enhanced Query + Raw Results

Once again, the model accesses the tool results via its updated state. It observes the results and concludes that no additional tool calls are necessary. The model aggregates and synthesizes the retrieved contexts and generates a response. The graph’s execution ends, and the final AIMessage containing the generated response is displayed to the user (Figure 12).

Figure 12: Model Generated Response

Critically, through this single ReAct loop, the agent autonomously employed both Advanced RAG query optimization techniques and Modular RAG’s recursive retrieval without human intervention. This dynamic capability significantly improves the system’s accuracy and the user’s overall experience.

RAG Pipeline Performance Testing and Optimization

As encouraged by Gao et al. (2024), effective evaluation of a RAG system requires rigorous testing of individual components. In this study, performance evaluation was strictly scoped to the RAG retriever. Because accurate generation is fundamentally dependent upon the retrieval of highly relevant contexts, optimizing the knowledge base’s data ingestion, chunking, and indexing strategies was prioritized. Formal quantitative evaluation of the generation module (e.g., assessing faithfulness and answer relevancy), as well as model fine-tuning, were considered out of scope for this project.

RAG Retriever Testing

Improving the quality of content returned from the vector store subsequently improves the entire pipeline’s performance; thus, great care was taken to optimize the RAG retriever. Retriever testing was performed using two independent 50-question datasets. The first dataset (Development Set) was used iteratively to tune the indexing and chunking strategies. The second 50-question dataset (Holdout Test Set) was strictly reserved for final performance evaluation.

The Ragas Python library was used to evaluate the retriever’s performance, using two critical metrics: Context Precision and Context Recall (Ragas 2025c). Both methods implement LLM-as-a-judge; OpenAI’s gpt-4o model was utilized, balancing high reasoning capabilities with lower cost, when compared to newer models such as gpt-5. The model was invoked with temperature = 0.0, to ensure the results were as deterministic and repeatable as possible.

Precision focuses on the relevancy of the returned results and is computed as the ratio of relevant chunks to total chunks. Retrievals which return a larger amount of relevant chunks will consequently have a higher ratio and return a higher score. Context Precision accounts for the order with which results are retrieved. For each chunk, it multiplies the running precision at chunk k by a binary relevance indicator, \(v_k\), where 1 is relevant and 0 is an irrelevant chunk. This is then summed and divided by the total number of relevant chunks. It is calculated as follows: \[\text{Context Precision} = \frac{\sum_{k=1}^{K}(\text{Precision@k} * v_k)}{\text{Total Number of Relevant Items in K Results}}\] (Ragas 2025a). For example, consider a retriever that returns 5 total results, 3 of which were deemed by the LLM to be relevant. The relevant chunks were returned at positions 1, 3 and 5 respectively. Overall precision is \(\frac{3}{5}\). The Context Precision is then: \[\frac{(\frac{1}{1} * 1) + (\frac{1}{2} * 0) + (\frac{2}{3} * 1) + (\frac{2}{4} * 0) + (\frac{3}{5} * 1)}{3} = 0.76 \]

Context recall is simpler in that it evaluates whether all of the relevant content, broken into individual claims, was retrieved (Ragas 2025b). \[\text{Context Recall} = \frac{\text{Number of ground truth claims supported by the retrieved contexts}}{\text{Total number of claims in the ground truth}}\] A higher score indicates that more of the ground truth was retrieved.

The RAG retriever evaluation process begins by generating embeddings for each query using the pinecone-sparse-english-v0 model (sparse-encoded) and gemini-embedding-001 (dense). This is consistent with the models utilized to develop the knowledge base. Then, the top-k results are retrieved, where k is specified at the time of testing. For a dataset with 50 questions, where k = 5 results were retrieved, this produces a set of 50 question / context sets, where each context set contains the k = 5 retrieved chunks. To evaluate the retriever’s performance, Context Precision and Context Recall are computed for each question and associated context set. Context Precision calls the Ragas LLM (gpt-4o) k times, one for each retrieved chunk, whereas Context Recall calls the Ragas LLM once per question. These individual metrics are recorded per question, then aggregated across the entire dataset. This process is then repeated five times per RAG pipeline design, documenting cross-run metrics (average and standard deviation) for comparison.

Figure 13: RAG Retrieval Evaluation Pipeline

The pipeline utilizes Ragas v0.4 Collections API and the asyncio Python library to perform asynchronous Ragas metric computation. Though the pipeline makes k API calls per question to compute Context Precision, the calls are performed nearly simultaneously, significantly reducing the time it takes to execute the pipeline. Because both Context Precision and Context Recall utilize LLM-as-a-judge, repeating the pipeline over a specific number of iterations is important for ensuring that the results are consistent and repeatable.

The testing process is computationally intensive. Testing a dataset of 50 questions, retrieving k=5 results per question, requires 400 total API calls:

Table 1: API Call Breakdown.
Number of API Calls Model Purpose
50 pinecone-sparse-english-v0 Generate Sparse Embeddings
50 gemini-embeddings-001 Generate Dense Embeddings
250 gpt-4o Compute Context Precision - k calls per question
50 gpt-4o Compute Context Recall - 1 call per question

Running the pipeline for a single model iteration five times makes 2000 total API calls: 500 API calls to the Embedding Models and 1500 API calls to a generative LLM.

RAG Retriever Optimization

Hierarchy Optimization

The RAG retriever pipeline performance was evaluated following every meaningful design change. Baseline testing indicated that splitting on headers Levels 1 through 4 resulted in overly-granular chunks, which was negatively impacting retriever performance. This was particularly prominent on procedural documents, where each step in the procedure was allocated to its own chunk. If a procedure had greater than k steps, where k is the number of relevant retrieved documents, the retriever is limited in the amount of context which can be returned. Figure 14 below displays one such case, which instructs how to purchase a parking permit. The procedure has 11 steps; however, querying the RAG knowledge base for “How do I purchase a parking permit?” with k=5, returned only 5 of the 11 steps. The evaluated Context Precision is 0.0 and the Context Recall is 0.5.

Figure 14: Baseline Chunking Strategy: Fragmented Procedure Steps

Consequently, the chunking strategy was revised to split on levels 1 and 2, better preserving context throughout procedural documents. This reduced the total number of upserted chunks from 5386 to 4091. This strategy also increased the retriever’s average Context Recall performance by 7% on the development dataset. Using the example above, the 11 instructional steps are now split over two chunks, rather than 11 chunks, allowing the retriever to return all steps in the procedure. This increased both Context Precision and Context Recall to 1.00. Figure 15 below shows a single context, capturing Steps #1 through #8 of the procedure.

Figure 15: H1/H2 Chunking Strategy: Single Chunk
Text Chunk Filtering

Analysis of the development set results following the chunking strategy revision revealed chunks which contained irrelevant or non-substantive text. These chunks were being returned by the retriever due to the relevance of the prepended metadata. Rule-based filtering was applied where there was little risk of excluding chunks with substantive content. Filtering was applied prior to prepending the metadata; all headers were stripped, leaving only the chunk’s body. Chunks containing no body text, or those which contained only leftover structural content or placeholder text, were removed. This automated filtering removed 267 non-semantic chunks (~ 6.5%) from the vector store. The breakdown can be seen in Table 2 below.

Table 2: Rule-Based Filtering.
Filter: Count:
Empty Body 192
FAQ Placeholder 47
Only Hyphens 11
CMS Placeholders 11
Related Articles Placeholder 6

Further analysis of the retriever’s behavior revealed that ‘Overview’ chunks were frequently returned by the retriever. Many UWF Public KB pages include a Level 1 “Overview” section. Due to the hierarchical splitting strategy, and content enrichment (e.g., Title, Path, etc.), the retriever was frequently returning these chunks with the highest relevancy score. While semantically relevant, they often lack the detail required to sufficiently answer a user’s query. Consequently, these sections displace truly relevant information, reducing both Context Precision and Context Recall. However, it was determined that total exclusion of the “Overview” sections could eliminate semantically relevant results. As seen in Figure 16 below, some “Overview” sections contain new or non-summary information, meaning that elimination should be assessed on a case-by-case basis.

Figure 16: Overview Section: Substantive vs. Stub

The subjective nature of relevancy lends itself well to LLM usage; Claude Haiku 4.5 was utilized to determine the relevancy of these stubs, again balancing reasoning capability with cost and ensuring a different LLM family than RAG evaluation models. The LLM was only called for chunks matching specific patterns frequently found within the overview sections (e.g., “This page contains…”, “Here you will find…”), reducing total upsertion cost. For a corpus containing 4091 chunks, the LLM was called only 79 times (~ 1.93% of total chunks evaluated). The LLM prompt was tuned prior to vector store re-indexing, ensuring that the LLM’s evaluation of relevancy aligned with human evaluation. The full prompt can be found in Figure 17 below.

Figure 17: LLM Stub Filter Prompt

The following figures illustrate the difference in retrieval results before and after filtering and re-indexing. In Figure 18, 3 of the 5 retrieved results are non-substantive “Overview” sections. Another retrieved result is occupied by a leftover FAQ placeholder from the scraped Confluence KB. Only one of the retrieved results contributes substantive information, though it is not relevant to the question.

Figure 18: Pre-Filter Retrieval Results

Figure 19 below demonstrates the retrieved results once the knowledge base was re-indexed with filtered chunks. The returned chunks are both more granular and more relevant to the question at hand.

Figure 19: Post-Filter Retrieval Results

Analysis and Results

Data Exploration and Visualization

RAG Knowledge Base Dataset

The entire document corpus contains 788 files. A total of 787 files were obtained from scraping the UWF Public Confluence Tree, which contains relevant information for students, faculty, and staff, including topics such as course registration, academic advising and more. UWF’s Department of Mathematics and Statistics’ 2024-2025 Graduate Student Handbook was also uploaded as a PDF file. It contains information pertinent to the department’s graduate degree programs and certifications.

The initial approach utilized a hierarchical-recursive chunking strategy, splitting on Levels 1 through 4. This resulted in a total of 5386 chunks. However, following testing, the chunking strategy was amended to reduce the granularity of chunks, splitting on Header Levels 1 through 2. With each subsequent improvement, the number of chunks was reduced. The final strategy implemented additional filtering to remove noisy and/or non-substantive chunks from the corpus. The final Pinecone Hybrid Index contained 3764 vectorized and text chunks.

Table 3: Knowledge Base Chunking Strategy - Chunking Breakdown.
Strategy Number of Chunks:
Baseline 5386
H1/H2 4091
H1/H2 with Stub Filtering 3764

Performance Evaluation Dataset

Two independent 50-question datasets were generated from the available corpus of documents to use for RAG pipeline performance evaluation. The first dataset, or the development set, used to tune the RAG pipeline, while the second dataset was retained for final performance evaluation. Dataset generation was automated and randomized.

For each dataset, all available documents were loaded and filtered for a minimum length to ensure enough content for question generation. 50 documents were randomly sampled without replacement, using Python’s random module. Claude Haiku 4.5 was used to generate a single natural language QA pair per document. Limiting content generation per document ensures broad topical coverage of the corpus and prevents over-indexing a single document. Figure 20 below shows the prompt used to generate the question and ground truth pairs.

Figure 20: LLM Prompt: Question / Ground Truth Generation

The Anthropic Claude family was explicitly chosen as it belongs to a different LLM family than the Ragas LLM (OpenAI GPT), limiting bias between QA generation and performance evaluation. Claude Haiku 4.5 was chosen from Anthropic’s available Claude models, as it has similar reasoning capabilities to Sonnet-4 but is much cheaper. The QA pairs were aggregated and exported to a .csv file, ready for use within the performance evaluation pipeline. This process was then repeated one more time to generate the second, independent dataset. An example of a generated question and ground truth pair can be seen in Figure 21 below.

Figure 21: Generated Question / Ground Truth Pair

RAG Pipeline Performance Analysis

Development Dataset Evaluation

Table 4 reports the results of iterative optimizations to the chunking and indexing strategies. These results were generated by running the evaluation pipeline using the development dataset. The Average Context Recall improved by 11% between the baseline strategies and the final strategy. The Average Context Precision also improved, though the improvements were less significant, with a 3% net improvement.

Table 4: Testing Results - Development Dataset.
Pipeline Design Iteration Average Context Recall Average Context Precision
Baseline 0.78 ± 0.00 0.74 ± 0.01
H1/H2 Chunking Strategy 0.85 ± 0.00 0.74 ± 0.01
H1/H2 + Stub Filtering 0.89 ± 0.00 0.77 ± 0.01

Holdout Test Set Evaluation

To validate the improvements observed, the system was evaluated against the holdout testing set. Because the pipeline’s design was strictly tuned using the development set, performance on the holdout set represents the system’s true expected performance.

Table 5 details the final average Context Recall and Context Precision metrics across five independent runs.

Table 5: Testing Results - Holdout Dataset.
Pipeline Design Iteration Average Context Recall Average Context Precision
Optimized Pipeline (H1/H2 + Stub Filtering) 0.97 ± 0.00 0.83 ± 0.01

Analysis

Comparing the evaluation results for the development dataset versus the holdout dataset, the pipeline’s performance exhibited strong generalization capabilities. Performance on the holdout dataset demonstrated an increase in Average Context Recall of 8% (0.89 to 0.97) and an increase in Average Context Precision of 6% (0.77 to 0.83). These results highlight the strengths of the chunking and indexing strategy improvements.

Analysis of the precision scores indicates that the retriever struggles with words that are semantically ambiguous. For example, for the query “How do I enroll in the UWF Student Orientation eLearning course?”, the retriever returns chunks related to Dual Enrollment Orientation and a generic eLearning Canvas Orientation. Because the registration process is nearly identical between the programs, the retriever struggles to differentiate between the two. Additionally, for procedural queries, such as “What do I need to do before UWF can issue my Form I-20 as an international graduate student?”, the correct answer was retrieved; however, the remaining four slots were filled with tangentially-related policies, such as transferring a form I-20 or instructions on how to submit proof of identity. Because the ground truth context was not returned as the top result, the question’s precision score degraded.

Agentic RAG Advantage

The Agentic RAG pipeline has a great advantage over a Naive RAG Pipeline. An average Context Precision of 0.83 suggests that a portion of the top-k results returned are irrelevant. Though this study evaluated only the retriever’s performance, the end-to-end experience must be considered. The agentically-enhanced generation module is capable of intelligent filtering, improving the results returned to the user. Additionally, using the same query as above, an intelligent agent will possess the ability to process the retrieved results, determine relevancy, and intelligently decide to re-call the RAG tool with an enhanced query. These capabilities provide an inherent advantage over Naive RAG, whose performance may suffer from semantically ambiguous topics.

Conclusion

Declaration of Generative AI Use

Generative AI was used to edit this paper for grammar, clarity, and conciseness. It was also used to evaluate general logic flow. All original ideas, including research methodology, data analysis, and technical content, presented in this paper are my own.

References

Gao, Yunfan, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. “Retrieval-Augmented Generation for Large Language Models: A Survey.” https://arxiv.org/abs/2312.10997.
Google. n.d. “Embeddings | Gemini API.” https://ai.google.dev/gemini-api/docs/embeddings.
Hugging Face. n.d. “MTEB Leaderboard.” https://huggingface.co/spaces/mteb/leaderboard.
Jaggavarapu, Manoj Kumar Reddy. 2025. “The Evolution of Agentic AI: Architecture and Workflows for Autonomous Systems.” Sarcouncil Journal of Multidisciplinary 5 (7): 418–27. https://doi.org/10.5281/zenodo.15876888.
Ji, Ziwei, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. “Survey of Hallucination in Natural Language Generation.” ACM Computing Surveys 55 (12): 1–38. https://arxiv.org/pdf/2202.03629.
Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, et al. 2020. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” In Advances in Neural Information Processing Systems, edited by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, 33:9459–74. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf.
Pinecone. n.d. “Upsert Records: Upsert in Batches.” https://docs.pinecone.io/guides/index-data/upsert-data#upsert-in-batches.
Pinecone. 2025. “Generate Embeddings API Reference - Sparse Embeddings.” https://docs.pinecone.io/reference/api/2025-10/inference/generate-embeddings#sparse-embedding.
Pinecone Systems. 2025. “Unlock High-Precision Keyword Search with Pinecone-Sparse-English-V0.” https://www.pinecone.io/learn/learn-pinecone-sparse/.
Pinecone Systems Inc. 2023. “Vector Similarity Explained.” https://www.pinecone.io/learn/vector-similarity/.
Ragas. 2025a. “Context Precision.” 2025. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/.
———. 2025b. “Context Recall.” Ragas Documentation. 2025. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_recall/.
———. 2025c. “Ragas API References.” 2025. https://docs.ragas.io/en/stable/references/.
Salemi, Alireza, and Hamed Zamani. 2024. “Evaluating Retrieval Quality in Retrieval-Augmented Generation.” https://arxiv.org/abs/2404.13781.
Sawarkar, Kunal, Abhilasha Mangal, and Shivam Raj Solanki. 2024. “Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers.” In 2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR), 155–61. https://doi.org/10.1109/MIPR62202.2024.00031.
Tuychiev, Bex. 2025. “Best Chunking Strategies for RAG in 2025.” Firecrawl Blog. https://www.firecrawl.dev/blog/best-chunking-strategies-rag-2025#understanding-chunking-trade-offs.
Yao, Shunyu, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. “ReAct: Synergizing Reasoning and Acting in Language Models.” In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2210.03629.