Parallel Query (Fan Out) Retrieval

Introduction:

Previously, we learnt what RAG is and why Query Translation is important. Now, we will learn about a popular technique of Query Translation: Parallel Query Retrieval.

What is Parallel Query Retrieval?

Some Backstory:

We know that RAG works in some context (documents, web, or anything that has relevant data). Now, let’s think we have given a file for Node.js as a context to the RAG. Now, the user might ask this:
“What is fs?”

As humans, we can understand, the user wants to know about “File System” in Node.js. But what if the Node.js documentation does not have the word “fs” in it, instead it has “file system” written everywhere. So, when using RAG, will it find any similarity? And will it be able to perform nicely?

No, right? But we need to take care of this.

Actually, we can solve this problem by this process:

The user asks the question.
We prompt the LLM and generate some similar questions like that. In here, the questions might be..
1. What is fs?
2. What is the file system?
3. What is a file in Node.js?
4. How to create a file in Node.js?
We search the reference of the generated questions in the vector store or any database we embedded the documentation.
We find similar files in the vector store. Suppose:
1. We could not find anything for the question “What is fs?”
2. We find 2 similarities for the question “What is the file system?” (just denoting: one yellow, one blue)
3. We find one similarity for the question “What is a file in Node.js?” (one blue)
4. We find three similarities for the question “How to create a file in Node.js?” (one blue, one yellow, one red)
We then filter the results and take the unique files.
Finally, we give context of the three unique files (yellow, blue, and red) to the LLM and answer the user’s query.

🤔

Did we solve our problem? Can the LLM answer our question now?

💡

Yes, the problem is solved for now. Look, the LLM did not know “fs” but did know “file system” for sure. Now, it can answer the user’s questions, right?

❓

Is this enough?

Definition:

Parallel Query Retrieval, also known as Fan Out Retrieval, is a method where multiple variants of the same user query are created and sent in parallel to different or the same retrieval systems. The goal is to maximize recall and diversify the retrieved documents, ultimately helping the LLM generate more informed and accurate answers.

Why Fan Out?

The word “Fan Out” actually comes from Systems design and Networking. It means:

🌐

Spreading a single input into multiple parallel paths or processes.

In Parallel Query Retrieval, we are taking a single user input and generating multiple queries and spreading them into multiple paths, just like “Fanning Out the Queries”. It’s like 4 or 5 experts are answering same questions, isn’t it interesting?

Some Examples:

Let’s divide the process into some parts to understand better:

🔴

I am using a Docker container in my system for the vector store.

Parallel Query Generation:

 class ParallelQuery:
     def __init__(self, api_key):
         genai.configure(api_key=api_key)
         self.model=genai.GenerativeModel('gemini-1.5-flash-001')

     def generateParallelQuery(self, query, number_of_queries = 3):
         try:
             system_prompt = f"""
                 You are a helpful AI assistant who generates {number_of_queries} queries with similar topics of the given query={query}.

                 METHOD:
                 1. You get a query, analyze it and find the keywords in that.
                 2. You generate similar words based on the keywords.
                 3. You make similar query like {query} using the newly generated keywords

                 EXAMPLE:
                 original: "What is fs in Node.js?"
                 generated:
                     1. "What is file system?"
                     2. "What are files in Node.js?"
                     3. "How to make files in Node.js?"

                 RETURN FORMAT
                 You only need to return the queries in this json format:
                 {{
                     "original": "{query}",
                     "generated": [
                         "generated_1",
                         "generated_2",
                         "generated_3"
                     ]
                 }}

                 Return ONLY valid JSON, no additional text.
             """

             response = self.model.generate_content(
                 system_prompt
             )

             if not response or not response.text:
                 print("No response from model")
                 return None

             filtered_response = filter_response(response)

             try:
                 parsed_response = json.loads(filtered_response)
                 return parsed_response
             except json.JSONDecodeError as e:
                 print(f"JSON parsing error: {e}")
                 return None

         except Exception as e:
             print(f"Problem occured while generating the response: {e}")
             return None

💡

Look at the system prompt closely, and you will understand. Other than that, everything is just refining the query

Searching References:

 # Main Parallel Search Function

 def perform_parallel_search(vector_store, queries, k_per_queries):
     """Perform search with multiple queries and combine results"""
     all_results = []

     for index, query in enumerate(queries, 1):
         print(f"Running search on query: {index}")
         response = vector_store.search(query, k=k_per_queries)

         for (document, score) in response:
             all_results.append({
                 'query': query,
                 'document': document,
                 'score': score,
                 'content': document.page_content,
                 'page': document.metadata.get('page', 'N/A'),
                 'source': document.metadata.get('source', 'N/A')
             })

     all_results.sort(key=lambda x:x['score'])
     unique_results = remove_duplicate_results(all_results) #just some function to remove duplicates, see more in the full code given below.

     print(f"Total result's length: {len(unique_results)}")
     return unique_results

 # Function to search in Vector Store:

 def search(self, query, k=5):
         """Search the vector store for relevant data"""
         try:
             if hasattr(self, 'vector_store') and self.vector_store:
                store = self.vector_store
             else:
                 print("Creating a new retriever...")
                 store = self._retrieve()
                 if not store:
                     raise ValueError("Failed to create a retriever....")

             results = store.similarity_search_with_score(query, k=k)
             print(f"Found {len(results)} results for the given query")
             return results
         except Exception as e:
             print("Failed to search on the store")
             return []

 # This is in the VectorStore defined in the full code.

Main Function:


 def main():
     request = input("Query> ")
     number_of_queries = int(input("Number of queries> ") or "3")

     gemini_api = os.getenv("GEMINI_API_KEY")

     try:
         gemini = ParallelQuery(api_key=gemini_api)
         vector_store = VectorStore("Lecture 3 - Polymorphism_250520_224757.pdf")
     except Exception as e:
         print(f"Error occured while setting up API and vector store: {e}")
         return

     response = gemini.generateParallelQuery(request, number_of_queries)
     total_queries = [response['original']]

     if response:
         print(f"\nOriginal: {response['original']}")
         for index, query in enumerate(response['generated']):
             print(f"{index+1}: {query}")
             total_queries.append(query)
     else:
         print("No response returned\n")

     results = perform_parallel_search(vector_store, total_queries, 5)

     for index, result in enumerate(results, 1):
         print(f"{index}: {result['content']}")
         print(f"In page: {result['page']}")

🧠

These are the basics of parallel query: user question → parallel query generates → search on the vector store → gives more robust results.

Full Code:

See the full code here.

Conclusion:

So, Parallel Query (Fan Out) Retrieval - some fancy name, huh? Actually, this is an optimization process for better output. There are a lot of other techniques out there, and I will go through them one by one. For now, stay tuned.

❓

I actually could make it more relevant. Can you tell me how?

Parallel Query (Fan Out) Retrieval

Introduction:

What is Parallel Query Retrieval?

Some Backstory:

Definition:

Why Fan Out?

Some Examples:

Full Code:

Conclusion:

Comments

Tour with GenAI

Advanced RAG: Query Translation

More from this blog

Semantic Routing

Logical Routing

Advanced RAG: Routing

HyDE (Hypothetical Document Embeddings)

Query Decomposition

Command Palette

Introduction:

What is Parallel Query Retrieval?

Some Backstory:

Definition:

Why Fan Out?

Some Examples:

Full Code:

Conclusion:

Comments

Tour with GenAI

Advanced RAG: Query Translation

More from this blog