Skip to main content

Command Palette

Search for a command to run...

Parallel Query (Fan Out) Retrieval

Let's learn the query retrieval technique (Parallel Query Retrieval Technique)

Updated
5 min read
Parallel Query (Fan Out) Retrieval

Introduction:

Previously, we learnt what RAG is and why Query Translation is important. Now, we will learn about a popular technique of Query Translation: Parallel Query Retrieval.

What is Parallel Query Retrieval?

Some Backstory:

We know that RAG works in some context (documents, web, or anything that has relevant data). Now, let’s think we have given a file for Node.js as a context to the RAG. Now, the user might ask this:
“What is fs?”

As humans, we can understand, the user wants to know about “File System” in Node.js. But what if the Node.js documentation does not have the word “fs” in it, instead it has “file system” written everywhere. So, when using RAG, will it find any similarity? And will it be able to perform nicely?

No, right? But we need to take care of this.

Actually, we can solve this problem by this process:

  • The user asks the question.

  • We prompt the LLM and generate some similar questions like that. In here, the questions might be..

    1. What is fs?

    2. What is the file system?

    3. What is a file in Node.js?

    4. How to create a file in Node.js?

  • We search the reference of the generated questions in the vector store or any database we embedded the documentation.

  • We find similar files in the vector store. Suppose:

    1. We could not find anything for the question “What is fs?”

    2. We find 2 similarities for the question “What is the file system?” (just denoting: one yellow, one blue)

    3. We find one similarity for the question “What is a file in Node.js?” (one blue)

    4. We find three similarities for the question “How to create a file in Node.js?” (one blue, one yellow, one red)

  • We then filter the results and take the unique files.

  • Finally, we give context of the three unique files (yellow, blue, and red) to the LLM and answer the user’s query.

🤔
Did we solve our problem? Can the LLM answer our question now?
💡
Yes, the problem is solved for now. Look, the LLM did not know “fs” but did know “file system” for sure. Now, it can answer the user’s questions, right?
Is this enough?

Definition:

Parallel Query Retrieval, also known as Fan Out Retrieval, is a method where multiple variants of the same user query are created and sent in parallel to different or the same retrieval systems. The goal is to maximize recall and diversify the retrieved documents, ultimately helping the LLM generate more informed and accurate answers.

Why Fan Out?

The word “Fan Out” actually comes from Systems design and Networking. It means:

🌐
Spreading a single input into multiple parallel paths or processes.

In Parallel Query Retrieval, we are taking a single user input and generating multiple queries and spreading them into multiple paths, just like “Fanning Out the Queries”. It’s like 4 or 5 experts are answering same questions, isn’t it interesting?

Some Examples:

Let’s divide the process into some parts to understand better:

🔴
I am using a Docker container in my system for the vector store.
  1. Parallel Query Generation:

     class ParallelQuery:
         def __init__(self, api_key):
             genai.configure(api_key=api_key)
             self.model=genai.GenerativeModel('gemini-1.5-flash-001')
    
         def generateParallelQuery(self, query, number_of_queries = 3):
             try:
                 system_prompt = f"""
                     You are a helpful AI assistant who generates {number_of_queries} queries with similar topics of the given query={query}.
    
                     METHOD:
                     1. You get a query, analyze it and find the keywords in that.
                     2. You generate similar words based on the keywords.
                     3. You make similar query like {query} using the newly generated keywords
    
                     EXAMPLE:
                     original: "What is fs in Node.js?"
                     generated:
                         1. "What is file system?"
                         2. "What are files in Node.js?"
                         3. "How to make files in Node.js?"
    
                     RETURN FORMAT
                     You only need to return the queries in this json format:
                     {{
                         "original": "{query}",
                         "generated": [
                             "generated_1",
                             "generated_2",
                             "generated_3"
                         ]
                     }}
    
                     Return ONLY valid JSON, no additional text.
                 """
    
                 response = self.model.generate_content(
                     system_prompt
                 )
    
                 if not response or not response.text:
                     print("No response from model")
                     return None
    
                 filtered_response = filter_response(response)
    
                 try:
                     parsed_response = json.loads(filtered_response)
                     return parsed_response
                 except json.JSONDecodeError as e:
                     print(f"JSON parsing error: {e}")
                     return None
    
             except Exception as e:
                 print(f"Problem occured while generating the response: {e}")
                 return None
    
💡
Look at the system prompt closely, and you will understand. Other than that, everything is just refining the query
  1. Searching References:

     # Main Parallel Search Function
    
     def perform_parallel_search(vector_store, queries, k_per_queries):
         """Perform search with multiple queries and combine results"""
         all_results = []
    
         for index, query in enumerate(queries, 1):
             print(f"Running search on query: {index}")
             response = vector_store.search(query, k=k_per_queries)
    
             for (document, score) in response:
                 all_results.append({
                     'query': query,
                     'document': document,
                     'score': score,
                     'content': document.page_content,
                     'page': document.metadata.get('page', 'N/A'),
                     'source': document.metadata.get('source', 'N/A')
                 })
    
         all_results.sort(key=lambda x:x['score'])
         unique_results = remove_duplicate_results(all_results) #just some function to remove duplicates, see more in the full code given below.
    
         print(f"Total result's length: {len(unique_results)}")
         return unique_results
    
     # Function to search in Vector Store:
    
     def search(self, query, k=5):
             """Search the vector store for relevant data"""
             try:
                 if hasattr(self, 'vector_store') and self.vector_store:
                    store = self.vector_store
                 else:
                     print("Creating a new retriever...")
                     store = self._retrieve()
                     if not store:
                         raise ValueError("Failed to create a retriever....")
    
                 results = store.similarity_search_with_score(query, k=k)
                 print(f"Found {len(results)} results for the given query")
                 return results
             except Exception as e:
                 print("Failed to search on the store")
                 return []
    
     # This is in the VectorStore defined in the full code.
    
  2. Main Function:

    
     def main():
         request = input("Query> ")
         number_of_queries = int(input("Number of queries> ") or "3")
    
         gemini_api = os.getenv("GEMINI_API_KEY")
    
         try:
             gemini = ParallelQuery(api_key=gemini_api)
             vector_store = VectorStore("Lecture 3 - Polymorphism_250520_224757.pdf")
         except Exception as e:
             print(f"Error occured while setting up API and vector store: {e}")
             return
    
         response = gemini.generateParallelQuery(request, number_of_queries)
         total_queries = [response['original']]
    
         if response:
             print(f"\nOriginal: {response['original']}")
             for index, query in enumerate(response['generated']):
                 print(f"{index+1}: {query}")
                 total_queries.append(query)
         else:
             print("No response returned\n")
    
         results = perform_parallel_search(vector_store, total_queries, 5)
    
         for index, result in enumerate(results, 1):
             print(f"{index}: {result['content']}")
             print(f"In page: {result['page']}")
    
🧠
These are the basics of parallel query: user question → parallel query generates → search on the vector store → gives more robust results.

Full Code:

See the full code here.

Conclusion:

So, Parallel Query (Fan Out) Retrieval - some fancy name, huh? Actually, this is an optimization process for better output. There are a lot of other techniques out there, and I will go through them one by one. For now, stay tuned.

I actually could make it more relevant. Can you tell me how?

Tour with GenAI

Part 7 of 12

This series explores how LLMs like ChatGPT go beyond chat, diving into automation, from sending requests to getting intelligent responses. Learn how real-world LLM-powered systems are built behind the scenes.

Up next

Advanced RAG: Query Translation

Let’s jump to the advanced RAG system together😊