Parallel Query (Fan Out) Retrieval
Let's learn the query retrieval technique (Parallel Query Retrieval Technique)

Introduction:
Previously, we learnt what RAG is and why Query Translation is important. Now, we will learn about a popular technique of Query Translation: Parallel Query Retrieval.
What is Parallel Query Retrieval?
Some Backstory:
We know that RAG works in some context (documents, web, or anything that has relevant data). Now, let’s think we have given a file for Node.js as a context to the RAG. Now, the user might ask this:
“What is fs?”
As humans, we can understand, the user wants to know about “File System” in Node.js. But what if the Node.js documentation does not have the word “fs” in it, instead it has “file system” written everywhere. So, when using RAG, will it find any similarity? And will it be able to perform nicely?
No, right? But we need to take care of this.
Actually, we can solve this problem by this process:
The user asks the question.
We prompt the LLM and generate some similar questions like that. In here, the questions might be..
What is fs?
What is the file system?
What is a file in Node.js?
How to create a file in Node.js?
We search the reference of the generated questions in the vector store or any database we embedded the documentation.
We find similar files in the vector store. Suppose:
We could not find anything for the question “What is fs?”
We find 2 similarities for the question “What is the file system?” (just denoting: one yellow, one blue)
We find one similarity for the question “What is a file in Node.js?” (one blue)
We find three similarities for the question “How to create a file in Node.js?” (one blue, one yellow, one red)
We then filter the results and take the unique files.
Finally, we give context of the three unique files (yellow, blue, and red) to the LLM and answer the user’s query.

Definition:
Parallel Query Retrieval, also known as Fan Out Retrieval, is a method where multiple variants of the same user query are created and sent in parallel to different or the same retrieval systems. The goal is to maximize recall and diversify the retrieved documents, ultimately helping the LLM generate more informed and accurate answers.
Why Fan Out?
The word “Fan Out” actually comes from Systems design and Networking. It means:
In Parallel Query Retrieval, we are taking a single user input and generating multiple queries and spreading them into multiple paths, just like “Fanning Out the Queries”. It’s like 4 or 5 experts are answering same questions, isn’t it interesting?

Some Examples:
Let’s divide the process into some parts to understand better:
Parallel Query Generation:
class ParallelQuery: def __init__(self, api_key): genai.configure(api_key=api_key) self.model=genai.GenerativeModel('gemini-1.5-flash-001') def generateParallelQuery(self, query, number_of_queries = 3): try: system_prompt = f""" You are a helpful AI assistant who generates {number_of_queries} queries with similar topics of the given query={query}. METHOD: 1. You get a query, analyze it and find the keywords in that. 2. You generate similar words based on the keywords. 3. You make similar query like {query} using the newly generated keywords EXAMPLE: original: "What is fs in Node.js?" generated: 1. "What is file system?" 2. "What are files in Node.js?" 3. "How to make files in Node.js?" RETURN FORMAT You only need to return the queries in this json format: {{ "original": "{query}", "generated": [ "generated_1", "generated_2", "generated_3" ] }} Return ONLY valid JSON, no additional text. """ response = self.model.generate_content( system_prompt ) if not response or not response.text: print("No response from model") return None filtered_response = filter_response(response) try: parsed_response = json.loads(filtered_response) return parsed_response except json.JSONDecodeError as e: print(f"JSON parsing error: {e}") return None except Exception as e: print(f"Problem occured while generating the response: {e}") return None
Searching References:
# Main Parallel Search Function def perform_parallel_search(vector_store, queries, k_per_queries): """Perform search with multiple queries and combine results""" all_results = [] for index, query in enumerate(queries, 1): print(f"Running search on query: {index}") response = vector_store.search(query, k=k_per_queries) for (document, score) in response: all_results.append({ 'query': query, 'document': document, 'score': score, 'content': document.page_content, 'page': document.metadata.get('page', 'N/A'), 'source': document.metadata.get('source', 'N/A') }) all_results.sort(key=lambda x:x['score']) unique_results = remove_duplicate_results(all_results) #just some function to remove duplicates, see more in the full code given below. print(f"Total result's length: {len(unique_results)}") return unique_results # Function to search in Vector Store: def search(self, query, k=5): """Search the vector store for relevant data""" try: if hasattr(self, 'vector_store') and self.vector_store: store = self.vector_store else: print("Creating a new retriever...") store = self._retrieve() if not store: raise ValueError("Failed to create a retriever....") results = store.similarity_search_with_score(query, k=k) print(f"Found {len(results)} results for the given query") return results except Exception as e: print("Failed to search on the store") return [] # This is in the VectorStore defined in the full code.Main Function:
def main(): request = input("Query> ") number_of_queries = int(input("Number of queries> ") or "3") gemini_api = os.getenv("GEMINI_API_KEY") try: gemini = ParallelQuery(api_key=gemini_api) vector_store = VectorStore("Lecture 3 - Polymorphism_250520_224757.pdf") except Exception as e: print(f"Error occured while setting up API and vector store: {e}") return response = gemini.generateParallelQuery(request, number_of_queries) total_queries = [response['original']] if response: print(f"\nOriginal: {response['original']}") for index, query in enumerate(response['generated']): print(f"{index+1}: {query}") total_queries.append(query) else: print("No response returned\n") results = perform_parallel_search(vector_store, total_queries, 5) for index, result in enumerate(results, 1): print(f"{index}: {result['content']}") print(f"In page: {result['page']}")
Full Code:
Conclusion:
So, Parallel Query (Fan Out) Retrieval - some fancy name, huh? Actually, this is an optimization process for better output. There are a lot of other techniques out there, and I will go through them one by one. For now, stay tuned.




