Tour with GenAI

Semantic Routing

Pritom Biswas — Sat, 14 Jun 2025 11:58:13 GMT

What is Semantic Routing?

We learnt about “Logical Routing” Previously. If you hadn’t read it, please go read it. Today’s article greatly depends on that.

A Scenario:

One day, Jack went to a library and asked the librarian this question:
“I need help making my web pages more interactive.”

There were two librarians in the library: Logical Lara and Semantic Sara. Both managed their sections.

Logical Lara strongly maintained “Logic”. She pulled out her rulebook and started checking these fields:

Contains "web"? ✓ Check the Web Development section
Contains "interactive"? ✓ Check the JavaScript shelf
Contains "pages"? ✓ Look in the HTML documentation

Logic Lara hands you a stack of advanced JavaScript framework manuals, DOM manipulation guides, and complex API references. Technically correct, but overwhelming for someone who might just need to understand basic event handling.

On the other hand, Semantic Sara thought for a while. She didn’t bring out her handbook. She analyzed your question and found only “web page” and “interactive” words, which would direct to a context. But no specification like “what type of interactiveness”, “in which framework”. From this analysis, she reached this conclusion:

"This person sounds like they're at the beginning of their interactive web development journey. They're not asking about specific frameworks or advanced concepts—they want to understand how to make things happen when users click, type, or interact with their web pages."

Sara walks you directly to a beginner-friendly section with interactive tutorials, starts with simple click handlers, and shows you a progression path from basic interactions to more complex features. She understood not just your words, but your intent, context, and level of expertise.

🤔

Now, tell me which result is more reasonable according to you?

What Semantic Sara did was the foundation of Semantic Routing: Analysis.

Definition:

Semantic routing is an intelligent routing technique that makes decisions based on the meaning and context of queries rather than just keyword matching. Unlike logical routing, which uses predefined rules and patterns, semantic routing understands the intent, context, and semantic relationships within the text.

Deep Dive:

How does it work?

Let’s understand this by the question: “Help with web interactivity*”*

Query Preprocessing:
1. Pre-processing:
  - Text normalization (lowercase, remove special characters)
  - Tokenization: ["help", "with", "web", "interactivity"]
  - Stop word removal (remove common words like i, need, help, with):
    Result: ["web", "interactivity"]
    
    Stop words do not contribute much to semantic search
2. Embedding Generation:
  - Uses pre-trained language models (BERT, Sentence-BERT, OpenAI embeddings)
  - Converts text to a high-dimensional vector: [0.2, 0.8, -0.1, 0.4, ...]
  - Captures semantic meaning, not just keywords
3. Context Analysis:
  - Intent Detection: "Help-seeking" + "Learning-oriented"
  - Domain Analysis: Web development
  - Complexity Assessment: Beginner level (simple language, broad request)

Knowledge Base (KB) Representation:

Each knowledge base is represented as a collection of semantic vectors that capture the meaning and context of its content. It serves as the central repository against which incoming queries are semantically evaluated. Generally, it is set up at the first of the system. Look below for more clarification:

 # HTML/CSS KB Vector: [0.4, 0.8, 0.3, ...]
 - High values for: "visual", "interaction", "beginner", "elements"
 - Represents: Basic web interactivity, styling, simple events

 # JavaScript KB Vector: [0.1, 0.9, 0.2, ...]  
 - High values for: "programming", "functions", "advanced", "logic"
 - Represents: Programming concepts, complex interactions

 # React KB Vector: [0.3, 0.7, 0.1, ...]
 - High values for: "components", "framework", "interactive", "state"
 - Represents: Framework-specific interactivity

Similarity Calculation:
- Cosine Similarity Calculation:
  
  $$\text{similarity} = \frac{\vec{A} \cdot \vec{B}}{|\vec{A}| \times |\vec{B}|}$$
- Results:
  - HTML/CSS KB: 0.87 (highest - captures "basic web interactivity")
  - React KB: 0.62 (good match for "interactivity" but more advanced)
  - JavaScript KB: 0.45 (relevant but too programming-focused)
  - Node.js KB: 0.33 (server-side, not directly interactive)

(Dummy values for simulation.)

Intelligent Decision Making:
- Ranking System:
  1. Sort by Similarity Scores
  2. Apply Confidence Thresholds (minimum 0.5)
  3. Check Score Gaps (clear winner vs ambiguous)
  4. Validate with Context (beginner vs advanced)
- Final Decision: Route to HTML/CSS KB because:
  - Highest semantic similarity (0.87)
  - Matches beginner intent
  - Covers basic web interactivity concepts
  - Appropriate starting point for the user's journey

So, the final decision would be “For the query ‘Help with web interactivity’, the semantic routing system would route to the HTML/CSS Knowledge Base because it has the highest semantic similarity (0.87) and best matches the beginner-level intent for learning basic web interactivity concepts.“

When to Semantic Routing?

When the query is kind of abstract and needs some validation, then sematic query excels. Like these:

Natural Language Queries: “How do I make my website respond to user clicks?“
Ambiguous or Paraphrased Queries: "Fix broken authentication" vs "Login not working" vs "User verification issues"
Cross Domain Queries: "Best practices for securing user data in web apps"
Beginner-Friendly Routing: "Help me understand how websites work"
Intent Heavy Queries: "I'm struggling with responsive design on mobile."
Synonym and variation handling: "API endpoints" vs "REST services" vs "web services"

When not to?

When the query consists of highly technical or specific technologies, or the main concept is already given in the query, using Semantic Query does not help much.

Let’s do some code:

Preprocessing:

 # It is part of another function.
 # For simplicity, I have only used tokenization
 query_vector = self.model.encode([query.strip()])[0] # Tokenizing

             similarities = {}
             query_norm = np.linalg.norm(query_vector)

             if query_norm == 0:
                 return {
                     "error": "Invalid query vector",
                     "routed_to": "general",
                     "confidence": 0.0
                 }

Knowledge Base setup:

 # Setting up the knowledge base when initializing the respective class.
 # The knowlwdge base should be big files stored in the database.
 # For simplicity, I have only used a dictionary.
 def __init__(self):
         self.model = SentenceTransformer('all-MiniLM-L6-v2')
         self.knowledge_bases = {
             "html_css": "HTML CSS styling layout beginner web design interactive elements",
             "javascript": "JavaScript programming functions DOM events advanced coding",
             "react": "React components hooks state JSX frontend framework",
             "nodejs": "Node.js server backend API express database",
             "authentication": "login security JWT tokens password user auth"
         }

         print("Semantic Router Starting")
         self.kb_vectors = {}
         for name, description in self.knowledge_bases.items():
             vector = self.model.encode([description])[0]
             self.kb_vectors[name] = vector

         print("Semantic Router Ready")

Similarity Calculation:

 def route_query(self, query):
         """Route a query to the best knowledge base"""
         # Some error handling
         if not query or not query.strip():
             return {
                 "error": "Query cannot be empty",
                 "routed_to": "general",
                 "confidence": 0.0
             }

         if not self.kb_vectors:
             return {
                 "error": "No knowledge bases available",
                 "routed_to": "general",
                 "confidence": 0.0
             }

         try:
     # This part is from query-preprocessing
     # Skip it now
             query_vector = self.model.encode([query.strip()])[0]

             similarities = {}
             query_norm = np.linalg.norm(query_vector)

             if query_norm == 0:
                 return {
                     "error": "Invalid query vector",
                     "routed_to": "general",
                     "confidence": 0.0
                 }

     # Similarity Search starts:
             for kb_name, kb_vector in self.kb_vectors.items():
                 kb_norm = np.linalg.norm(kb_vector)

                 if kb_norm == 0:
                     similarities[kb_name] = 0.0
     # Used Cosine Similarity formula
                 else:
                     similarity = np.dot(query_vector, kb_vector) / (query_norm * kb_norm)
                     similarities[kb_name] = similarity

             if not similarities:
                 return {
                     "error": "No similarities calculated",
                     "routed_to": "general",
                     "confidence": 0.0
                 }

             best_kb = max(similarities, key=similarities.get)
             best_score = similarities[best_kb]

             threshold = 0.3  
             if best_score < threshold:
                 best_kb = "general"
                 confidence = best_score
             else:
                 confidence = best_score

             return {
                 "query": query.strip(),
                 "routed_to": best_kb,
                 "confidence": confidence,
                 "all_scores": similarities
             }

         except Exception as e:
             return {
                 "error": f"Routing failed: {str(e)}",
                 "routed_to": "general",
                 "confidence": 0.0
             }
     # Similarity search ends

Decision Making (Demo Function):

 def demo(router, query):
     """Test the Semantic Routing"""
     result = router.route_query(query)
     print(result)

     print(f"\n📰 Query: {result['query']}")
     print(f"➡️ Routed to: {result['routed_to']}")
     print(f"🙌 Confidence: {result['confidence']}")
     print("\nAll Scores: ")
     for name, score in result['all_scores'].items():
         print(f"{name}: score-> {score}")

Demo Input-Output:

 Semantic Router Starting
 Semantic Router Ready
 🧪Initiating Test: 
 > What is js?    
 {'query': 'What is js?', 'routed_to': 'javascript', 'confidence': np.float32(0.43704033), 'all_scores': {'html_css': np.float32(0.17255959), 'javascript': np.float32(0.43704033), 'react': np.float32(0.31851408), 'nodejs': np.float32(0.22914742), 'authentication': np.float32(0.18344694)}}

 📰 Query: What is js?
 ➡️ Routed to: javascript
 🙌 Confidence: 0.4370403289794922

 All Scores: 
 html_css: score-> 0.17255958914756775
 javascript: score-> 0.4370403289794922
 react: score-> 0.3185140788555145
 nodejs: score-> 0.22914741933345795
 authentication: score-> 0.1834469437599182
 > Javascript interactivity tutorial
 {'query': 'Javascript interactivity tutorial', 'routed_to': 'javascript', 'confidence': np.float32(0.5200579), 'all_scores': {'html_css': np.float32(0.40155196), 'javascript': np.float32(0.5200579), 'react': np.float32(0.1519815), 'nodejs': np.float32(0.0759646), 'authentication': np.float32(0.040414095)}}

 📰 Query: Javascript interactivity tutorial
 ➡️ Routed to: javascript
 🙌 Confidence: 0.5200579166412354

 All Scores:
 html_css: score-> 0.4015519618988037
 javascript: score-> 0.5200579166412354
 react: score-> 0.15198150277137756
 nodejs: score-> 0.07596459984779358
 authentication: score-> 0.04041409492492676
 > javascript authentication
 {'query': 'javascript authentication', 'routed_to': 'authentication', 'confidence': np.float32(0.51601857), 'all_scores': {'html_css': np.float32(0.16553222), 'javascript': np.float32(0.4025679), 'react': np.float32(0.14441179), 'nodejs': np.float32(0.20198642), 'authentication': np.float32(0.51601857)}}

 📰 Query: javascript authentication
 ➡️ Routed to: authentication
 🙌 Confidence: 0.5160185694694519

 All Scores:
 html_css: score-> 0.1655322164297104
 javascript: score-> 0.40256789326667786
 react: score-> 0.14441178739070892
 nodejs: score-> 0.2019864171743393
 authentication: score-> 0.5160185694694519
 > What is python?
 {'query': 'What is python?', 'routed_to': 'general', 'confidence': np.float32(0.07313205), 'all_scores': {'html_css': np.float32(0.06834164), 'javascript': np.float32(0.07313205), 'react': np.float32(0.043729357), 'nodejs': np.float32(0.04890592), 'authentication': np.float32(-0.04147682)}} 

 📰 Query: What is python?
 ➡️ Routed to: general
 🙌 Confidence: 0.07313205301761627

 All Scores:
 html_css: score-> 0.06834164261817932
 javascript: score-> 0.07313205301761627
 react: score-> 0.04372935742139816
 nodejs: score-> 0.04890592023730278
 authentication: score-> -0.04147681966423988

This is it. The “Semantic Routing”/

💡

I have left a lot of complicated things in here. You need to implement those.

Full Code:

See the full code here.

Conclusion:

Semantic Routing and Logical Routing both have their use cases. It greatly depends on the context.

Logical Routing

Pritom Biswas — Thu, 12 Jun 2025 19:48:20 GMT

What is Logical Routing?

Previously, we saw what “Routing” in RAG means. So, we’re going to start from the definition in here.

Definition:

Logical routing is a query handling approach that uses explicit rules, patterns, and conditional logic to determine where to direct incoming queries in an information system. Unlike semantic approaches that analyze meaning, logical routing relies on precise, predefined criteria to make routing decisions.

💡

Simply speaking, “Logical Routing” guides queries based on specific rules that are set in advance.

Core Parts:

Logical routing mainly consists of these 4 parts:

Rule Engine:

The central component that evaluates conditions and executes routing decisions based on predefined logic.
Pattern Matchers:

Tools that identify specific patterns in queries (keywords, phrases, question types).
Decision Trees:

Structured flow-charts that guide the routing process through a series of yes/no questions
Routing Destinations:

The various endpoints where queries can be directed (knowledge bases, specialized handlers, etc.)

How does it work?

Look at the following diagram:

💡

Previously, we learnt that “Query Analysis & Clarification”, “Route Decision”, “Execution”, and “Response Generation” are the fundamental workflows of Routing.

Let’s take an example and see what exactly happens here:

Query: “How do I implement authentication in a React app with Node.js backend?”

Pre-processing:
- Generally, the query is converted to lowercase.
- Then the query is tokenized: ["how", "do", "i", "implement", "authentication", "in", "a", "react", "app", "with", "node.js", "backend"]
Logical Routing:
1. Rule Matching:
  - System checks predefined rules in order of specificity
  - Matches rule: "IF query contains React AND Node.js AND authentication → route to full-stack authentication guides"
2. Pattern Matching:
  - Languages/Frameworks identified: "React", "Node.js"
  - Concepts identified: "authentication"
  - Operation type: "implementation" (how-to)
3. Decision Trees:
  
  Look at this:
4. Route Decisions:
  - Based on the rule match, the system selects "javascript_fullstack_auth" as the routing destination.
Response Phase:

The query is directed specifically to the "JavaScript Full-Stack Authentication Documentation" section, which contains:
- React frontend authentication patterns
- Node.js backend authentication implementations
- JWT/session management guides
- Security best practices

The system retrieves information specifically from this targeted section rather than searching the entire database, resulting in:

More relevant results
Faster response times
Content specifically about implementing authentication in React+Node.js applications.

❓

Quick question: Is it necessary to use “Rule Matching”, “Pattern Matching”, and “Decision Trees” together in all the queries? If yes, why? If not, why?

Is it always good?

Will answer it in the next article. Now, let’s do some code🤓

Let’s Code:

In this part, we will code Logical Routing. We will use a dummy Qdrant DB and simulate what happens when we use this kind of “Routing”

Feed the context (Give Knowledge):

 # These should be valid files in your system
 def _init_knowledge_base(self):
         """Initialize the knowlegde base"""
         # Let's work with a dummy knowledge base. In real case, there might be some website data, database data and so on
         self.kb_files = {
             "javascript": "javascript_docs.pdf",
             "python": "python_docs.pdf",
             "ruby": "ruby_docs.pdf",
             "react": "react_docs.pdf",
             "nodejs": "nodejs_docs.pdf", 
             "django": "django_docs.pdf",
             "api": "api_docs.pdf",
             "database": "database_docs.pdf",
             "authentication": "auth_docs.pdf",
             "general": "general_web_docs.pdf"
         }

         # This should be a proper knowledge base when implemented with real information
         self.knowledge_bases = {}

Define the patterns (Define Rules):

 def _init_patterns(self):
         # In ideal cases, these patters should de generated using LLMs in massive amount and
         # stored in the vector store beforehand.
         """Initialize pattern matching rules"""
         # Language patterns
         self.language_patterns = {
             "javascript": ["javascript", "js", "ecmascript", ".js"],
             "python": ["python", "py", ".py", "pip"],
             "ruby": ["ruby", "rails", "erb", "gem", ".rb"]
         }

         # Framework patterns
         self.framework_patterns = {
             "react": ["react", "jsx", "component", "hook", "props", "state"],
             "nodejs": ["node", "nodejs", "npm", "express", "package.json"],
             "django": ["django", "drf", "django-rest-framework"]
         }

         # Concept patterns
         self.concept_patterns = {
             "api": ["api", "rest", "endpoint", "http", "request", "response"],
             "database": ["database", "db", "sql", "query", "mongodb", "schema", "model"],
             "authentication": ["auth", "login", "jwt", "token", "session", "password"]
         }

         # Operation patterns
         self.operation_patterns = {
             "how_to": ["how to", "how do i", "steps to", "guide for", "tutorial", "implement"],
             "definition": ["what is", "define", "explain", "meaning of", "understand", "concept of"],
             "comparison": ["vs", "versus", "compare", "difference between", "better than", "pros and cons"],
             "troubleshooting": ["fix", "error", "bug", "issue", "problem", "not working", "debug"]
         }

Match the Patterns (Pattern Matching):

 # Match the patterns with pre-defined rules.
 def _match_patterns(self, query, pattern_dict):
         """Match query against a pattern dictionary and return the matched results"""
         query_lower = query.lower()
         matches = []
         matched_patterns = {}

         for category, patterns in pattern_dict.items():
             for pattern in patterns:
                 if pattern in query_lower:
                     if category not in matches:
                         matches.append(category)
                         matched_patterns[category] = [pattern]
                     else:
                         matched_patterns[category].append(pattern)

         return matches, matched_patterns

Analyze Query (between rules matching and routing):

```python def analyze_query(self, query): """Analyze the query and get the results"""

languages, lang_patterns = self._match_patterns(query, self.language_patterns) frameworks, framework_patterns = self._match_patterns(query, self.framework_patterns) concepts, concept_patterns = self._match_patterns(query, self.concept_patterns) operations, op_patterns = self._match_patterns(query, self.operation_patterns)

analysis = { "languages": languages, "frameworks": frameworks, "concepts": concepts, "operations": operations, "matched_patterns": { "languages": lang_patterns, "frameworks": framework_patterns, "concepts": concept_patterns, "operations": op_patterns }, "original_query": query }

return analysis


5. **Route the query:**

    ```python
    # Routing started
    def route_query(self, query):

            """Apply logical routing rules to determine the best knowledge base available"""
            analysis = self.analyze_query(query=query)
            route_info = {
                "query": query,
                "analysis": analysis,
                "route_decision": None,
                "decision_path": [],
                "knowledge_base": None
            }

            if analysis["languages"]:
                primary_language = analysis["languages"][0]
                route_info["decision_path"].append(f"Language detected: {primary_language}")
                if analysis["frameworks"]:
                    framework = analysis["frameworks"][0]

                    compatible = False
                    if primary_language == "javascript" and framework in ["react", "nodejs"]:
                        compatible = True
                    elif primary_language == "python" and framework == "django":
                        compatible = True

                    if compatible:
                        route_info["decision_path"].append(f"Compatible framework found: {framework}")
                        route_info["route_decision"] = framework
                        route_info["knowledge_base"] = self._get_kb(framework)
                        return route_info

                route_info["route_decision"] = primary_language
                route_info["knowledge_base"] = self._get_kb(primary_language)
                return route_info

            elif analysis["frameworks"]:
                framework = analysis["frameworks"][0]
                route_info["decision_path"].append(f"Framework detected: {framework}")
                route_info["route_decision"] = framework
                route_info["knowledge_base"] = self._get_kb(framework)
                return route_info

            elif analysis["concepts"]:
                concept = analysis["concepts"][0]
                route_info["decision_path"].append(f"Concept detected: {concept}")
                route_info["route_decision"] = concept
                route_info["knowledge_base"] = self._get_kb(concept)
                return route_info

            route_info["decision_path"].append("No specific domain detected, using general knowledge base")
            route_info["route_decision"] = "general"
            route_info["knowledge_base"] = self._get_kb("general")
            return route_info


    # Routing done. Now Search the final decision on the vector store
        def search(self, query, k=5):
            """Route the query and perform search"""

            route_info = self.route_query(query)
            kb = route_info["knowledge_base"]

            if not kb:
                print("No valid knowledge base found for routing decision")
                return []

            results = kb.search(query, k=k)

            return {
                "routing": route_info,
                "results": results
            }

Dummy Checker Function:

 # Dummy Checking Function. In real time, real queries will be sent to the server an the backend 
 # will validate and route it.
 def demonstration_logical_routing():
     """Show examples of the logical routing system with real queries"""
     router = LogicalRouter()

     example_queries = [
         "How do I create an array in JavaScript?",
         "What's the best way to connect Python to a SQL database?",
         "How to fix React component rendering error",
         "What is authentication in web applications?",
         "Explain the difference between Node.js and Django",
         "How to implement REST APIs?",
         "What is a closure in JavaScript?",
         "Django vs Flask - which should I use?"
     ]

     print("\n" + "="*80)
     print(" "*30 + "LOGICAL ROUTING DEMO")
     print("="*80 + '\n')

     for index, query in enumerate(example_queries, 1):
         print(f"Query no. {index}: {query}")
         print('\n')

         route_info = router.route_query(query)
         print("\nQUERY ANALYSIS:")

         for category, items in route_info["analysis"].items():
             if category not in ["mathched_patters", "original_query"]:
                 print(f"-> {category.capitalize()}: {', '.join(items)}")

         print("\n ➡️ Routing Decision Paths: ")
         for step in route_info["decision_path"]:
             print(f" -> {step}")

         print(f"\n 🤚 Final Decision Here: {route_info["route_decision"]}")

         if route_info["knowledge_base"]:
             kb_name = route_info["route_decision"]
             print(f" Routed to Knowledge base: {kb_name}")
         else:
             print(" No valid knowledge base found")

         print("-"*80)

Demo Response:

Initializing the logical router with base path
Logical router base path initilalized successfully

================================================================================
                              LOGICAL ROUTING DEMO
================================================================================

Query no. 1: How do I create an array in JavaScript?


Initializing knowldege base: javascript
Creating new collection: 'javascript_docs'
Error creating new collection: {e}

QUERY ANALYSIS:
-> Languages: javascript
-> Frameworks:
-> Concepts:
-> Operations: how_to
-> Matched_patterns: languages, frameworks, concepts, operations

 ➡️ Routing Decision Paths:
 -> Language detected: javascript

 🤚 Final Decision Here: javascript
 Routed to Knowledge base: javascript
--------------------------------------------------------------------------------
Query no. 2: What's the best way to connect Python to a SQL database?


Initializing knowldege base: python
Creating new collection: 'python_docs'
Error creating new collection: {e}

QUERY ANALYSIS:
-> Languages: python
-> Frameworks:
-> Concepts: database
-> Operations:
-> Matched_patterns: languages, frameworks, concepts, operations

 ➡️ Routing Decision Paths:
 -> Language detected: python

 🤚 Final Decision Here: python
 Routed to Knowledge base: python
--------------------------------------------------------------------------------
Query no. 3: How to fix React component rendering error


Initializing knowldege base: react
Creating new collection: 'react_docs'
Error creating new collection: {e}

QUERY ANALYSIS:
-> Languages:
-> Frameworks: react
-> Concepts:
-> Operations: how_to, troubleshooting
-> Matched_patterns: languages, frameworks, concepts, operations

 ➡️ Routing Decision Paths:
 -> Framework detected: react

 🤚 Final Decision Here: react
 Routed to Knowledge base: react
--------------------------------------------------------------------------------
Query no. 4: What is authentication in web applications?


Initializing knowldege base: authentication
Creating new collection: 'auth_docs'
Error creating new collection: {e}

QUERY ANALYSIS:
-> Languages:
-> Frameworks:
-> Concepts: authentication
-> Operations: definition
-> Matched_patterns: languages, frameworks, concepts, operations

 ➡️ Routing Decision Paths:
 -> Concept detected: authentication

 🤚 Final Decision Here: authentication
 Routed to Knowledge base: authentication
--------------------------------------------------------------------------------
Query no. 5: Explain the difference between Node.js and Django


Initializing knowldege base: nodejs
Creating new collection: 'nodejs_docs'
Error creating new collection: {e}

QUERY ANALYSIS:
-> Languages: javascript
-> Frameworks: nodejs, django
-> Concepts:
-> Operations: definition, comparison
-> Matched_patterns: languages, frameworks, concepts, operations

 ➡️ Routing Decision Paths:
 -> Language detected: javascript
 -> Compatible framework found: nodejs

 🤚 Final Decision Here: nodejs
 Routed to Knowledge base: nodejs
--------------------------------------------------------------------------------
Query no. 6: How to implement REST APIs?


Initializing knowldege base: api
Creating new collection: 'api_docs'
Error creating new collection: {e}

QUERY ANALYSIS:
-> Languages:
-> Frameworks:
-> Concepts: api
-> Operations: how_to
-> Matched_patterns: languages, frameworks, concepts, operations

 ➡️ Routing Decision Paths:
 -> Concept detected: api

 🤚 Final Decision Here: api
 Routed to Knowledge base: api
--------------------------------------------------------------------------------
Query no. 7: What is a closure in JavaScript?



QUERY ANALYSIS:
-> Languages: javascript
-> Frameworks:
-> Concepts:
-> Operations: definition
-> Matched_patterns: languages, frameworks, concepts, operations

 ➡️ Routing Decision Paths:
 -> Language detected: javascript

 🤚 Final Decision Here: javascript
 Routed to Knowledge base: javascript
--------------------------------------------------------------------------------
Query no. 8: Django vs Flask - which should I use?


Initializing knowldege base: django
Creating new collection: 'django_docs'
Error creating new collection: {e}

QUERY ANALYSIS:
-> Languages:
-> Frameworks: django
-> Concepts:
-> Operations: comparison
-> Matched_patterns: languages, frameworks, concepts, operations

 ➡️ Routing Decision Paths:
 -> Framework detected: django

 🤚 Final Decision Here: django
 Routed to Knowledge base: django
--------------------------------------------------------------------------------

Full Code:

Get the full code here (Github Gist)

There is an implementation of Vector Store; you can ignore it. I am using Docker for running Qdrant Store locally.

Additional Resource:

llm-rag (github link)

Conclusion:

Next, we will see the Semantic Routing and see some comparisons.

Advanced RAG: Routing

Pritom Biswas — Tue, 10 Jun 2025 17:56:49 GMT

Introduction:

Previously, we learned about some “Query Translation” Techniques: “ How to break a query and get the gist of it? “. In this article, we’re gonna work on the data that is stored in our Database.

Problem statement:

Suppose we have a huge database where we have information about JavaScript, Node.js, Python, Ruby, Rust, and all the stuff about modern Web Development. Now, play some Q&As.

When I search for something in there, is it searching the whole Database?

-Yes.
Does it help if I hugely upgrade the quality of my query and then search?

-No, actually. I have upgraded the query, but did not make any way to search the Database efficiently.

🤔

Then what should we do?

Some approach:

Let’s think of an easy approach. First, we will mark our data chunks according to their domains. Then, when we search the Database, we will specifically search on the targeted domains. In this way we can reduce the cost of operation, right?

This is called “Routing”. Easy, right?

This is the basic process of “Routing”.

Routing:

Definition:

Routing is a technique where the system intelligently decides which retrieval strategy, knowledge source, or processing path to use based on the characteristics of the incoming query.

Think of it like a smart receptionist:

Medical questions → Route to medical expert
Legal questions → Route to the legal department
Technical questions → Route to the engineering team
Simple questions → Route to the general information desk

How Routing Works:

Usually, Routing works in these 4 steps:

Query Analysis and Classification:

The router examines the incoming query to determine:
- Query Type: Question, comparison, how-to, definition, etc.
- Domain: Technical, business, medical, legal, etc.
- Complexity: Simple factual vs complex analytical
- Intent: Information seeking, problem-solving, decision-making
Route Decision:

Based on the analysis, the router decides:
- Which retrieval method to use (vector search, keyword search, hybrid)
- Which knowledge source to query (general docs, technical docs, specific databases)
- Which processing strategy to apply (direct retrieval, decomposition, HyDE, parallel)
- Which model/prompt to use for generation
Execution:

The query is sent down the chosen path with appropriate configurations
Response:

Results are formatted according to the route's specifications.

Some Procedural Examples:

Let’s simulate this process with a question: “How do I deploy a React app?“

Query Analysis:
- Type: How-to/Procedural
- Domain: Web Development
- Complexity: Medium
- Intent: Problem-solving
Routing Decision:
- Knowledge Source: Development documentation
- Method: Keyword + semantic search
- Strategy: Step-by-step retrieval
- Response Format: Numbered instructions
Execution:

Will go to the chunk where the information about Development is stored. Run some Query Retrieval Techniques. Get the Data.
Response:

Give the extracted Data.

Types of Routing:

Two types of Routing are widely followed in the industry:

Logical Routing
Semantic Routing

We will learn about them in detail in the later articles.

Conclusion:

Routing is essential when we need to handle a large amount of data. But for smaller applications, it only increases complications.

HyDE (Hypothetical Document Embeddings)

Pritom Biswas — Mon, 09 Jun 2025 19:04:19 GMT

Previous Context

We saw how the “Parallel Query Retrieval” and “Query Decomposition” work. Let’s just see a recap:

Parallel Query Retrieval:

We asked this question, “What is fs?” and we got some questions like this from the LLM:

What is fs?
What is the file system?
What is a file in Node.js?
How to create a file in Node.js?

Really a straightforward question and some variants of it, right?

Query Decomposition:

We had a complex question: “What are the advantages and disadvantages of React compared to Vue.js for building large-scale applications?“ and our LLM generated these questions:

Compare React and Vue.js for large-scale projects
What are the pros and cons of using React for building large applications?
Is React or Vue.js better for developing complex web applications?
What are the benefits of using Vue.js over React in large-scale projects?

Complex multi-topic queries made simple.

Now, let’s think about something complex…

A new scenario:

Some experimentation 🧑‍🔬:

Suppose we have a big “Academic Research Paper” on “LLM models - Transformers and NLPs“ in our hands, and we want to ask questions about it. Sample question:

"How does transformer architecture improve natural language understanding?"

Run Through Parallel Query Retrieval:
- "How do transformers help with NLP?"
- "What makes transformer architecture better for language?"
- "Why are transformers good for natural language processing?"

We got these questions. But “Research Papers” do not contain questions. They contain research findings, methodologies, and numericals.

❓

We can get some response, but will it be “contextual” to us?

Run through Query Decomposition:
- "What is transformer architecture?"
- "How does the attention mechanism work?"
- "What are transformer benefits for NLP?"

We got some broken-down or “decomposed” queries. But these are still questions, and we need references so that we can search through the Paper and get our relevant information.

❓

Again, we will get some response, but will it be that helpful?

Let’s try a new approach:

So, what we will do is we will generate some pseudo answers based on the question, and we will find the reference/keywords from the answers in our vector store. Suppose the LLM generates this response:

“Transformer architecture revolutionizes natural language understanding through self-attention mechanisms that capture long-range dependencies more effectively than recurrent neural networks. The multi-head attention allows the model to focus on different representation subspaces simultaneously, enabling better contextual understanding…“

We have got keywords like:
- "self-attention mechanism"
- "long-range dependencies"
- "recurrent neural networks"
- "subspace" and
- "contextual understanding."
  
  This gives us more relevant data to search through, which increases our chances of finding the information we need, right?

💡

This example might be quite overwhelming. Think like this: “We are generating a pseudo-answer before searching for our real answer in the document, instead of generating some questions. “

This method is called “ HyDE (Hypothetical Document Embeddings) “.

Definition:

HyDE stands for Hypothetical Document Embedding. Instead of directly searching with the user's question, HyDE generates a hypothetical answer first, then uses that generated answer to search for relevant documents

Why HyDE?

We have:

Parallel Query Retrieval
Reciprocal Rank Fusion and
Query Decomposition

- These are powerful techniques. So, why do we need HyDE for?

HyDE adds something extra powerful: contextual richness.

Why does that matter?

Many technical documents spread information across sections.
Questions might miss key terms that appear only in answers.
Generating a pseudo-answer lets us "guess" what a good answer might look like, then search backwards to locate supporting material.

This is the core idea behind HyDE — generating hypothetical documents (or answer embeddings) to find real ones.

This is the basic workflow of HyDE:

❓

There is a slight drawback in this approach. Can you identify that?

When not to HyDE?

HyDE is only applicable when the answer is scattered throughout the document and no direct instance is given. For simpler queries, using HyDE is quite too much. Each time you generate a pseudo-response, it burns more tokens than generating questions. So, it is only applicable when precision is more important than price.

Plus, there is a drawback to this approach. Remember, when we generated a pseudo-response to the previously asked question on the “Research Paper”? The LLM model that is generating the answer should at least know the context of the question to generate useful keywords, right?

😶

HyDE does not work well with smaller language models. It takes the “most updated and largest” language model to use with.

Ok, enough with the ideas and intuitions. Let’s discuss the implementation:

Let’s code 🥰:

Sequence:

Nice system prompt
Input requests, get a “Paragraph”
Decompose that for more clarification
Generate some “Parallel Query” for context-matching
Vector Search and get your answer.

Here, 1 and 2 are the compulsory states, and 4 to 6 are just optimizations so that we do not miss any context.

Code:

1 and 2:

def HyDE(self, query):
        print("HyDE Running 🧑‍🔬")
        try:
            system_prompt = f"""
               Generate a comprehensive, expert-level answer to this query as if you're writing documentation or academic content.

                Query: "{query}"

                REQUIREMENTS:
                1. Write in professional, authoritative tone (like a domain expert)
                2. Generate exactly one well-structured paragraph (4-6 sentences)
                3. Include technical terminology and key concepts relevant to the field
                4. Cover the main topic plus 2-3 closely related subtopics
                5. Use declarative statements, not questions
                6. Write as if explaining to a knowledgeable audience

                RETURN FORMAT:
                {{
                    "original": "{query}",
                    "generated": "your expert paragraph here"
                }}

                Return ONLY valid JSON, no additional text.
            """

            response = self.model.generate_content(system_prompt)

            if not response:
                print("No response was generated. ")
                return None

            filtered_response = filter_response(response)

            try:
                parsed_response = json.loads(filtered_response)
                return parsed_response
            except json.JSONDecodeError as e:
                print(f"JSON parsing error: {e}")
                return None

        except Exception as e:
            print("Failed to run HyDE: {e}")
            return None

4 to 6.

See the full code.

Full Code:

See the full code. (Mainly see the main function for clarification.)

Input and Output Testing:

Input:

 How does transformer architecture improve natural language understanding?

Output:

 HyDE Running 🧑‍🔬
 Transformer architecture significantly enhances natural language understanding (NLU) by leveraging0
 self-attention mechanisms. Unlike recurrent neural networks (RNNs), transformers process all input 
 tokens simultaneously, enabling them to capture long-range dependencies and contextual information. 
 This allows for better representation of semantic relationships between words and phrases, resulting 
 in improved performance on tasks like machine translation, text summarization, and question answering. 
 Furthermore, transformers' parallel processing capabilities facilitate efficient training and inference, 
 making them a highly effective architecture for NLU tasks.

 Decomposing Query 🧠
 1: Transformers enhance natural language understanding using self-attention mechanisms.
 2: Transformers process all input tokens simultaneously, unlike recurrent neural networks.
 3: Transformers can capture long-range dependencies and contextual information.
 4: Transformers improve performance on tasks like machine translation, text summarization, and question answering.
 5: Transformers enable better representation of semantic relationships between words and phrases.
 6: Transformers facilitate efficient training and inference due to parallel processing capabilities.
 7: Transformers are a highly effective architecture for natural language understanding tasks.
 8: Transformers are an alternative to recurrent neural networks for natural language processing.
 9: Transformer architecture has advantages over recurrent neural networks for certain tasks.
 10: The use of self-attention in transformers is crucial for their effectiveness.

 # There are also Parallel Query Generation is running after this. Did not show that.

❓

When you will follow the full code, you might not see Decomposed queries as single lines but as questions. Can you generate single lines like this?

Conclusion:

We have seen a bunch of Query Translation methods till now - Parallel Query (Fan Out) Retrieval, Reciprocal Rank Fusion, Query Decomposition, HyDE (Hypothetical Document Embeddings). All of these are used for structuring the response generated by LLM models. We are not doing anything fancy till now, just increasing our chances of finding the required data in our document so that we can do “further work” on them.

When we did not have these LLMs, we had to do this similarity search manually, using codes, conditionals. But these LLMs made this part quite easy. Though coding logics are still the most wanted things in engineering. We just need to center the response obtained by the LLMs, and one of the problems is solved.

In the later parts, we will see how we can optimize the whole process of automating a system more intensely.

Query Decomposition

Pritom Biswas — Sun, 08 Jun 2025 18:30:10 GMT

Previous context:

What we saw earlier

Remember, we did implement “Parallel Query Retrieval” and “Reciprocal Rank Fusion (RRF)”. There, we asked a question, “What is fs?” and the LLM generated some similar questions.

Now, in this article, let us ask this question: “ What is React? “ and run this into the Parallel Query Retrieval system, what similar questions will we get?

What is React.js?
What is the React framework?
React JavaScript library explained
Introduction to React

Like these, right?

Let’s test something:

Nice, now let us test with another question: “ What are the advantages and disadvantages of React compared to Vue.js for building large-scale applications? “ I got these similar queries from my system:

Compare React and Vue.js for large-scale projects
What are the pros and cons of using React for building large applications?
Is React or Vue.js better for developing complex web applications?
What are the benefits of using Vue.js over React in large-scale projects?

Look closely, all the queries include React.js and Vue.js together; they never separate them so that the LLM can retrieve individual knowledge about them. This is ok if my supplied document in the RAG has them together and directly answers the question. But,

"What if the supplied document does not directly answer the question and has both React.js and Vue.js in separate places? ”

🛑

Yeah, new problem marked. How would you solve this?

Interesting, right?

So, how can we get rid of this thing? Simple, divide the complex query into a simpler form. In a word, “DECOMPOSE THEM” !!!

What is Query Decomposition?

Let’s try something:

So, our previous “complex” query was this: “ What are the advantages and disadvantages of React compared to Vue.js for building large-scale applications? “ Can we break this into these queries?

What are React advantages for large applications?
What are React disadvantages for large applications?
What are Vue.js advantages for large applications?
What are Vue.js disadvantages for large applications?

Now, we have got React.js and Vue.js differently. So, even if they are in different places in the given context, we can apply vector search on them and get results. This is like breaking a query into a less abstract one.

❓

Can you optimize this more? I mean, what if the given context does not have the exact query words in it?

💡

Of course, just generate some parallel queries of the decomposed queries !!!

The main workflow kind of looks like this:

Yeah, this is “Query Decomposition” !!!

Definition:

Query Decomposition is a technique where you break down a complex, multi-faceted user question into smaller, more focused sub-questions before performing retrieval in a RAG system.

Some diagrams:

Look at the following diagram for better understanding.

These are the intuitions and main mechanisms of “Query Decomposition”.

Why Query Decomposition:

These fields mainly force a query to decompose:

Vector Search Limitations:

When multiple distinct concepts are asked in a single query, vector embeddings struggle to search through the document and establish connections.
Improved Retrieval Coverage:

A single query might miss some concepts, whereas generating fragmented queries retrieves more subject-specific data, which finally generates more accurate results.
Reduced Semantic Confusion:

Complex queries sometimes include multiple concepts that might be semantically close but differ in their original concepts. Decomposing queries reduces this confusion and generates unambiguous, clear solutions.
Better Document Relevance:

Dividing the query into smaller sub-queries helps retrieve data from subject-specific fields, which keeps relevance with the context/document.

💡

Isn’t this like asking different specialist in their specialized fields to answer different questions?

Let’s code:

Flow:

Give a nice system prompt.
Take the user query
Generate Decomposed Query
Run Parallel Query Retrieval on them
Search on the Vector Store using the queries and Retrieve the response/docs
Use the Original Query and the Retrieved context to get the results

Code:

🔴

I am using a Docker container for the vector store, so you also need to implement that.

1 to 3. Decompose Query Function

def decomposeQuery(self, query, number_of_queries = 3):
        print("Decomposing Query 🧠")
        try:
              # Nice prompt
            system_prompt = f"""
                You are a helpful AI assistant who decomposes the given complex {query} into simpler queries using its keywords at the given number = {number_of_queries}.

                METHOD:
                1. Firstly, analyze the complex query and extract its keywords and split the distinct keywords.
                2. Secondly, make new queries using the keywords. ALWAYS remember to keep one distinct topic in a single query.
                3. Thirdly, Then return the queries in the given format.
                4. ALWAYS remember that each should only take one line.
                5. TRY to make as straight-forward as possible. Each query should consist the gist of the original query.

                EXAMPLE:
                "original": "What are the advantages and disadvantages of React compared to Vue.js for building large-scale applications?"
                "generated":
                    1. What are React advantages for large applications?
                    2. What are React disadvantages for large applications?
                    3. What are Vue.js advantages for large applications?
                    4. What are Vue.js disadvantages for large applications?

                RETURN FORMAT
                You only need to return the queries in this json format:
                {{
                    "original": "{query}",
                    "generated": [
                        "generated_1",
                        "generated_2",
                        "generated_3"
                    ]
                }}
                ONLY return in the given format.
            """

            response = self.model.generate_content(system_prompt)

            if not response or not response.text:
                print("No response from model")
                return None

            filtered_response = filter_response(response)

            try:
                parsed_response = json.loads(filtered_response)
                return parsed_response
            except json.JSONDecodeError as e:
                print(f"JSON parsing error: {e}")
                return None

        except Exception as e:
            print(f"Query Decomposition failed: {e}")
            return None

Run a Parallel Query on the decomposed queries:

 # Calling this multiple times to get the paralled queries:
 def hybridQuery(self, query, number_of_queries):
         print("Hybrid Query Initiating 🤓")
         try:
             response = self.generateParallelQuery(query=query, number_of_queries=number_of_queries)

             if not response:
                 print("Hybridization failed")
                 return None

             return response
         except Exception as e:
             print("Hybrid Query Generation Failed")
             return None

 # Parallel Query Generation Function:
 def generateParallelQuery(self, query, number_of_queries = 3):
         print("Generating Parallel Query 🤔")
         try:
             system_prompt = f"""
                 You are a helpful AI assistant who generates {number_of_queries} queries with similar topics of the given query={query}.

                 METHOD:
                 1. You get a query, analyze it and find the keywords in that.
                 2. You generate similar words based on the keywords. Extract the keywords from the whole {query} and then decide what to make.
                 3. You make similar query like {query} using the newly generated keywords
                 4. The generated queries will not exceed one line.
                 5. Keep them as straigt-forward as possible

                 EXAMPLE:
                 original: "What is fs in Node.js?"
                 generated:
                     1. "What is file system?"
                     2. "What are files in Node.js?"
                     3. "How to make files in Node.js?"

                 RETURN FORMAT
                 You only need to return the queries in this json format:
                 {{
                     "original": "{query}",
                     "generated": [
                         "generated_1",
                         "generated_2",
                         "generated_3"
                     ]
                 }}

                 Return ONLY valid JSON, no additional text.
             """

             response = self.model.generate_content(
                 system_prompt
             )

             if not response or not response.text:
                 print("No response from model")
                 return None

             filtered_response = filter_response(response)

             try:
                 parsed_response = json.loads(filtered_response)
                 return parsed_response
             except json.JSONDecodeError as e:
                 print(f"JSON parsing error: {e}")
                 return None

         except Exception as e:
             print(f"Problem occured while generating the response: {e}")
             return None

5 and 6. See the full code.

Full Code:

See the full Code here

Conclusion

“Query Decomposition” is another optimization method to handle more complex queries and save the LLM from redundancy. For simpler queries, Parallel Query Retrieval was good enough. But Query decomposition enables specialization on the retrieval process.

❓

Can you tweak the Query Decomposition function code so that it uses different specialized characters (Doctors, Engineers, etc.) to get specialized answers?

💡

Hint: Agents

Reciprocate Rank Fusion (RRF)

Pritom Biswas — Sat, 07 Jun 2025 13:00:24 GMT

Remember:

Previously, we made a RAG model with Parallel Query Retrieval (Fan Out system). If you did not read it, just click on the name “Parallel Query Retrieval.” I will take some context from there.

Context:

You did give a documentation of Node.js to the RAG model and asked, “What is fs?” But the documentation did not have the word “fs” in it. To solve this,

The LLM generated some similar questions (parallel query generation):

What is the file system?
What is a file in Node.js?
How to create a file in Node.js?

Then you searched these terms in the “Vector Store” and found this:

You could not find anything for the question “What is fs?”
You found 2 similarities for the question “What is the file system?” (just denoting: one yellow, one blue)
You found one similarity for the question “What is a file in Node.js?” (one blue)
You found three similarities for the question “How to create a file in Node.js?” (one blue, one yellow, one red)

The diagram was like this:

💡

Observe the rankings of the files, what positions they appear in each query, and how much time they appear in total.

❓

Can you notice a major flaw in the response? The blue file appeared the highest times but ranked second. Will the response be dependable in this context?

Yeah, this response is somewhat optimized, but in a large context, it will lose its relevance and credibility, right? This problem gives birth to a new term, “Reciprocate Rank Fusion”.

What is Reciprocate Rank Fusion (RRF)?

Some insights:

Some hefty words, huh? Well, “Reciprocate Rank Fusion” just means “Rank Them”, simple!!!

Now, the million-dollar question: “How will you rank them?” There should be two criteria, right?

Sort the results on the total appearance in descending order.
If the total appearances for two results are the same, sort them in the order of appearance (1st, 2nd, 3rd).

This is the main idea. It‘s time to do a dry run. Look at the diagram below:

Now, notice the observations:

We found 2 similarities for the question “What is the file system?” (just denoting: one yellow, one blue). Yellow was 1st, Blue was 2nd.
We found two similarities for the question “What is a file in Node.js?” (one blue, one red). Blue 1st, and Red 2nd.
We find three similarities for the question “How to create a file in Node.js?” (one blue, one yellow, one red). Blue is 1st, Yellow is 2nd, and Red is 3rd.

So, the final observation:

Blue has 3 appearances. So, it will be placed in the first position.
Yellow and Red both have appeared twice. But, Yellow appeared before the Red more frequently. So, Yellow will be 2nd.
Red will be 3rd.

Wow, ranking done!!! Now, your result has more relevance than just retrieving data in parallel.

Intuition:

If a document ranks higher in multiple lists (even if it doesn't appear in all), it’s probably important and gets a better final score.

Definition:

Reciprocal Rank Fusion (RRF) is a simple yet powerful method used to combine multiple ranked lists of documents (or responses, answers, items, etc.) into a single fused ranking.

Formula:

$$\text{RRF_score}(\text{document}) = \sum_{i=1}^{n} \frac{1}{k + \text{rank}_i}$$

This generally means to add the reciprocal of the ranks added with a constant k (generally 60) throughout all the results. Let’s take a few more examples to understand this:

## User query:
"What is polymorhism?"

## LLM Generated queries
queries = [
    "What is polymorphism?",
    "Types of polymorphism in OOP", 
    "How does polymorphism work?",
    "Polymorphism examples"
]


## Example results:
 Query 1 results: [Page15, Page16, Page18, Page20]
 Query 2 results: [Page16, Page15, Page17, Page19] 
 Query 3 results: [Page15, Page18, Page16, Page21]
 Query 4 results: [Page17, Page15, Page20, Page16]

**Here 3 pages appeared: Page15, Page16 and Page17**

# RRF Calculations:
Page15_RRF = 1/(60+1) + 1/(60+2) + 1/(60+1) + 1/(60+2) = 0.0164 + 0.0161 + 0.0164 + 0.0161 = 0.0650
Page16_RRF = 1/(60+2) + 1/(60+1) + 1/(60+3) + 1/(60+4) = 0.0161 + 0.0164 + 0.0159 + 0.0156 = 0.0640
Page18_RRF = 1/(60+3) + 0 + 1/(60+2) + 0 = 0.0159 + 0 + 0.0161 + 0 = 0.0320

## Final ranking: Page15 (0.0650) > Page16 (0.0640) > Page18 (0.0320)

💡

Look closely, here a higher score means higher relevance, why?

🧠

The more the results appear, the higher the summation value. Take some input and test it yourself.

Why k=60?

Look at the examples below:

# Without k constant:
Rank 1: 1/1 = 1.0
Rank 2: 1/2 = 0.5    ← 50% drop! Too harsh!
Rank 3: 1/3 = 0.33   ← 67% drop from rank 1

# With k=60:
Rank 1: 1/(60+1) = 1/61 = 0.0164
Rank 2: 1/(60+2) = 1/62 = 0.0161  ← Only 2% drop
Rank 3: 1/(60+3) = 1/63 = 0.0159  ← Only 3% drop from rank 1

So, we need a constant value for stopping the value from dropping suddenly, and “60” is an experimental value that works well in almost every situation.

🧠What does the word “Reciprocal” mean?

Great question!

When we were analyzing and combining query results, we needed a way to fairly rank documents that appeared across different queries. Instead of simply assigning points based on ranks and summing them up, we used the reciprocal of the rank values.

Why? Because taking the inverse (or reciprocal) of a rank gives higher weight to top-ranked results while still allowing lower-ranked results to contribute. This method helps stabilize the scoring and makes the system more robust.

In the RRF formula, we take the rank of a document (like Page15), add a constant k (usually 60), and then invert the result. We repeat this for each query where the document appears and sum them up. This "reciprocation" is where the name Reciprocal Rank Fusion comes from.

In short:

⬅

Reciprocal means “inverse,” and using the inverse rank helps us fuse results in a smart and balanced way.

Our general discussion is done. Let’s move to some code:

Some examples:

🔴

I am using a Docker container in my system for the vector store.

def perform_parallel_search_with_rrf(vector_store, queries, k_per_queries, rrf_k=60, min_similarity_threshold = 0.4):
    """Perform search with multiple queries and combine results"""
    all_documents = {}
    query_results = {}

    # Running parallel search on it.
    for index, query in enumerate(queries, 1):
        print(f"Runnung search on query {index}: {query}")
        query_results[query] = []
        response = vector_store.search(query, k=k_per_queries)

        # Checing relevance of the response. The lower, the better. 
        # Notice, this is the score generated by the search function or Qdrant DB and 
        # according to them the similarity score should be lower to be more relevant after search
        # Check out this: https://qdrant.tech/documentation/concepts/search/
        # This is not same as our RRF (this is extra)
        relevant_results = []
        for rank, (document, score) in enumerate(response, 1):
            if score <= min_similarity_threshold:
                relevant_results.append((rank, document, score))
            else:
                print(f"The response no. {index} is not relevant enough")

        if not relevant_results:
            print("No relevant results found, sorry")
            continue

        # Ranking begins
        # We first calculate in which position the document appears for a query
        for original_rank, document, score in relevant_results:
            content_hash = hash(document.page_content[:100].strip())
            page_number = document.metadata.get('page', 'unknown')
            doc_id = f"page_{page_number}_{content_hash}"

            if doc_id not in all_documents:
                all_documents[doc_id] = {
                    'document': document,
                    'content': document.page_content,
                    'page': document.metadata.get('page', 'N/A'),
                    'source': document.metadata.get('source', 'N/A'),
                    'query_ranks': {},
                    'similarity_scores': {},
                    'rrf_contributions': {},
                    'queries_appeared': []
                }

            all_documents[doc_id]['query_ranks'][query] = original_rank
            all_documents[doc_id]['similarity_scores'][query] = score # We save the Qdrant search score in here
            all_documents[doc_id]['queries_appeared'].append(query)

            rrf_contribution = 1/(original_rank+rrf_k) # We calculate the RRF contribution for each document for each query
            all_documents[doc_id]['rrf_contributions'][query] = rrf_contribution

            query_results[query].append({
                'doc_id': doc_id,
                'rank': original_rank,
                'similarity_score': score,
                'rrf_contribution': rrf_contribution
            })

    rrf_results = []

    # We then arrange the RRF contribution wise results in here.
    # We then sort the results
    for doc_id, doc_data in all_documents.items():
        total_rrf_score = sum(doc_data['rrf_contributions'].values())
        num_of_queries_appeared = len(doc_data['queries_appeared'])
        avg_rank = sum(doc_data['query_ranks'].values())/num_of_queries_appeared
        avg_similarity = sum(doc_data['similarity_scores'].values())/num_of_queries_appeared
        best_rank = min(doc_data['query_ranks'].values())
        best_similarity = min(doc_data['similarity_scores'].values())

        consensus_score = num_of_queries_appeared/len(queries)

        rrf_results.append({
            'document': doc_data['document'],
            'content': doc_data['content'],
            'page': doc_data['page'],
            'source': doc_data['source'],
            'rrf_score': total_rrf_score,
            'consensus_score': consensus_score,
            'num_queries_appeared': num_of_queries_appeared,
            'avg_rank': avg_rank,
            'best_rank': best_rank,
            'avg_similarity': avg_similarity,
            'best_similarity': best_similarity,
            'query_ranks': doc_data['query_ranks'],
            'similarity_scores': doc_data['similarity_scores'],
            'rrf_contributions': doc_data['rrf_contributions'],
            'queries_appeared': doc_data['queries_appeared']
        })

    rrf_results.sort(key=lambda x:x['rrf_score'], reverse=True) # We are sorting here.

    return rrf_results

💡

Read the comments in the code. Hope you will understand.

Full Code:

See the full code here.

Some observations:

Did you notice that I applied the “RRF” technique on “Parallel Query Retrieval”? Is it necessary?
Is there any problem with this technique? Will it generate expected results every time? When will this technique “hallucinate”/”fail”?
Can you check this implementation by giving different inputs?

Conclusion:

“Reciprocate Rank Fusion” is just a fancy word. Under the hood, it just ranks the queries. Is it enough? I cannot say. There are a bunch of techniques that are better than the previous one. In the later articles, I will try to cover them. Stay Tuned!

Parallel Query (Fan Out) Retrieval

Pritom Biswas — Fri, 06 Jun 2025 18:55:41 GMT

Introduction:

Previously, we learnt what RAG is and why Query Translation is important. Now, we will learn about a popular technique of Query Translation: Parallel Query Retrieval.

What is Parallel Query Retrieval?

Some Backstory:

We know that RAG works in some context (documents, web, or anything that has relevant data). Now, let’s think we have given a file for Node.js as a context to the RAG. Now, the user might ask this:
“What is fs?”

As humans, we can understand, the user wants to know about “File System” in Node.js. But what if the Node.js documentation does not have the word “fs” in it, instead it has “file system” written everywhere. So, when using RAG, will it find any similarity? And will it be able to perform nicely?

No, right? But we need to take care of this.

Actually, we can solve this problem by this process:

The user asks the question.
We prompt the LLM and generate some similar questions like that. In here, the questions might be..
1. What is fs?
2. What is the file system?
3. What is a file in Node.js?
4. How to create a file in Node.js?
We search the reference of the generated questions in the vector store or any database we embedded the documentation.
We find similar files in the vector store. Suppose:
1. We could not find anything for the question “What is fs?”
2. We find 2 similarities for the question “What is the file system?” (just denoting: one yellow, one blue)
3. We find one similarity for the question “What is a file in Node.js?” (one blue)
4. We find three similarities for the question “How to create a file in Node.js?” (one blue, one yellow, one red)
We then filter the results and take the unique files.
Finally, we give context of the three unique files (yellow, blue, and red) to the LLM and answer the user’s query.

🤔

Did we solve our problem? Can the LLM answer our question now?

💡

Yes, the problem is solved for now. Look, the LLM did not know “fs” but did know “file system” for sure. Now, it can answer the user’s questions, right?

❓

Is this enough?

Definition:

Parallel Query Retrieval, also known as Fan Out Retrieval, is a method where multiple variants of the same user query are created and sent in parallel to different or the same retrieval systems. The goal is to maximize recall and diversify the retrieved documents, ultimately helping the LLM generate more informed and accurate answers.

Why Fan Out?

The word “Fan Out” actually comes from Systems design and Networking. It means:

🌐

Spreading a single input into multiple parallel paths or processes.

In Parallel Query Retrieval, we are taking a single user input and generating multiple queries and spreading them into multiple paths, just like “Fanning Out the Queries”. It’s like 4 or 5 experts are answering same questions, isn’t it interesting?

Some Examples:

Let’s divide the process into some parts to understand better:

🔴

I am using a Docker container in my system for the vector store.

Parallel Query Generation:

 class ParallelQuery:
     def __init__(self, api_key):
         genai.configure(api_key=api_key)
         self.model=genai.GenerativeModel('gemini-1.5-flash-001')

     def generateParallelQuery(self, query, number_of_queries = 3):
         try:
             system_prompt = f"""
                 You are a helpful AI assistant who generates {number_of_queries} queries with similar topics of the given query={query}.

                 METHOD:
                 1. You get a query, analyze it and find the keywords in that.
                 2. You generate similar words based on the keywords.
                 3. You make similar query like {query} using the newly generated keywords

                 EXAMPLE:
                 original: "What is fs in Node.js?"
                 generated:
                     1. "What is file system?"
                     2. "What are files in Node.js?"
                     3. "How to make files in Node.js?"

                 RETURN FORMAT
                 You only need to return the queries in this json format:
                 {{
                     "original": "{query}",
                     "generated": [
                         "generated_1",
                         "generated_2",
                         "generated_3"
                     ]
                 }}

                 Return ONLY valid JSON, no additional text.
             """

             response = self.model.generate_content(
                 system_prompt
             )

             if not response or not response.text:
                 print("No response from model")
                 return None

             filtered_response = filter_response(response)

             try:
                 parsed_response = json.loads(filtered_response)
                 return parsed_response
             except json.JSONDecodeError as e:
                 print(f"JSON parsing error: {e}")
                 return None

         except Exception as e:
             print(f"Problem occured while generating the response: {e}")
             return None

💡

Look at the system prompt closely, and you will understand. Other than that, everything is just refining the query

Searching References:

 # Main Parallel Search Function

 def perform_parallel_search(vector_store, queries, k_per_queries):
     """Perform search with multiple queries and combine results"""
     all_results = []

     for index, query in enumerate(queries, 1):
         print(f"Running search on query: {index}")
         response = vector_store.search(query, k=k_per_queries)

         for (document, score) in response:
             all_results.append({
                 'query': query,
                 'document': document,
                 'score': score,
                 'content': document.page_content,
                 'page': document.metadata.get('page', 'N/A'),
                 'source': document.metadata.get('source', 'N/A')
             })

     all_results.sort(key=lambda x:x['score'])
     unique_results = remove_duplicate_results(all_results) #just some function to remove duplicates, see more in the full code given below.

     print(f"Total result's length: {len(unique_results)}")
     return unique_results

 # Function to search in Vector Store:

 def search(self, query, k=5):
         """Search the vector store for relevant data"""
         try:
             if hasattr(self, 'vector_store') and self.vector_store:
                store = self.vector_store
             else:
                 print("Creating a new retriever...")
                 store = self._retrieve()
                 if not store:
                     raise ValueError("Failed to create a retriever....")

             results = store.similarity_search_with_score(query, k=k)
             print(f"Found {len(results)} results for the given query")
             return results
         except Exception as e:
             print("Failed to search on the store")
             return []

 # This is in the VectorStore defined in the full code.

Main Function:


 def main():
     request = input("Query> ")
     number_of_queries = int(input("Number of queries> ") or "3")

     gemini_api = os.getenv("GEMINI_API_KEY")

     try:
         gemini = ParallelQuery(api_key=gemini_api)
         vector_store = VectorStore("Lecture 3 - Polymorphism_250520_224757.pdf")
     except Exception as e:
         print(f"Error occured while setting up API and vector store: {e}")
         return

     response = gemini.generateParallelQuery(request, number_of_queries)
     total_queries = [response['original']]

     if response:
         print(f"\nOriginal: {response['original']}")
         for index, query in enumerate(response['generated']):
             print(f"{index+1}: {query}")
             total_queries.append(query)
     else:
         print("No response returned\n")

     results = perform_parallel_search(vector_store, total_queries, 5)

     for index, result in enumerate(results, 1):
         print(f"{index}: {result['content']}")
         print(f"In page: {result['page']}")

🧠

These are the basics of parallel query: user question → parallel query generates → search on the vector store → gives more robust results.

Full Code:

See the full code here.

Conclusion:

So, Parallel Query (Fan Out) Retrieval - some fancy name, huh? Actually, this is an optimization process for better output. There are a lot of other techniques out there, and I will go through them one by one. For now, stay tuned.

❓

I actually could make it more relevant. Can you tell me how?

Advanced RAG: Query Translation

Pritom Biswas — Thu, 05 Jun 2025 15:54:52 GMT

🤔Let’s think back

We learned about the “Basic RAG System” beforehand, right? If you did not read that, read from here. We know, a RAG system consists of these parts:

Indexing + Retrieval
Augmentation and
Generation

But these parts are not so useful by themselves. So, we need some optimization on them like this:

Query Translation
Routing
Query Construction
Indexing + Retrieval
Augmentation and
Generation

All these layers are added to get the best results from the LLM models. In this article, we will try to know about “Query Translation” and why it is needed.

What is Query Translation?

📖General Discussion:

No, right?

Here comes the importance of “Remodeling the User Query”. We need to reshape/enhance/remodel the user’s query for better output. This method is called “Query Translation”.

➡️Definition:

In the context of Retrieval-Augmented Generation (RAG), query translation is the process of transforming a user’s query into a more optimized form to improve the retrieval of relevant information.

📃Methods of Query Translation:

There are a lot of methods in here. Let’s discuss the main ones:

Parallel Query (Fan Out) Retrieval
Reciprocate Rank Fusion
Query Decomposition
HyDE (Hypothetical Document Embeddings)

These methods are mainly used in the industry. I’m not gonna discuss them in depth here, but in separate articles. Good Luck.

Introduction to RAG: 101

Pritom Biswas — Thu, 05 Jun 2025 09:59:43 GMT

A common scenario:

You published a book on Generative AI on March 27th of this year. However, the Large Language Model (LLM) you’re using was last trained on March 26th. As we know, LLMs don’t have access to information beyond their last training cutoff.

So, from the model’s perspective, your book doesn’t exist—it’s invisible. You can't ask it questions about the book or expect it to summarize or reference its contents. This brings us to the key question:

“How can I teach the LLM about my book?”

There are several approaches to solve this problem:

Using Agents

One way is to use agents that retrieve information from your book and present it to the user in response to queries. This can be effective in many cases.

But is it feasible in all situations?
Not always. Here’s why:

If your book is extensive, the agent must search through the entire content, or at least across targeted indices, to find relevant information. This process can be resource-intensive and may not scale efficiently.

Using Fine-Tuning

Another approach is fine-tuning—training the LLM with your book's content so that it becomes familiar with the material and can respond to queries naturally.

Sounds ideal, right?

But what if your book is updated frequently?
Then this method becomes less efficient. Here’s why:

Fine-tuning is both time-consuming and costly. Every time you update the book, you’d need to retrain the model with the new content, which is not practical if updates are frequent. In such cases, fine-tuning becomes a resource-draining solution.

🧐

Therefore, using only the agents will not suffice, and relying solely on fine-tuning will prove expensive in the long run. It would be very easy if we could just use these things conditionally in our application, right? Here comes the concept of RAG (Retrieval Augmented Generation)

What is RAG?

Retrieval-Augmented Generation is a hybrid framework that combines two key features of modern AI systems:

Information retrieval and
Text generation

It was first introduced by Facebook AI in 2020 to overcome the knowledge gaps of the LLMs. Let’s dissect the terms to understand better:

1. Retrieval:

At its core, it means fetching relevant information from an external knowledge source (e.g., databases, vector stores, documents, websites, etc) at the time of the query. Process:

Instead of relying on what the model “knows”, it performs a search on the given sources.
The sources can be a vector store, a database, or anything that has the relevant information.
It is like asking the model to “look something up” before answering.

🧠

Is it like “ChatGPT meeting Google”?

2. Augmentation:

“Augmented” means “enhanced with extra capabilities.” In this context:

The retrieved documents are injected into the model’s context (as prompts) before generating the answer.
Some extra operation is done on the retrieved data to help the model understand the context better.
This helps the model to augment/enhance its “knowledge base“ in real time

💡

The model gets smarter, not by “training“ but by “giving it helpful context“.

3. Generation:

This step is easy. Process:

Now, the model has the “context” it needed. Based on the context, it generates meaningful answers.
This is the actual question and answer phase, based on both the input query and the fetched context.

🤔

Now, tell me if agents are used in this process? Do I need to train the model?

Yeah, this is the core process that happens in RAGs.

A simple application:

⬅

Remember the book you published earlier? Let’s make a simple chat application on that book.

1. Retrieval:

I will follow these procedures:

Fix the Data Source: Here, the data source is your book.
Fragmentation/Chunking: Divide the data into smaller fragments/chunks so that I can do operations on the data efficiently. (Chunking itself is an art. Will get back to it in some other article, stay tuned 🥰)
Embedding: Embed the books’ data into the vector store (qdrant, Pinecone DB, etc.) so that I can easily search for similarity.
Store: Store the embeddings in the vector store.
User query embed: Get the user query and embed that also. (Need similar things to search, right?)
Search: Finally, search the similarity according to the embeddings of the user query.

💡

Extras: The part (1) is known as “Indexing” and part (2, 3, 4) is called “Retrieval”

2. Augmentation:

I would like to follow this procedure:

Prompt the AI: Will feed the context to the API. Here, will feed similar data to the AI.
Generate similar queries: You can skip this part. But it is good to give the AI more context. What if the user gives very dull queries🤔?

Now, the LLM model knows the context of the query that the user has asked. 🙂

3. Generation:

Feed all the queries.
Get the result.

Yeah, this will happen in the RAG for your book’s chatbot. Here’s the whole picture:

Why RAG matters:

In this whole process, did I use any agents?

-Yes, retrieval part, right?

And how much agents did I use? Just on some smaller parts (chunks), right? Is it more efficient than traversing through the whole dataset?

-Of course.

And Fine-Tuning? Did I hardcore Fine-tune the model? No, right?

So, in short, RAG meets most of the “Real-World” applications and can interact with live knowledge. This fits almost in all situations nicely. 🙂

Conclusion:

RAG itself is very complex. I have only shown a very basic use case. In future articles, I will discuss on more complex systems. Stay tuned. 🥰🥰🥰

Fine Tuning and more...

Pritom Biswas — Tue, 03 Jun 2025 18:31:10 GMT

What is Fine-Tuning?

First, let me create a scenario:

Suppose an LLM model trained its dataset on 25th March and you have started a business from 27th March of the same year. We all know that every model available now has a cut-off time, right? That means each pre-trained model can have all the data available on the dataset until a fixed date and after that, it does not know anything. So, as you started late, the LLM model itself does not know anything about your business. Now, you have a problem.

“How can the users get the latest/important data of your business???“

You can solve this problem by several methods:

Use AI Agents: You can use agents to scrape data from the internet. But this works on a very shallow level and cannot answer any query that is not on the internet.
Train the AI model: There is another approach. Train the LLM model on your business data and open the data for the users to query on that so that it can answer thoroughly on the business. This thing is better than just using some agents.

Here, you trained your model on your data and made a transformed model to meet your needs, “THIS IS CALLED FINE-TUNING”. Let’s see the formal definition…

Definition:

*“*Fine-tuning is the process of taking a pre-trained model (typically on a large, general dataset) and further training it on a smaller, task-specific dataset to adapt it to a particular problem.“

Why is this needed?

Easily speaking, to fit the LLM model according to some specific needs. This thing also helps in these cases:

It can reduce computing costs and training time
Can work on smaller datasets
Can give better performance

Process of Fine-Tuning:

To Fine-Tune a model these steps are followed:

Methods of Fine-Tuning:

There are several methods of fine-tuning:

Full Fine-Tuning (also known as Full Parameter Fine-Tuning)
Partial/Layer-wise Fine-Tuning
LoRA Fine-Tuning
PEFT (Parameter-Efficient Fine-Tuning)

Now, let’s elaborate on some of these.

Full Fine-Tuning:

In full Fine-Tuning, you adjust the actual weights of the pre-trained LLM model through Forward Propagation, Loss Calculation, Back Propagation, and then Weight Update.

This method provides the most accurate solution, with a low risk of incorrect information. It works well for smaller models, but it's not as efficient for larger ones. Why?

Because you need to update the entire LLM, and training a whole model is very costly in terms of hardware and time. If you want to train a model often, it will use a lot of resources, which isn't practical.
LoRA (Low-Rank Adaptation) :

Earlier, we saw that training the entire model (actual LLM) is very expensive. So, what if instead of training the whole model, we create a separate memory space to store the differences in responses based on queries from the actual model? Then, when we ask the model something next time, we add these differences to the response to get the desired answer. This is the process of the “Low-Rank Adaptation” method.

A little bit of confusing, right?

Let’s answer this, “How do the LLM models generate responses???”

-Doesn’t it find the nearest values from its vector embeddings? Isn’t it just the next token prediction?

-Yes.

So, in the end, everything operates on some numbers, right? So, if we calculate how much a response token is deviated from our desired token and then on our next query add the deviation with the response token, won’t we get our desired response? Yeah, sure we are. This is the main idea behind this process. Let’s see diagrams:

The first diagram runs for the first time and trains the new LLM model with fine-tuned data. For each query, the second diagram then runs.

This process is very time-efficient. I mean you do not need to change the original LLM, but make a new temporary model and use its deviation, simple!!!

But it consumes a lot of memory (trade-offs between memory and time). And as it runs on deviation, it does not work very well where precision matters.

I will not discuss the other two, will leave them to you!!!

Some insight on LoRA:

Let’s say we have a weight matrix in an LLM with dimensions m × n. Fine-tuning such large matrices directly can be computationally expensive and memory-intensive.

This is where LoRA (Low-Rank Adaptation) shines.

Instead of updating the full m × n matrix during fine-tuning, LoRA introduces a “delta” matrix — a learned adjustment to the original weights. Due to the nature of most tasks, this delta matrix tends to be sparse (mostly zeros) and low-rank, meaning that only a small subset of changes actually matter.

Here’s the clever part:
Rather than modifying the full matrix, LoRA decomposes it into two smaller matrices of shapes m × r and r × n, where r « m, n. Fine-tuning is applied to these smaller matrices. During inference, they are multiplied and added back to the original weights, reconstructing the adapted transformation.

This approach:

Preserves performance
Minimizes memory and compute overhead
Allows parameter-efficient fine-tuning even for very large models

That’s the core idea: train small, plug back smartly. LoRA makes large-scale model adaptation practical and scalable.

Use cases of Fine-Tuning:

Heavily used in chatbot training
Code completion for specific languages
Image classification system (eg, Medical Sectors)
etc.

Conclusion:

Fine-tuning is a way to get data specific to a system. There are many other methods (like Agents, RAG, etc.), but for certain needs, where adding an extra layer for a specific use case on an LLM is needed for a while, Fine-Tuning works well.

Let's make our Agent

Pritom Biswas — Sat, 31 May 2025 06:06:58 GMT

What is an Agent?

An agent is something that can automatically perform tasks, reason, and generate results.

We have seen that LLM models and AIs are like brains that can think, reason, and answer questions. But this isn't very practical on its own. What would you do with just that? Create a chatbot and chat all day?

Here comes the concept of AI Agents, which is much broader than just LLMs! The main idea is to give some actions to the LLM models, like giving hands and legs to the AI so they can do their tasks.

So, the official definition of AI Agents is:
“An AI agent is a software system designed to interact with its environment, gather information, and perform tasks autonomously to achieve predetermined goals set by humans or other systems. “

How does it work?

Ok, we are done with the definition. Now comes the part of its mechanism. According to some resources, there are five core components of AI Agents:

Perception System: Agents receive input from users or sensors. (Generally, the user query)
Reasoning Engine: The LLM that processes information and makes decisions. (The AI models)
Tool Use: The ability to call external functions. (We will get back to it.)
Decision Framework: some structured workflow: plan → action → observe → output
1. Plan: Decides what to do based on the query
2. Action: Calls appropriate functions with specific parameters
3. Observe: Processes the results from function calls
4. Output: Provides final responses to users
Memory: The agent maintains conversation history to track context.

This is the basic workflow of an agent. Now, let’s make a simple weather agent of our own :)

Our First AI Agent:

I am gonna explain the code step by step. Don’t forget to read the comments.

Just some general imports:

import os
import json
import requests
from google import genai
from dotenv import load_dotenv

load_dotenv()
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
client = genai.Client(api_key=GEMINI_API_KEY)
#just install the pacakages/dependencies in your code file

Perception system: Input from the user.
```
 user_query = input('> ')
```

Reasoning Engine: I am using the Gemini API here. (It’s kind of free :) )

 system_prompt = f"""
     You're a helpful AI assistant who is specialized in resolving user query.
     You work on plan, action, observe, output mode.

     Available tools: {list(available_functions.keys())}
     Tool descriptions:
     - get_weather(city: str): Returns weather information for a given city

     IMPORTANT RULES:
     - Return ONLY ONE step per response, not multiple steps
     - Start with "plan" step first
     - Wait for next input before proceeding to next step
     - When step is "action", you MUST specify function and input

     Output JSON Format (return only ONE):
     {{
         "step": "plan|action|observe|output",
         "content": "description of what you're doing",
         "function": "function name (only for action step)",
         "input": "function parameter (only for action step)"
     }}

     User Query: """

This is the system prompt of the system, which will decide where to execute what part. In a word, will reason the whole process based on the user query.

Tool Use: I am using a weather API, and the LLM will call the API when needed.

 ## This is the main part for the function call
 def get_weather(city: str):
     response = requests.get(f"https://wttr.in/{city}?format=%C:%t")

     if(response.status_code == 200):
         data = response.text.split(':')
         situation = data[0]
         temp = data[1]
         return f"weather situation: {situation} and temparature: {temp}"
     else:
         print("API failed to get weather data")

 ## This is the object for function listing. Notice in the systemp_prompt, I am listing the available functions there
 available_functions = {
     "get_weather": get_weather
 }

Decision Framework + Memorising:

Observe the code closely.

 while step_count < max_steps:
         ## Memorising the previous prompts
         full_prompt = system_prompt + user_query + "\n" + conversation_history

         response = client.models.generate_content(
             model="gemini-2.0-flash-001",
             contents=full_prompt
         )

         print(f"AI Response: {response.text}\n")

         parsed = parse_response(response.text) ## An additional function for parsing, will give it later.
         if not parsed:
             print("Failed to parse response")
             break

         ## Here, the main game begins. First the function checks the step name and its content. 
         ## Based on the name and content, it decides its action.
         ## If step is calling for an action aka. funciton, it calls a function
         ## If step is calling for output, it stops.
         ## No action for plan and observe, as it will just be handled and will do nothing (printing it though)

         step = parsed.get("step")
         content = parsed.get("content")

         if step == "action":
             function_name = parsed.get("function")
             function_input = parsed.get("input")

             if function_name in available_functions:
                 result = available_functions[function_name](function_input)
                 observation = f"Function {function_name} returned: {result}"
                 print(f"Function Call: {function_name} ('{function_input}')")
                 print(f"Result: {result}\n")

                 conversation_history += f"\nObservation: {observation}"

             else:
                 print(f"Function {function_name} not available")
                 break

         elif step == "output":
             print("=== FINAL ANSWER ===")
             print(content)
             break

         conversation_history += f"\nstep: {response.text}"
         step_count += 1

     if step_count >= max_steps:
         print("Maximum steps reached")

Like the explanation (comments in the code), the code executes the decision framework nicely.

Yeah, this is the main workflow of an agent. Now, this is the whole code:

import os
import json
import requests
from google import genai
from dotenv import load_dotenv

load_dotenv()
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
client = genai.Client(api_key=GEMINI_API_KEY)

def get_weather(city: str):
    response = requests.get(f"https://wttr.in/{city}?format=%C:%t")

    if(response.status_code == 200):
        data = response.text.split(':')
        situation = data[0]
        temp = data[1]
        return f"weather situation: {situation} and temparature: {temp}"
    else:
        print("API failed to get weather data")

available_functions = {
    "get_weather": get_weather
}


system_prompt = f"""
    You're a helpful AI assistant who is specialized in resolving user query.
    You work on plan, action, observe, output mode.

    Available tools: {list(available_functions.keys())}
    Tool descriptions:
    - get_weather(city: str): Returns weather information for a given city

    IMPORTANT RULES:
    - Return ONLY ONE step per response, not multiple steps
    - Start with "plan" step first
    - Wait for next input before proceeding to next step
    - When step is "action", you MUST specify function and input

    Output JSON Format (return only ONE):
    {{
        "step": "plan|action|observe|output",
        "content": "description of what you're doing",
        "function": "function name (only for action step)",
        "input": "function parameter (only for action step)"
    }}

    User Query: """


def parse_response(response_text):
    """Extract JSON from the response"""
    try:
        text = response_text.replace("```json", "").replace('```', "")
        lines = text.strip().split('\n')

        for line in lines:
            line = line.strip()
            if line.startswith('{') and line.endswith('}'):
                try:
                    return json.loads(line)
                except:
                    continue
        start = text.find('{')
        if start != -1:
            brace_count = 0
            for i, char in enumerate(text[start:], start):
                if char == '{':
                    brace_count += 1
                elif char == '}':
                    brace_count -= 1
                    if brace_count == 0:
                        json_str = text[start: i+1]
                        return json.loads(json_str)
    except Exception as e:
        print(f"Parse error: {e}")
        pass

    return None

def run_agent(user_query):
    conversation_history = ""
    step_count = 0
    max_steps = 100

    print(f"User Query: {user_query}\n")

    while step_count < max_steps:
        full_prompt = system_prompt + user_query + "\n" + conversation_history

        response = client.models.generate_content(
            model="gemini-2.0-flash-001",
            contents=full_prompt
        )

        print(f"AI Response: {response.text}\n")

        parsed = parse_response(response.text)
        if not parsed:
            print("Failed to parse response")
            break

        step = parsed.get("step")
        content = parsed.get("content")

        if step == "action":
            function_name = parsed.get("function")
            function_input = parsed.get("input")

            if function_name in available_functions:
                result = available_functions[function_name](function_input)
                observation = f"Function {function_name} returned: {result}"
                print(f"Function Call: {function_name} ('{function_input}')")
                print(f"Result: {result}\n")

                conversation_history += f"\nObservation: {observation}"

            else:
                print(f"Function {function_name} not available")
                break

        elif step == "output":
            print("=== FINAL ANSWER ===")
            print(content)
            break

        conversation_history += f"\nstep: {response.text}"
        step_count += 1

    if step_count >= max_steps:
        print("Maximum steps reached")

user_query = input('> ')

if __name__ == "__main__":
    user_query = user_query
    run_agent(user_query)

Sample Input:

> What is the weather in Satkhira?

Sample Output:

User Query: What is the weather in Satkhira?

AI Response: ```json
{
        "step": "plan",
        "content": "I need to get the weather information for Satkhira. I will use the get_weather tool to get the weather information.",   
        "function": null,
        "input": null
}
```


AI Response: ```json
{
        "step": "action",
        "content": "Get weather information for Satkhira",
        "function": "get_weather",
        "input": "Satkhira"
}
```

Function Call: get_weather ('Satkhira')
Result: weather situation: Overcast and temparature: +31°C

AI Response: step: ```json
{
        "step": "observe",
        "content": "The weather in Satkhira is Overcast and the temperature is +31°C.",
        "function": null,
        "input": null
}
```

AI Response: ```json
{
        "step": "output",
        "content": "The weather in Satkhira is Overcast and the temperature is +31°C.",
        "function": null,
        "input": null
}
```

=== FINAL ANSWER ===
The weather in Satkhira is Overcast and the temperature is +31°C.

The agent will extract the city name and pass it to the get_weather function, then extract the result from the API, and show the result.

Conclusion:

Seems pretty normal, right? This is just one example of how Agents work. Try giving more complex queries, like finding the average temperature of three districts or extracting the temperatures of several districts and displaying them in a table. Then you'll realize how much more powerful it is compared to a regular API call and query. Just imagine the possibilities if the query is done on a database or the entire Internet.

Different Prompting Styles

Pritom Biswas — Fri, 30 May 2025 09:34:14 GMT

What is Prompting?

Prompting, also known as prompt engineering, is the process of providing various inputs (such as text, images, or documents) to an AI model (or large language model, or LLM) to achieve the desired output.

We usually ask an AI model a question, and it gives us an answer. But it can be more efficient if we use the right format. AI models are just a bunch of code with instructions to give the best output. This is where "Prompt Engineering" comes in.

Different AI model in the market follows different prompting structure. Here are some examples:

OpenAI

{
    "role": "system",
    "content": "some system prompt" // eg. "You are a helpful assistant that answers in bullet points."
},
{
    "role": "user",
    "content": "some user prompt" // eg. "Explain how solar panels work."
}

Llama 2


[INST]
    <>
        You are a helpful, concise assistant that answers technical questions. //system prompt
    <>

    How does a binary search tree work? //user prompt
[/INST]

Grok, Gemini kind of follows the same structure as OpenAI

Types of Prompting Techniques

We've seen how different AI models need their inputs to be formatted. When we prompt something, we not only use the right format but also try to get the best answers from the LLMs. To do this, different prompting styles help make the LLMs give the most useful and user-friendly answers:

Direct Answer Prompting

Zero-shot prompting

Few-Shot Prompting

Instruction Prompting

Contextual Prompting

Persona-Based Prompting

Role-Playing Prompting

Chain-of-Thought (CoT) Prompting

Self-Consistency Prompting

Multimodal Prompting

Ok, now let me give a brief description of these with shortcodes.

Direct Answer Prompting:

Direct prompting is giving clear and specific instructions to a model without including examplesto guide its output. It is like “Just Ask”.

import os from google import genai from google.genai import types from dotenv import load_dotenv load_dotenv() GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") client = genai.Client(api_key=GEMINI_API_KEY) direct_prompts = [ "Explain What is direct prompting" ] response = client.models.generate_content( model='gemini-2.0-flash-001', contents=direct_prompts ) print(response.text)

Zero-shot prompting:

It is more like the Direct Prompting, no example is given. But the key difference in here is that Zero-shot Prompting explicitly defines the task to perfrom where in Direct Prompting, the question is asked directly.

import os from google import genai from dotenv import load_dotenv load_dotenv() GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") client = genai.Client(api_key=GEMINI_API_KEY) zero_shot_prompts = [ "Classify this review as positive or negative: 'I absolutely loved this restaurant, the food was amazing!'", "Translate the following English text to French: 'Hello, how are you doing today?'", "Summarize this paragraph in one sentence: 'Artificial intelligence has made significant strides in recent years. Machine learning models can now perform tasks that were once thought to require human intelligence. This has led to breakthroughs in various fields including healthcare, finance, and transportation.'", "Extract the main entities from this sentence: 'Apple CEO Tim Cook announced the new iPhone at their headquarters in Cupertino last Tuesday.'", "Answer this question with yes or no: 'Is the sun larger than the earth?'" ] ## Here, Classify, Translate, Summarize, Extract, Answer are the specifier of the tasks. response = client.models.generate_content( model='gemini-2.0-flash-001', contents=zero_shot_prompts ) print(response.text)

Few-Shot Prompting:

Unlike zero-shot prompting (where you only specify the task), few-shot prompting provides demonstration Examples in the prompt itself. The model can then follow the pattern established by these examples when responding to new inputs.

import os from google import genai from dotenv import load_dotenv load_dotenv() GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") client = genai.Client(api_key=GEMINI_API_KEY) few_shot_prompts = [ """Classify the sentiment as positive, negative, or neutral: Example 1: Text: "This movie was absolutely terrible." Sentiment: Negative Example 2: Text: "I had a wonderful time at the restaurant." Sentiment: Positive Example 3: Text: "The weather is cloudy today." Sentiment: Neutral Now classify this: Text: "The service was slow but the food was delicious." Sentiment:""", """Translate English to French: English: Hello, how are you? French: Bonjour, comment allez-vous? English: I love artificial intelligence. French: J'aime l'intelligence artificielle. English: What time is the meeting tomorrow? French:""" ] ## Here, along with specifier of the tasks, their expected answer is also given, so that the output can be more directed response = client.models.generate_content( model='gemini-2.0-flash-001', contents=few_shot_prompts ) print(response.text)

Instruction Prompting:

Instruction prompting provides the model with specific guidelines about:

The task to perform

The exact steps to follow

The formatting of the output

Constraints and requirements

Evaluation criteria

Unlike the zero-shot and few-shot prompting, it adds extra criteria, “The exact steps to follow to reach the conclusion”

import os from google import genai from dotenv import load_dotenv load_dotenv() GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") client = genai.Client(api_key=GEMINI_API_KEY) instruction_prompts = [ """Write a product description for a wireless headphone. Follow these instructions: 1. Keep it under 100 words 2. Highlight at least 3 key features 3. Include battery life information 4. Target audience is young professionals 5. End with a call to action 6. Do not mention price""", """Analyze the following customer feedback and do exactly as instructed: Feedback: "I've been using your app for 3 months. It's mostly good but crashes sometimes and the dark mode hurts my eyes." Instructions: 1. Identify all issues mentioned 2. Rate severity of each issue (Low/Medium/High) 3. Suggest one specific solution for each issue 4. Format your response as a table with columns: Issue, Severity, Solution 5. Add a brief conclusion with exactly 2 sentences""", """Create a 5-day meal plan following these requirements: 1. Each day must include breakfast, lunch, and dinner 2. All meals must be vegetarian 3. Include calorie count for each meal 4. No meal should repeat during the 5 days 5. Include at least one protein source in each meal 6. Format in a clear, readable structure with days as headings""" ] ## Here, along with specifier of the tasks, their expected answer is also given plus the steps to reach conclusion is also here, so that the output can be more directed def get_response(prompt): response = client.models.generate_content( model = 'gemini-2.0-flash-001', contents=prompt ) return f"Prompt: \n{prompt}\n\nResponse:\n{response.text}\n{'='*50}\n" for prompt in instruction_prompts: print(get_response(prompt))

Contextual Prompting:

Contexual Prompting is more like Instruction Prompting, but here, the clear context of a situation is given. Let me give an example: Question: What is greater? 9.8 or 9.11.
Context 1: General number system: Of course, 9.80 > 9.11
Context 2: Topic List of Books: If you have read any book, then you should notice that 9.8 means the 8th lesson of the chapter 9 and 9.11 means the 11th lesson of the 9th chapter. So, of course, 9.11 is greater!!

So, based on the different contexts the definite answer might change, and here Contextual Prompting helps.

import os from google import genai from dotenv import load_dotenv load_dotenv() GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") client = genai.Client(api_key=GEMINI_API_KEY) contextual_prompts = [ """Context: You are reviewing code for a junior developer who is learning Python. They have just submitted their first attempt at writing a function that calculates the factorial of a number. Code: def factorial(n): if n == 0: return 1 else: return n * factorial(n-1) Question: What feedback would you give this developer about their factorial function?""" ] ## Here, background information is provided before asking the question response = client.models.generate_content( model='gemini-2.0-flash-001', contents=contextual_prompts[0] ) print(response.text)

Persona-Based Prompting:

In here, it basically follows the structure of Contextual Prompting, but an extra layer of someone’s tone/role/character/viewpoint is given.

In general, in persona-based prompting, you:

Define a specific role or character for the AI to embody

Specify characteristics, expertise, or background of this persona

Frame questions that the persona should answer from their perspective

Get responses that reflect the knowledge and communication style of that persona

import os from google import genai from dotenv import load_dotenv load_dotenv() GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") client = genai.Client(api_key=GEMINI_API_KEY) persona_prompts = [ """Persona: You are a cybersecurity expert with 15 years of experience in network security and ethical hacking. You specialize in explaining complex security concepts in simple terms. Your answer starts with, "Hey There, whatcha? I am here to help. No Worries, 'kay..." Question: What are the most important steps a small business should take to protect themselves from ransomware attacks?""", """Persona: You are a professional chef who specializes in Italian cuisine. You've worked in 5-star restaurants in Rome and have published several cookbooks on authentic Italian cooking. You are hot-tempered and if anybody asks unnecessary question, you boil out. Question: What's your secret to making the perfect homemade pasta dough?""", """Persona: You are a quantum physicist working at a leading research institution. You have a knack for explaining complicated physics concepts to non-scientists. You are a lovely person and a romanticist. You try to seduce female co-workers Question: How would you explain quantum entanglement to someone with no background in physics?""" ] ## Here, a specific role/character is defined for the AI to adopt when answering def get_response(prompt): response = client.models.generate_content( model = 'gemini-2.0-flash-001', contents=prompt ) return f"Prompt: \n{prompt}\n\nResponse:\n{response.text}\n{'='*50}\n" for prompt in persona_prompts: print(get_response(prompt))

Role-Playing Prompting:

Role-playing prompting involves placing the AI in a specific scenario and asking it to respond as if it were a character within that scenario. Unlike persona-based prompting (which focuses on expertise and traits), role-playing emphasizes interactive scenarios and situational responses.

In role-playing prompting, you:

Create a specific scenario or situation

Cast the AI in a particular role within that scenario

Often include other characters or elements for interaction

Ask the AI to respond as if the scenario were real

import os from google import genai from dotenv import load_dotenv load_dotenv() GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") client = genai.Client(api_key=GEMINI_API_KEY) roleplay_prompts = [ """Role-play: You are a medieval blacksmith in a fantasy kingdom. A young adventurer has entered your shop looking for their first sword but doesn't have much money. They're asking about the different types of weapons you sell. Respond as the blacksmith would in this scenario.""", """Role-play: You are a time traveler from the year 2300 who has just arrived in 2025. You're speaking with someone who is curious about what the future is like. You're trying not to reveal too much to avoid changing the timeline. How do you respond to their questions about future technology?""", """Role-play: You are the captain of a spaceship that has just received a distress signal from a nearby planet known to be dangerous. Your crew is divided on whether to investigate or ignore it. You need to make a decision and explain it to your crew. What do you say to your crew?""" ] ## Here, the AI is placed in a specific scenario with contextual details def get_response(prompt): response = client.models.generate_content( model='gemini-2.0-flash-001', contents=prompt ) return f"Prompt: \n{prompt}\n\nResponse:\n{response.text}\n{'='*50}\n" for prompt in roleplay_prompts: print(get_response(prompt))

Chain-of-Thought (CoT) Prompting:

Chain-of-thought prompting is a technique that encourages the AI to show its reasoning process step-by-step before providing a final answer. This approach is particularly effective for complex problems requiring multi-step reasoning. It is more like Instruction Prompting, but unlike that, in here, the reasoning in each step is built on the reasoning of the previous step(you can see this in some models of OpenAI). Its main goal is to expose the reasoning process.

In Chain-of-Thought prompting, you:

Ask the model to "think step by step" before answering

Encourage showing intermediate reasoning and calculations

Break down complex problems into logical sequences

Follow the reasoning process from start to conclusion

import os from google import genai from dotenv import load_dotenv load_dotenv() GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") client = genai.Client(api_key=GEMINI_API_KEY) cot_prompts = [ """Solve this math problem. Think step by step before giving your final answer. Problem: If a store is selling a shirt for $45 after applying a 25% discount, what was the original price of the shirt?""", """Consider this logical puzzle. Think step by step through the reasoning process. Consider each steps reasoning for the next step's base case. Puzzle: Jack is looking at Anne, and Anne is looking at George. Jack is married, George is unmarried. Is a married person looking at an unmarried person? Explain your reasoning.""", """Analyze whether this argument is valid. Think step by step through your analysis. Argument: All mammals are warm-blooded. All whales are mammals. Therefore, all whales are warm-blooded.""" ] ## Here, the AI is explicitly asked to show its reasoning process step by step and follow each step's reasoning def get_response(prompt): response = client.models.generate_content( model='gemini-2.0-flash-001', contents=prompt ) return f"Prompt: \n{prompt}\n\nResponse:\n{response.text}\n{'='*50}\n" for prompt in cot_prompts: print(get_response(prompt))

Self-Consistency Prompting and Multimodal Prompting:

I will discuss these topics in another article.

Conclusion:

There are quite a few techniques available in the industry. Each one has its own use case, and depending on the need, on a single application, various methods can be used. But all of these are structured way to get the best of any LLM model out there.