January 26, 2026

Building AI Support Agents: Question Extraction as a Retrieval Strategy

Liza Katz | CEO at Neradot
Building AI Support Agents: Question Extraction as a Retrieval Strategy

Follow along this using the tutorial on our GitHub.

Developers like to use standardized approaches when preparing data for RAG applications: take a document, slice it into fixed-size chunks, embed them, and dump them into a vector database. Simple enough.

For many use cases, this works fine. But for a support knowledge base - where the goal is to solve specific user problems - this approach often yields sub-optimal results.

When you chunk a troubleshooting guide using naive chunking strategies, you introduce noise. A single chunk might contain a symptom description, a safety warning, and half a warranty disclaimer. Then, when a user asks, "Why is the red light blinking?", the semantic signal of that chunk is diluted by the irrelevant text surrounding the answer. The vector engine struggles to match the user's specific intent with your generic content block.

There is a better strategy for this problem space: Question Extraction.

The Concept: Semantic Pointers

Instead of embedding the answer (the text), we embed the questions that the text answers.

We reverse-engineer the document. We treat every paragraph in a support article as a potential answer to a user’s inquiry. By generating the questions that a specific piece of text answers, we create precise "semantic pointers" that link directly to the relevant content.

When a user searches for "my screen is black," we don't compare it to a technical manual's paragraph about power supply voltages. We compare it to a pre-generated question like "What should I do if the device won't turn on?", which is semantically much closer.

Defining "Good" Questions

Generating these questions isn't just summarization. It is closer to the academic challenge of Automatic Question Generation (AQG), used for automatically building exams. The quality of your retrieval depends entirely on the quality of these questions.

To make this work, we need strict criteria.

Micro Level (The Question): 

Each question must be:

  • Fluent: Grammatically correct and natural.
  • Clear: Unambiguous.
  • Concise: No fluff.
  • Answerable: The source text must provide the complete answer.

Macro Level (The Set): 

When generating questions for a full document, the set must be:

  • Minimal: No duplicates. We don't need "Why is it broken?" and "Why does it not work?" in the same index.
  • Complete: Every major troubleshooting scenario in the text must be represented.
Implementation with Python and Pinecone

We can use an LLM to generate this dataset, then index it using Pinecone.

Step 1: The Extraction Prompt

We ask the LLM to act as the extractor.

import openai

def extract_questions(text_chunk):
    prompt = f"""
    You are a question extraction specialist. Your task is to generate representative questions that users might ask when facing the issue described in this support document.

    
    # Principles for Question Extraction
    1. Questions must be self-contained and clear.
    2. Do not generate duplicate questions.
    3. Ensure all troubleshooting scenarios are covered.
    
    Text: {text_chunk}
    
    Output format: JSON list of strings.
    """
    
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Step 2: Indexing

Instead of embedding the text chunk, we embed the generated questions. Crucially, we store the original text (or a pointer to it) in the metadata.

from pinecone import Pinecone

pc = Pinecone(api_key="YOUR_API_KEY")
index = pc.Index("support-kb")

# Assume 'questions' is the list we got from the LLM
# and 'original_text_id' points to the source article
vectors = []

for q in questions:
    # Embed the QUESTION, not the answer
    embedding = get_embedding(q) 
    
    vectors.append({
        "id": generate_unique_id(),
        "values": embedding,
        "metadata": {
            "text_id": original_text_id,
            "question": q,
            "type": "generated_question"
        }
    })

index.upsert(vectors)

Now, when a user queries the system, we search against these clear, focused vectors. The match score will be significantly higher because we are comparing "intent to intent" rather than "intent to content."

Step 3: Using the index

Now, when a user queries the system, we search against these clear, focused vectors. The match score will be significantly higher because we are comparing "intent to intent" rather than "intent to content."

Here is how we perform the retrieval. The key difference from a standard RAG pipeline is the indirect mapping: we match the user's question to a known valid question, but we return the article associated with it.

from pinecone import Pinecone
import openai

# Initialize connection
pc = Pinecone(api_key="YOUR_API_KEY")
index = pc.Index("support-kb")

def search_knowledge_base(user_query, top_k=3):
    # 1. Embed the User's Query
    # We use the same model used for indexing the questions
    response = openai.embeddings.create(
        input=user_query,
        model="text-embedding-3-small"
    )
    query_vector = response.data[0].embedding

    # 2. Search against the "Questions" Index
    search_results = index.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True
    )

    results = []
    for match in search_results['matches']:
        # 3. The "Double Hop"
        # The vector match identifies the Semantic Question (Intent).
        # The metadata provides the Pointer to the Content (Answer).
        
        result = {
            "score": match['score'],
            "matched_question": match['metadata']['question'],
            "source_article_id": match['metadata']['text_id'],
            # If you stored the actual answer text in metadata, retrieve it here:
            "answer_text": match['metadata'].get('content_text')
        }
        results.append(result)

    return results

This approach also offers superior transparency. In a standard chunk-based system, debugging why a specific paragraph was retrieved can be cryptic. Here, the system tells you exactly which "question" it thought the user was asking.

Benefits and Limitations

This approach has distinct advantages for support agents:

  1. High Semantic Density: The vectors represent pure intent.
  2. Symptom Matching: Users describe symptoms ("it's making a weird noise"). If your extraction phase captures this ("Why is the unit buzzing?"), you bridge the gap between user language and technical documentation.
  3. Manual Enrichment: You can improve accuracy without rewriting your docs. If users keep asking a specific question that the model misses, you just manually add that question-vector to the index and point it to the right article.

However, this is not a silver bullet. If a user asks, for example, "What is the exact width of the product in millimeters?", standard chunking or structured SQL retrieval might be superior. Question extraction works best for "How do I..." or "Why is..." scenarios.

Conclusion

RAG is not a single algorithm; it is a family of strategies. Choosing the right semantic processing method is just as important as choosing the right model.

For data that is problem-solution oriented, like help centers or FAQs, standard chunking is often a mismatch. By extracting questions, you align your data structure with your user's intent, leading to a system that actually solves the problem rather than just retrieving text.