Replaced Embedding RAG with PageIndex on My Own Site — Here's What Actually Happened

TL;DR: After hitting a wall with hybrid BM25 + Embedding + Rerank search (wrong retrievals, high infra cost, Japanese embedding model limitations), I implemented PageIndex — a vector-free RAG architecture — as the AI assistant for my company site. The result: no VectorDB, no embedding model, no Reranker needed. Two-stage LLM-based filtering via a URL tree navigated to the correct pages for queries like "What products do you have?" and "What ASR-related blog posts exist?" — zero vector math involved.


The Problem: Why Hybrid RAG Broke Down in Practice

The canonical RAG stack looks solid on paper: BM25 (keyword search) + Embedding (vector search) + Rerank. In theory it covers both lexical and semantic retrieval. In practice, running it as an AI assistant for a SaaS documentation site exposed three hard walls.

Wall 1: Queries That Miss Both Keywords and Semantics

Users don't speak the same language as your docs. If your documentation says "authentication," users ask "how do I log in." BM25 misses. Whether the embedding model captures the semantic gap depends heavily on training data distribution — and for domain-specific content (audio processing, medical, legal), general-purpose embeddings produce coarse vector representations that cluster incorrectly.

No matter how capable the LLM is, if the retriever fails, the answer is garbage. This is the core problem that no prompt engineering can fix.

Wall 2: Japanese Embedding Model Quality

For Japanese-language content specifically, the embedding model ecosystem is significantly thinner than English. Off-the-shelf multilingual models often compress Japanese semantics poorly. Fine-tuning a domain-specific embedding model is a valid solution — but it adds a non-trivial engineering workload on top of the retrieval problem you're already trying to solve.

Wall 3: Infrastructure Complexity for Small-Scale Deployments

Running a proper hybrid RAG stack requires:

  • A VectorDB (Pinecone, Weaviate, pgvector, etc.) — deployment, indexing pipeline, uptime management
  • An embedding model — either a hosted API (cost) or self-hosted (GPU, latency)
  • A reranking model — another deployment and inference cost
  • A pipeline that stitches these together and handles failures

For a company site with dozens to low hundreds of pages, this is massive overengineering. The infra cost and operational complexity dwarf the value delivered at that scale.


What Is PageIndex? The "Table of Contents" Mental Model

PageIndex is a RAG architecture that replaces vector search entirely with LLM-based hierarchical navigation.

The mental model is a book's table of contents. You don't read every page of a book to find what you need — you scan the ToC, identify the right chapter, then the right section. PageIndex automates exactly this with an LLM.

Original PageIndex Design

The original implementation hierarchizes content using Markdown headings:

text
# Chapter 1
## Section 1.1
### Subsection 1.1.1

The LLM navigates this tree to locate relevant content, never needing to compute vector similarity.

The Extension Used Here: URL-Based Hierarchy

For a multi-page website, I adapted PageIndex to treat the URL path structure itself as the content tree:

text
https://neosophie.com/
├── /ja/blog/
│   ├── /ja/blog/20260226-japanese-asr-benchmark
│   ├── /ja/blog/20260305-vibevoice-quantization
│   └── /ja/blog/tags/
│       ├── /ja/blog/tags/asr
│       └── /ja/blog/tags/diarization
├── /products
└── /contact

The LLM is shown a summary of this tree and reasons: "Which subtree is most likely to contain the answer?" No embeddings. No BM25. Just structured reasoning over a compact site map.

This approach shares conceptual DNA with hierarchical skill routing systems — where an agent reads brief skill descriptions to decide which tool to invoke, rather than scanning all tools at once. The key insight in both cases: a compact, structured summary is often more useful than raw semantic similarity.


Building the URL Tree: Step-by-Step Implementation

Prerequisites: Crawled URL Collection

Assume you have a dictionary of crawled pages keyed by path:

python
page_by_path = {
    "/": {"title": "Neosophie", "content": "..."},
    "/products": {"title": "Products", "content": "..."},
    "/ja/blog/20260226-japanese-asr-benchmark": {
        "title": "Japanese ASR Model Benchmark",
        "content": "..."
    },
    "/ja/blog/tags/asr": {"title": "Tag: asr", "content": "..."},
    # ...
}

Step 1: Expand Ancestor Paths

Crawled URLs may skip intermediate paths. If you crawled /docs/api/auth but not /docs/api or /docs, you need to synthesize those intermediate nodes to build a valid tree.

python
def _ancestor_paths(path: str) -> list[str]:
    """
    "/docs/api/auth" → ["/docs/api/auth", "/docs/api", "/docs", "/"]
    """
    result: list[str] = [path]
    current = path
    while True:
        if current == "/":
            break
        parent = current.rsplit("/", 1)[0] or "/"
        result.append(parent)
        current = parent
    return result

all_paths: set[str] = set()
for path in list(page_by_path):
    all_paths.update(_ancestor_paths(path))

Step 2: Build Nodes with Crawl Flags

python
from dataclasses import dataclass, field

@dataclass
class UrlTreeNode:
    segment: str          # Last path segment (e.g., "auth")
    path: str             # Full path (e.g., "/docs/api/auth")
    url: str              # Absolute URL
    title: str            # Page title (crawled pages only)
    crawled: bool         # Whether this page was actually fetched
    children: list["UrlTreeNode"] = field(default_factory=list)

nodes: dict[str, UrlTreeNode] = {}
for path in sorted(all_paths):
    page = page_by_path.get(path)
    crawled = page is not None
    segment = path.rsplit("/", 1)[-1] or "/"
    nodes[path] = UrlTreeNode(
        segment=segment,
        path=path,
        url=f"https://example.com{path}",
        title=page.get("title", "") if page else "",
        crawled=crawled,
    )
python
roots: list[UrlTreeNode] = []
for path, node in nodes.items():
    if path == "/":
        roots.append(node)
        continue
    parent_path = path.rsplit("/", 1)[0] or "/"
    parent = nodes.get(parent_path)
    if parent is not None:
        parent.children.append(node)
    else:
        roots.append(node)  # Treat as root if parent is missing

Step 4: Collapse Single-Child Intermediate Nodes

If /ja (uncrawled) has only one child /ja/blog, skip /ja and connect /ja/blog directly to the parent. This prevents unnecessary depth inflation.

python
def _collapse_single_child_intermediates(
    nodes: list[UrlTreeNode],
) -> list[UrlTreeNode]:
    result = []
    for node in nodes:
        node.children = _collapse_single_child_intermediates(node.children)
        # Uncrawled node with exactly one child → skip it
        if not node.crawled and len(node.children) == 1:
            result.append(node.children[0])
        else:
            result.append(node)
    return result

Before and after the collapse:

text
Before:
/
└── ja  (uncrawled, one child)
    └── blog  (crawled)

After:
/
└── blog  ← /ja eliminated, blog promoted directly

The LangGraph Agent: All 6 Nodes Explained

The full workflow is a LangGraph agent with six nodes:

Rendering diagram...

Node 1: classify — Routing Current-Page vs. Site-Wide Questions

Distinguishes "summarize this page" (the user is viewing a specific page) from "what blog posts exist about ASR?" (site-wide retrieval).

python
import re

_CURRENT_PAGE_PATTERN = re.compile(
    r"this page|current page|summarize|explain this|what does this",
    re.IGNORECASE,
)

def _node_classify(self, state: dict) -> dict:
    question = state.get("question", "")
    current_url = state.get("context_info", {}).get("current_url")
    if current_url and self._CURRENT_PAGE_PATTERN.search(question):
        return {"question_type": "current_page"}
    return {"question_type": "general"}

Implementation note: The current regex-based classifier misses indirect phrasing like "tell me about this service." An LLM-based classifier is a planned improvement for ambiguous cases.

Node 2a: fetch_current_page (current_page route)

Directly fetches the page by URL. Falls back to the first 4 pages in the index if the URL isn't found.

python
def _node_fetch_current_page(self, state: dict) -> dict:
    current_url = state.get("context_info", {}).get("current_url", "")
    index = self._load_page_index()
    pages_by_url = {p["url"]: p for p in index.get("pages", [])}
    page = pages_by_url.get(current_url)
    if page:
        return {"selected_pages": [page]}
    # Fallback: first 4 pages
    return {"selected_pages": list(pages_by_url.values())[:4]}

Node 2b: build_subtrees (general route)

Converts the URL tree into a flat list of subtrees, each annotated with the leaf pages it contains. This is the structure the LLM will reason over in the next step.

python
def _node_build_subtrees(self, state: dict) -> dict:
    index = self._load_page_index()
    url_tree = index.get("url_tree", [])
    pages_by_url = {p["url"]: p for p in index.get("pages", [])}
    seen_paths: set[str] = set()
    subtrees: list[dict] = []

    # Prioritize top-level children (highest navigational importance)
    for root in url_tree:
        for child in root.get("children", []):
            path = child.get("path", "")
            if path not in seen_paths:
                seen_paths.add(path)
                leaf_pages = self._collect_all_leaf_pages(child, pages_by_url)
                subtrees.append({
                    "path": path,
                    "url": child.get("url", ""),
                    "title": child.get("title", ""),
                    "leaf_pages": leaf_pages,
                })

    # Add deeper non-leaf nodes (deduplicated via seen_paths)
    self._collect_non_leaf_subtrees(url_tree, pages_by_url, subtrees, seen_paths)
    return {"subtrees": subtrees}

@staticmethod
def _collect_all_leaf_pages(
    node: dict,
    pages_by_url: dict,
) -> list[dict]:
    """Recursively collect all leaf pages under a subtree node."""
    children = node.get("children", [])
    if not children:
        page = pages_by_url.get(node.get("url", ""))
        return [page] if page else []
    result = []
    for child in children:
        result.extend(
            RagAgentService._collect_all_leaf_pages(child, pages_by_url)
        )
    return result

Node 3: select_subtrees — Stage 1 LLM Filtering

The LLM receives a ToC-style text summary of all subtrees and selects up to 2 subtree IDs most relevant to the question.

python
def _node_select_subtrees(self, state: dict) -> dict:
    question = state.get("question", "")
    subtrees = state.get("subtrees", [])

    toc_lines = []
    for i, st in enumerate(subtrees, start=1):
        preview = [p.get("title", "") for p in st["leaf_pages"][:6]]
        titles_str = ", ".join(preview)
        if len(st["leaf_pages"]) > 6:
            titles_str += f" and {len(st['leaf_pages']) - 6} more"
        toc_lines.append(
            f"id={i} | section={st['title']} | url={st['url']} | "
            f"pages={len(st['leaf_pages'])} | contained pages: {titles_str}"
        )

    system_prompt = (
        "You are a search planner. "
        "From the site structure below, select the section IDs (max 2) "
        "most likely to contain the answer to the user's question. "
        "Return ONLY a JSON array of IDs. Example: [1, 3]"
    )
    human_prompt = f"Question: {question}\n\nSite structure:\n" + "\n".join(toc_lines)

    response = self._llm.invoke(
        [SystemMessage(content=system_prompt), HumanMessage(content=human_prompt)]
    )
    selected_ids = self._extract_id_list(response.content)  # "[1, 3]" → [1, 3]

    selected_paths = [
        st["path"]
        for i, st in enumerate(subtrees, start=1)
        if i in set(selected_ids)
    ]
    return {"selected_subtree_paths": selected_paths}

Node 4 → 5: collect_leaves → select_pages — Stage 2 LLM Filtering

After collecting all leaf pages from the selected subtrees, the LLM narrows down to at most 4 specific pages using headings as a compact content fingerprint.

python
def _node_select_pages(self, state: dict) -> dict:
    question = state.get("question", "")
    leaf_candidates = state.get("leaf_candidates", [])

    page_lines = []
    for i, page in enumerate(leaf_candidates, start=1):
        headings = page.get("headings", [])
        headings_str = ", ".join(headings[:5])
        page_lines.append(
            f"id={i} | title={page.get('title', '')} "
            f"| url={page.get('url', '')} | headings={headings_str}"
        )

    system_prompt = (
        "You are a search planner. "
        "From the page index below, select the page IDs (max 4) "
        "most relevant to the question. "
        "Return ONLY a JSON array of IDs. Example: [3, 8]"
    )
    human_prompt = f"Question: {question}\n\nPage index:\n" + "\n".join(page_lines)

    response = self._llm.invoke(
        [SystemMessage(content=system_prompt), HumanMessage(content=human_prompt)]
    )
    selected_ids = self._extract_id_list(response.content)

    selected_pages = [
        page
        for i, page in enumerate(leaf_candidates, start=1)
        if i in set(selected_ids)
    ]
    return {"selected_pages": selected_pages}

Node 6: answer — Final Response Generation

Selected page content is assembled into a context window (capped at 14,000 characters total, 3,500 per page) and passed to the LLM for answer generation.

python
def _node_answer(self, state: dict) -> dict:
    question = state.get("question", "")
    selected_pages = state.get("selected_pages", [])
    context_info = state.get("context_info", {})

    context_parts = []
    total_chars = 0
    for i, page in enumerate(selected_pages, start=1):
        content = page.get("content", "")[:3500]
        entry = (
            f"[{i}] title={page.get('title', '')}\n"
            f"url={page.get('url', '')}\n"
            f"headings={page.get('headings', '')}\n"
            f"content={content}"
        )
        total_chars += len(entry)
        if total_chars > 14000:
            break
        context_parts.append(entry)

    context = "\n\n".join(context_parts)

    system_prompt = (
        "You are a website assistant AI. "
        "Answer using ONLY the provided context. "
        "If context is insufficient, explicitly state what is missing. "
        "Never use tables. Prefer bullet points for structured information. "
        "End your response with a bullet list of reference URLs."
    )
    human_prompt = (
        f"User context:\n"
        f"- Current URL: {context_info.get('current_url', 'N/A')}\n"
        f"- Timestamp: {context_info.get('timestamp', 'N/A')}\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}"
    )

    response = self._llm.invoke(
        [SystemMessage(content=system_prompt), HumanMessage(content=human_prompt)]
    )
    return {"answer": response.content}

Real Logs: What Actually Happened in Production

These are direct excerpts from server logs during initial validation.

Case 1: "What products do you have?"

The ToC passed to select_subtrees (abbreviated):

text
id=1 | section=Contact | url=https://neosophie.com/contact | pages=1
id=2 | section=Blog    | url=https://neosophie.com/ja/blog | pages=17 | contained pages: AI tools engineers check daily... and 11 more
id=4 | section=Products | url=https://neosophie.com/products | pages=1 | contained pages: Products | Neosophie

LLM selected: [4] → navigated directly to /products. Zero vector computation. Correct on first try.

Stage 1 — Subtree selection:

Input to LLM:

text
Question: What ASR-related blog posts exist?

Site structure:
id=1 | section=Contact | url=.../contact | pages=1 | ...
id=2 | section=Blog    | url=.../ja/blog | pages=17 | contained pages: AI tools engineers check daily..., Japanese ASR Model Benchmark 2026..., ...

LLM selected: [2] → correctly narrowed to the blog section without seeing any individual pages.

Stage 2 — Page selection:

The 17 blog pages are presented as a compact index:

text
Question: What ASR-related blog posts exist?

Page index:
id=20 | title=17 Sites AI Engineers Check Daily           | headings=...
id=21 | title=Japanese ASR Benchmark 2026: Whisper, Qwen3, Voxtral, ReazonSpeech | headings=Summary, Test Environment...
id=6  | title=How VibeVoice-ASR Quantization Affects DER  | headings=Summary, Introduction...
id=11 | title=Tag: asr                                    | headings=# asr, VibeVoice-ASR..., Japanese ASR...
...

LLM selected: [21, 6, 11] — the ASR benchmark post, the quantization post, and the ASR tag page. All three are directly relevant. No hallucinated results.

Stage 3 — Answer generation:

text
Context:
[1] title=Japanese ASR Benchmark 2026: Whisper, Qwen3, Voxtral, ReazonSpeech
    url=https://neosophie.com/ja/blog/20260226-japanese-asr-benchmark
    content=Benchmark of 8 Japanese speech recognition (ASR) models under identical conditions...

[2] title=How VibeVoice-ASR Quantization Affects DER
...

The key insight from both cases: the 2-stage subtree → page filtering kept the context window minimal regardless of total site size. The LLM never saw more than it needed to.


Glossary

TermDefinition
RAGRetrieval-Augmented Generation — feeding relevant retrieved content into an LLM's context before generation, reducing hallucination and enabling knowledge that isn't in the model's weights
BM25A classical full-text ranking algorithm based on term frequency and inverse document frequency; fast and effective for keyword matching, blind to semantics
EmbeddingConverting text into a high-dimensional vector such that semantically similar texts produce nearby vectors; the foundation of vector search
RerankerA model that re-scores an initial set of retrieved candidates to improve final ordering; typically a cross-encoder that reads query + document together
VectorDBA database optimized for storing and querying dense vectors; examples include Pinecone, Weaviate, pgvector
PageIndexA vector-free RAG method that uses heading or URL hierarchies as a navigable table of contents, with LLM reasoning replacing vector similarity search
LangGraphA LangChain sub-library for defining LLM agent workflows as directed graphs with explicit state management and conditional branching
Leaf nodeIn a tree structure, a node with no children — the terminal elements
DERDiarization Error Rate — a metric for speaker diarization systems measuring the fraction of audio incorrectly attributed to wrong or missed speakers
Cross-encoderA retrieval model architecture that jointly encodes query and document (vs. bi-encoder which encodes them independently), enabling finer-grained relevance scoring at the cost of speed

Known Limitations and Planned Improvements

Limitation 1: Context Overflow on Large Sites

The current implementation passes all top-level subtrees to select_subtrees in one shot. For sites with thousands of pages, the ToC text alone could exceed the context window.

Planned fix: LLM-based pre-grouping. Cluster similar subtrees first, have the LLM select a broad category, then drill into the narrowed subtree in a 3-stage cascade.

Limitation 2: Flat URL Structures Break the Tree

If a site has no meaningful URL hierarchy — all pages at / depth — the tree provides no navigational signal. In this scenario, content-based hierarchies derived from page headings or topic clustering must substitute for URL structure.

When PageIndex won't work well: Sites with flat URL structures (e.g., all pages at root depth), content that's primarily short unstructured fragments (forum posts, chat logs), or queries requiring fuzzy semantic matching ("restaurants with a cozy vibe") are poor fits. In these cases, embedding-based retrieval is the better tool.

Limitation 3: Stale Index on Site Updates

Every site update requires a re-crawl and index rebuild. This is true for VectorDB-based RAG too, but a differential update mechanism (tracking page modification timestamps, only re-crawling changed pages) is a necessary addition for production use.

Limitation 4: Classifier Edge Cases

The regex-based classify node misses indirect phrasing. A user asking "can you explain what this does?" while viewing a specific page won't trigger the current-page route. An LLM classifier is more robust but adds latency and cost to every request.


Comparison: PageIndex vs. Embedding RAG

DimensionEmbedding RAGPageIndex
VectorDB requiredYesNo
Embedding model requiredYesNo
Reranker (recommended)YesNo
LLM calls per query1 (answer only)Up to 3
Robustness to keyword mismatchLowHigh
Structured content fitMediumHigh
Unstructured content fitHighLow
Implementation complexityHighLow
Operational overheadHighLow

PageIndex shines under a specific set of conditions: structured site with clear URL hierarchy × resource-constrained deployment × high retrieval precision requirement. Outside those conditions, hybrid Embedding RAG remains the more flexible default.


FAQ

Q1: Is PageIndex always better than Embedding RAG?

No. PageIndex excels on well-structured content with clear URL or heading hierarchies. Embedding-based retrieval handles unstructured data, large corpora of short text fragments, and fuzzy semantic queries better. The right answer depends on your content structure and query types — ideally benchmark both on your actual data.

Q2: Can I implement this without LangGraph?

Yes. The nodes are pure functions chained in sequence with conditional branching on question_type. A simple Python function pipeline works fine. LangGraph adds structured state management, easier debugging, and cleaner conditional routing — worth the added dependency if your workflow grows more complex.

Q3: Won't 3 LLM calls per query make costs prohibitive?

In practice, the select_subtrees and select_pages calls have small, structured inputs (ToC text, not full page content) and require only basic reasoning. Small, cheap models (claude-haiku, gpt-4o-mini) handle these filtering stages reliably. The final answer call is the only one that benefits from a high-capability model. A tiered model strategy keeps costs close to single-call approaches.

Q4: Is VectorDB truly zero-cost to eliminate?

In this implementation, yes — pages are stored in JSON or SQLite. No vector index is built or queried. However, if your site reaches thousands of pages or you need BM25 keyword fallback alongside PageIndex, a lightweight hybrid configuration (adding SQLite FTS or Elasticsearch) may become necessary.

Q5: What about long page content?

The _node_answer implementation caps content at 3,500 characters per page and 14,000 characters total. For long pages, apply heading-based chunking when building the index — store each heading section as a separate page entry. This keeps individual units retrievable without losing the navigational structure.

Q6: Does this work with JavaScript-rendered sites (Next.js, SPA)?

The crawler must render JavaScript to extract page content — Playwright or Puppeteer are suitable tools. Once page content is extracted and stored in the index, the PageIndex agent operates identically regardless of how the source site was rendered.

Q7: How do you handle pages that don't map cleanly to URL hierarchy (e.g., tag pages)?

Tag pages (like /blog/tags/asr) naturally appear as leaf nodes in the tree. Because their titles and headings contain tag names and lists of related articles, the LLM reliably selects them for tag-related queries. The headings field in the page index is particularly useful here — it surfaces content fingerprints without loading full page text.


Conclusion

VectorDB, embedding models, and rerankers are powerful tools — but they carry real infrastructure weight. For a company site at moderate scale, that weight often isn't justified by the retrieval improvement delivered.

PageIndex provides a leaner alternative: URL hierarchy as a table of contents, LLM reasoning as the navigator, two-stage filtering to keep context minimal. The approach won't replace vector search universally, but for structured content × small teams × precision-first requirements, it deserves a serious look before you provision a VectorDB.

The full implementation above is reproducible as-is. If you're hitting retrieval quality issues with hybrid RAG on a structured site, try navigating the tree before you tune the embeddings.

Related Articles

Explore more articles connected to this topic.