Replaced Embedding RAG with PageIndex on My Own Site — Here's What Actually Happened
TL;DR: After hitting a wall with hybrid BM25 + Embedding + Rerank search (wrong retrievals, high infra cost, Japanese embedding model limitations), I implemented PageIndex — a vector-free RAG architecture — as the AI assistant for my company site. The result: no VectorDB, no embedding model, no Reranker needed. Two-stage LLM-based filtering via a URL tree navigated to the correct pages for queries like "What products do you have?" and "What ASR-related blog posts exist?" — zero vector math involved.
The Problem: Why Hybrid RAG Broke Down in Practice
The canonical RAG stack looks solid on paper: BM25 (keyword search) + Embedding (vector search) + Rerank. In theory it covers both lexical and semantic retrieval. In practice, running it as an AI assistant for a SaaS documentation site exposed three hard walls.
Wall 1: Queries That Miss Both Keywords and Semantics
Users don't speak the same language as your docs. If your documentation says "authentication," users ask "how do I log in." BM25 misses. Whether the embedding model captures the semantic gap depends heavily on training data distribution — and for domain-specific content (audio processing, medical, legal), general-purpose embeddings produce coarse vector representations that cluster incorrectly.
No matter how capable the LLM is, if the retriever fails, the answer is garbage. This is the core problem that no prompt engineering can fix.
Wall 2: Japanese Embedding Model Quality
For Japanese-language content specifically, the embedding model ecosystem is significantly thinner than English. Off-the-shelf multilingual models often compress Japanese semantics poorly. Fine-tuning a domain-specific embedding model is a valid solution — but it adds a non-trivial engineering workload on top of the retrieval problem you're already trying to solve.
Wall 3: Infrastructure Complexity for Small-Scale Deployments
Running a proper hybrid RAG stack requires:
- A VectorDB (Pinecone, Weaviate, pgvector, etc.) — deployment, indexing pipeline, uptime management
- An embedding model — either a hosted API (cost) or self-hosted (GPU, latency)
- A reranking model — another deployment and inference cost
- A pipeline that stitches these together and handles failures
For a company site with dozens to low hundreds of pages, this is massive overengineering. The infra cost and operational complexity dwarf the value delivered at that scale.
What Is PageIndex? The "Table of Contents" Mental Model
PageIndex is a RAG architecture that replaces vector search entirely with LLM-based hierarchical navigation.
The mental model is a book's table of contents. You don't read every page of a book to find what you need — you scan the ToC, identify the right chapter, then the right section. PageIndex automates exactly this with an LLM.
Original PageIndex Design
The original implementation hierarchizes content using Markdown headings:
# Chapter 1
## Section 1.1
### Subsection 1.1.1
The LLM navigates this tree to locate relevant content, never needing to compute vector similarity.
The Extension Used Here: URL-Based Hierarchy
For a multi-page website, I adapted PageIndex to treat the URL path structure itself as the content tree:
https://neosophie.com/
├── /ja/blog/
│ ├── /ja/blog/20260226-japanese-asr-benchmark
│ ├── /ja/blog/20260305-vibevoice-quantization
│ └── /ja/blog/tags/
│ ├── /ja/blog/tags/asr
│ └── /ja/blog/tags/diarization
├── /products
└── /contact
The LLM is shown a summary of this tree and reasons: "Which subtree is most likely to contain the answer?" No embeddings. No BM25. Just structured reasoning over a compact site map.
This approach shares conceptual DNA with hierarchical skill routing systems — where an agent reads brief skill descriptions to decide which tool to invoke, rather than scanning all tools at once. The key insight in both cases: a compact, structured summary is often more useful than raw semantic similarity.
Building the URL Tree: Step-by-Step Implementation
Prerequisites: Crawled URL Collection
Assume you have a dictionary of crawled pages keyed by path:
page_by_path = {
"/": {"title": "Neosophie", "content": "..."},
"/products": {"title": "Products", "content": "..."},
"/ja/blog/20260226-japanese-asr-benchmark": {
"title": "Japanese ASR Model Benchmark",
"content": "..."
},
"/ja/blog/tags/asr": {"title": "Tag: asr", "content": "..."},
# ...
}
Step 1: Expand Ancestor Paths
Crawled URLs may skip intermediate paths. If you crawled /docs/api/auth but not /docs/api or /docs, you need to synthesize those intermediate nodes to build a valid tree.
def _ancestor_paths(path: str) -> list[str]:
"""
"/docs/api/auth" → ["/docs/api/auth", "/docs/api", "/docs", "/"]
"""
result: list[str] = [path]
current = path
while True:
if current == "/":
break
parent = current.rsplit("/", 1)[0] or "/"
result.append(parent)
current = parent
return result
all_paths: set[str] = set()
for path in list(page_by_path):
all_paths.update(_ancestor_paths(path))
Step 2: Build Nodes with Crawl Flags
from dataclasses import dataclass, field
@dataclass
class UrlTreeNode:
segment: str # Last path segment (e.g., "auth")
path: str # Full path (e.g., "/docs/api/auth")
url: str # Absolute URL
title: str # Page title (crawled pages only)
crawled: bool # Whether this page was actually fetched
children: list["UrlTreeNode"] = field(default_factory=list)
nodes: dict[str, UrlTreeNode] = {}
for path in sorted(all_paths):
page = page_by_path.get(path)
crawled = page is not None
segment = path.rsplit("/", 1)[-1] or "/"
nodes[path] = UrlTreeNode(
segment=segment,
path=path,
url=f"https://example.com{path}",
title=page.get("title", "") if page else "",
crawled=crawled,
)
Step 3: Link Parent-Child Relationships
roots: list[UrlTreeNode] = []
for path, node in nodes.items():
if path == "/":
roots.append(node)
continue
parent_path = path.rsplit("/", 1)[0] or "/"
parent = nodes.get(parent_path)
if parent is not None:
parent.children.append(node)
else:
roots.append(node) # Treat as root if parent is missing
Step 4: Collapse Single-Child Intermediate Nodes
If /ja (uncrawled) has only one child /ja/blog, skip /ja and connect /ja/blog directly to the parent. This prevents unnecessary depth inflation.
def _collapse_single_child_intermediates(
nodes: list[UrlTreeNode],
) -> list[UrlTreeNode]:
result = []
for node in nodes:
node.children = _collapse_single_child_intermediates(node.children)
# Uncrawled node with exactly one child → skip it
if not node.crawled and len(node.children) == 1:
result.append(node.children[0])
else:
result.append(node)
return result
Before and after the collapse:
Before:
/
└── ja (uncrawled, one child)
└── blog (crawled)
After:
/
└── blog ← /ja eliminated, blog promoted directly
The LangGraph Agent: All 6 Nodes Explained
The full workflow is a LangGraph agent with six nodes:
Node 1: classify — Routing Current-Page vs. Site-Wide Questions
Distinguishes "summarize this page" (the user is viewing a specific page) from "what blog posts exist about ASR?" (site-wide retrieval).
import re
_CURRENT_PAGE_PATTERN = re.compile(
r"this page|current page|summarize|explain this|what does this",
re.IGNORECASE,
)
def _node_classify(self, state: dict) -> dict:
question = state.get("question", "")
current_url = state.get("context_info", {}).get("current_url")
if current_url and self._CURRENT_PAGE_PATTERN.search(question):
return {"question_type": "current_page"}
return {"question_type": "general"}
Implementation note: The current regex-based classifier misses indirect phrasing like "tell me about this service." An LLM-based classifier is a planned improvement for ambiguous cases.
Node 2a: fetch_current_page (current_page route)
Directly fetches the page by URL. Falls back to the first 4 pages in the index if the URL isn't found.
def _node_fetch_current_page(self, state: dict) -> dict:
current_url = state.get("context_info", {}).get("current_url", "")
index = self._load_page_index()
pages_by_url = {p["url"]: p for p in index.get("pages", [])}
page = pages_by_url.get(current_url)
if page:
return {"selected_pages": [page]}
# Fallback: first 4 pages
return {"selected_pages": list(pages_by_url.values())[:4]}
Node 2b: build_subtrees (general route)
Converts the URL tree into a flat list of subtrees, each annotated with the leaf pages it contains. This is the structure the LLM will reason over in the next step.
def _node_build_subtrees(self, state: dict) -> dict:
index = self._load_page_index()
url_tree = index.get("url_tree", [])
pages_by_url = {p["url"]: p for p in index.get("pages", [])}
seen_paths: set[str] = set()
subtrees: list[dict] = []
# Prioritize top-level children (highest navigational importance)
for root in url_tree:
for child in root.get("children", []):
path = child.get("path", "")
if path not in seen_paths:
seen_paths.add(path)
leaf_pages = self._collect_all_leaf_pages(child, pages_by_url)
subtrees.append({
"path": path,
"url": child.get("url", ""),
"title": child.get("title", ""),
"leaf_pages": leaf_pages,
})
# Add deeper non-leaf nodes (deduplicated via seen_paths)
self._collect_non_leaf_subtrees(url_tree, pages_by_url, subtrees, seen_paths)
return {"subtrees": subtrees}
@staticmethod
def _collect_all_leaf_pages(
node: dict,
pages_by_url: dict,
) -> list[dict]:
"""Recursively collect all leaf pages under a subtree node."""
children = node.get("children", [])
if not children:
page = pages_by_url.get(node.get("url", ""))
return [page] if page else []
result = []
for child in children:
result.extend(
RagAgentService._collect_all_leaf_pages(child, pages_by_url)
)
return result
Node 3: select_subtrees — Stage 1 LLM Filtering
The LLM receives a ToC-style text summary of all subtrees and selects up to 2 subtree IDs most relevant to the question.
def _node_select_subtrees(self, state: dict) -> dict:
question = state.get("question", "")
subtrees = state.get("subtrees", [])
toc_lines = []
for i, st in enumerate(subtrees, start=1):
preview = [p.get("title", "") for p in st["leaf_pages"][:6]]
titles_str = ", ".join(preview)
if len(st["leaf_pages"]) > 6:
titles_str += f" and {len(st['leaf_pages']) - 6} more"
toc_lines.append(
f"id={i} | section={st['title']} | url={st['url']} | "
f"pages={len(st['leaf_pages'])} | contained pages: {titles_str}"
)
system_prompt = (
"You are a search planner. "
"From the site structure below, select the section IDs (max 2) "
"most likely to contain the answer to the user's question. "
"Return ONLY a JSON array of IDs. Example: [1, 3]"
)
human_prompt = f"Question: {question}\n\nSite structure:\n" + "\n".join(toc_lines)
response = self._llm.invoke(
[SystemMessage(content=system_prompt), HumanMessage(content=human_prompt)]
)
selected_ids = self._extract_id_list(response.content) # "[1, 3]" → [1, 3]
selected_paths = [
st["path"]
for i, st in enumerate(subtrees, start=1)
if i in set(selected_ids)
]
return {"selected_subtree_paths": selected_paths}
Node 4 → 5: collect_leaves → select_pages — Stage 2 LLM Filtering
After collecting all leaf pages from the selected subtrees, the LLM narrows down to at most 4 specific pages using headings as a compact content fingerprint.
def _node_select_pages(self, state: dict) -> dict:
question = state.get("question", "")
leaf_candidates = state.get("leaf_candidates", [])
page_lines = []
for i, page in enumerate(leaf_candidates, start=1):
headings = page.get("headings", [])
headings_str = ", ".join(headings[:5])
page_lines.append(
f"id={i} | title={page.get('title', '')} "
f"| url={page.get('url', '')} | headings={headings_str}"
)
system_prompt = (
"You are a search planner. "
"From the page index below, select the page IDs (max 4) "
"most relevant to the question. "
"Return ONLY a JSON array of IDs. Example: [3, 8]"
)
human_prompt = f"Question: {question}\n\nPage index:\n" + "\n".join(page_lines)
response = self._llm.invoke(
[SystemMessage(content=system_prompt), HumanMessage(content=human_prompt)]
)
selected_ids = self._extract_id_list(response.content)
selected_pages = [
page
for i, page in enumerate(leaf_candidates, start=1)
if i in set(selected_ids)
]
return {"selected_pages": selected_pages}
Node 6: answer — Final Response Generation
Selected page content is assembled into a context window (capped at 14,000 characters total, 3,500 per page) and passed to the LLM for answer generation.
def _node_answer(self, state: dict) -> dict:
question = state.get("question", "")
selected_pages = state.get("selected_pages", [])
context_info = state.get("context_info", {})
context_parts = []
total_chars = 0
for i, page in enumerate(selected_pages, start=1):
content = page.get("content", "")[:3500]
entry = (
f"[{i}] title={page.get('title', '')}\n"
f"url={page.get('url', '')}\n"
f"headings={page.get('headings', '')}\n"
f"content={content}"
)
total_chars += len(entry)
if total_chars > 14000:
break
context_parts.append(entry)
context = "\n\n".join(context_parts)
system_prompt = (
"You are a website assistant AI. "
"Answer using ONLY the provided context. "
"If context is insufficient, explicitly state what is missing. "
"Never use tables. Prefer bullet points for structured information. "
"End your response with a bullet list of reference URLs."
)
human_prompt = (
f"User context:\n"
f"- Current URL: {context_info.get('current_url', 'N/A')}\n"
f"- Timestamp: {context_info.get('timestamp', 'N/A')}\n\n"
f"Context:\n{context}\n\n"
f"Question: {question}"
)
response = self._llm.invoke(
[SystemMessage(content=system_prompt), HumanMessage(content=human_prompt)]
)
return {"answer": response.content}
Real Logs: What Actually Happened in Production
These are direct excerpts from server logs during initial validation.
Case 1: "What products do you have?"
The ToC passed to select_subtrees (abbreviated):
id=1 | section=Contact | url=https://neosophie.com/contact | pages=1
id=2 | section=Blog | url=https://neosophie.com/ja/blog | pages=17 | contained pages: AI tools engineers check daily... and 11 more
id=4 | section=Products | url=https://neosophie.com/products | pages=1 | contained pages: Products | Neosophie
LLM selected: [4] → navigated directly to /products. Zero vector computation. Correct on first try.
Case 2: "What ASR-related blog posts exist?"
Stage 1 — Subtree selection:
Input to LLM:
Question: What ASR-related blog posts exist?
Site structure:
id=1 | section=Contact | url=.../contact | pages=1 | ...
id=2 | section=Blog | url=.../ja/blog | pages=17 | contained pages: AI tools engineers check daily..., Japanese ASR Model Benchmark 2026..., ...
LLM selected: [2] → correctly narrowed to the blog section without seeing any individual pages.
Stage 2 — Page selection:
The 17 blog pages are presented as a compact index:
Question: What ASR-related blog posts exist?
Page index:
id=20 | title=17 Sites AI Engineers Check Daily | headings=...
id=21 | title=Japanese ASR Benchmark 2026: Whisper, Qwen3, Voxtral, ReazonSpeech | headings=Summary, Test Environment...
id=6 | title=How VibeVoice-ASR Quantization Affects DER | headings=Summary, Introduction...
id=11 | title=Tag: asr | headings=# asr, VibeVoice-ASR..., Japanese ASR...
...
LLM selected: [21, 6, 11] — the ASR benchmark post, the quantization post, and the ASR tag page. All three are directly relevant. No hallucinated results.
Stage 3 — Answer generation:
Context:
[1] title=Japanese ASR Benchmark 2026: Whisper, Qwen3, Voxtral, ReazonSpeech
url=https://neosophie.com/ja/blog/20260226-japanese-asr-benchmark
content=Benchmark of 8 Japanese speech recognition (ASR) models under identical conditions...
[2] title=How VibeVoice-ASR Quantization Affects DER
...
The key insight from both cases: the 2-stage subtree → page filtering kept the context window minimal regardless of total site size. The LLM never saw more than it needed to.
Glossary
| Term | Definition |
|---|---|
| RAG | Retrieval-Augmented Generation — feeding relevant retrieved content into an LLM's context before generation, reducing hallucination and enabling knowledge that isn't in the model's weights |
| BM25 | A classical full-text ranking algorithm based on term frequency and inverse document frequency; fast and effective for keyword matching, blind to semantics |
| Embedding | Converting text into a high-dimensional vector such that semantically similar texts produce nearby vectors; the foundation of vector search |
| Reranker | A model that re-scores an initial set of retrieved candidates to improve final ordering; typically a cross-encoder that reads query + document together |
| VectorDB | A database optimized for storing and querying dense vectors; examples include Pinecone, Weaviate, pgvector |
| PageIndex | A vector-free RAG method that uses heading or URL hierarchies as a navigable table of contents, with LLM reasoning replacing vector similarity search |
| LangGraph | A LangChain sub-library for defining LLM agent workflows as directed graphs with explicit state management and conditional branching |
| Leaf node | In a tree structure, a node with no children — the terminal elements |
| DER | Diarization Error Rate — a metric for speaker diarization systems measuring the fraction of audio incorrectly attributed to wrong or missed speakers |
| Cross-encoder | A retrieval model architecture that jointly encodes query and document (vs. bi-encoder which encodes them independently), enabling finer-grained relevance scoring at the cost of speed |
Known Limitations and Planned Improvements
Limitation 1: Context Overflow on Large Sites
The current implementation passes all top-level subtrees to select_subtrees in one shot. For sites with thousands of pages, the ToC text alone could exceed the context window.
Planned fix: LLM-based pre-grouping. Cluster similar subtrees first, have the LLM select a broad category, then drill into the narrowed subtree in a 3-stage cascade.
Limitation 2: Flat URL Structures Break the Tree
If a site has no meaningful URL hierarchy — all pages at / depth — the tree provides no navigational signal. In this scenario, content-based hierarchies derived from page headings or topic clustering must substitute for URL structure.
When PageIndex won't work well: Sites with flat URL structures (e.g., all pages at root depth), content that's primarily short unstructured fragments (forum posts, chat logs), or queries requiring fuzzy semantic matching ("restaurants with a cozy vibe") are poor fits. In these cases, embedding-based retrieval is the better tool.
Limitation 3: Stale Index on Site Updates
Every site update requires a re-crawl and index rebuild. This is true for VectorDB-based RAG too, but a differential update mechanism (tracking page modification timestamps, only re-crawling changed pages) is a necessary addition for production use.
Limitation 4: Classifier Edge Cases
The regex-based classify node misses indirect phrasing. A user asking "can you explain what this does?" while viewing a specific page won't trigger the current-page route. An LLM classifier is more robust but adds latency and cost to every request.
Comparison: PageIndex vs. Embedding RAG
| Dimension | Embedding RAG | PageIndex |
|---|---|---|
| VectorDB required | Yes | No |
| Embedding model required | Yes | No |
| Reranker (recommended) | Yes | No |
| LLM calls per query | 1 (answer only) | Up to 3 |
| Robustness to keyword mismatch | Low | High |
| Structured content fit | Medium | High |
| Unstructured content fit | High | Low |
| Implementation complexity | High | Low |
| Operational overhead | High | Low |
PageIndex shines under a specific set of conditions: structured site with clear URL hierarchy × resource-constrained deployment × high retrieval precision requirement. Outside those conditions, hybrid Embedding RAG remains the more flexible default.
FAQ
Q1: Is PageIndex always better than Embedding RAG?
No. PageIndex excels on well-structured content with clear URL or heading hierarchies. Embedding-based retrieval handles unstructured data, large corpora of short text fragments, and fuzzy semantic queries better. The right answer depends on your content structure and query types — ideally benchmark both on your actual data.
Q2: Can I implement this without LangGraph?
Yes. The nodes are pure functions chained in sequence with conditional branching on question_type. A simple Python function pipeline works fine. LangGraph adds structured state management, easier debugging, and cleaner conditional routing — worth the added dependency if your workflow grows more complex.
Q3: Won't 3 LLM calls per query make costs prohibitive?
In practice, the select_subtrees and select_pages calls have small, structured inputs (ToC text, not full page content) and require only basic reasoning. Small, cheap models (claude-haiku, gpt-4o-mini) handle these filtering stages reliably. The final answer call is the only one that benefits from a high-capability model. A tiered model strategy keeps costs close to single-call approaches.
Q4: Is VectorDB truly zero-cost to eliminate?
In this implementation, yes — pages are stored in JSON or SQLite. No vector index is built or queried. However, if your site reaches thousands of pages or you need BM25 keyword fallback alongside PageIndex, a lightweight hybrid configuration (adding SQLite FTS or Elasticsearch) may become necessary.
Q5: What about long page content?
The _node_answer implementation caps content at 3,500 characters per page and 14,000 characters total. For long pages, apply heading-based chunking when building the index — store each heading section as a separate page entry. This keeps individual units retrievable without losing the navigational structure.
Q6: Does this work with JavaScript-rendered sites (Next.js, SPA)?
The crawler must render JavaScript to extract page content — Playwright or Puppeteer are suitable tools. Once page content is extracted and stored in the index, the PageIndex agent operates identically regardless of how the source site was rendered.
Q7: How do you handle pages that don't map cleanly to URL hierarchy (e.g., tag pages)?
Tag pages (like /blog/tags/asr) naturally appear as leaf nodes in the tree. Because their titles and headings contain tag names and lists of related articles, the LLM reliably selects them for tag-related queries. The headings field in the page index is particularly useful here — it surfaces content fingerprints without loading full page text.
Conclusion
VectorDB, embedding models, and rerankers are powerful tools — but they carry real infrastructure weight. For a company site at moderate scale, that weight often isn't justified by the retrieval improvement delivered.
PageIndex provides a leaner alternative: URL hierarchy as a table of contents, LLM reasoning as the navigator, two-stage filtering to keep context minimal. The approach won't replace vector search universally, but for structured content × small teams × precision-first requirements, it deserves a serious look before you provision a VectorDB.
The full implementation above is reproducible as-is. If you're hitting retrieval quality issues with hybrid RAG on a structured site, try navigating the tree before you tune the embeddings.
Related Articles
Explore more articles connected to this topic.
5 Books a Working AI Engineer Actually Read for Real-World Accuracy Improvement and Product Development
The reason you can call AI APIs but still can't improve accuracy is a lack of understanding of what's happening inside the model. These 5 books — all O'Reilly titles written by practitioners at the forefront of the field, including Hugging Face engineers and Chip Huyen — are the standard reading list for ML engineers worldwide.
Read article →Speaker Diarization Evaluation Metrics: Beyond DER — A Practical Diagnostic Guide
Evaluating speaker diarization systems requires choosing the right metric for what you actually want to measure. DER is the industry standard, but it cannot diagnose *why* a system fails. To systematically improve a real system, **combine DER with Confusion Matrix, Purity/Coverage, and Boundary Error** for proper root-cause analysis. This article covers calculation formulas, working Python code, and real experimental data — not just textbook definitions.
Read article →