Classic RAG retrieves from a vector store you built ahead of time. That’s perfect for your own documents — and useless for anything that changed this morning. Live-web RAG flips it: retrieve from the open web at query time, so the model answers from current sources. This guide builds it with one Auxiliar key handling both the retrieval (search) and the fetch (scrape-to-markdown).
Why fetch the page, not just the snippet?
Search APIs return short snippets. For grounding, snippets are thin — they drop the numbers, caveats and context that make an answer correct. So the reliable pattern is: search to find the right pages, then fetch those pages as clean markdown and feed the real text to the model. The fetch step is where scrape quality matters; garbled HTML poisons the answer.
The pipeline
import os, requests
from anthropic import Anthropic
AUX = "https://api.auxiliar.ai"
H = {"Authorization": f"Bearer {os.environ['AUXILIAR_API_KEY']}"}
llm = Anthropic()
def retrieve(question, k=4):
# Tavily is agent-native: returns relevance-scored results in one call.
r = requests.post(f"{AUX}/tavily/search", headers=H,
json={"query": question, "max_results": k}, timeout=30)
r.raise_for_status()
return [hit["url"] for hit in r.json().get("results", [])]
def to_markdown(url):
# Firecrawl returns clean, LLM-ready markdown (top markdown quality in our benchmark).
r = requests.post(f"{AUX}/firecrawl/v1/scrape", headers=H,
json={"url": url, "formats": ["markdown"]}, timeout=60)
return r.json().get("data", {}).get("markdown", "") if r.ok else ""
def answer(question):
docs = [(u, to_markdown(u)) for u in retrieve(question)]
context = "\n\n".join(f"[{i+1}] {u}\n{md[:5000]}" for i, (u, md) in enumerate(docs) if md)
msg = llm.messages.create(
model="claude-sonnet-5", max_tokens=800,
messages=[{"role": "user", "content":
f"Using only the sources below, answer and cite with [n]. If unsure, say so.\n\n"
f"Q: {question}\n\n{context}"}])
return msg.content[0].text
print(answer("What's the current status of the EU AI Act's rules for general-purpose models?"))
No vector database, no embeddings, no re-indexing job — just fresh retrieval at query time. Add embeddings later if you want to rank or cache; for many agents, live retrieval alone is enough.
Getting the retrieval right
The quality ceiling of live-web RAG is set by two choices, and both are one-line swaps on the gateway:
- The index. Tavily is agent-native; Exa is neural/semantic; Serper is raw Google. They surface different pages for the same query. Compare them in best search API for AI agents.
- The reader. Markdown quality varies a lot between scrapers, and on protected sites the fetch can fail entirely. See Firecrawl vs Jina for the two leading URL-to-markdown options, and scrape without getting blocked for hard targets.
Because retrieval and fetch share one Auxiliar key, you can mix and match — Tavily to find, Firecrawl to read, a stealth scraper as fallback — without a single extra signup.