From a distance the inventory assistant we wrote about looks like a RAG system — a model grounding its answer in retrieved context. Up close it's a different pattern, and the distinction is worth a short post: it decides what kind of data the assistant can answer from and how fresh those answers are.
What RAG actually is
RAG — retrieval-augmented generation — is a specific pattern. You take a corpus of text (PDFs, documentation, support transcripts, release notes), split it into chunks, embed each chunk as a vector, and store those vectors in a database that can search by similarity. At query time, the user's question is embedded too, the top-k closest chunks are pulled out, and they're stuffed into the model's prompt as context. The model reads the chunks and writes an answer grounded in them.
It's a beautiful pattern for unstructured knowledge. The catch: the answer is only as fresh as the last indexing run, only as precise as the embedding model's sense of similarity, and only as complete as the chunks that happen to land in the top-k.
What we built instead
Our inventory assistant doesn't embed anything. When a user asks "how many blue T-shirts in size small at the Dhanmondi outlet," the model doesn't retrieve chunks of text — it calls a tool. The tool takes a typed input, runs a MongoDB aggregation against the tenant's live database, and returns a structured result. The model reads that result and answers.
The industry term is tool-augmented generation, or function calling. The research community sometimes calls it agentic RAG to emphasize that the model decides which query to issue. The mechanism is different in kind, not degree.
Four differences that matter
Freshness. Vector RAG answers from whatever was last indexed — yesterday, last week, whenever the pipeline last ran. A tool call hits the live database, so "how many units sold in the last hour" is a legitimate question.
Precision. A pipeline with $match and $group returns a number. Similarity search returns a set of probably-relevant passages and the model has to reconcile them. For quantitative questions — count, sum, average, top-N — the tool path is unambiguous.
Writes. Classical RAG is read-only by design: retrieval doesn't mutate state. Our tools can — transfers, adjustments, receiving, price edits — as long as the user approves each one. That's beyond what RAG is for.
Cost profile. RAG pays for embeddings (once per chunk, re-run whenever the source updates) plus a similarity search per query. Tool calling pays for a schema preflight plus the query itself. Neither is strictly cheaper; they're optimized for different data.
“When the answer lives in rows, tool calling wins. When it lives in paragraphs, RAG wins.”
Where we would still reach for RAG
Structured data lives in a database. Unstructured knowledge lives in prose — manuals, runbooks, support transcripts, release notes. If we were building a "what did this feature used to do" assistant, or a "how do I configure a new outlet" assistant, a vector store over our documentation would be exactly right. The model would answer from the docs, with citations to the source chunks.
The difference comes down to the shape of the data. Sometimes a single assistant wants both — the support bot that quotes the manual and then checks the account — and you end up with both patterns running under the same conversation.
The right tool for the right data
It's tempting to call everything that augments an LLM "RAG," because the high-level shape is the same — retrieve context, then generate. But naming matters: it sets the expectation for what the system can do. A tool-calling agent that can move stock does not share safety considerations with a doc-search bot, and flattening the distinction costs engineers a mental model they need.
We went with tools because inventory lives in rows. We'd choose differently for a corpus that lives in paragraphs — and the next piece of the assistant, a help bot that answers from the BMS manual, probably will.