Ask a warehouse manager how much of a product they have left across three outlets. They open four screens, filter twice, and read you a number. Our AI inventory assistant answers the same question in a sentence — and, if the conversation is heading there, offers to transfer the stock it just found.

Why a conversation, not a form

Most inventory tools are a stack of forms. You know what you want, but you have to translate it — first into which screen, then into which filter, then into which column. A chat interface short-circuits the translation: you ask, the system does the column work, and you see the answer in plain English. When the answer isn't enough on its own, the same thread carries into the next action without a context switch.

The hard part isn't the chat bubble. It's making the model safe and specific enough that a sentence like "transfer ten blue T-shirts from the main warehouse to Dhanmondi" doesn't end in the wrong store with the wrong count. That is what the architecture below is for.

Two passes instead of one

Every user message goes through a small, fast router before the main model sees it. The router is a single-token call to a lightweight LLM with a strict system prompt: return one word — "chat" or "inventory." Greetings, acknowledgements and small talk fall into the first bucket; anything touching stock, transfers, items or locations falls into the second.

For "chat" we strip the tool library out of the next call — the model answers conversationally without the overhead of a nineteen-tool context. For "inventory" we keep the tools attached and let the model plan. The net effect: warm small talk stays cheap, and the full pipeline kicks in only when it's earned.

A tool library, not a prompt

The assistant isn't a monolithic prompt with "rules." It's the main model plus nineteen named tools — schema introspection, read-only queries over a dozen collections, stock transfers, adjustments, receiving, price and media edits, and a web-search escape hatch via Tavily. Each tool declares its input schema with Zod; each tool's description is written for the model, not for humans.

The important split is inside the library. Reads execute immediately: the model calls them, gets a result, keeps planning. Writes don't — they're declared without an execute function, which signals the AI SDK to surface a confirmation card in the UI. Nothing moves stock without a user's yes.

“Nothing moves stock without a user's yes.”

The approval loop, made idempotent

When a user confirms a write, the approved tool call comes back in the next request and hits the server's tool executor. Before running the operation we check a ten-minute TTL cache keyed by the tool-call ID — if the same approval is replayed (a network reconnect, a refresh mid-stream), we return the cached result instead of running the operation twice. That turns an AI-driven chat into something that behaves like a well-written POST endpoint.

Tenant, permission, schema

Before any tool runs, the request clears three gates: it's authenticated (per-request session validation), the user holds the VIEW_ITEMS permission, and any query tool first calls getSchemaContext for its target collection. The last one is explicit in the tool's own description — the model is told, in prose, that it must read the schema before writing a pipeline. That removes a class of silent failures where the model invents a field that doesn't exist in this tenant's data.

Database connections are scoped per tenant. The tool executor is built fresh for each request around a tenantId; there is no shared pool where the AI could accidentally read another company's records.

Step-bounded, cache-friendly

The main generation runs with a hard ceiling: at most seven planning steps before the stream must terminate. In practice a typical query resolves in two — one to pull schema, one to run the aggregation — but the bound keeps pathological loops out of production.

Everything runs through Vercel's AI Gateway with caching set to auto. The system prompt and schema payloads hit the cache; user messages don't. On a warm cache, a question like "which items expire next month" costs a fraction of what a cold one does, and the user sees a faster response.

What we're still working on

The write-class tools are the most fun and the most dangerous. Simple transfers work well today; larger flows — adjustments across a whole stocktake, bulk price updates — need a better summary card before we put them in front of customers. We would rather develop one fewer capability than a capability that moves stock you didn't mean to move.

Intent classification is also a moving target. The current router is a small heuristic on top of a small LLM; the heuristic catches "hi" and "thanks" locally, the LLM handles the rest. Over time we expect more of that to live in a classifier tuned on our own traffic — closing the loop from conversation to correct stock, one message at a time.