Theuy Limpanont

Why this exists

Document AI demos usually stop at "the model wrote a summary." Real document workflows need exact numbers — tax rates, totals, vendor names — that have to come out of the PDF and into a queryable shape. This system takes invoices end-to-end: private upload, structured extraction against a Zod schema, vector embedding, and a search endpoint that routes the same query box to either semantic similarity or parameterised SQL depending on what the user typed.

What's built

Private upload pipeline. Files land in private Vercel Blob and are served through an authenticated proxy at /api/documents/:id/file — the blob URL is never exposed.
AI extraction with strict typing. gpt-4o-mini pulls invoice number, vendor, dates, subtotal, tax rate/amount, total, and line items against a Zod schema; failures surface as retryable errors rather than silently degraded data.
Hybrid search. A single search box at /api/search auto-detects numeric intent and routes accordingly: "consulting services" → Pinecone top-5; "tax 9%" → tax_rate BETWEEN 0.085 AND 0.095; "over $20,000" → SQL range query.
Retry & delete. Failed extractions can be re-run in place; deletes cascade across blob, vectors, and the DB row in one server action.
Synthetic sample data. The homepage offers a ZIP of three randomly-generated invoice PDFs (via pdfkit) so the entire pipeline can be tried without bringing your own data.

Technical choices worth calling out

Two retrieval modes, one input box

The naïve approach is to embed every query and rely on vector similarity. That fails the moment a user types "over $20,000" — the embedding doesn't know what $ or > mean numerically. So the search endpoint inspects the query first: numeric tokens (currency, percentages, comparisons) route to parameterised SQL against total_amount / tax_rate; everything else routes to Pinecone. Users get one box; the backend picks the right index.

Zod schema as the contract between LLM and DB

The extraction prompt asks for an object that matches a Zod schema, and the schema is also the type that's written to Postgres. If GPT returns malformed JSON, parsing fails before any DB write — there's no "partially extracted" row. Same schema doubles as runtime validation on the API responses.

Auto-init schema on first request

DB tables are created on the first request that touches them — no manual migration script in the deploy pipeline, no db:setup step that has to be remembered. npm run db:init exists as an escape hatch for testing, but the production happy path doesn't use it.

1024-dim embeddings to match Pinecone

text-embedding-3-small is configured to output 1024 dimensions instead of its default 1536. That's chosen specifically to fit the Pinecone Serverless cosine-1024 index — same index, same vector store, reused across this and the support-chatbot case study without spinning up a second pod.

Outcome

The pipeline ships into client products that need typed extraction from documents — invoices, receipts, contracts, forms — without rewriting the schema-driven prompt, the hybrid retrieval router, or the private-blob serving pattern.

Live demo →

PDF invoice extraction with hybrid semantic + numeric search

Drop in an invoice; get back structured fields, embeddings, and a search bar that knows the difference between "consulting services" and "over $20,000".