What this prompt does
This prompt makes the AI a senior AI application engineer that specifies a Gemini multi-modal document Q&A system tightly enough to build, returning working server and client code rather than pseudocode. You provide the [model], the [stack], and the [doc_types], and it returns the server route and the client component plus a one-line note on how citations map back to page numbers.
The six deliverables target document trust: PDF upload and parsing for your [doc_types] that preserves images and tables with OCR for scanned pages; a retrieval-and-grounding step so answers cite exact source page numbers; a multi-modal prompt that passes page images plus extracted text to your [model]; streaming responses with token-by-token rendering; prompt caching to cut cost on repeated questions over the same document; and guardrails that refuse when the answer is not in the source and surface low-confidence answers. The structure works because the trust killer in document tools is a confident answer with no source - users cannot verify it - so page-level citations and grounded refusals are baked in from the start.
When to use it
- You are building a document Q&A tool where users must verify answers against the source.
- Your documents include images, tables, or scanned pages needing OCR.
- You need answers grounded with exact page-number citations, not free-floating claims.
- You want streaming, token-by-token responses for a responsive UI.
- You re-ask many questions over the same long document and want caching to control cost.
Example output
You get a server route that handles PDF upload, parsing with OCR, retrieval and grounding, the multi-modal call to your [model] passing page images plus text, prompt caching, and refusal guardrails - plus a client component that streams the answer token by token and shows page-number citations. A short note explains how each citation maps back to its source page.
Pro tips
- Add prompt caching early; re-sending a long PDF on every follow-up question quietly wrecks the bill.
- Set
[doc_types]to your real inputs - scanned invoices need OCR, while clean digital reports may not, and that changes the parsing path. - Keep the grounded-refusal guardrail strict; an answer not in the source should be refused, not guessed.
- Match
[model]to one with strong multi-modal support so page images are actually read, not ignored. - Verify citations map to the right page on real documents; grounding logic can drift on multi-column or rotated pages.
- Surface low-confidence answers visibly so users know when to double-check rather than trusting silently.
- Pass both the page image and the extracted text to the model; tables and charts are read far more reliably from the image, while text grounding sharpens the citation.
- Stream the response token by token so a long grounded answer feels responsive instead of leaving the user staring at a blank screen while it generates.