Prompt engineering moved from a novelty to a core engineering discipline faster than almost any practice in software. Two years ago it was a meme. Today, any product that embeds an LLM has a prompt layer that determines whether the product is reliable, safe, and cost-effective — or not. This guide covers what we've learned building production prompts across customer support, content generation, document analysis, and agentic workflow systems.
Prompts Are Source Code Now
The most important mindset shift: treat prompts as source code. Version them. Review them. Test them. Deploy them through the same pipeline as the rest of your application. Teams that treat prompts as documentation that lives in a Slack message or a Google doc will regret it.
In practice this means: prompts live in source control, have automated test suites, get measured for regression when you change models, and carry the same privacy and security review as any customer-facing surface.
Structure That Makes Prompts Reliable
The System Prompt Pattern
Every production prompt has a structured system prompt that stays constant, plus a user prompt that varies per request. The system prompt typically includes:
- Role definition: "You are a customer support agent for Acme Corp..."
- Capabilities and limits: What the model can and cannot help with, and what to do at the boundary.
- Output format: Exact structure of the response — JSON schema, markdown sections, plain text bullets.
- Tone and style: Voice guidelines. Specific words to prefer or avoid.
- Safety policies: Hard refusals and how to route edge cases.
Few-Shot Examples
For any task where the output format is non-trivial, include two to five examples of input-output pairs in the prompt. Few-shot examples are more powerful than detailed instructions for teaching structure. A handful of well-chosen examples does more than a thousand words of description.
Chain-of-Thought When It Helps
For complex reasoning tasks, explicitly instructing the model to "think step by step before answering" improves accuracy substantially. The cost is latency and tokens. For user-facing real-time chat, hide the reasoning and stream only the final answer. For batch analysis or agent workflows, the reasoning trace is often the most valuable part of the output.
Output Structuring: JSON Mode and Tool Use
Freeform text output is fine for chat. Everything else should use structured output. Modern APIs (OpenAI, Anthropic, Google) all support JSON-mode or schema-enforced output that guarantees the response parses into the structure you defined. This eliminates an entire category of "the model returned weird markdown I can't parse" bugs.
For agent workflows, tool use is the structured equivalent — the model chooses from a set of defined functions and returns arguments that match the function's schema. Think of tool use as the primary interface for any LLM that needs to take action, not just respond with text.
The Evaluation Loop
The difference between a toy prompt and a production prompt is evaluation. Before shipping any LLM feature, define:
- A representative evaluation set: 50-500 real inputs with expected outputs or grading criteria. This is your regression suite.
- Scoring criteria: Exact-match for structured outputs. LLM-as-judge for quality. Human review for the highest-stakes use cases.
- A cadence: Run the eval suite on every prompt change, every model change, and periodically even when nothing changes (providers update models silently).
Libraries like Promptfoo, Inspect, and Braintrust Eval make this scaffolding cheap to set up. Teams that ship prompts without an eval suite are flying blind.
Prompt Engineering Anti-Patterns
- The "please please please" prompt: Begging the model to follow instructions rarely helps. Structure and examples do.
- Kitchen-sink system prompts: Prompts that keep growing as bugs are discovered. At some point, split into multiple specialized prompts with a router.
- Missing refusal handling: Models will refuse some requests. Your application has to detect and route refusals gracefully — not assume every response is usable content.
- Trusting model output in critical paths: LLMs hallucinate. Any factual claim that matters must be grounded in tool calls against authoritative data, not model memory.
- Ignoring cost and latency: Every token costs money and adds latency. Trim prompts. Use smaller models where they suffice. Cache system prompts where providers support it.
Prompt Caching and Cost Control
Most providers now offer prompt caching — if a prefix of your prompt is reused across requests, the provider charges a fraction of the cost for those cached tokens. For any application with a large stable system prompt and a small variable user input, caching cuts cost by 50-90% and reduces latency.
Structure your prompts to put stable content first (system prompt, examples, documentation) and variable content last (user query). This maximizes the cacheable prefix.
Safety and Prompt Injection
Any prompt that accepts user input is vulnerable to prompt injection — the user attempts to override the system prompt with their own instructions. Defenses include:
- Clear delimiters between system instructions and user content (XML tags work well).
- Instructions in the system prompt to ignore attempts to override its rules.
- Output filtering for patterns that indicate compromise.
- Privilege separation — never give an LLM access to systems beyond what its task requires.
Prompt injection is an active area of research; no defense is bulletproof. Design systems assuming the prompt can be compromised and limit the blast radius accordingly.
Frequently Asked Questions
How do I know which model to use?
Start with the smallest capable model for cost and latency. Upgrade only when evaluation scores justify the expense. The right answer is frequently a mix — a small model for classification, a large model for reasoning, and a specialized model for code or images.
Should I fine-tune or just prompt-engineer?
Prompt-engineer first. Fine-tuning is worth the investment when you have thousands of labeled examples, when the task is narrow and repetitive, and when cost or latency at scale matters. Most teams get 80%+ of the way there with prompts alone.
How do I manage prompts across multiple LLM providers?
Prompts don't translate perfectly between providers. Maintain provider-specific versions where the differences matter. Use an evaluation suite to confirm parity when switching. Abstractions like LangChain or LiteLLM help with API shape but don't eliminate the need for per-provider prompt tuning.
Open Door Digital builds production LLM features with rigorous evaluation. Talk to our team about your AI roadmap.
Related reading: AI Agents for Business Automation and LLM Fine-Tuning for Business.