Prompt engineering changes quickly, but the language developers use to discuss it changes even faster. This glossary is designed as a practical reference for people building with large language models: a place to confirm definitions, distinguish similar terms, and keep pace with how prompting workflows evolve in real applications. Rather than treating terminology as fixed, it helps you track which concepts are stable, which are drifting, and which deserve a fresh review each month or quarter.
Overview
This guide gives you a working prompt engineering glossary for developers, with an emphasis on terms that actually appear in product discussions, prompt reviews, eval docs, and implementation notes. It is not a marketing dictionary. It is a reference for people who need prompts to behave consistently inside tools, automations, and applications.
At a basic level, prompt engineering is the practice of writing structured instructions that guide a model toward usable output. As recent developer-focused guidance has stressed, the goal is not merely to ask better questions. It is to produce responses your code, workflow, or team can rely on. That means prompts often need explicit instructions, context blocks, output constraints, examples, and a clear expectation of what “good” looks like.
For developers, the safest evergreen interpretation is this: prompt engineering works best when you treat prompts like system components rather than one-off chat messages. You define inputs, expected outputs, edge cases, and tests. You refine them over time. You document them so other people can understand why they exist.
Because terminology shifts as models and APIs change, this article is also a tracker. Some terms have remained stable for years, while others have broadened, narrowed, or been absorbed into larger workflow concepts. If you revisit this glossary on a monthly or quarterly cadence, you can keep your team aligned without relearning the basics every time a vendor adds a feature.
Core terms developers should know
Prompt: The input sent to a model. In practice, this may include instructions, user content, examples, formatting rules, retrieved context, tool definitions, and output constraints.
Prompt engineering: The process of designing, testing, and refining prompts so a model produces useful and reliable output. In development work, this often includes versioning prompts, measuring failures, and integrating prompts into applications.
LLM: Large language model. A model trained on large text datasets to predict and generate language. In developer discussions, “LLM” is often shorthand for the model behind a prompt workflow, regardless of whether the task is coding, summarisation, extraction, or analysis.
Inference: The act of running the model on a prompt to generate output. Useful in engineering discussions because it separates model execution from training.
Context: The information available to the model at inference time. Context can include system instructions, user input, attached documents, retrieved passages, chat history, and tool results.
Context window: The amount of text or tokens the model can consider in a single request. This matters when prompts become long, when retrieval adds many documents, or when conversations grow over time.
Token: A unit of text processed by the model. Token limits affect cost, latency, and whether prompts fit in the context window.
System prompt: High-priority instructions that define the model’s role, style, boundaries, and output rules. System prompt examples often include task framing, forbidden behaviours, required formats, and escalation rules.
User prompt: The immediate request or task content supplied by the user or calling application.
Assistant message: Prior model output in a multi-turn conversation. This can influence later responses, sometimes helpfully and sometimes in ways that create drift.
Prompt template: A reusable prompt structure with variables inserted at runtime. Prompt templates are useful when the task is stable but the input changes, such as support classification, code review comments, or product summary generation.
Structured output: Output constrained into a predictable format such as JSON, XML, a schema, or a fixed list of fields. Structured output prompts are important when the result needs to be parsed by code.
What to track
The most useful glossary is not just a list of definitions. It tracks where confusion tends to appear. This section groups terms by the kinds of decisions developers repeatedly make: prompt design, workflow design, and reliability.
Prompting techniques
Zero-shot prompting: Asking the model to perform a task without providing examples. This is often the fastest starting point, especially when the task is common and the instruction is clear.
Few-shot prompting: Providing a small number of examples to show the model what good output looks like. Few shot prompting is especially useful for tone matching, classification consistency, edge-case handling, and extraction formats.
Chain-of-thought prompting: A method that asks the model to reason step by step. Developers should use this term carefully. It is widely discussed, but in production settings the safer practical takeaway is to request better reasoning structure or intermediate checks rather than rely on verbose hidden reasoning as a cure-all.
Prompt chaining: Breaking a complex task into multiple prompts where each step feeds the next. Prompt chaining is often more reliable than one very long instruction because it lets you separate planning, retrieval, generation, validation, and formatting.
Role prompting: Assigning the model a role such as reviewer, tutor, debugger, or policy checker. This can improve consistency when paired with clear task rules, but role labels alone rarely solve vague prompts.
Instruction hierarchy: The practical idea that some instructions carry more weight than others depending on the API and message structure. This matters when system rules conflict with user requests or retrieved content.
Application and workflow terms
Tool calling: A pattern where the model selects or triggers external tools, functions, or APIs. The key distinction is that the model is not expected to know everything itself; it can request structured operations.
Function calling: Often used interchangeably with tool calling, though some platforms reserve it for typed function interfaces. In either case, the goal is predictable interaction with external logic.
RAG: Retrieval-augmented generation. A workflow where relevant documents or records are retrieved and inserted into the prompt context before generation. RAG best practices usually focus on retrieval quality, chunking, grounding, and source control rather than prompting alone.
Grounding: Anchoring the model’s response in supplied context or source material. Grounding reduces unsupported claims when the retrieved material is relevant and clearly prioritised.
Hallucination: Model output that is incorrect, unsupported, or fabricated. The term remains common, though many teams now prefer more precise descriptions such as unsupported answer, citation error, or schema failure depending on the actual problem.
Guardrails: Constraints or checks placed around a prompt workflow. These may include content rules, input validation, schema validation, confidence checks, tool restrictions, or human review steps.
Prompt injection: An attempt to manipulate the model through malicious or conflicting instructions embedded in user input or retrieved content. This is particularly relevant in RAG systems and agent-style workflows.
Agent: A loosely defined term, which is why it should be tracked carefully. In the safest practical sense, an agent is a prompt-driven system that can plan, call tools, inspect results, and continue across multiple steps. Because usage varies, teams should define exactly what they mean when they use it.
Reliability and evaluation terms
Eval: Short for evaluation. A structured way to measure prompt or model performance against tasks, examples, rubrics, or expected outputs.
LLM evaluation: The broader practice of testing model behaviour for accuracy, consistency, safety, formatting compliance, and task success. Prompt changes should ideally be evaluated, not judged by a handful of ad hoc chats.
Test set: A collection of inputs used to compare prompt or model performance. Strong test sets include normal cases, difficult cases, and known failure modes.
Regression: A drop in performance after a prompt, model, or workflow change. Regressions are common when a prompt is improved for one use case but weakened for another.
Determinism: The degree to which the same prompt returns the same output. In practice, LLM systems are rarely perfectly deterministic, so developers focus on acceptable variance rather than exact repetition.
Temperature: A generation setting that influences variability. Lower temperature often supports more stable outputs; higher temperature can increase variation and creativity, though it may reduce consistency.
Latency: The time it takes to receive a response. Prompt length, tool use, model choice, and chaining all affect latency.
Prompt drift: A gradual decline in prompt performance as contexts, models, business rules, or user inputs change. This is one of the best reasons to maintain a living glossary and prompt review habit.
Failure mode: A repeatable way a prompt or workflow goes wrong, such as missing fields, weak citations, unsupported claims, invalid JSON, or refusal where completion was expected.
If you want to build stronger review habits around these terms, it helps to pair this glossary with a metrics framework. Fuzzypoint’s guide on how to evaluate prompt quality is a useful companion for turning vocabulary into testable practice.
Cadence and checkpoints
A glossary becomes more valuable when it has a maintenance rhythm. The point is not to update definitions for the sake of freshness. The point is to review terms when the way your team uses them has changed.
Monthly checks
Once a month, review the terms tied to active work:
- Are people using “agent,” “guardrails,” or “grounding” to mean different things?
- Have new prompt templates introduced terms that need documenting?
- Have model or API changes altered how you think about system prompts, structured output, or tool calling?
- Have new failure modes appeared in logs that deserve a clearer label?
This is also a good time to compare team language against implementation reality. For example, a workflow described internally as “RAG” may actually be simple document stuffing with weak retrieval. A monthly check helps keep terminology honest.
Quarterly checks
Every quarter, revisit the glossary at the architecture level:
- Which prompt engineering definitions still reflect current model behaviour?
- Do your eval terms still map to how you test production systems?
- Have broad labels like “hallucination” become too vague for your incident review process?
- Do you need separate definitions for prompting, orchestration, and evaluation?
Quarterly reviews are also the right place to update examples. Prompting guidance is more useful when each term includes one practical reference from your own stack, such as a classifier prompt, a support triage template, or a JSON extraction task.
Checkpoint triggers outside the calendar
Some updates should happen immediately rather than waiting for the next scheduled review:
- You adopt a new model family or API format.
- You move from chat-style prompts to tool-driven workflows.
- You start using retrieval or external knowledge sources.
- You enforce schema-based output in production.
- You notice repeated misunderstandings in code reviews or incident retrospectives.
For broader practical guidance on maintaining stable prompt systems, see Prompt Engineering Best Practices for Developers: A Living Checklist.
How to interpret changes
Not every terminology shift matters equally. Some changes signal genuine improvements in practice; others are just relabeling. The useful question is whether a new term helps you build, debug, or evaluate systems more clearly.
Good changes to adopt
Adopt new terminology when it improves precision. For example, replacing a vague “the model made something up” with a more specific label like unsupported answer, retrieval miss, or schema violation makes debugging easier. Similarly, distinguishing tool calling from prompt chaining clarifies where logic lives: in the model, the orchestrator, or the surrounding application.
Changes to treat cautiously
Be careful with terms that become fashionable before they become clear. “Agent” is the obvious example. In one team it may mean a simple loop that calls tools. In another it may imply memory, planning, retries, permissions, and long-running task execution. If a term expands too far, define your local version in writing and use that until the wider language settles.
When definitions broaden
Some terms naturally expand as platforms evolve. “Prompt” once implied a block of plain text. Now it often includes message roles, multimodal inputs, schemas, retrieved context, tool specs, and hidden instructions. That broader meaning is worth acknowledging because it affects how developers design and test systems. A prompt is no longer just phrasing. It is often a structured interface.
When definitions narrow
Other terms become narrower as teams mature. “Hallucination” is still useful as a general label, but mature teams often break it into categories because each failure type has a different fix. A fabricated citation, an extraction miss, and a stale RAG answer are not the same operational problem.
If your work involves code generation or automated engineering workflows, these distinctions become even more important. Fuzzypoint’s article on integrating LLM outputs with tests and static analysis offers a practical next step for connecting prompt terminology to software quality controls.
When to revisit
Use this glossary as a recurring checkpoint, not a one-time read. The right moment to revisit it is whenever your prompt language stops matching your real system. That usually happens before teams notice it formally.
Revisit this page when:
- a prompt that used to work starts failing more often
- you introduce a new model and old assumptions no longer hold
- your team starts using broad terms without shared definitions
- you shift from ad hoc prompting to templates, chaining, or tools
- your eval process exposes recurring failure patterns that lack clear labels
- you add retrieval, structured outputs, or tool permissions to production workflows
A practical habit is to keep a short local appendix beside this glossary for your own stack. Add three things only: the term, your team’s working definition, and one real example from production. That keeps vocabulary tied to system behaviour instead of theory.
If you are building a prompt review process, start small. Pick ten terms from this article that matter to your current workload. Define them in your docs. Map each term to one prompt, one failure mode, or one evaluation check. Review the list next month. If nothing changed, that stability is useful information. If several definitions now feel loose, update them before confusion turns into inconsistent output.
The most durable prompt engineering best practices are usually simple: be explicit, structure inputs, constrain outputs, test changes, and document assumptions. A living glossary supports all five. It gives your team a shared vocabulary for discussing what the prompt is supposed to do, what the model actually did, and what needs refinement next.
Bookmark this glossary and return to it on a monthly or quarterly cadence. In prompt engineering, shared language is part of system reliability.