Interactive course · 11 sections · 37 min

Prompt Engineering for Non-Technical Users

A vendor-neutral, evidence-based course on prompt engineering for non-technical learners. Worked examples, practical exercises, quizzes, and a final practice exam with interactive multi-choice questions.

By Site Admin 8 June 2026 1 view

Prompt Engineering for Non-Technical Users

Executive summary

This course is designed as a vendor-neutral, evidence-based introduction to prompt engineering for non-technical learners as of 8 June 2026. The central finding from current primary guidance is that the major providers now broadly agree on the same core workflow: define the task clearly, provide the right context, specify the desired output, test against realistic examples, and iterate using evaluation rather than trusting a first draft. OpenAI defines prompt engineering as writing effective instructions so models reliably meet requirements; Google describes prompt design as creating natural-language requests that elicit accurate, high-quality responses; Anthropic frames prompt engineering as improvement against explicit success criteria; Microsoft guidance likewise stresses specificity, context, and output design.

For non-technical users, the most transferable skills are not coding skills. They are clear task definition, audience awareness, constraints, examples, source grounding, and output checking. Official guidance from OpenAI, Anthropic, Google, and Microsoft all supports those habits, and the current tool landscape now makes them practical through mainstream features such as file upload, web search, structured outputs, prompt variables, prompt versioning, and built-in evaluation tools.

The pedagogy in this course is intentionally novice-friendly. Each unit uses worked examples, then guided practice, then a short low-stakes quiz, because research on worked examples and retrieval practice shows strong benefits for novice learning and long-term retention. A larger review of learning techniques likewise rates practice testing highly and finds that techniques differ substantially in their usefulness, which supports a course structure that repeatedly checks understanding rather than relying on passive reading alone.

The prompt-engineering workflow

Define the task and success criteria.
Draft a prompt that captures task, audience, context, constraints, and desired output.
Test on realistic examples rather than a single happy-path case.
Inspect failures to find patterns.
Revise the instructions, context, examples, or output format.
Run structured evaluations to compare versions.
Deploy and monitor in real use, then loop back to step 4.

This workflow reflects the evaluation-centred loops described in Anthropic's prompt-engineering and evaluation guidance and OpenAI's evaluation documentation.

Open questions and limitations

Model names, model limits, UI features, and plan availability change quickly. The model and tool comparisons in this course reflect the official pages opened on 8 June 2026 and the information visible on those pages at that time. Before procurement or production rollout, learners should re-check the current vendor model pages and product help centres.

Course blueprint

Course purpose

The purpose of this course is to help non-technical learners become reliable, critical, and safe users of LLM-based tools at work and in study. The course assumes that prompting is iterative rather than magical, and that prompted outputs should be checked, grounded, and evaluated. That framing matches the current primary guidance from OpenAI, Anthropic, Google, and Microsoft.

Learning objectives

By the end of the course, learners should be able to:

Explain in plain language what an LLM is, how it is trained at a high level, and why it can still make mistakes.
Write a clear first-draft prompt using task, context, audience, constraints, and desired format.
Improve output consistency using roles, examples, schemas, and step-by-step prompt chains.
Decide when to use files, retrieval, or web grounding instead of relying on a model's memory alone.
Reduce hallucination risk by requesting evidence, allowing uncertainty, and defining fallback behaviour.
Recognise major limitations such as knowledge cutoffs, non-determinism, and prompt injection.
Evaluate prompts with realistic test cases and simple success criteria.
Choose an appropriate platform or model family for common non-technical workflows.

Target audience and prerequisites

Item	Course assumption
Intended learner	Office workers, managers, analysts, educators, researchers, policy staff, administrators, students, and founders who want to use LLM tools effectively without writing code.
Prior technical knowledge	None required.
Required prior AI knowledge	None required.
What is helpful but optional	Comfort with copying text from documents, using web apps, and comparing outputs critically.
What is not required	Python, APIs, statistics, machine learning maths, or software engineering.
Recommended setup	Access to at least one mainstream assistant plus one prompt-testing interface if available.

The non-technical positioning is deliberate. Current vendor guidance shows that natural-language prompting, files, sources, and web interfaces are now first-class workflows rather than merely developer-only features.

Syllabus and estimated duration

Module	Focus	Guided time
Foundations of large language models	Architecture, training, safety, limitations, hallucinations, evaluation metrics	90 minutes
Writing your first reliable prompt	Task, context, audience, constraints, ideal output	75 minutes
Controlling output with role, context, constraints, and format	System role, structure, separators, output contracts	75 minutes
Improving consistency with examples, schemas, and chained prompts	Few-shot prompting, structured outputs, decomposition	90 minutes
Grounding answers and reducing hallucinations	Files, retrieval, search, citations, uncertainty, prompt injection	105 minutes
Evaluating prompts and building repeatable workflows	Success criteria, test cases, manual evaluation, tool selection, capstone	90 minutes
Final practice exam and review	Consolidation and assessment	60 minutes

Total recommended course time: about 9 hours including exercises and review.

Prompt technique comparison

Technique	What it is	Best use	Main strength	Main risk or limitation
Clear task, context, and desired output	Tell the model what to do, give background, say what the answer should look like	Almost every task	Usually the highest-return starting point for non-technical users	Vague prompts produce vague results
Role or system instruction	Set role, tone, boundaries, and output contract at the highest level	Repeatable assistants, house style, safety boundaries	Helps shape behaviour and formatting	Does not guarantee compliance, so it must still be tested
Few-shot examples	Include a small number of input/output examples	Classification, extraction, tone matching, structured responses	Often improves consistency and accuracy quickly	Poor or inconsistent examples can teach the wrong pattern
Structured outputs or schemas	Require JSON or another explicit response structure	Data extraction, automation, tables, hand-off to other tools	Easier to parse and compare	Overly rigid schemas can force bad completions if the input is unrelated
Prompt chaining	Split a complex task into multiple prompts	Research, analysis, editing, transformation pipelines	Reduces overload and makes checking easier	More steps mean more coordination and more evaluation points
Grounding with files, retrieval, or search	Give the model trusted source material or allow it to retrieve current information	Policy Q&A, current events, document analysis	Improves factual accuracy and traceability	Retrieval can still miss evidence; web content can be untrusted
Explicit uncertainty or fallback policy	Tell the model what to do when unsure	High-stakes questions, policy, legal, medical, finance, internal docs	Reduces false confidence and unsupported answers	If overused, may produce unnecessary refusals or hedging

Current LLM landscape

Representative current model families from official sources current on or visible from pages checked on 8 June 2026.

Provider	Representative current models	Context and cutoff information
OpenAI	GPT-5.5, GPT-5.4 mini	GPT-5.5: 1M context, 128K max output, knowledge cutoff 1 Dec 2025; GPT-5.4 mini: 400K context, 128K max output, knowledge cutoff 31 Aug 2025
Anthropic	Claude Opus 4.8, Claude Sonnet 4.6, Claude Haiku 4.5	Opus 4.8: 1M context, 128K max output, reliable knowledge cutoff Jan 2026; Sonnet 4.6: 1M context, 64K max output, reliable cutoff Aug 2025; Haiku 4.5: 200K context, 64K max output, reliable cutoff Feb 2025
Google	Gemini 2.5 Pro, Gemini 3.5 Flash	Both: 1,048,576 input tokens, 65,536 output tokens; knowledge cutoff January 2025

Recommended platforms for learners

Everyday assistants

Platform	Best fit	Why use it
ChatGPT	General drafting, summarising, source-based research, file-based work	Strong all-round platform for practising prompt basics, document analysis, and research workflows
Claude	Careful drafting, reasoning, and prompt refinement	Particularly useful for learning prompt refinement and prompt testing discipline
Gemini Apps	Everyday ideation, file analysis, connected apps, source inspection	Very strong for grounded multimodal workflows and learner-facing study features
Microsoft 365 Copilot Chat	Work and education environments already using Microsoft 365	Best fit when the learner's workplace already centres its work in Microsoft 365

Prompt labs and evaluation environments

Platform	Best fit	Why use it
OpenAI Playground	Prompt iteration with version control	Excellent for teaching disciplined prompt iteration
Anthropic Console	Prompt generation and test-set-based evaluation	Excellent for showing what "prompt engineering with evals" looks like in practice
Google AI Studio	Fast experimentation	Good for rapid experimentation and classroom demos
Microsoft Copilot Studio	Low-code organisational prompting	Strong for business users who need repeatable prompt actions without full software development

Module 1: Foundations of large language models

Lesson content

At a high level, modern LLMs are built on the Transformer architecture, introduced in Attention Is All You Need, which proposed a sequence model based on attention mechanisms rather than recurrence or convolution. That architecture became the basis for later large language models. GPT-3 then showed that scaling autoregressive language models can substantially improve few-shot performance, meaning the model can adapt to tasks from instructions or a few examples without updating its weights during use.

A practical non-technical picture of training has three stages. First, a model is pretrained on very large text corpora so it learns patterns of language. Second, the model is often instruction-tuned or supervised fine-tuned on examples of desired behaviour. Third, many widely used assistants are preference-aligned using methods such as reinforcement learning from human feedback, or, in Anthropic's case, variants such as Constitutional AI and reinforcement learning from AI feedback.

Current systems process text as tokens, not as full words or sentences. OpenAI's documentation explains that tokens are chunks of text, often shorter than a word, and gives the English rule of thumb that 1 token is roughly 4 characters or 0.75 words. The same documentation notes that prompt and output tokens together must fit within a model's context length. Google describes the context window as analogous to short-term memory: it is the amount of information a model can consider in one go. Current flagship model pages from OpenAI, Anthropic, and Google all publish context windows around 1 million tokens for some top models, which is why large-document prompting is now common.

LLMs are powerful, but they are not databases, calculators, or infallible experts. Microsoft's prompt-engineering explanation reminds learners that generative language models predict the next likely words based on prior text. OpenAI explicitly says outputs are non-deterministic, meaning the same prompt can vary across runs or model versions. Current model pages also publish knowledge cutoffs, which is one reason recent facts may require grounding or search. On truthfulness benchmarks, models can still fail badly: the original TruthfulQA paper introduced 817 questions across 38 categories and found that models often reproduced common falsehoods.

Two further limitations matter greatly in practice. First, hallucinations remain possible. Anthropic's glossary defines an honest AI partly in terms of not hallucinating or confabulating, and its hallucination guide recommends explicitly allowing uncertainty and grounding answers in quoted evidence. Second, prompt injection is a real safety risk when a model reads untrusted content. OpenAI defines prompt injection as malicious text or data that tries to override instructions, while Google analogises it to SQL injection in its safety guidance.

For non-technical learners, evaluation metrics are best understood as different lenses for checking performance. The MMLU benchmark measures broad academic and professional question answering across 57 tasks. HumanEval measures functional correctness for code generation. TruthfulQA measures truthfulness on questions designed around common misconceptions. HELM is broader: it argues for multi-metric evaluation and reports metrics such as accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency across many scenarios. These benchmarks are helpful background, but they do not replace testing a model on your own real tasks.

Plain-language explanation

Think of an LLM as a very advanced prediction engine that has read a huge amount of language and then been further trained to follow instructions. It does not "know" facts in the same way a search engine or database stores them. Instead, it generates the response that seems most likely and useful from the prompt and the context it has been given. That is why prompts matter, and that is why evidence, files, or search tools matter when accuracy matters.

A token is a small piece of text. A context window is the amount of text and other input the model can consider at the same time. A knowledge cutoff is the latest period of training data that the provider says the model knows reliably. A hallucination is an answer that sounds plausible but is unsupported or wrong. A benchmark is a standard test, but your real workflow still needs its own tests.

Step-by-step example

Goal: show why a stronger prompt is easier for a model to answer well.

Weak version

Explain machine learning.

Better version

Explain how machine learning works using a simple everyday analogy.

Requirements:
- Keep the explanation under 120 words.
- Avoid technical jargon.
- Make it understandable for someone who has never studied computer science.

Best version for a non-technical audience

Explain how machine learning works using a simple analogy about learning a skill.

Requirements:
- Use an analogy about cooking, sport, or music.
- Keep it under 100 words.
- Avoid technical terms.
- Write in 3 short paragraphs:
  1) the analogy,
  2) how the analogy maps to machine learning,
  3) one sentence summarising the core idea.
- If any part would need technical jargon, replace it with plain English.

This progression mirrors OpenAI Academy's published examples: clearer task, clearer audience, clearer format, and tighter constraints usually produce better responses.

Practical exercises

Rewrite the prompt "Tell me about AI" so it is suitable for a Year 9 pupil.
Rewrite the prompt "Summarise this report" so it is suitable for a chief executive who has two minutes to read it.
Write one prompt that asks for a response in simple prose and one that asks for a table, using the same source text.

Formative quiz

Questions

What is a token?
What does "context window" mean in practice?
Why can a well-aligned model still make factual errors?
Name one benchmark and what it is meant to measure.

Answers

A token is a chunk of text used internally by the model; it can be smaller than a word.
The context window is the amount of information the model can consider in one request; Google compares it to short-term memory.
Because the model predicts likely outputs from training and prompting, is non-deterministic, may have a knowledge cutoff, and can hallucinate or reproduce falsehoods.
Examples: MMLU for broad multitask question answering; HumanEval for code correctness; TruthfulQA for truthfulness; HELM for multi-metric, multi-scenario evaluation.

Module 2: Writing your first reliable prompt

Lesson content

OpenAI Academy's updated prompt guidance offers a beginner-friendly structure with three steps: outline the task, give helpful context, and describe your ideal output. Microsoft's Copilot Studio guidance says a prompt generally has two parts: instruction and context. Together, those sources support a simple starter model for non-technical users: tell the assistant what you want, give relevant background, and describe the form of the answer.

OpenAI also says there is no single perfect prompt and that iteration is part of effective use. Anthropic's overview similarly assumes prompt engineering starts only after you have defined success criteria and have a first draft you want to improve. Google's prompting guide explicitly calls prompt engineering iterative. The practical implication is simple: do not worship the first answer; treat prompting as editing.

A good first-draft prompt for non-technical use usually answers five questions: What is the task? Who is the audience? What context should the model use? What constraints matter? What should the output look like?

Plain-language explanation

A bad prompt leaves the model guessing. A good prompt removes guesswork. If you say only "Summarise this", the model has to guess the audience, the length, the tone, and what to keep or omit. If you say "Summarise this for a senior manager in five bullets, focusing on risks, deadlines, and decisions", the model has a much clearer target. That does not guarantee perfection, but it usually improves relevance immediately.

Step-by-step example

Scenario: you have rough meeting notes and want a useful summary for your manager.

Weak prompt

Summarise these notes.

Stronger prompt

Summarise these meeting notes for my manager.

Needs:
- 5 bullet points maximum
- focus on decisions, deadlines, and risks
- plain professional English
- do not include chit-chat

Strongest first draft

You are helping me prepare an update for my manager.

Task:
Summarise the meeting notes below.

Audience:
A busy manager who was not in the meeting.

Output:
- 5 bullet points maximum
- each bullet starts with either Decision, Deadline, Risk, or Open question
- end with one sentence: "What needs attention next"

Constraints:
- use only information from the notes
- if the notes do not support a conclusion, write "Not clear from the notes"

This progression adds role, task, audience, output contract, and fallback behaviour.

Practical exercises

Turn "Write an email" into a prompt for a polite customer apology email.
Turn "Research this topic" into a prompt that specifies the audience, length, and citations.
Turn "Make this shorter" into a prompt that specifies the maximum word count and what must remain.

Formative quiz

Questions

What are the two core prompt parts in Microsoft Copilot Studio's guidance?
What are OpenAI Academy's three basic steps for a good prompt?
Why is a fallback phrase such as "Not clear from the notes" useful?
Why should you expect to revise a prompt?

Answers

Instruction and context.
Outline the task, give helpful context, and describe your ideal output.
Because explicit fallback behaviour reduces unsupported guessing and helps the model handle missing information more honestly.
Because official guidance from OpenAI, Anthropic, and Google all treat prompt engineering as iterative rather than one-shot.

Module 3: Controlling output with role, context, constraints, and format

Lesson content

Once learners can write a strong first draft, the next step is learning to shape output intentionally. OpenAI recommends putting overall tone or role guidance in the system message and task-specific details or examples in user messages. Microsoft's current system-message guidance says system messages help steer behaviour, tone, and output format, and can define role, boundaries, format, and safety constraints. Anthropic similarly recommends giving Claude a role and separating instructions, context, and inputs with XML tags.

Structure matters because it reduces ambiguity. Microsoft's prompt-engineering guidance recommends clear syntax and separators to communicate intent and make outputs easier to parse. Google's prompting page recommends placing critical instructions and format requirements at the start of the system instruction or user prompt. Anthropic says XML tags reduce misinterpretation when prompts mix instructions, context, examples, and variable inputs.

Output formatting also deserves explicit control. Google's prompt guide notes that you can directly specify response format, such as a table, bullet list, or paragraph, and recommends structured output features for more complex schemas. OpenAI's docs likewise support structured JSON outputs, while Anthropic recommends precisely defining the desired format using JSON, XML, or custom templates when consistency matters.

Plain-language explanation

A good prompt does not only say what you want. It also says how you want it returned. Think of it like commissioning work from a colleague. If you need a chart-ready table, do not ask for "thoughts". If you need a formal update, do not ask for "a quick write-up". The more the model has to guess about role, tone, or layout, the less dependable the result becomes.

Step-by-step example

Scenario: you want to turn messy customer feedback into something operational.

Role:
You are a customer-insights assistant for a retail company.

Task:
Read the feedback comments and identify the main issues.

Context:
The comments come from customers who ordered online in the last 30 days.

Output format:
Return a table with these columns:
- issue category
- short evidence quote
- severity (low, medium, high)
- recommended action owner

Constraints:
- Use only information in the comments.
- If severity is uncertain, mark it as "unclear".
- Keep evidence quotes short.

Practical exercises

Rewrite a loose brainstorming prompt into one that returns a decision table.
Rewrite an email-drafting prompt so it produces three tonal variants: formal, friendly, and neutral.
Take a paragraph prompt and reformat it using separate headings for Role, Task, Context, Constraints, and Output.

Formative quiz

Questions

What is the main purpose of a system message?
Why do separators or tags help?
When should you ask for a table or JSON explicitly?
Do system messages guarantee perfect compliance?

Answers

To steer the assistant's behaviour, tone, scope, and output format at the highest level.
Because they make the prompt structure clearer and reduce misinterpretation.
When output needs to be parsed, compared, reused, or handed to another tool or workflow.
No. System messages influence behaviour but do not guarantee compliance, so they still need testing and iteration.

Module 4: Improving consistency with examples, schemas, and chained prompts

Lesson content

A first strong prompt often gets you from poor results to acceptable results. To get from acceptable to consistent, the highest-value techniques are usually examples, schemas, and decomposition. OpenAI defines few-shot learning as steering a model by including a handful of input/output examples rather than fine-tuning. Anthropic says examples are one of the most reliable ways to steer output format, tone, and structure and recommends 3 to 5 relevant, diverse, structured examples. Google's guide says few-shot prompts help regulate format, phrasing, scope, and general response patterning.

Examples work best when they are consistent. Google explicitly warns that the structure and formatting of few-shot examples should remain consistent, because one main purpose of examples is to teach the model the desired response format. Anthropic similarly recommends wrapping examples in explicit tags so the model can distinguish them from instructions.

For complex tasks, a single mega-prompt is often worse than a series of smaller prompts. OpenAI Academy recommends breaking big tasks into smaller steps. Google formalises this as chain prompts, where the output of one prompt becomes the input to the next. Microsoft's guidance similarly notes that breaking tasks into stages can significantly improve results on large texts with many factual claims.

When the output needs to be dependable, add a schema. Google recommends using structured output features for complex JSON schemas, and Anthropic recommends defining the output format precisely with JSON, XML, or templates. OpenAI's structured outputs feature likewise supports schema-constrained output generation.

Plain-language explanation

If you want the model to copy a pattern, show the pattern. If you want it to do several jobs, split the jobs. If you want easy checking, force a structure. Those three ideas solve a large share of everyday prompting problems.

Step-by-step example

Scenario: classify customer messages and then draft a reply plan.

Step one: teach the categories with examples

Classify each message into one of these labels:
- Delivery problem
- Refund request
- Product quality
- Other

Examples:
Message: My parcel still hasn't arrived and the tracking has not moved in days.
Label: Delivery problem

Message: The shoes arrived but the stitching opened on day one.
Label: Product quality

Message: I returned this item last week and still have no refund.
Label: Refund request

Step two: force structured output

Return JSON with these keys:
- label
- short_reason
- urgency

Step three: chain the next task

Using the JSON classification above, draft a two-sentence response plan for the support agent.

Practical exercises

Create three few-shot examples for classifying email subject lines.
Convert an open-ended summarisation prompt into a JSON extraction prompt.
Split a complicated task into three chained prompts: extract, organise, rewrite.

Formative quiz

Questions

What is few-shot prompting?
Why should example formatting stay consistent?
What is prompt chaining?
When is a schema especially useful?

Answers

It is steering the model by including a small number of input/output examples in the prompt.
Because the examples also teach the response format, not just the task.
It is breaking a complex task into multiple sequential prompts so each step is simpler and easier to inspect.
When the answer must be stable, parseable, comparable, or ready for downstream use.

Module 5: Grounding answers and reducing hallucinations

Lesson content

A well-written prompt is not enough when the answer depends on recent information, internal documents, or exact evidence. Google's grounding documentation says that connecting Gemini to Google Search provides access to real-time web content, can reduce hallucinations by basing responses on real-world information, and can provide citations beyond the model's knowledge cutoff. Anthropic's web-search tool documentation makes the same basic point for Claude: web access gives the model up-to-date information beyond its cutoff and returns cited sources. OpenAI's ChatGPT Search and Deep Research documentation likewise emphasises current web information, citations, and documented reports.

For private or domain-specific information, the key idea is retrieval-augmented generation or related grounded workflows. The original RAG paper describes generation that combines a pretrained model with retrieved external memory. Anthropic's glossary defines RAG as combining retrieval with generation to improve accuracy, relevance, and grounding in evidence. Google's File Search tool says the Gemini API supports RAG by importing, chunking, indexing, and retrieving relevant user data. OpenAI's retrieval guide describes semantic search over your own data and notes that it is especially powerful when combined with models for synthesis.

Hallucination reduction is partly a prompt-design problem and partly a system-design problem. Anthropic recommends explicitly allowing the model to say "I don't know" and, for long factual tasks, asking it to extract direct quotes first before reasoning over them. OpenAI's optimisation guide recommends starting with prompting, then adding retrieval, then fine-tuning or fact-checking steps if needed. Google's safety guidance explicitly recommends grounding to reduce factual inaccuracies or hallucinations.

Grounding also creates new security concerns. OpenAI warns that prompt injections occur when untrusted content tries to override instructions, potentially leading to misaligned actions or data leakage. Google's safety guidance compares prompt injection to SQL injection, and OpenAI's safety best practices recommend constraining user input and relying where possible on narrower, validated inputs. Anthropic likewise documents jailbreaks and prompt injections as attempts to make the model ignore guidelines or your instructions.

The grounding workflow

Start with the user question.
Select trusted sources (your files, your trusted web search, or your retrieval system).
Retrieve the relevant files, quotes, or web results.
Ask the model to answer only from retrieved evidence.
Return the answer with citations or marked uncertainty when no evidence supports it.
A human checks the output for sensitive or high-stakes use.

Plain-language explanation

If the answer must be exact, give the model the source. If the topic is recent, let the model search. If the source does not contain the answer, make the model say so. This is often more important than clever wording.

Step-by-step example

Scenario: answer a question about an internal travel policy document.

You are answering questions about our travel policy.

Source rule:
Use only the uploaded policy document.

Method:
1. First extract the exact sentence or sentences that answer the question.
2. Then answer in plain English.
3. If the document does not answer the question, say "The document does not provide this information."

Question:
Are premium-economy flights allowed for trips under 6 hours?

Practical exercises

Rewrite a general knowledge prompt so it requires search grounding.
Rewrite a document Q&A prompt so it requires direct quotes first.
Identify one possible prompt injection risk in a workflow where a model reads user-uploaded documents from unknown sources.

Formative quiz

Questions

What is the main purpose of grounding?
What is one easy way to reduce hallucinations in a factual task?
What is RAG in simple terms?
What is prompt injection?

Answers

To improve factual accuracy and traceability by basing the answer on external evidence rather than on model memory alone.
Tell the model it may say "I don't know" or "the source does not say", and ask it to use direct quotes or cited evidence.
RAG is a workflow that retrieves relevant external information and gives it to the model so generation is more accurate and evidence-based.
Prompt injection is malicious text or data that tries to override the model's instructions or safety rules.

Module 6: Evaluating prompts and building repeatable workflows

Lesson content

Prompt engineering becomes professional only when it is tied to evaluation. OpenAI defines evals as structured tests for measuring model performance and says they are central to improving accuracy, performance, and reliability despite the non-deterministic nature of AI systems. Anthropic says building a successful LLM application starts by defining clear success criteria and then designing evaluations to measure performance against them. Microsoft's Foundry documentation similarly describes evaluations as a way to measure performance, quality, and safety before deployment and in production.

OpenAI's current evaluation best practices recommend a concrete workflow: define the objective, collect evaluation data, define metrics, run comparisons, and iterate. It also notes that LLMs are often better at discriminating between options than at open-ended self-judgement, so pairwise comparisons, classification, or scoring against explicit criteria can be more reliable than vague free-form judging. Anthropic's guidance makes measurability part of good success criteria.

Current prompt tooling now supports a disciplined workflow for non-technical users. OpenAI's Playground documentation describes project-level prompts, version history with rollback, variables, side-by-side comparison, built-in eval linkage, and an Optimize tool. Anthropic's Evaluation Tool supports generated test cases, CSV import, side-by-side comparisons, 5-point quality grading, and prompt versioning. Google AI Studio allows fast prompt experimentation and has a large public prompt gallery. Microsoft Copilot Studio supports custom prompts for summarisation, translation, extraction, and task identification.

Evaluation also helps answer a crucial analytical question: is prompt engineering enough? Anthropic explicitly says not every failing evaluation is best solved by prompt engineering alone and gives latency and cost as examples of problems that may be better solved by choosing a different model. OpenAI's optimisation guide recommends starting with prompting, then adding static examples, then retrieval, then fine-tuning or fact-checking if the task still needs improvement.

Plain-language explanation

A prompt is only "good" if it works on more than one example. If you test it only once, you are admiring a draft. If you test it on ten realistic cases and compare the outputs against simple criteria, you are doing prompt engineering properly.

Step-by-step example

Scenario: evaluate an email-summarisation prompt.

Success criteria

It identifies the main action item.
It captures any deadline.
It does not invent facts.
It stays under 80 words.
Human reviewers rate it at least 4 out of 5 for usefulness.

Test set

Create ten realistic emails:

ordinary request
urgent complaint
ambiguous request
message with no deadline
message with missing information
one adversarial or messy example

Prompt revision loop

Run the prompt on all ten emails.
Mark pass or fail against each criterion.
Inspect failure patterns.
Revise the prompt.
Re-run the same test set.
Compare versions side by side.

Practical exercises

Create a five-case evaluation set for a meeting-summary prompt.
Create a pass/fail checklist for a policy Q&A prompt.
Decide whether a poor result should be solved by rewriting the prompt, adding source material, or switching models, and explain why.

Formative quiz

Questions

What is an eval?
Why are pairwise comparisons or pass/fail criteria often better than vague free-form judging?
Name one feature in OpenAI Playground or Anthropic Console that helps prompt iteration.
If a prompt is accurate but too slow and expensive, what might help more than prompt changes?

Answers

An eval is a structured test used to measure the quality of an LLM output or system against defined criteria.
Because current guidance says LLMs are better at discriminating between options than at open-ended evaluation.
Examples include version history, rollback, variables, linked evals, Optimize, generated test cases, side-by-side comparison, or quality grading.
Selecting a different model may help more, especially for latency or cost.

Final practice exam

Instructions: answer all sections. Total marks: 100.

Section A

Question: Which prompt is strongest for a non-technical learner who needs a board-ready summary of a policy memo?

A. "Summarise this." B. "Summarise this memo for me." C. "Summarise this policy memo for a board member who has two minutes to read it. Return five bullets covering decision needed, major risk, important deadline, financial effect, and one open question. Use only the memo text and say 'Not stated in memo' if evidence is missing." D. "Give your thoughts on this memo."

Model answer: C. It specifies task, audience, output format, and fallback behaviour.

Section B

Question: Which statement correctly explains the difference between a knowledge cutoff and a context window?

A. "A knowledge cutoff is what the model remembers between requests; a context window is the maximum file size you can upload." B. "A knowledge cutoff is the latest period of training data the provider says the model knows reliably; a context window is the amount of input the model can consider in a single request, comparable to short-term memory." C. "A knowledge cutoff is the deadline by which the model must respond; a context window is the browser pane the visitor types into." D. "A knowledge cutoff is a security feature that blocks recent topics; a context window is how many users can talk to the model at once."

Model answer: B. A knowledge cutoff is about the training data; a context window is about how much input fits into one request. The other options confuse training data with runtime memory, response timeouts, or unrelated platform concepts.

Section C

Question: Which rewrite of "Help me with this report" is the strongest business prompt?

A. "Read my report and tell me what you think." B. "Summarise my report in a few bullets." C. "You are helping me prepare a report for senior leadership. Review the attached report draft and produce an executive summary: 6 bullets maximum, one each for main finding, business risk, opportunity, deadline, decision needed, and data gap; end with a one-sentence recommendation; use only the report; label unsupported points 'Not supported by draft'; keep the tone formal and concise." D. "Make this report better."

Model answer: C. It adds role, task, audience, output contract, and uncertainty handling. The other options omit one or more of those, leaving the model to guess.

Section D

Question: When is few-shot prompting clearly stronger than zero-shot prompting?

A. "When the task is purely creative, such as writing a short story from scratch with no constraints." B. "When you want the model to behave unpredictably across requests." C. "When the model has no internet access." D. "When the task is classification or extraction and you need consistent output that follows a fixed pattern, because a few input-output examples teach the pattern from real cases."

Model answer: D. Few-shot examples are most useful when the task has a repeatable structure. Customer-message labelling, structured extraction, and tone matching are strong candidates. The other options either describe the wrong kind of task or misread what few-shot does.

Section E

Question: Which two-step prompt chain sensibly turns messy customer comments into an action plan?

A. "Step one: predict next quarter's revenue. Step two: write a press release." B. "Step one: ask for one paragraph summarising everything. Step two: translate it to French." C. "Step one: extract the issues into a table with columns for issue, evidence quote, and severity. Step two: using that table, draft three recommended actions ordered by urgency." D. "Step one: rank customers by how angry they sound. Step two: thank them all individually."

Model answer: C. The chain separates extraction from recommendation, so each step is easier to inspect, fix, and evaluate. The other options either change the goal or skip the extraction stage entirely.

Section F

Question: You need an answer about a policy that changed last week. What should you do, and why?

A. "Trust the model's memory, because LLMs continually retrain themselves on new policy documents." B. "Use a grounded workflow such as current web search or a trusted uploaded document, because the answer depends on recent or changing information that may sit outside the model's knowledge cutoff." C. "Ask the model to guess and add a disclaimer in the response." D. "Ask the same question many times and pick the most common answer."

Model answer: B. Recent information is exactly the case grounding is built for. The other options rely on memory the model may not have, or on noise rather than evidence.

Section G

Question: Which option correctly defines prompt injection and names a valid mitigation?

A. "Prompt injection is when typing too quickly confuses the model; the mitigation is to slow down." B. "Prompt injection is when the model invents data; the mitigation is to delete the conversation history." C. "Prompt injection is malicious text or data that tries to override or manipulate the model's instructions; one mitigation is to restrict the model to trusted sources and use narrower, validated inputs." D. "Prompt injection is a paid feature for faster responses; the mitigation is to upgrade your plan."

Model answer: C. Prompt injection is an instruction-override attack carried inside otherwise innocent-looking content. Restricting sources and validating inputs is one well-known defence; treating untrusted content as data rather than commands is another.

Section H

Question: Which evaluation plan for a prompt that summarises inbound customer emails is strongest?

A. "Run the prompt once on a single email; ship it if the output looks fine." B. "Score every email by hand and never compare prompt versions." C. "Define success criteria (captures the main request, any deadline, and the right tone; avoids invented facts; stays under 80 words). Use 10 to 20 realistic emails covering normal, ambiguous, edge-case, and adversarial examples. Score pass/fail per criterion plus a human usefulness score from 1 to 5. Compare prompt versions side by side and keep the better one." D. "Ask the model to grade its own outputs and trust the result without human review."

Model answer: C. Real evaluation needs explicit criteria, realistic test data, scoring you can compare across versions, and human judgment in the loop. The other options either skip evaluation entirely or rely on the model judging itself.

Grading rubric

Criterion	Excellent	Good	Adequate	Needs revision	Marks
Task definition	Clear, specific, audience-aware, realistic	Mostly clear with minor gaps	Understandable but broad	Vague or incomplete	15
Context use	Relevant context selected deliberately	Useful context present	Some relevant context	Missing or irrelevant context	15
Constraints and fallback	Strong output contract and uncertainty handling	Good constraints with minor omissions	Basic constraints only	No meaningful constraints	15
Use of techniques	Appropriate use of role, examples, schema, or chaining	Techniques mostly appropriate	Limited technique use	Techniques absent or misused	15
Grounding and safety	Strong evidence use and clear mitigation of hallucination/injection risks	Good grounding with minor gaps	Some grounding but weak checking	Ungrounded and unsafe	15
Evaluation thinking	Clear success criteria and realistic test cases	Mostly good evaluation logic	Basic evaluation idea only	No serious evaluation plan	15
Plain-language quality	Clear, concise, accessible	Mostly clear	Some jargon or clutter	Hard to follow	10

Suggested grade bands

Distinction: 85 to 100
Merit: 70 to 84
Pass: 50 to 69
Revise and resubmit: below 50

Accessibility and pedagogy

This course should be delivered and maintained with explicit accessibility features. W3C guidance says clear headings help users understand what information is present and how it is organised, and the WAI headings tutorial notes that headings support in-page navigation by browsers and assistive technologies. W3C also requires text alternatives for non-text content so information can be rendered in the sensory modality that fits the user. In practice, that means every diagram should have a text summary, every image should have equivalent alt text, and important meaning should never depend on colour alone.

The language of the course should remain plain and audience-focused. Digital.gov's plain-language guidance says clear, easy-to-understand content is critical for public understanding and should be written for its specific audience. That principle is directly relevant to non-technical prompting instruction: use short sentences, concrete examples, defined terms, and minimal jargon.

The overall pedagogy should also follow Universal Design for Learning. CAST describes the UDL Guidelines as a research-informed framework intended to ensure that learners can access and participate in meaningful, challenging opportunities, and the guidelines are designed for multiple means of engagement, representation, and action or expression. In practical course design terms, that supports combining short readings, diagrams, live demos, worked examples, written exercises, discussion prompts, downloadable checklists, and optional peer review.

For novice learners in particular, worked examples should come before independent problem-solving. Reviews of worked-example research describe them as expert solutions for learners to study and note that example-based instruction is especially valuable for novice learning. Frequent low-stakes quizzes should also be retained, because test-enhanced learning research shows that taking memory tests improves later retention, and the broader learning-techniques review rates practice testing highly across many contexts.

Delivery notes

Teaching note	Practical implementation
Keep each lesson chunk short	Use 10 to 15 minute teaching segments followed by one guided example.
Always show before independent practice	Demonstrate a weak prompt, then a stronger prompt, then explain the difference.
Build sceptical habits	Ask learners to identify missing context, unverifiable claims, and better fallback wording.
Use real workplace tasks	Prefer emails, meeting notes, policy questions, planning, and research tasks over abstract tasks.
Assess process as well as answer	Grade why learners chose a technique, not just what output they wrote.
Preserve accessibility	Offer text transcripts for demos, alt text for diagrams, and a printable checklist version of templates.

References and further reading

Official vendor and platform documentation

OpenAI: Prompt engineering, Prompting, Models, Reasoning models, Evaluation best practices, Working with evals, Optimizing LLM Accuracy, Key concepts, Prompting fundamentals, Prompt management in Playground, ChatGPT Search, Deep research in ChatGPT, ChatGPT capabilities and projects.
Anthropic: Models overview, Prompt engineering overview, Prompting best practices, Console prompting tools, Define success criteria and build evaluations, Using the Evaluation Tool, Reduce hallucinations, Mitigate jailbreaks and prompt injections, Web search tool, Glossary, Claude product page.
Google AI for Developers: Prompt design strategies, Models, Gemini 2.5 Pro, Gemini 3.5 Flash, Long context, Grounding with Google Search, File Search, Google AI Studio quickstart, Prompt gallery, Safety and factuality guidance, Gemini Apps Help.
Microsoft Learn: Create a prompt, Prompt engineering techniques, System message design, Overview of Microsoft 365 Copilot Chat, Microsoft 365 Copilot Chat documentation, Run evaluations from the Microsoft Foundry portal.

Foundational model, alignment, retrieval, and evaluation papers

Vaswani et al., Attention Is All You Need.
Brown et al., Language Models are Few-Shot Learners.
Ouyang et al., Training language models to follow instructions with human feedback.
Bai et al., Constitutional AI: Harmlessness from AI Feedback.
Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
Hendrycks et al., Measuring Massive Multitask Language Understanding.
Chen et al., Evaluating Large Language Models Trained on Code.
Lin, Hilton, and Evans, TruthfulQA: Measuring How Models Mimic Human Falsehoods.
Liang et al., Holistic Evaluation of Language Models.
Lin, ROUGE: A Package for Automatic Evaluation of Summaries.
Papineni et al., BLEU.
Zhang et al., BERTScore.
Liu et al., G-Eval.

Accessibility and pedagogy sources

W3C, Web Content Accessibility Guidelines 2.2; Understanding SC 2.4.6 Headings and Labels; Understanding SC 1.1.1 Non-text Content; WAI Headings Tutorial.
CAST, Universal Design for Learning Guidelines 3.0.
Digital.gov, Plain language guide series.
Roediger and Karpicke, Test-Enhanced Learning: Taking Memory Tests Improves Long-Term Retention.
Atkinson et al., Learning from Examples: Instructional Principles from the Worked Examples Research.
Dunlosky et al., Improving Students' Learning With Effective Learning Techniques.

Tags LLMs Generative AI Prompt Engineering