The Prompt Report: A Systematic Survey of Prompt Engineering Techniques¶
ArXiv: 2406.06608
🎯 Pitch¶
This paper delivers the most comprehensive, evidence-based synthesis of prompt engineering to date by conducting a PRISMA-guided systematic review that maps 58 text-based and 40 multimodal/multilingual prompting techniques into an actionable taxonomy, paired with a unified vocabulary. By benchmarking prompting strategies and providing empirical case studies—including high-stakes domains like suicidality detection—it empowers both practitioners and researchers to navigate and standardize best practices in a field whose inconsistent terminology and scattered methods have previously impeded reliable, secure, and effective AI deployment.
1. Executive Summary (2–3 sentences)¶
This paper systematizes the rapidly expanding area of prompt engineering by (a) performing a PRISMA‑guided systematic literature review that yields a taxonomy of 58 text‑based prompting techniques (plus 40 multimodal/multilingual variants) and a consolidated vocabulary of 33 terms, and (b) validating practice through two empirical studies: a technique benchmark on MMLU and a real‑world prompt‑engineering case study for detecting suicidal “entrapment.” It matters because prompting is the primary interface to modern generative models; inconsistent terminology, scattered techniques, and unclear best practices hinder reliable deployment, evaluation, and safety.
2. Context and Motivation¶
- Problem/gap addressed
- Prompt engineering has grown ad hoc with conflicting terminology and overlapping techniques, which makes it hard to know what works, when, and why (Section 1; Figure 1.3 for terminology; Figure 2.2 for technique map).
- There is no consolidated, evidence‑based guide that spans text, multilingual, and multimodal prompting, while also covering safety (prompt hacking) and evaluation (Sections 3–5).
-
Practical guidance has lacked empirical case studies demonstrating how experts iterate prompts on real tasks (Section 6).
-
Why this is important
- Prompts are the operational handle on LLM behavior in consumer, enterprise, and research settings (Section 1). Better prompts consistently improve performance across tasks (Section 1 citing Wei et al., 2022b; Liu et al., 2023b).
-
Security and safety failures (e.g., prompt injection and jailbreaking) create real business risk (Section 5.1); poor calibration, bias, and ambiguity hurt reliability (Section 5.2).
-
Prior approaches and where they fall short
-
Earlier surveys covered prompting broadly (e.g., including cloze or soft prompts) or specific subareas (reasoning, multimodal, etc.). They did not provide an up‑to‑date, PRISMA‑grounded synthesis focused on modern hard, prefix prompts, nor did they pair a field map with actionable empirical case studies (Section 7).
-
Positioning
- Scope is deliberately focused on standard, deployable prompting: hard (discrete)
prefixprompts, notclozeor soft prompts, and task‑agnostic techniques (Section “Scope of Study”). The work unifies vocabulary, organizes techniques, quantifies usage, and demonstrates practice through benchmarking and a detailed real‑world prompt engineering process.
3. Technical Approach¶
This paper has three pillars: a systematic review to build a taxonomy, an empirical benchmark, and a real‑world case study.
- Systematic review (PRISMA pipeline)
- Data sources: arXiv, Semantic Scholar, ACL; 44 prompting‑related keywords (Appendix A.4).
- Process (Figure 2.1; Section 2.1.1):
- Start with 4,247 unique records after deduplication.
- Human review on 1,661 arXiv titles/abstracts with 92% inter‑annotator agreement; criteria focused on novel prompting techniques using hard, prefix prompts; fine‑tuning papers excluded.
- LLM‑assisted screening of remaining records using a GPT‑4 prompt (Appendix A.5); validated at 89% precision and 75% recall (F1 = 81%).
- Final corpus: “1,565 quantitative records included in analysis” (Figure 2.1).
-
Output:
- A terminology chart capturing components of prompts and related artifacts (Figure 1.3).
- A taxonomy of text‑based prompting techniques grouped into six families (Figure 2.2).
- Extensions for multilingual and multimodal prompting (Figures 3.1 and 3.2).
- A structured treatment of security, safety, and evaluation (Sections 4–5).
-
Terminology and prompt components
- A
promptis any input (text, image, audio, etc.) used to guide a generative model’s output; prompts often come fromprompt templates, which are functions with variables that render into concrete prompts (Section 1.1; Figure 1.2). -
Typical components (Section 1.2.1):
Directive(the task, e.g., “Classify…”).Exemplars(a.k.a. “shots”).Output formattingandstyleconstraints.Role(a persona to influence style or behavior).Additional Information(domain facts, constraints).
-
Taxonomy of text techniques (Figure 2.2; Sections 2.2.1–2.2.5)
In‑Context Learning (ICL): learning from exemplars or instructions inside the prompt; includesfew‑shotprompts andzero‑shotinstruction prompts (Figures 2.4–2.6).Thought generation: inducing explicit reasoning, e.g.,Chain‑of‑Thought (CoT)and its zero‑shot, few‑shot, and table/analogical variants (Section 2.2.2; Figure 2.8).Decomposition: dividing complex problems into sub‑problems (Least‑to‑Most,Tree‑of‑Thought,Program‑of‑Thoughts) (Section 2.2.3).Ensembling: produce multiple answers using prompt variations or sampling, then aggregate (Self‑Consistency,DENSE,USP) (Section 2.2.4).-
Self‑criticism: generate, critique, and revise answers (Self‑Refine,COVE,Self‑Verification) (Section 2.2.5). -
Formalization and answer engineering
- Prompts are treated as conditioning mechanisms on a language model
p_LM, optionally via a templateT(x)and few‑shot exemplarsX(Appendix A.8, Eqs A.1–A.5). - Prompt optimization is maximizing a score function
Sover a dataset, sometimes jointly with ananswer extractorEthat parses LLM outputs into canonical answers (Appendix A.8, Eqs A.6–A.8; Section 2.5). -
Answer engineeringdecisions: answershape(token/span),space(allowed values), andextractor(regex, verbalizer, or a separate LLM) (Section 2.5; Figure 2.13). -
Empirical benchmark (Section 6.1)
- Task: MMLU subset (2,800 questions; 20% per category; sensitive “human_sexuality” excluded).
- Model:
gpt-3.5-turbo. - Prompting conditions (Figure 6.2):
Zero‑Shot;Zero‑Shot CoTwith three thought inducers;Zero‑Shot CoTwithSelf‑Consistency(3 samples);Few‑Shot;Few‑Shot CoT; andFew‑Shot CoTwithSelf‑Consistency. - Two question formats (Figures 6.3 and 6.4).
- Decoding: temperature 0.5 for
Self‑Consistency, 0.0 otherwise (Section 6.1.3). -
Answer parsing: pattern‑based extraction of choices; variations tested (Section 6.1.4).
-
Real‑world case study: suicidal “entrapment” detection (Section 6.2)
- Data: 221 Reddit r/SuicideWatch posts labeled by two trained coders for presence/absence of “entrapment” (Krippendorff’s α = 0.72) (Section 6.2.2).
Entrapmentis defined as “a desire to escape from an unbearable situation, tied with the perception that all escape routes are blocked” with concrete phrasing cues (Figure 6.7).
- Goal: prompt‑engineer a binary classifier using only prompting (no fine‑tuning).
- Process (~20 hours, 47 steps) included:
- Guardrail conflicts: initial models responded with crisis advice instead of labels; switching to
GPT‑4‑32Kallowed one‑word labels (Section 6.2.3.2). - Iterative technique exploration: zero‑shot with definition, few‑shot,
CoT,contrastive CoT, targeted answer extraction, ensembling, and a new method,Automatic Directed CoT (AutoDiCoT)(Figures 6.12–6.16).
- Guardrail conflicts: initial models responded with crisis advice instead of labels; switching to
AutoDiCoT(Figure 6.12):- For each training item, elicit a reasoning chain
r_ithat either justifies a correct label or explains why the earlier label was wrong; store triplets(q_i, r_i, a_i). - Use selected triplets as exemplars (including an “incorrect reasoning” one) to steer reasoning on new inputs (Figures 6.13, 6.16).
- For each training item, elicit a reasoning chain
-
Observations:
- Seemingly innocuous context duplication (pasting the same context email twice) improved performance, and de‑duplication hurt it (Section 6.2.3.3: “Full Context Only,” “De‑Duplicating Email”).
- Over‑restricting to “explicit” entrapment raised precision but crashed recall—misaligned with the clinical goal of minimizing false negatives (Section 6.2.3.3).
-
Evaluation frameworks and safety (Sections 4.2 and 5)
- LLM‑as‑evaluator designs: prompt roles, CoT, model‑generated guidelines, output formats (binary, Likert, JSON/XML), and frameworks like
LLM‑EVAL,G‑EVAL, andChatEval(Section 4.2.1–4.2.3). - Security taxonomy:
prompt hacking⊇prompt injection(override developer instructions) andjailbreaking(coax unsafe actions with no developer instruction present), plus risks (data leakage, package hallucination, chatbot liability), and defenses (detectors, guardrails) (Section 5.1; Figure 5.1).
4. Key Insights and Innovations¶
- A field‑wide, PRISMA‑grounded taxonomy and vocabulary
- Novelty/significance: integrates 58 text‑based techniques into six coherent families (Figure 2.2) and provides a consistent terminology (Figure 1.3), reducing confusion across papers that use overlapping or conflicting names (Section 1.2).
-
Difference from prior: earlier reviews were broader or less structured; here the focus on deployable hard, prefix prompts plus a formal prompt definition and answer engineering pipeline (Appendix A.8; Section 2.5) is practice‑oriented.
-
A practical, formal view of
answer engineering - What’s new: elevates output parsing as a first‑class design space—answer
shape,space, andextractor(Section 2.5; Figure 2.13)—and unifies it with prompt optimization via Eq. A.8 (Appendix A.8). -
Why it matters: many failures in real systems come from brittle post‑processing rather than modeling; formalizing this makes evaluations repeatable and prompts more robust.
-
Automatic Directed CoT (AutoDiCoT)for reasoning control - What it is: a simple algorithm to build CoT exemplars that explicitly steer reasoning away from observed error modes by including “what not to do” alongside correct rationales (Figure 6.12; One‑Shot contrastive example in Figure 6.13).
-
Impact: on the entrapment task, a
10‑Shot AutoDiCoTprompt reached the best development F1 (0.53) with recall 0.86 and precision 0.38 (Section 6.2.3.3; Figure 6.16 and Figure 6.6). This demonstrates targeted reasoning control without fine‑tuning. -
Empirical clarity on technique performance trade‑offs
- Benchmark result (MMLU;
gpt‑3.5‑turbo):Few‑Shot CoToutperformsZero‑ShotandZero‑Shot CoT, whileSelf‑Consistencyhelps zero‑shot but not few‑shot in this setup (Section 6.1.5; Figure 6.1). -
Quote of main numbers: > Zero‑Shot 0.627; Zero‑Shot CoT 0.547; Zero‑Shot CoT + Self‑Consistency 0.574; Few‑Shot 0.652; Few‑Shot CoT 0.692; Few‑Shot CoT + Self‑Consistency 0.691 (Figure 6.1).
-
A consolidated treatment of prompt security and alignment
- Security: clear distinctions between
prompt injectionandjailbreaking, concrete risks (e.g., training‑data leakage, package hallucination), and layered defenses (detectors, guardrails) (Section 5.1; Figure 5.1). - Alignment: practical prompt‑level mitigations for prompt sensitivity, miscalibration, sycophancy, bias, and ambiguity (Section 5.2; Figure 5.2).
5. Experimental Analysis¶
- Evaluation methodology
- Benchmark (Section 6.1):
- Dataset: MMLU subset (2,800 items; 20% per category).
- Model:
gpt-3.5-turbo. - Prompt settings: 6 technique families with 2 question formats; temperature 0.5 for
Self‑Consistency, else 0.0. - Parsing: pattern‑based extraction rules (Section 6.1.4).
-
Case Study (Section 6.2):
- Dataset: 221 Reddit posts; 121 for development, 100 for test.
- Metric: F1, plus precision and recall reported throughout.
- Iterative exploration across ~20 prompting variants (Figures 6.5, 6.6).
-
Main quantitative results
- Benchmark headline (Figure 6.1):
> Accuracy:
Few‑Shot CoT0.692 ≈Few‑Shot CoT + Self‑Consistency0.691, both better thanFew‑Shot0.652 andZero‑Shot0.627;Zero‑Shot CoTalone drops to 0.547;Zero‑Shot CoT + Self‑Consistencyrecovers to 0.574. -
Case study progression (Figures 6.5–6.6):
- Starting point: zero‑shot with definition (Section 6.2.3.3 “Zero‑Shot + Context”) achieved F1 0.40 with recall 1.00 and precision 0.25.
- Best development result:
10‑Shot AutoDiCoT(with duplicated context) reached F1 0.53, recall 0.86, precision 0.38 (Figure 6.16; Figure 6.6). - Sensitivity: removing duplicated context (“De‑Duplicating Email”) reduced F1 to 0.45 (recall 0.74; precision 0.33).
- Test set via automated prompt optimization (
DSPy, BootstrapFewShotWithRandomSearch):“F1 0.548 (precision 0.385, recall 0.952)” without using the email or the “explicit entrapment” constraint (Figure 6.19).
-
Do the experiments support the claims?
- The MMLU benchmark systematically compares common techniques under a controlled setup and reveals a non‑obvious outcome:
Zero‑Shot CoTcan hurt performance unless stabilized bySelf‑Consistency(Figure 6.1). This supports the paper’s caution that technique effectiveness is context‑dependent (Section 6.1.5). -
The entrapment case study convincingly demonstrates the realities of prompt engineering: sensitivity to context, the value of exemplar selection and reasoning control (
AutoDiCoT), and the importance of aligning prompt objectives to domain goals (high recall in clinical screening) (Section 6.2.4). -
Ablations, failures, robustness
- Answer‑extraction choices mattered; parsing only the “first characters” improved F1 over “exact match” during early steps (Section 6.2.3.3).
- Ensembling (
10‑Shot AutoDiCoT Ensemble + Extraction) unexpectedly degraded performance due to unstructured outputs requiring additional extraction (Section 6.2.3.3). - Overly strict instruction (“only explicit entrapment”) improved precision but sharply reduced recall, revealing value misalignment for the clinical use case (Section 6.2.3.3).
-
The paper quantifies model/dataset citation frequency to characterize technique adoption (Figures 2.9–2.11). For example: > Figure 2.11 shows
Chain‑of‑ThoughtandFew‑Shotmethods among the most cited in the corpus. -
Conditionality and trade‑offs
Self‑Consistencyimproves zero‑shot CoT but has little added value for few‑shot CoT in this benchmark—an interaction effect worth checking per task/model (Figure 6.1).- Reasoning control (
AutoDiCoT) increased recall at some cost in precision—beneficial for high‑risk screening, but perhaps not for precision‑critical tasks.
6. Limitations and Trade-offs¶
- Scope constraints
- Focused on
hard(discrete)prefixprompts; excludes soft prompts and gradient‑based prompt tuning (Section “Scope of Study”). -
Task‑agnostic techniques only; domain‑specific prompting is out of scope (Section “Scope of Study”).
-
Empirical limits
- Benchmark uses a single model (
gpt‑3.5‑turbo), one dataset (MMLU subset), and limited variants (Section 6.1), so generalization across models/tasks is not guaranteed. - Parsing‑based evaluation can misjudge answers if the model’s formatting drifts (Section 6.1.4; Section 2.5).
-
Citation counts as a proxy for usage (Figures 2.9–2.11) reflect research discourse, not necessarily industry adoption.
-
Prompt sensitivity and brittleness
- Small format variations (e.g., duplicated context email) significantly affected results (Section 6.2.3.3), echoing broader sensitivity findings (Section 5.2.1).
-
Over‑constraint or poorly aligned instructions can degrade metrics most valued by stakeholders (e.g., recall in clinical screening) (Section 6.2.3.3).
-
Compute, cost, and latency
-
Techniques like
Self‑Consistency, ensembling,Tree‑of‑Thought, or agentic tool‑use increase API calls, latency, and cost (Sections 2.2.4, 4.1), which may limit real‑time or large‑scale use. -
Security defenses remain partial
- Prompt‑based defenses can mitigate but not eliminate prompt hacking; detectors/guardrails reduce risk but are not foolproof (Section 5.1.3).
7. Implications and Future Directions¶
- How this work changes the landscape
- Provides a shared map and vocabulary for the field (Figures 1.3, 2.2, 3.1, 3.2), making it easier to reason about design choices, compare techniques, and teach best practices.
-
Bridges prompting research with safety and evaluation, promoting end‑to‑end thinking: prompts, answer extraction, evaluation pipelines, and defenses (Sections 2.5, 4.2, 5).
-
Follow‑up research enabled
- Multi‑model, multi‑dataset replications of the benchmark to test interaction effects (e.g., when
Zero‑Shot CoThelps vs. hurts). - Programmatic methods for
answer engineering(learned extractors, structured decoding) that reduce formatting brittleness (Section 2.5). - Generalized
Directed CoTmethods: algorithmic selection of “what not to do” exemplars, with uncertainty‑aware sampling or RL. -
Robust prompting under adversarial or noisy inputs; formal safety metrics for guards/detectors (Section 5.1.3).
-
Practical applications and downstream use cases
- Enterprise prompt design playbooks: exemplar selection (Figure 2.3), role/style instructions (Section 2.2.1.3), ensembling with cost control (Section 2.2.4), and built‑in answer extraction (Section 2.5).
- Safety‑critical screening workflows (healthcare, trust & safety): prioritize high‑recall prompts, add calibration prompts (Section 5.2.2), and include human‑in‑the‑loop confirmation.
- Agentic systems: tool routing and retrieval‑augmented reasoning patterns (
MRKL,ReAct,IRCoT) for tasks needing factuality and planning (Section 4.1; Figure 4.1). - LLM‑as‑evaluator: adopt
G‑EVAL/LLM‑EVALstyles with CoT and model‑generated guidelines for consistent, auditable assessments (Section 4.2).
Overall, the work offers a coherent blueprint for designing, evaluating, and securing prompt‑driven systems—from vocabulary and taxonomy (Figures 1.3, 2.2) through empirical guidance (Figure 6.1) and real‑world procedure (Figures 6.12–6.16)—while candidly surfacing sensitivity and safety pitfalls that practitioners must manage.