Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning
Read Original PaperTL;DR
Imagine trying to teach a language model, as impressive as they are, to play chess. They can generate text that sounds like a plan, but can they actually execute it, accounting for every possible move and consequence? Getting LLMs to perform structured symbolic planning, the kind needed for reliable real-world decision-making, is surprisingly hard.
LLMs struggle with structured symbolic planning (like PDDL), lacking the logical rigor required for verifying preconditions and state transitions. PDDL-INSTRUCT aims to solve this by combining instruction tuning with logical Chain-of-Thought reasoning and external verification feedback (using VAL), resulting in instruction-tuned models achieving planning accuracy up to 94% on standard benchmarks. This represents a 66% absolute improvement over baselines.
Summary
Although LLMs have shown their strength in many domains, even general reasoning, they hit a wall when it comes to the logical reasoning and systematic verification crucial for automated planning. They struggle with formal representations like the Planning Domain Definition Language (PDDL), a standardized way to describe planning problems. While some approaches tried to have LLMs generate executable code, use environment feedback, or refine plans iteratively, they often fell short, especially as the planning tasks scaled in complexity. Previous attempts to use Chain-of-Thought (CoT) prompting, where the model generates intermediate reasoning steps, had also proven inadequate for planning.
Here comes PDDL-INSTRUCT. A novel, multi-phase instruction tuning framework designed to explicitly teach LLMs to reason through the precondition-effect structure of planning domains using a logical chain-of-thought approach. A possible saviour. The core idea is to move beyond bare plan generation and decompose planning verification into atomic reasoning steps. Instead of just asking the model to produce a plan, they guide it through the precise logical reasoning needed to determine when actions are applicable in a given state. This involves explicitly checking preconditions, applying effects, and validating invariants. Think of it as teaching the LLM to meticulously double-check its work, step by step.
PDDL-INSTRUCT consists of two key training phases:
1. Phase 1: Initial Instruction Tuning. The LLM is trained on carefully crafted prompts pairing planning domains and problems with detailed explanations of their solutions. This phase establishes a foundation of planning knowledge and teaches the model to articulate logical justifications for action validity.
2. Phase 2: Logical CoT Instruction Tuning. This is where the magic happens. The initially tuned LLM is trained to produce explicit, step-by-step state-action-state sequences (⟨s0,a1,s1⟩). A full logical reasoning chain. These reasoning chains are then passed through a *verification module* implemented using an external 'referee', VAL, a well-established plan validator. VAL systematically checks the validity of each state transition based on action preconditions and effects (ground-truth feedback).
Results
The experiments, conducted using Llama-3-8B and GPT-4, were impressive. The models were evaluated using PlanBench across three distinct planning domains: Blocksworld, Mystery Blocksworld, and Logistics.
PDDL-INSTRUCT significantly outperformed baseline models and models with only basic instruction tuning. LLama-3 with detailed feedback earned an average absolute improvement of 35% over basic instruction tuning and 66% over the baseline.
Limitations
The framework currently assumes planning domains without complex PDDL features like conditional effects or durative actions. Future work could explore expanding PDDL coverage, developing self-verification capabilities (reducing reliance on external verifiers), and advancing towards optimal planning (finding not just valid plans, but the best ones).
Key Insights
-
Logical CoT is Essential for Planning: When properly integrated with instruction tuning and external validation, Chain-of-Thought (CoT) prompting can successfully enable LLMs to reason about action applicability, state transitions, and plan validity
-
Detailed Feedback Trumps Binary: Providing detailed feedback—specific reasoning about which precondition failed or which effect was incorrectly applied, rather than simple binary (valid/invalid) feedback—consistently yields superior performance
-
The Necessity of External Verification: The validation step ensures logical coherence and robustness against unfaithful CoT reasoning—where the model generates plausible, but internally inconsistent, chains of logic. It’s key to achieving reliability.
-
Decoupled Optimization for Robustness: The two-stage optimization process ensures the model develops the necessary logical foundation before optimizing for overall planning success.
Personal Takeaway
This provides a promising path for developing trustworthy AI systems capable of reliable, complex decision-making. This framework could extend beyond planning, and be powerful in any task requiring long-horizon reasoning and systematic, verifiable deduction, such as theorem proving. I can see it being really helpful in blockchain formal verification, and more domain specific AI applications.
I took some of the ideas from this paper and applied it on a simple ReACT agent with just two tools, and it worked pretty well. You can check it out on the github repo.