Here’s a discussion on our recent paper with an additional experiment that uses GPT-4 to beat past GPT-4 standards.
In our recent paper, "Reflexion: An Autonomous Agent with Dynamic Memory and Self-Reflection," we introduce a framework that allows AI agents to emulate human-like self-reflection and evaluate its performance on the ALFWorld and HotpotQA benchmarks. Our goal was to create AI agents that learn by reflecting on failures and enhancing their results, much like humans do. If you're interested in exploring this further, we've made all the code and logs available for you to access at link.
In this post, we describe some of the ideas we are exploring related to extending the Reflexion framework and give a peek into interesting results we are observing. Our slightly improved Reflexion-based GPT-4 agent achieves state-of-the-art pass@1 results (88%) on HumanEval, outperforming GPT-4 (67.0%) and CodeT: Code Generation with Generated Tests (65.8%), which were the previous state-of-the-art standards.
Reflexion Without Definitive Ground Truth
Human intelligence is notable for its ability to learn from mistakes. We often don't solve problems on our first try, but when we make mistakes we generate new ideas to refine our approach through self-reflection, through analyzing our missteps. In our paper, we formalized this concept using a ground truth success metric for problem-solving evaluation and showed how we can iteratively improve on tasks. However, several real world situations don't have a definitive ground truth or a single optimum solution.
To address such situations, we propose a method that again mirrors human problem-solving. When given a task whose solutions are not clearly defined, we usually take time to plan and create an internal test suite based on our contextual understanding, either consciously or unconsciously. We evaluate various potential solutions against these tests and assign a confidence level to each. Adjustments are made until a solution that is likely to satisfy all or most of the tests, which then becomes the proposed solution to be executed. In this scenario, the solution that meets all or most internal test cases is accepted as the one likely to result in the ground truth, and the chance of success relies on the probability of an erroneous test design.
This method can be applied to various problems without a firm ground truth (pass@1), which are similar to many that span fields such as protein or chemical design, and architectural design, or simple problems that we encounter daily. As the use of LLMs and other sensory large neural networks advance, we may see widespread applications of Reflexion in tasks traditionally performed by humans. For instance, an AI chef could create dishes based on your cravings, refining the recipe through continuous feedback. Similarly, an AI business consultant might develop a successful business strategy without a predefined path. By using self-reflection for iterative learning, we can develop high-confidence solutions for problems in which the concrete ground truth is unavailable.
This concept is not only applicable to complex human-centered problems, but to simpler text-based problems such as code implementation. As developers often implement programs, they participate in an iterative loop of code writing, execution, and debugging. Typically, programmers spend more time resolving bugs in existing code than writing new code from scratch. This iterative nature makes program implementation ideal for an application of Reflexion.
Applying Reflexion to HumanEval
The HumanEval dataset has become a widely recognized benchmark to measure code generation accuracy. Our Reflexion-based agent was benchmarked on the HumanEval dataset and achieved 88% accuracy, surpassing GPT-4 (67%), CodeT (65.8%), and PaLM (26.2%).
Typically, in the initial stage of program implementation, a programmer designs internal tests that they may use to evaluate their future implementations’ performance. This allows them to continuously refine their code to satisfy the constraints of the internal tests. When all of the internal tests pass, the programmer will push their code with confidence that they have produced a solution that has accomplished the task to the best of their understanding of the problem at hand.
While there are various approaches that one may use to program practically, approaches taken in programming competitions give a good sense of an optimal approach to program implementation. In such competitions, participants are often allowed to write their implementation in an external editor and run self-designed test cases. Some competition guidelines may go as far as allowing competitors to evaluate their implementations on a subset of “visible” test cases. In program implementation, this is the idea of test-driven development (TDD). TDD works as follows: (1) a human receives a description of a technical problem and a list of goal features; (2) the human designs a suite of unit tests that capture the ideal behavior of the program to the best of their understanding of the technical problem; (3) the human writes an implementation in code; (4) the human runs their test suite on their program; (5) if all of the tests pass and if the human is confident in the design and coverage of their tests, they will push their code to the codebase with high confidence.
TDD for code generation approaches is not novel, however. CodeT: Code Generation with Generated Tests is an approach that uses dual execution agreement, which involves internal test generation and execution to evaluate confidence among the generated samples. Until the public release of GPT-4, CodeT held the state-of-the-art standard at 65.8% accuracy on HumanEval. The pipeline for CodeT pass@1 completion is explained as follows (1) the agent is given a function signature and docstring (2) the agent generates a collection of internal unit tests without access to the ground truth (3) the agent generates a collection of function body implementations (4) the agent evaluates the implementations on the internal unit tests (5) the agent returns the implementation that passes the most tests.
The CodeT approach improves performance as the number of function body samples increases. While this approach allows for a large state space to be explored from the base state (given function signature and docstring), it does not allow the agent to explore starting from previous, high-confident states, quantified by the evaluation of their internal tests. This idea can be explained more intuitively with an example. Let’s say that a human is given the task of finding an object in a house, which may contain several rooms, drawers, and cabinets with a constraint that they may only examine 1 area of the house at a time. If the equivalent of CodeT were applied, they would design an internal test suite such as a description of the object that they are looking for, then, if they find an object that matches the description, they can report with high confidence that they have completed the task. In their search samples, such as “check room 1”, “check room 2”, etc., their ability to find the object will increase as the number of samples increases. However, if the object is well hidden, there is a significant chance that the human will not explore the area in which the object can be found due to a lack of memory of past attempts.
In program development, this is synonymous to an attempt to blindly generate N number of proposed solutions with hopes that 1 sample will satisfy the constraints of the internal tests. Our Reflexion paper demonstrates an approach to allow an agent to refine their past methods in alignment with ground truth solutions. However, in cases such as a pass@1 metric, the ground truth is not available. Thus, we use Reflexion with a relaxed success evaluation to explore high-confident states beyond the start state to find solutions that satisfy internal tests while maintaining adherence to pass@1 standards.
Relaxing Success Evaluation
By using Reflexion to iteratively refine the current implementation, we are shifting the “accuracy bottleneck” from correct syntactic and semantic code generation to correct syntactic and semantic test generation. In theory, test generation should be much easier to accomplish than code generation. Following this assumption, we hypothesized that if an agent can design diverse and accurate tests, then they can use the internal tests to iteratively refine their implementation, and then the agent’s accuracy can be redefined as its ability to generate accurate tests.
Implementation
Test generation
The method for test generation was inspired by CodeT: Code Generation with Generated Tests found at https://github.com/microsoft/CodeT.
Function body generation
The method for code implementation generation was inspired by CodeT: Code Generation with Generated Tests found at https://github.com/microsoft/CodeT.
Unit test execution
Unit test execution was implemented to provide the agent with the following features: (1) evaluation—to assess its current accuracy on internal unit tests and (2) feedback—a verbose log pass/fail status per test with error types or output values for failed tests e.g. “string”, 5, AssertionError, SyntaxError, etc. To evaluate accuracy on internal unit tests, we pair the current function implementation to every internal unit test. For each test, if the test passes, we add it to a list of passed tests. If the test fails, we use a language specific abstract syntax tree (in this case, the Python AST module) to construct a function call using the same parameters from the failed test to capture the error type or return output, which are then added to a list of failed tests. Examples are shown below:
Function call construction
assert func(x0, y0) == z0 → func(x0, y0)
assert func(x3, y3) == z3 → func(x3, y3)
assert func(x4, y4) == z4 → func(x4, y4)
Feedback example output
Passed tests:
assert func(x0, y0) == z0
assert func(x3, y3) == z3
assert func(x4, y4) == z4
Failed tests:
assert func(x1, y1) == z1 # output: AssertionError
assert func(x2, y2) == z2 # output: 5
Self-reflection generation
In building upon our work in the Reflexion paper, we aim to further isolate individual problems to achieve iterative improvement. For problems rooted in natural language, it is common to see implementations requiring the LLM to enhance performance on two or more subtasks. Specifically, we want to isolate two tasks: (1) error identification— for instance, "the second for loop in this function is unnecessary and may cause runtime errors as shown in tests #1 and #2," and (2) implementation correction— for example, "here is an updated implementation with the corrections: python\n <new code>". To isolate these tasks, we make two calls to the LLM. The first call generates a natural language instruction based on self-reflection, while the second call produces a new implementation, considering the internal test feedback, the previous implementation, and the instruction for a revised version.
Apply Reflexion
By relaxing the success criteria to internal test accuracy, we are able to perform an iterative feedback loop that respects the criteria for pass@1 performance. In this writing, we hope that we were able to provide meaningful ideas for future Reflexion implementations. We encourage others to apply Reflexion to enable agents to solve a variety of complex tasks that are currently dominated by human intelligence. If you have any questions, don’t hesitate to contact Noah @ noahshinn024@gmail.com or Ashwin @ agopi@mit.edu
References
Reflexion: an autonomous agent with dynamic memory and self-reflection
I Speak, You Verify: Toward Trustworthy Neural Program Synthesis
Emphasizes that trust is the most important aspect for useful agents
Agent should only give the answer when they are confident
Toolformer: Language Models Can Teach Themselves to Use Tools
Agents can be trained to use tools
Faithful Reasoning Using Large Language Models, CodeT: Code Generation with Generated Tests
Sampling and selecting
Large Language Models Can Self-Improve
Self-improvement on unlabeled data sets using fine-tuning
The work they are doing is amazing. Do you intend that this agent can be implemented in a framework such as langchain in the future?
I read through everything.
I barely understood anything.
Yet, I couldn't stop reading.
Frig'n smart kids.
Love sharing a planet with you
Thank you for your hard work