OpenAI’s new “CriticGPT” model is trained to critique GPT-4 outputs

Magnify / Illustration created by OpenAI.

On Thursday, OpenAI researchers unveiled CriticGPT, a new artificial intelligence model designed to identify bugs in the code generated by ChatGPT. It aims to improve the process of getting AI systems to behave in ways that humans want (called “alignment”) through Reinforcement Learning from Human Feedback (RLHF), which helps human reviewers refine the output of large language models (LLMs).

As outlined in a new research paper titled “LLM Critics Help Catch LLM Bugs,” OpenAI created CriticGPT to act as an AI assistant for human trainers who review the programming code generated by the ChatGPT AI assistant. CriticGPT—based on the GPT-4 LLMS family—analyzes code and highlights potential bugs, making it easier for people to spot bugs that might otherwise go unnoticed. The researchers trained CriticGPT on a dataset of code samples with intentionally embedded errors, teaching it to recognize and flag various coding errors.

The researchers found that annotators preferred CriticGPT over human critiques in 63 percent of cases involving naturally occurring LLM errors, and that human-machine teams using CriticGPT wrote more comprehensive critiques than humans alone, while reducing rates of confabulation (hallucinations) compared to AI alone . critics.

Development of an automated critic

The development of CriticGPT involved training the model on a large number of inputs containing intentionally embedded errors. Human trainers were asked to edit the code written by ChatGPT, introduce errors, and then provide example feedback as if they had discovered the errors. This process allowed the model to learn to identify and criticize different types of coding errors.

In experiments, CriticGPT has demonstrated its ability to catch both embedded errors and naturally occurring errors in ChatGPT output. Criticisms of the new model were favored by trainers over those created by ChatGPT itself in 63 percent of the cases involving natural errors (statistics mentioned above). This preference was due in part to CriticGPT producing fewer useless “nitpicks” and generating fewer false positives or hallucinatory problems.

The researchers also created a new technique they call Force Sampling Beam Search (FSBS). This method helps CriticGPT write more detailed code reviews. It allows researchers to adjust how thoroughly CriticGPT looks for problems, while controlling how often it can create problems that don’t actually exist. They can tweak this balance to what they need for different AI training tasks.

Interestingly, the researchers discovered that CriticGPT’s capabilities go beyond just code review. In their experiments, they applied the model to a subset of ChatGPT training data that had previously been rated as error-free by human annotators. Surprisingly, CriticGPT identified errors in 24 percent of these cases—errors that were subsequently confirmed by human reviewers. OpenAI thinks this demonstrates the model’s potential to generalize to non-code tasks and highlights its ability to pick up subtle errors that even careful human evaluation might miss.

Despite promising results, like all AI models, CriticGPT has limitations. The model was trained on relatively short ChatGPT responses, which may not fully prepare it to evaluate the longer and more complex tasks that future AI systems might tackle. Furthermore, while CriticGPT reduces confabulations, it does not eliminate them completely, and human trainers may still make labeling errors based on these spurious outputs.

The research team recognizes that CriticGPT is most effective at identifying bugs that can be pinpointed to one specific location in the code. However, real errors in AI outputs can often be spread over multiple parts of the answer, posing a challenge for future iterations of the model.

OpenAI plans to integrate CriticGPT-like models into its RLHF labeling and provide AI assistance to its trainers. For OpenAI, this is a step towards developing better tools for evaluating outputs from LLM systems that can be difficult for humans to evaluate without additional support. However, the researchers caution that even with tools like CriticGPT, extremely complex tasks or responses can still be challenging for human raters—even those aided by AI.

Development of an automated critic

Leave a Comment Cancel Reply