Exploring the Limits of GPT-4: Breaking the Benchmark

🤖 Introduction

Since late April, myself and machine learning engineer Josh Stapleton have been evaluating over a hundred and twenty thousand answers from GPT models to explore their limits. In my original Smart GPT video, I showed that even popular TED Talks calling GPT-4 stupid were not accurately testing what GPT-4 could do, and actually, it could easily get such questions right. Little did we foresee that come the summer, our tests with GPT-4 would be revealing a host of mistakes in an official globally used benchmark, uncovering concerns that even OpenAI and Google don't appear to be aware of. But by the end of this article, I want to show how you can tangibly benefit from our experiments, including in unexpected domains like medicine.

- Introduction

- Table of Contents

- Smart GPT: A Quick Intro

- The Problem with Benchmarking GPT-4

- Smart GPT Framework

- Breaking the Benchmark

- The Need for Independent Professional Benchmarking

- Practical Applications of Smart GPT

- Pros and Cons of Smart GPT

- Highlights

- FAQ

🤖 Smart GPT: A Quick Intro

Smart GPT was a way of using the latest prompt engineering research to trigger better performance in a model like GPT-4. Getting the model to think a bit, aka use some tokens before giving a final answer, was key. Another important element I talked about in that video was the power of getting the model to self-reflect, an insight I drew on from talks with the lead author of the famous reflection paper. My manual experiments showed that using optimized prompts, reflection, and self-dialogue, you could boost performance in almost any task, and I demoed the improvement on formal logic and college mathematics.

But there was a problem, which is why you guys haven't heard about Smart GPT in a while. How could I systematically benchmark GPT-4 using these methods when I'm just one guy? Well, enter Josh Stapleton, machine learning engineer extraordinaire. Without him, it would have been impossible to build out such a fleshed-out, flexible code base with which we could systematize experiments and iterate rapidly.

🤖 The Problem with Benchmarking GPT-4

We quickly realized that there was another problem with benchmarking the original version of Smart GPT on tens of thousands of official questions. It would be hell to manually extract out the final answers within pages of reflection and resolving, not to mention cost tens of thousands of dollars. And trust me, a month of YouTube advertising would not even cover the first hour of that run. Unfortunately, and no, we would never compromise by asking GPT-4 to grade its own answers. It would be unscientific and inaccurate. The infamous MIT paper is enough evidence of that. GPT-4 didn't get 100 on an MIT degree, and this paper was withdrawn. So yes, we had to lower the power level of Smart GPT, get rid of the reflection and resolving, deliberately sacrificing some of its intelligence because we simply couldn't afford to unleash it fully. And yet we still got a new, albeit unofficial, record of 88.4 on the MMLU that not only beats the 86.4 recorded by OpenAI, it beats the projections for 2024 the Metaculus recorded before TPT came out. And yet we are both convinced that there are at least a dozen more ways performance can be further boosted using existing models. Yes, that might mean GPT-4 getting a result reserved for June of 2025. The thing is, we have hit the limits of what a self-funding team of two can do.

🤖 Smart GPT Framework

The Smart GPT framework is highly parallelized and can handle industry-scale use cases. We used a thread and a sync i/o based approach to make simultaneous calls to the API at answer option answer and subject levels, stacking parallelization upon parallelization. This led to crazy iteration speed boosts. For example, we were able to complete the final GPT-4 run in under two hours. Generating single answer options in series would have taken weeks. Smart GPT is a model-agnostic, parametrized, and highly flexible system that can be applied to disparate use cases. We are already working on applications in a number of domains in both the public and private sectors. The system is evolving and improving constantly under the hood as we continue to innovate. While the current system can get state-of-the-art results with the ability to handle enterprise-scale data, there are a number of known ways to improve it, which we aim to implement in the near future, from better and more numerous automatically sourced exemplars to LLM-driven prompt optimization to fine-tuning.

🤖 Breaking the Benchmark

The MMLU (Massive Multitask Language Understanding) is arguably the best-known benchmark of language model performance. It stands for massive because it has over 14,000 questions and multitask because it covers 57 different domains. The idea behind it was truly fantastic, and it is important enough to feature prominently on the first page of the GPT-4 technical report. In the past, I have said that getting a hundred percent on this test would be a good sign of AGI. Others have talked about 95. I do think I have like a 50% chance, like within the next 20 years or so, there might be something that will be my call in HEI or a transformative AI. Well, maybe we can measure it on benchmarks. There's like this famous MMLU benchmark that scores like 95 on this. The paper itself notes that an 89.8 performance represents human expert ability, which as you can tell from the title, we are achingly close to beating. And as you'll see in a moment, GPT-4 with the full power of prompt engineering could likely get upwards of 90 to 92 right now. And frankly, whether it's GPT-5 or Gemini, that 95 threshold should easily be broken by next year, not in 20 years.

🤖 The Need for Independent Professional Benchmarking

We need an independent professional benchmarking organization to be asked to step in, perhaps one of the major education companies like Pearson, funded by the top AGI labs, inspired by the fantastic vision behind the MMLU. They could design an incredibly broad range of subject tests, and those could be rolled into a benchmark that stretched all the way to extreme difficulties. Each question could be rigorously vetted to be unambiguous, and the volume and diversity of questions would reduce the kind of overfitting that people are worrying about now. With human eval, and of course, the answers could also be blind human-graded. All models, including open-source ones, held to the exact same standard and given the best chance to shine. We need to benchmark these models to the best of their abilities, find the ceiling of what they can do, not the floor.

🤖 Practical Applications of Smart GPT

Smart GPT can be applied to a wide range of domains, including medicine. For example, we tested GPT-4 on a medical diagnosis question, and with the full power of Smart GPT, we went from always getting it wrong with no exemplars and no self-consistency to always getting it right with exemplars, self-consistency, and self-reflection. Obviously, we are not saying that you should rely on GPT-4 for medical diagnoses, but we are saying that you might be surprised about the diversity of domains to which such methods can be applied.

🤖 Pros and Cons of Smart GPT

Pros:

- Smart GPT can boost performance in almost any task.

- The Smart GPT framework is highly parallelized and can handle industry-scale use cases.

- Smart GPT is a model-agnostic, parametrized, and highly flexible system that can be applied to disparate use cases.

Cons:

- Smart GPT sacrifices some of the model's intelligence to make it more affordable.

- Smart GPT requires human grading, which can be time-consuming and expensive.

🤖 Highlights

- Smart GPT can boost performance in almost any task.

- The Smart GPT framework is highly parallelized and can handle industry-scale use cases.

- We need an independent professional benchmarking organization to be asked to step in.

- Smart GPT can be applied to a wide range of domains, including medicine.

🤖 FAQ

Q: What is Smart GPT?

A: Smart GPT is a way of using the latest prompt engineering research to trigger better performance in a model like GPT-4.

Q: What is the MMLU?

A: The MMLU (Massive Multitask Language Understanding) is arguably the best-known benchmark of language model performance.

Q: What are the pros of Smart GPT?

A: Smart GPT can boost performance in almost any task, the Smart GPT framework is highly parallelized and can handle industry-scale use cases, and Smart GPT is a model-agnostic, parametrized, and highly flexible system that can be applied to disparate use cases.

Q: What are the cons of Smart GPT?

A: Smart GPT sacrifices some of the model's intelligence to make it more affordable

- End -