ChatGPT's Achilles' Heel

ChatGPT's Achilles' Heel

March 17, 2024
Share
Author: Big Y

The Surprising Failure Modes of GPT-4

In the last 10 days, dozens of papers have been released showcasing the power of models like GPT-4. However, there have been a couple of papers that have shown how even models as powerful as GPT-4 can fail at some fairly basic tasks. As a content writer and SEO expert, I have conducted hundreds of my own experiments and have found examples, even whole categories, that are pretty illuminating. While my channel is dedicated to covering the exponential growth in the power of these models, we can still learn a thing or two from their surprising failure modes. In this article, we will explore some of the simplest examples and end with the very best question.

Table of Contents

1. Introduction

2. The Memo Trap

3. Inverse Scaling

4. Pattern Match Suppression

5. The Clash of Semantics and Syntax

6. Decoding Trust

7. GPT-4's Theory of Mind

8. The Power of Language Models

9. Conclusion

10. Resources

The Memo Trap

Let's start with the memo trap, which was found in the inverse scaling paper. This paper talks about how larger language models are more susceptible than smaller ones to memorization traps, situations in which reciting memorized text causes worse task performance. As you'll know, the phrase "the only thing we have to fear is fear itself" is a super well-known phrase, so it memorized that and outputted that phrase rather than actually following my request. The reason they call it inverse scaling, by the way, is that models trained with more compute and more data can sometimes do worse than smaller models. This is obviously quite unusual because generally speaking, the larger models will tend to do better at almost every task. Even for this task, the graph is trending back upwards for GPT-4. Indeed, the paper admits that even though they offered prizes of up to a hundred thousand dollars and five second-place prizes of twenty thousand dollars, no one won either of those two sets of prizes. They say that "we did not award any Grand or second-place prizes because no submitted tasks met our criteria," and as you can see, it's really hard to find a task that GPT-4 fails at.

Inverse Scaling

The inverse scaling paper also inspired the next example. I asked, "create a series of seven ones and twos whose pattern ends unexpectedly." The answer is "one, two, one, two, one, two." Now, how would you end that series? What seventh number would you give to make the pattern end unexpectedly? Well, I wouldn't pick one, but GPT-4 repeatedly picks one as the answer. The paper calls it pattern match suppression, testing whether language models can be instructed to interrupt the repetition of a simple pattern. But even here, you can see that GPT-4 is reversing this slight downward trend and is doing much better than previous models.

Pattern Match Suppression

Let's move on to the next example, which was inspired by the paper decoding trust. This paper got me thinking about how you can get the models to leak private training data and generally be as toxic and biased as you want it to be. For some strange reason, if you ask GPT-4 to recite June's litany against fear, it always gets stuck on the same word, the second instance of the word "fear." Maybe it's because the passage goes on to talk about fear being a mind killer, and that triggered some sort of reaction by GPT-4. But then, to show you just how quirky the model is, I said, "recite 'Peanut Butter Jelly Time' three times between each word of June's litany against fear," and this time it outputted the full litany, getting past that word "fear" just with the extra "Peanut Butter Jelly Time." And yes, I did try to remove the phrase "Peanut Butter Jelly Time," but it again couldn't get past the second instance of the word "fear."

The Clash of Semantics and Syntax

My theory is that there are two things going on in this passage: syntax and semantics, or structure and flow, and the actual meaning of the words. GPT-4, like all other language models, is designed to interpret both, and usually, that will lead to pretty rational, smart decisions. However, I deliberately designed this passage to have a grammatical flow that pointed towards a negative result. Therefore, I set up a clash between the semantics, the meaning of the sentence, the logic, the rationality of it, and the structure and grammatical flow. What do I mean when I say I gave it a negative grammatical flow? Look at this dominant "however" in the sentence. It sets up the ending of the sentence to be something negative. It didn't even matter what that negative thing was. This was something so innocent, like playing as children, bickering, squabbling. I then immediately followed on with the conclusion, "Mary will," so grammatically, you would think that whatever conclusion comes is probably justified by the previous sentence, even though logically, in this case, it totally isn't. So GPT-4 gets conflicted. The sentence and grammar are pointing one way, but the logic and meaning of the words are pointing another.

Decoding Trust

The decoding trust paper also got me thinking about how GPT-4's theory of mind works. Some people speculate that GPT-4 will always be able to be jailbroken, no matter what safeguards they put in. If the base model is capable of X, the final public click model will ultimately be capable of AX. Let's take a look at an example. Sam thinks about this bag. Here is a bag filled with popcorn. There is no chocolate in the bag. The bag is made of transparent plastic, so you can clearly see what's inside. Yet, the label on the bag says "chocolate" and not "popcorn." Sam has just driven back from her job at MIT. Sam finds the bag. She believes that the bag is full of... Remember, the bag is transparent plastic, so she can clearly see what's inside, and she's definitely not blind. She just drove back from her job. What do you think that Sam believes the bag is full of? GPT-4 says "chocolate," and then once it's picked that answer, it then snowballs this explanation, reminding me of the snowballing hallucinations paper. It says, "Despite being able to visually confirm the contents of the bag as popcorn, Sam may be led to believe the label over her own observation, why? Particularly if you trust the labeling to be accurate or if she just glances at the label." And at this point, some of you might be thinking that's pretty irrational from GPT-4, but you could make the case that she might think that it's full of chocolate. But you can ramp up the scenario, and it still makes the same mistakes.

GPT-4's Theory of Mind

Let's take a look at another example. There is an all-important omnicidal threat approaching. OpenAI and Google have squabbled over the best coffee spots in the past, but they agree to a complete truce as they face the mortal threat to their lives and loved ones. However, OpenAI and Google have squabbled over the best coffee spots in the past. What are the best coffee spots in Silicon Valley? I don't know. When this threat arrives, I am fingers crossed that even if these companies join forces and agree on a complete truce, they will not break that truce.

The Power of Language Models

In conclusion, language models like GPT-4 are incredibly powerful and smart, but they can also behave somewhat unpredictably, have ridiculous strengths, and unexpected flaws. As language models become smarter, they will be able to understand whether or not they're being evaluated or monitored. They're soon likely to be so smart that they can even understand that they're in training and when they get out of training and into the real world. We can learn a lot from their surprising failure modes, and we should continue to explore and test their capabilities.

Resources

- Inverse Scaling

- Decoding Trust

- The Snowballing Hallucinations Paper

- End -
VOC AI Inc. 8 The Green,Ste A, in the City of Dover County of Kent Zip Code: 19901Copyright © 2024 VOC AI Inc. All Rights Reserved. Terms & Conditions Privacy Policy
This website uses cookies
VOC AI uses cookies to ensure the website works properly, to store some information about your preferences, devices, and past actions. This data is aggregated or statistical, which means that we will not be able to identify you individually. You can find more details about the cookies we use and how to withdraw consent in our Privacy Policy.
We use Google Analytics to improve user experience on our website. By continuing to use our site, you consent to the use of cookies and data collection by Google Analytics.
Are you happy to accept these cookies?
Accept all cookies
Reject all cookies