🦙 GPT-4's Predecessor: Llama 2 - A Technical Review
Llama 2, the successor to the open-source Llama language model, was recently released by Meta. This model has been trained on more data, has more parameters, and has doubled the context length. The model has been fine-tuned for chat, and the benchmarks show that it crushes other open-source language models. However, it is more of an incremental upgrade over Llama 1. In this article, we will review the technical paper and highlight the key features of Llama 2.
Table of Contents
- Introduction
- Benchmarks
- Data
- Reinforcement Learning with Human Feedback
- Safety and Responsibility
- Compute
- Social IQ
- Ghost Attention
- Sentiment Analysis
- Collaboration with Microsoft
- Conclusion
Introduction
Llama 2 has been trained on a new mix of publicly available data, and after pre-training on 2 trillion tokens, the model still did not show any sign of saturation. The loss going down represents an improvement, and as you can see, they could have kept going. The model has been fine-tuned for chat, and the benchmarks show that it crushes other open-source language models. However, it is more of an incremental upgrade over Llama 1.
Benchmarks
The benchmarks deliberately compared Llama 2 to Llama 1 and other famous open-source models but not with GPT-4. In these benchmarks, the trend is fairly clear: Llama 2 crushes the other open-source language models but is more of an incremental upgrade over Llama 1. The MMLU benchmark shows that it knows a lot about a lot of subjects, but the human eval benchmark shows that it's not amazing at coding.
Data
Llama 2 was trained on a new mix of publicly available data, and the model has been fine-tuned for chat. They used more robust data cleaning and trained on 40 more total tokens. They say they didn't include any data from Meta's products or services, but what they did do is upsample the most factual sources. After pre-training on those 2 trillion tokens, the model still did not show any sign of saturation.
Reinforcement Learning with Human Feedback
The short version is that reward modeling is a way of telling the base model which outputs humans prefer. They train two separate reward models, one optimized for helpfulness and the other for safety. They tried to make sure that the reward models or doggy trainers were as smart as the dog itself. The reward model knows what the chat model knows, and that is to prevent cases where the base model just hallucinates, and the reward model can't tell the difference.
Safety and Responsibility
Not everyone who uses AI models has good intentions. AI agents could potentially be used for nefarious purposes, such as misinformation or bioterrorism or cybercrime. However, they have made efforts to tune the models to avoid these topics. They do describe at great length a trade-off though between helpfulness and safety.
Compute
Llama 2 was trained on A100s, and they don't say much other than that. I am sure Llama 3 will be trained on the newer H-100s from Nvidia because apparently, Meta has purchased more of those than any other company, including Microsoft.
Social IQ
Llama 1 actually did better than Llama 2 on social IQ benchmarks. Llama 2 got 21.7 in Aquarad, which is a test of mathematical reasoning, and Orca at the exact same size of 13 billion parameters got almost 28.
Ghost Attention
The authors introduced ghost attention, which means that the model pays attention over multiple turns of the conversation. The authors also throw in this observation that llms have internalized the concept of time.
Sentiment Analysis
When they did a sentiment analysis of the model, they found that the sentiment for Llama 2 for right-wing was higher than for left-wing.
Collaboration with Microsoft
Microsoft and Meta teamed up to make Llama 2 widely available, and we get news that Llama 2 may soon be on your phone and PC.
Conclusion
Llama 2 is a significant upgrade over Llama 1, and the benchmarks show that it crushes other open-source language models. However, it is more of an incremental upgrade over Llama 1. The model has been fine-tuned for chat, and the benchmarks show that it crushes other open-source language models.