17/12/2023
[Comparing the effectiveness of various techniques to improve the accuracy of LLM (Large Language Models)]
Currently, the three main accepted techniques to improve the accuracy of Generative AI (GenAI) are:
1. Firstly, creating larger models and training them with a vast amount of foundational data. GPT-2 has 1.5 billion parameters and was trained with a 40 GB dataset; GPT-3 has 175 billion parameters (100 times larger) and was trained with a dataset of approximately 1.86 TB (we don't know the size in GB or TB; 1 TB = 1024 GB but we do know it in terms of tokens, roughly estimated as 4 bytes per token).
The exact method of creating GPT-4 is not clearly known, as OpenAI shifted to a proprietary model approach. However, it's speculated that GPT-4 might consist of multiple subunits, each nearly the size of GPT-3, about 112.5 billion parameters, totaling 16 subunits, interconnected via Mixture of Experts (MoE), leading to a combined size of 1.76 trillion parameters (estimated by George Hotz's speed).
Newer models like Gemini and Olympus are expected to be around 1.5 - 2 trillion parameters, probably constructed in the same way.
Current results from LMSys, which conducts Single Blind A/B tests, show that Gemini Pro outperforms PaLM 2 bison significantly and is equivalent to GPT-3.5 but still not as good as GPT-4. Google has postponed the launch of Gemini Ultra, the largest and best LLM, to next month, so its true performance is still unknown.
Note: Single blind testing means the subjects are unaware of what they are being tested with (commonly used with placebos in drug testing). In double-blind tests, both the subjects and the experiment controllers are unaware. In triple-blind tests, even the analysts or researchers don't know the details of the samples.
Models, both pre-trained and fine-tuned, in some cases, are further trained with human feedback (Reinforcement learning from human feedback: RLHF). These are called frozen models, which do not acquire new information and often have a cut-off date for training, making them unaware of information beyond that point.
However, the knowledge processing of these frozen models is highly efficient and quick, as they operate through weighted ANNs within the model.
2. The second measure is inputting information as "context" for the model to process (In-context prompting or prompt engineering). This method, though not as efficient as the first, can still be effective, especially with techniques like "Chain of Thought" or five-shot prompting. These represent an optimized level of format specification and flexibility. If done well, this method is the most straightforward to update additional information beyond the frozen knowledge.
3. The third measure is adding information from external vector databases, known as Retrieval Augmented Generation (RAG). This is the least effective in terms of knowledge assimilation compared to the first two measures but can be applied to unlimited information (in theory.) It's like having notes in an external notebook; you need to search and understand them before use, which is not immediate.
[Which measure is best?]
Traditionally, emphasis was on the first measure: increasing the model's parameter size and the training dataset size, improving the efficiency of frozen models. However, there's a current issue of saturation in scaling up. This is not just a technical or energy consumption problem, but it's about the capability to "store free energy" or to consume Negative Entropy (NegEn), as per the second law of thermodynamics.
Complex Adaptive Systems (CAS) of different scales, from molecular to galactic levels, consume energy to reduce their entropy but increase the entropy of their surroundings. This is evident in various forms of life, which consume more energy as they increase in size. A relevant book on this is "Scale: The Universal Laws of Life and Death in Organisms" by Geoffrey West, which discusses the logarithmic scale magnitudes of different systems.
Thus, with AI saturation, the focus shifts to the second and third measures. Recent research by Microsoft's team "Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine" shows that combining techniques 2 and 3 in a generalist model like GPT-4 can outperform specially tuned models (med-PaLM2) in medical benchmarks.
In the MedQA benchmark, GPT-4 scored 79.2%, while Hipocratic AI scored 80.2%. Microsoft's report (specifically on MedQA) shows GPT-4 base scoring 86.1%, Med-PaLM 2 scoring 86.5%, and Microsoft's GPT-4 Medprompt technique scoring 90.2% (using a combination of five-shot prompts, CoT, and RAG).
This data indicates that the highest efficiency in large models is achieved through appropriately conducted in-context prompting, supplemented by RAG.
The issue of AI scaling saturation leads to considering higher-order abstractions, which is of my interest. Alternatively, addressing issues in lower-order abstractions, like what Apple's AI research team is doing, is also crucial. They are addressing entropy collapse in transformer architectures, trying to find the optimal balance between entropy collapse and structural collapse.
In every order of complex systems, finding the "Goldilocks Zone" in each abstraction level is essential, which I'll discuss further in future opportunities.
[Attached Images]
1. I have summarized the heat map of the effectiveness of various techniques in GenAI from different reports.
2. The report on performance through benchmarks of various models from Microsoft's research team.
3. The report on the increased proportion resulting from the use of various techniques by Microsoft's research team on the GPT-4 base model compared to other models.
4. The report on comparative performance results (specifically for MedQA) is from Hippocratic AI.