Demystifying Mixture of Experts in AI: Beyond the Hype

5 min readOct 17, 2024

When it comes to advanced AI models like GPT-4, there’s a lot of buzz and sometimes confusion about the technologies that power them. One such topic is the “Mixture of experts.” Contrary to popular belief, this technique doesn’t involve multiple specialized models, each an expert in its own field.
Instead, it’s a clever engineering approach that enhances the efficiency and performance of a single, massive model. Let’s break down what mixture of experts really means and how it plays a role in cutting-edge AI models.

The Scale of Modern AI Models

First, let’s grasp the sheer size we’re dealing with. GPT-4 is rumored to have a staggering 1.8 trillion parameters. To put that into perspective:

1.8 trillion parameters equal 1,800 billion, or 1.8 million million.
If a person were to process each parameter at a rate of one per second, it would take 57,000 years to handle them all.
Even if all 8 billion people on Earth worked together, processing one parameter each per second, it would still take about 2.6 days.

Yet, transformer models like GPT-4 perform these calculations in milliseconds. How is this possible? The secret lies in sophisticated engineering techniques, including the mixture of experts.

What Is a Mixture of Experts?

Despite the name, a mixture of experts doesn’t mean having multiple specialized AI models each handling different tasks. Instead, it involves a single model that contains multiple smaller components, known as “Experts,” within its architecture. Let’s take a closer look using a model similar to GPT-4 called Mixtral 8x7B, developed by the French startup Mistral.

Understanding the Basics

A transformer model like GPT-4 or Mixtral works by predicting the next word in a sentence, one word at a time. Here’s a simplified breakdown of the process:

Embeddings: The input text is converted into numerical representations called embeddings. Think of these as large lists of numbers that capture various attributes of each word or token, such as its meaning, position in the sentence, and more.
Transformer Blocks: These embeddings pass through multiple layers of transformer blocks, each containing two main components:

- Attention Mechanism: Helps the model understand the context by determining how different words in the sentence relate to each other.

- Feedforward Networks: Process each token individually to refine its understanding before passing it to the next layer.

Enter the Mixture of Experts

The mixture of experts enhances this process by introducing multiple feedforward networks within each transformer block. Here’s how it works:

Multiple Experts: Instead of having a single feedforward network, the model includes several smaller ones (e.g., 8 in Mixtral 8x7B). These are not separate models but different parts of the same model architecture.
Router or Gating Network: This mini-network decides which experts to use for processing each token. For efficiency, only a few experts (e.g., 2 out of 8) are activated for any given token, a strategy known as a sparse mixture of experts.

Why Use Multiple Experts?

Efficiency: By distributing the workload across multiple experts, the model can handle more parameters without a proportional increase in computation time.
Scalability: It allows the model to scale up its capabilities without becoming prohibitively slow or resource-intensive.

A Hospital Analogy

Imagine a hospital with various specialized departments (the Experts). When a patient arrives (a Token), the reception (Router) directs them to the appropriate department based on their symptoms (the token’s attributes).

Not every department is involved in every case, just as not all experts are used for every token. This ensures that resources are used efficiently, and patients receive the specialized care they need without unnecessary delays.

The Reality of Experts

Interestingly, in models like Mixtral, these Experts aren’t specialized in distinct areas like math or language. Instead, the router assigns tokens to experts in a seemingly random or evenly distributed manner. This means that while having multiple experts increases the total number of parameters, it doesn’t necessarily lead to specialized handling of different types of data. The primary benefit is the efficient use of computational resources rather than targeted expertise.

The Origins and Evolution

The mixture of experts approach isn’t new. It dates back to a 2013 research paper and has since been adapted and scaled up for use in modern transformer models. OpenAI and other AI pioneers have built upon this foundational idea, integrating it into large-scale models to enhance their performance and efficiency.

Breaking Down the Numbers

Let’s revisit the numbers to understand the impact:

Mixtral 8x7B suggests 8 feedforward layers, each with 7 billion parameters, totaling 56 billion parameters.
However, since only 2 experts are active at any time, only about 13 billion parameters are in use during processing.
This means that despite the high total number of parameters, the active computation remains manageable, enabling rapid processing times akin to milliseconds.

Conclusion

The mixture of experts is a powerful technique that allows AI models to handle vast numbers of parameters efficiently. By incorporating multiple feedforward networks and intelligently routing tokens to the appropriate experts, models like GPT-4 and Mixtral achieve remarkable performance without sacrificing speed. While the term “mixture of experts” might suggest specialized models working in tandem, the reality is a sophisticated integration of multiple components within a single model architecture. Understanding this helps demystify some of the complexities behind today’s most advanced AI systems and showcases the innovative engineering that drives their capabilities.

Thank you for diving into the intricacies of the mixture of experts with us. Stay tuned for more insights and explanations on the fascinating world of AI.