Join our upcoming webinar “Deriving Business Value from LLMs and RAGs.”
Register now

Large language models (LLMs) have captured the minds and hearts of AI people, enterprises, and even individuals. The new trend is to scale these models to have much greater performance with more or less stable costs. Turns out that the performance of an LLM positively correlates with the model size and scalability. This scaling comes with computational immersive resources; as you might guess, the bigger the model, the higher the costs.

performance vs parameter count
LLM performance based on parameter count: Source

This is one of the biggest challenges of the industry. While Mixture of Experts have been recently hyped for improving transformer models, there is a new approach that ML people find a lot more promising – Mixture of Tokens. Certain drawbacks that MoEs showcased while trying on different models created a need for other methods. In this blog post, we’ll touch upon these new techniques and examine the ways MoTs scale large language models while maintaining training and inference costs.

Mixture of Experts

Mixture of Experts gained fame for dramatically optimizing transformers’ scalability. To understand this, let’s first learn who those ‘experts’ are. In MoEs, experts are models that are specialized for one or several tasks. In standard transformer models, the tokens are processed by a standard feed-forward layer. Instead of this approach, MoEs direct each token towards a pool of experts alongside a small network called controller. This controller ensures that each token is processed by only a small subset of experts. A few advancements of this technique were further introduced, the main ones being switch and expert choice.

The switch transformer sends each token to exactly one expert that has the highest score produced by the controller. This technique resulted in a huge reduction of parameters – from the 1.6T model(T5 architecture) to the FLOPS cost of the equivalent 1.4B vanilla Transformer.

Expert choice offers a slightly different approach. Instead of having tokens select the top-k experts, the experts themselves choose the top-k tokens. This method guarantees even load balancing(each expert receives the same number of tokens) and achieves substantial gains in training efficiency and downstream performance. However, there’s a risk that some tokens won’t be chosen.

moe approaches
MoE approaches: From left to right: standard feed-forward, switch, expert choice

Limitations of current approaches

While the performance of the huge-parameter-count MoE architectures is impressive, they come with a new set of challenges during both training and inference. The most notable ones are:

Training instability: This method chooses and matches experts to tokens discreetly. That means small changes in controller weights can have disproportionate effects on controller decisions.

Load imbalance: The problem with MoEs is that we cannot efficiently balance the way tokens and experts are assigned since the choice of the routing network is not efficiently restricted. This is why some tokens are left without any expert to process them(token dropping), and almost all tokens are assigned to a few experts only(model collapse).

Information leak: Some successful MoE methods process tokens from different positions in a sequence together (i.e., by comparing scores of all tokens in a batch). This imposes an intra-sequence information leak and hinders their utility in autoregressive decoding.

Knowledge hybridity: Experts in conventional MoE architectures often accumulate a wide array of knowledge due to a limited number of experts. This broad knowledge base dilutes the specialization and effectiveness of individual experts.

Knowledge redundancy: There is a tendency for multiple experts to converge in learning similar information, resulting in overlapping knowledge domains and inefficient use of model parameters.

In their recent paper, scientists from Cohere AI discussed ways to tackle one of the main MoE challenges – having to store all the experts in memory. They’re proposing extremely parameter-efficient MoE by uniquely combining MoE architecture with lightweight experts. Their MoE architecture outperforms standard PEFT methods and is on par with full fine-tuning by only updating the lightweight experts – less than 1% of an 11B parameters model.

Addressing MoE limitations

In their recent paper, scientists from Cohere AI discussed ways to tackle one of the main MoE challenges – storing all the experts in memory. They propose an extremely parameter-efficient MoE by uniquely combining MoE architecture with lightweight experts. Their MoE architecture outperforms standard PEFT methods and is on par with full fine-tuning by only updating the lightweight experts – less than 1% of an 11B parameters model.

A recent paper discusses the last two limitations of MoEs and suggests a new technique to address them – DeepSeekMoE. This is new MoE architecture aims at enhancing expert specialization by employing two key strategies: fine-grained expert segmentation and shared expert isolation.

Fine-grained expert segmentation involves subdividing the FFN intermediate hidden dimension, allowing for a more nuanced distribution of knowledge among fine-grained experts. This segmentation enables each expert to focus on more specific knowledge areas, thereby achieving a higher level of specialization while keeping a constant computational cost.

Concurrently, the shared expert isolation strategy designates specific experts as 'shared,' responsible for capturing common knowledge across various contexts. By concentrating general knowledge on these shared experts, we reduce redundancy in the learning process of other experts. This approach improves parameter efficiency and ensures that each expert remains focused on unique and distinct knowledge areas.

routed and shared expert
DeepSeekMoE. Across these three architectures, the number of expert parameters and computational costs remain constant: Source

DeepSeekMoE is scaled to train a 16B model, and with only about 40% of computations, it achieves comparable performance with DeepSeek 7B and LLaMA2 7B. Researchers also plan to scale up DeepSeekMoE to 145B, highlighting its advantages over the GShard architecture and showing a comparable performance with DeepSeek 67B.

Mixture of Tokens

Several MoE drawbacks led to the rise of Mixture of Tokens (MoTs). This slight modification of approach solves a lot of the problems imposed by the discussed methods. Instead of routing tokens to experts, MoT mixes tokens from different examples before feeding them to the experts.  This allows the model to learn from all token-expert combinations and improve training stability and expert utilization. After feeding experts with tokens, each mixture is processed and redistributed back to the original tokens.

How is Mixture of Tokens performed? First, you need to set importance weights for each token. This is done through the controller, followed by the softmax layer performed on the resulting token scores. Thus, token weights are independently calculated for each expert. In the end, you multiply each token by its importance weight and add all of them together.

mixture of tokens
Mixture of Tokens: Tokens are mixed uniquely for each expert (mixing weights are decided by the controller, omitted here for simplicity), then each mixture is processed and redistributed back to the original tokens (using the same weights as before).

MoT addresses the problems with MoE models by making the following changes:

  1. Mixes tokens from different examples before feeding them to experts; this improves training stability and expert utilization by allowing the models to learn from ALL token-expert combinations.
  2. Mixture of Tokens is a fully differentiable model, meaning that it can be trained using standard gradient-based methods. This avoids the need for auxiliary losses or other difficult-to-train techniques, making it easier to train and deploy."
moe vs mot
MoEs vs MoTs: In Mixture of Experts (left), each token is routed to a different expert feed-forward layer. In Mixture of Tokens (right) tokens within each group are mixed and the mixed token is processed by an expert feed-forward layer.


Mixture of Tokens has the potential to significantly improve the performance and efficiency of LLMs. It has shown amazing results of a 3x decrease in training time when compared to vanilla Transformer. In the future, we anticipate that MoTs will continue to yield even more significant improvements.

mot vs baseline
MoTs cut dense vanilla Transformer’s final training loss in just 1/4 of the steps and 1/3 of the training time, expected to improve significantly in the future. Source:
Disclaimer: This post is informed by research from the scholarly article “Micture of tokens: Efficient LLMs through cross-example aggregation.”, authored by multiple contributors.

Recommended for you

Stay connected

Subscribe to receive new blog posts and latest discoveries in the industry from SuperAnnotate