Mistral AI, an emerging leader in the AI industry, has just announced the release of Mixtral 8x7B, a cutting-edge sparse mixture of expert models (SMoE) with open weights. This new model is a significant leap forward, outperforming Llama 2 70B in most benchmarks while offering a 6x faster inference rate. Licensed under the open and permissive Apache 2.0, Mixtral 8x7B stands as the most robust open-weight model available. It's setting new standards in cost-performance efficiency, rivaling and sometimes outdoing GPT-3.5 on mainstream benchmarks.
Mixtral 8x7B marked a major milestone for Mistral AI as they closed up €400 million in their Series A funding round. This investment escalates the company's valuation to an impressive $2 billion, signaling a solid entry into the competitive AI landscape. The funding round, led by the renowned Andreessen Horowitz, also saw participation from Lightspeed Venture Partners and many other prominent investors, including Salesforce and BNP Paribas.
The company, co-founded by alums from Google's DeepMind and Meta, focuses on foundational AI models with a distinct approach, emphasizing open technology. This strategy positions Mistral AI as a potential European counterpart to established players like OpenAI.
The three Mistrals
Mistral-tiny and mistral-small are currently using their two released open models; the third, mistral-medium, uses a prototype model with higher performances that we are testing in a deployed setting.
Mistral-tiny: Mistral's most cost-effective endpoint. Currently serves Mistral 7B Instruct v0.2, a new minor release of Mistral 7B Instruct. Mistral-tiny only works in English. It obtains 7.6 on the MT-Bench. The instructed model can be downloaded here.
Mistral-small: Serves Mixtral 8x7B, masters English/French/Italian/German/Spanish and code, and obtains 8.3 on the MT-Bench.
Mistral-medium: An even better version of Mixtral 8x7B, and is served to alpha users of their API. Scoring 8.6 on MT-Bench, it's very close to GPT-4 and beats all other models tested. It masters English/French/Italian/German/Spanish, is good at coding, and scored 8.6 on the MT-brench.
At the moment, Mistral published insights only from its 7B and 8x7B models.
We'll dive deeper into the 8x7B model very soon, but first, let's explore Mistral 7B.
Mistral AI's initial model, Mistral 7B, diverged from competing directly with larger models like GPT-4. Instead, it's trained on a smaller dataset (7 billion parameters), offering a unique alternative in the AI model landscape. In a move to emphasize accessibility, Mistral AI has made this model available for free download, allowing developers to use it on their own systems. Mistral 7B is a small language model that costs considerably less than models like GPT-4. Though GPT-4 can do much more than such small models, it’s more expensive and complex to run.
Here are the top things to know about Mixtral:
- It handles 32k-token context.
- It runs English, French, Italian, German and Spanish.
- It's good at coding.
- If fine-tuned and can become an instruction-following model, achieving an 8.3 score on the MT-Bench.
The model is compatible with existing optimization tools such as Flash Attention 2, bitsandbytes, and PEFT libraries. The checkpoints are released under the mistralai organization on the Hugging Face Hub.
How Mixtral 8x7B works
Mixtral employs a sparse mixture of expert (MoEs) architecture.
The image below illustrates a setup where each token is processed by a specific expert, with a total of four experts involved. In the case of Mixtral-8x-7B, the model is more complex, featuring 8 experts, and uses 2 of these experts for each token. For each layer and every token, a specialized router network selects 2 of the 8 experts to process the token. Their outputs are then merged together in an additive manner.
So, why use MoEs? In the Mixtral model, combining all 8 experts, each designed for a 7B-sized model theoretically leads to a total parameter count close to 56B. However, this figure is slightly reduced in practice. This is because the MoE method is applied selectively to the MoE layers, not self-attention weight matrices. Therefore, the actual total parameters are likely in the range of 40-50B.
The key advantage here lies in how the router functions. It directs the tokens so that, at any given time during the forward pass, only 7B parameters are engaged, not the entire 56B. Each token is only processed by two experts out of 8 at every layer. However, the
experts can be different ones at different layers, enabling more complex processing paths. This selective engagement of parameters makes both the training and, more importantly, the inference processes significantly faster than what is observed in traditional non-MoE models. This efficiency is a primary reason for opting for an MoE-based approach in models like Mixtral.
Mixtral 8x7B / Mistral 7B vs. LLaMa
Table 1 compares Mistral 7B and Mixtral 8x7B with Llama 2 7B/13B/70B and Llama 1 34B in different categories. Mixtral leaves Llama 2 behind with most metrics, especially in code and mathematics. In terms of size, Mixtral only uses 13B active parameters for each token, which is five times less than Llama 2 70B and is thus much more efficient.
It’s important to mention that this comparison focuses on the active parameter count, which has to do with computational costs, not memory and hardware costs. Mixtral memory costs are proportional to its sparse parameter count, 48B, which is still smaller than Llama 2 70B. Regarding device utilization, SMoEs run more than one expert per device, which results in increased memory loads and are more suitable for batched workloads.
Here's a more in-depth comparison chart on different benchmarks, demonstrating the Mistral models' performance against the LLaMa models.
Mixtral 8x7B vs. LLaMa 2 70B vs. GPT 3.5
Mixtral matches or outperforms Llama 2 70B, as well as GPT-3.5, on most benchmarks.
They're currently using Mixtral 8x7B behind mistral-small, which is available in beta. In a strategic move to monetize its technological advancements, Mistral AI has opened its developer platform. This platform allows other companies to integrate Mistral AI's models into their operations via APIs, representing a significant step towards commercializing their AI innovations.
The results of Table 3 show the performances of the Mixtral and LlaMa models on multilingual benchmarks. The Mixtral model, compared to Mistral 7B, was trained on an unsampled proportion of multilingual data, which gave extra capacity to Mixtral to perform well on these benchmarks and maintain a high accuracy in English at the same time. Mixtral significantly outperforms LlaMa 2 70B in French, German, Spanish, and Italian.
Expert routing analysis
Mistral AI researchers analyzed experts’ behaviors in choosing tokens to see connections between experts and the domains of their chosen tokens. As shown in the image below, the analysis indicates no significant relation between them. The image represents the experts that are either selected as a first or second choice by the router.
If we observe the expert assignments of ArXiv, biology (PubMed abstracts), and philosophy (PhilPapers) papers, we notice that while covering completely different topics, they have very similar expert assignments. DM Mathematics is the only topic that significantly varies from others, possibly due to its synthetic nature and limited representation of natural language. This indicates that the router still has some structured syntactic behavior.
In conclusion, Mistral AI's introduction of the innovative Mixtral 8x7B model and the successful completion of a €400 million funding round mark a significant turning point in the AI industry. This European AI pioneer is not only redefining efficiency and performance standards with its advanced technology but also solidifying its position as a key player in the global AI landscape. With substantial financial support and a focus on open and accessible AI solutions, Mistral AI is well-positioned to lead future developments and applications in this rapidly evolving field.