**TLDR:** Mixtral-8x7B uses a "sparse mixture-of-experts" architecture, activating only a subset of its 46.7B parameters (12.9B) for each data token it processes. This allows Mixtral to deliver powerful responses like much larger models, but with the speed and resource demands of a significantly smaller model.
Mixtral is a [[sparse mixture-of-experts (MoE) network]]. It is a decoder-only model where the feedforward block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a router network chooses two of these groups (the “experts”) to process the token and combine their output additively.
This technique increases the number of parameters of a model while controlling cost and latency, as the model only uses a fraction of the total set of parameters per token. Concretely, Mixtral has 46.7B total parameters but only uses 12.9B parameters per token. It, therefore, processes input and generates output at the same speed and for the same cost as a 12.9B model.