Hardware Requirements for Running Open-Source LLMs Locally

**TLDR:** For local LLM deployment, hardware needs vary with model size and quantization. A basic setup can manage a 7B model, but a 13B model demands a more robust system, like a high-end GPU with substantial VRAM and RAM. For top-tier models like the 70B Llama, specialized hardware is essential. Many enthusiasts are running open-source LLMs like the Mistral series at home. If you're considering following suit, understand [[Neural Net Quantization]] before you continue reading. # Hardware Requirements for Running Open-source LLMs Locally Since quantization can introduce new challenges and trade-offs between accuracy and model size, particularly when using low-precision integer formats such as INT8, and successful quantization often requires prior knowledge and extensive finetuning, if you're just getting into running open-source LLMs locally, you'll probably download a quantized model. It's like how someone might download a compressed movie torrent rather than the entire Blu-ray rip because of computational or storage limitations, or they don't have the knowledge or desire to compress it themselves. For example, some hardware configurations can run a scaled-down 7B Mistral model on as little as 6 GB of VRAM or RAM, assuming there's a modern multi-core CPU. However, these setups tend to have slower processing speeds due to the limited memory capacity. Conversely, running a 13B Mistral model would typically require either a high-end GPU with more than 12 GB of VRAM and over 32 GB of RAM, or a dual GPU setup. That said, a setup with 24 or 32 GB of VRAM and 64 or 128 GB of RAM would be more than adequate, not just for the 13B Mistral but also for running the 7B version very efficiently. It's worth noting that these hardware specifications aren't sufficient for much larger models, like the 65B [[Parameters]] model Llama. For such massive models, specialized hardware is necessary. An example is the Tinybox, a $15,000 computer developed by Geohot. Tinybox and similar systems are designed to manage the immense computational and cooling demands of these large-scale models, ensuring efficient and sustainable operations. More information on this can be found at [Tinygrad.org](https://tinygrad.org/). However, [[Mixtral-8x7B]] does something interesting to work around hardware limits. It uses a special architecture to be more efficient. It has 47B params, and it's responses reflect that, but with the speed and resource demands of a 13B param model.