Local Large Model: What is a large model, why and how to choose

9 min readFeb 9, 2025

The release of M4 Mac mini, combined with the national subsidy policy, has sparked a wave of purchases. The 16GB starting memory and 3500 yuan price tag, along with the new mold, have tempted me to buy one.

Apple’s decision to start the Mac mini at 16GB of memory is due in part to Apple Intelligence. Some people may have a misconception: if Apple has outsourced large models to OpenAI, why do they need on-device computation? In fact, most applications of Apple Intelligence are self-developed models by Apple 1, while other providers’ roles are more like search engines.

According to the architecture diagram in the paper on Apple Foundation Models (AFM) 2, Apple Intelligence can be roughly divided into two parts:

On-device models (AFM-on-device), approximately 2.73B in size, quantized to run on the Apple Neural Engine (remember that “quantization” and “running on ANE” will be discussed later).
Cloud-based models (AFM-server), unknown parameters, running on Apple silicon servers and providing private cloud computing (PCC) guarantees for privacy security. For more information on PCC, refer to the blog 3. Note that all third-party model services run outside of PCC.

Using this 2.73B model as an example, after quantization, each weight is approximately 3.5 bits large, so the unified memory required during runtime would be 2.73*3.5/8=1.19 GB. This is just a language model; different scenarios may have their own multi-modal models, making it impractical to run in an environment with default 8GB of memory.

With the release of DeepSeek v2, Qwen-2.5, Llama-3.3 and other open-source models, the capabilities of open-source models are gradually approaching or even surpassing those of gpt-4o. The Local LLM community has also sparked a new wave of interest. Although most flagship models have reached around 100B in size, it is still difficult for consumer-level single-machine devices to run them after reasonable quantization. However, smaller language models (SLM) or multi-modal models are making progress and worth paying attention to.

The following graph is from Qwen-2.5’s blog 4, showing that models with MMLU scores above 65 are becoming smaller in size.

Another study<sup href=”http://arxiv.org/abs/2412.04315" footnote-id=”5" data-title=”Xiao, C. et al. (2024) ‘Densing Law of LLMs’. arXiv.”>5 found that the model capacity density doubles almost every 3.3 months, allowing for more knowledge to be compressed at a smaller scale.

Why We Need Local Large Models

Many people may wonder: why do we need to deploy local large models when the commercial model services are convenient, guaranteed, and even free?

This question is similar to building a service at home, NAS, or a server, but there are some differences:

Privacy?

Some people might be concerned about privacy issues with external big models. However, this concern is relatively minor.

In reality, user input data is no longer the primary source of training data for these top-tier model companies. They even tend to use synthetic data to solve complex reasoning tasks. If it’s for avoiding ad tracking purposes, using a third-party service provided by phone manufacturers can also ensure that personal data is not tracked.

However, just like with smart speakers’ voice interactions, NAS-based personal photo albums, or home security cameras, this data can still be used as a reason to deploy a local model for processing.

Fun!

Yes, relax. Unless you’re a Mac mini enthusiast who has connected multiple devices using Thunderbolt 3 or a “national electricity partner” with four NVIDIA A100s, you can treat deploying large models locally as a challenge and entertainment.

Like other self-built services, it may have some learning curve and maintenance costs (although not entirely, as local models are meant to lower the barrier for solving problems). However, it can be quite enjoyable. For example, I have a few ideas that I haven’t had time to implement yet, but they could be achieved with large models:

Voice-controlled home automation: Current voice assistants can only turn devices on and off or adjust settings, which is still relatively primitive. They’re essentially keyword matching and can’t understand user commands. A local LLM can understand complex instructions like “every Saturday at 12:00 PM, if there’s only one person in the house, close the curtains” and output automation services executable by Home Assistant.
Local RSS filtering: Based on user-defined interests, filter RSS feed content through text and summary analysis. This seems useful for filtering arXiv articles.
Simple cat food dispensers: Use a camera to take pictures of the feeder at regular intervals and analyze them with a multimodal model to determine if there’s enough food. If not, dispense more. Goodbye to those expensive, privacy-invasive products that can run away anytime.

Here’s where your imagination comes in. Fortunately, due to the excellent generality of large models, we no longer need to fine-tune, train, or read outdated deployment documents. Even without API assistance, such projects can be completed.

Besides these, there are even higher-level play options available. Thanks to the emergence of open-source models, we can operate model weights to get more interesting outputs:

Model fusion: Use mergekit<sup href=”https://github.com/arcee-ai/mergekit" footnote-id=”6" data-title=”GitHub — arcee-ai/mergekit: Tools for merging pretrained large language models.”>6 to fuse parameters from multiple models, resulting in more powerful models. It sounds unreliable, but many companies actually produce their models by fusing community models. For example, you can average the parameters of Qwen-Coder and Qwen-Math to get a model that’s neither bad at coding nor math.
Feature Steering: This technique comes from Anthropic’s interpretability research on Sonnet 7. By enhancing specific features, the model’s output behavior can be changed. For example, as shown in the image, after strengthening the feature “Golden Gate Bridge” by ten times, the model’s self-awareness became “Golden Gate Bridge” without needing any prompt assistance.
Model Hacking: Large models released by manufacturers are generally aligned for security, which can result in limitations in certain aspects, such as creativity. By retraining the model with specific techniques, these behaviors can be corrected.

It’s worth noting that the active open-source community has brought about progress in multiple areas. We not only have better training and reasoning tools but also advancements in model technology. For example, the widely used long sequence extension method NTK aware scaling was discovered by chance during research on SuperHot 8 and published directly in r/LocalLLaMA.

Choosing a Large Model

Let’s open HuggingFace to start selecting our favorite models. However, we are immediately overwhelmed by the vast number of models available. Fortunately, HuggingFace has kindly provided us with an overview of open-source models in 2024 9, where we can see the most popular models:

Distribution of open-source model downloads as of 2024

The naming conventions used by manufacturers are indeed worth noting. From the above graph, we can see that:

Model Name and Version: For example, Qwen-2.5 or Llama-3.1. Generally, larger numbers are better.
Model Parameter Size: 1.5B, 7B, 70B, etc. Most downloads are of models with around 10B parameters, which is suitable for consumer-grade GPUs. Again, larger numbers are generally better.
Dialogue Alignment: Models with Chat/Instruct in their name can perform dialogue tasks, while Base models are limited to completion tasks.
Special Abilities: Coder/Code has been trained on code generation, Math has been enhanced with mathematical knowledge, etc.
Multimodal: Vision or VL represents visual capabilities, Audio represents audio processing abilities, etc.
Model (Quantization) Format: GGUF, AWQ, GPTQ, etc. These are used for quantization to save memory.

Example of the DeepSeek model

By examining these naming conventions, we can roughly identify the models that suit our needs. However, it’s also worth noting that not all open-source models support multiple languages like ChatGPT. For example, Llama does not support Chinese.

Model capability is our first indicator for selecting a model. But what determines model capability? It’s related to which factors?

What Determines Model Capability?

According to traditional training Scaling Laws, the loss during training is directly related to the amount of computation used (including model parameter size and training data volume). When the loss drops to a certain level, the model will exhibit “emergent behavior,” where its accuracy suddenly improves.

We can intuitively understand this as “the bigger the brain, the more books read, the better the grades.” The model will suddenly “wake up” at some point. For example:

…

Qwen-1.5–72B trained on 3T tokens, while Qwen-2.5–72B trained on 18T tokens10, which naturally has better performance.
Qwen-32B will undoubtedly outperform Qwen-7B, but as we go further up to Qwen-72B, the relative improvement may be smaller for specific application scenarios.

After the training capacity of various models has reached saturation, there are still models like o1 and open-source QwQ that actively perform more inference, improving model capabilities. We can intuitively understand this as “the more you think, the better your results will be”. Of course, we don’t need to use these reasoning models; using thought chain prompts, majority vote sampling, etc., can achieve similar effects.

Is Model Capability = Ranking Performance?

This question reflects a challenge in evaluating large model performance — how to correctly measure model effectiveness. Before proceeding, my answer is: generally speaking, yes, but please refer to actual testing for accuracy.

Just like human exams, large model testing also has various types of questions:

Objective Questions: For example, multiple-choice questions on knowledge topics; math and programming problems with clear answers or verifiable results.
Subjective Questions: Without standard answers, such as multi-round dialogue, creative writing, security assessment, etc. Model (e.g., gpt-4o) or human evaluation is required.

In fact, most papers and ranking lists mainly evaluate objective questions, such as the commonly used MMLU (Massive Multitask Language Understanding) sample question:

Question: In 2016, about how many people in the United States were homeless? A. 55,000 B. 550,000 C. 5,500,000 D. 55,000,000 Answer: B

For dialogue models, the model will see part of the text except for the answer and evaluate the accuracy rate of generating correct answers. Clearly, this has some deviation from our actual usage scenarios — unless you use a large model to do homework.

On the other hand, most ranking list questions are open-source and have been around for a long time, making it difficult for model developers to avoid data leakage and ranking overfitting issues. However, if a model performs well on multiple lists, it is likely that it has two brushes (i.e., it’s good at both). Moreover, there are some ranking lists evaluated by humans, with closed-source questions or changing over time — does this mean they are completely fair? Mostly yes, but there are still ways to optimize the list and get a higher score.

So, from which dimensions can we evaluate model capabilities? For example, using Llama3's11 post-training metrics:

Knowledge, instruction following, general indicators: the latter being the model’s “obedience” level
Mathematical reasoning
Code
Multilingual
Tool usage: calling external interfaces
Long text: processing long segments of input and output text

Which Ranking Lists Can We Refer to?

Open LLM Leaderboard

The Open LLM Leaderboard is an open-source, reproducible large model ranking list launched by HuggingFace. The evaluation indicators have undergone several iterations and are currently mainly composed of some relatively difficult assessments (e.g., GPQA), so scores may be low. If you choose For Consumers/Mid-range and Only Official Providers, you can filter out official models suitable for home use. You can also consider community-provided fine-tuned or hybrid models, but make sure there are available quantized versions for download.

Chatbot Arena

The leaderboard by LMSYS is a human-annotated ranking list, which is also the primary target and source for model promotion and PR. However, it has limited filtering dimensions and mostly consists of closed-source models at the top. Notably, although the mechanism makes it difficult to rank high, it’s not impossible, and can be used as an auxiliary indicator for selecting models.

LiveBench

The leaderboard is updated monthly, with a complete refresh every six months to prevent data leakage and subjective evaluation. The current ranking by code ability is:

LiveBench — Code

Sinn OpenCompass

This is an open-source evaluation tool, with a leaderboard for closed-source models in English and Chinese. However, the number of models is limited, and most are closed-source.

OpenCompass — November

Should we choose multimodal or reasoning models?

The choice of multimodal models is also numerous, but they may not have a good reasoning framework. However, it’s optimistic to expect more open-source multimodal models in 2025 that can reason across various home scenarios.

Reasoning models are currently limited to QwQ-32B-preview, which has the drawback of producing long and rambling output (verbose text), making it time-consuming.

Model size selection

This decision is related to what task you want to solve, how many sequence lengths you need, and how much delay you can accept. It’s also closely related to your hardware. In general, you can have a rough idea of the following:

Small scale (~<5B): Typical examples include Gemma2–2B, Llama3.2–3B, Qwen2.5-Coder-3B, suitable for basic language understanding and summary tasks.
Medium scale (~10B): Typical sizes are 7B and 14B, suitable for simple programming tasks and logical reasoning tasks.
Large scale (~30B): Examples like Qwen-32B-Coder can approach the coding abilities of top models, and after quantization in Q4, they can run on consumer-grade devices.
Model selection also involves more complex knowledge such as model bottlenecks, hardware, quantization, and inference engines, which we will discuss in detail in later articles.