Key concept of DeepSeek Part 2

8 min read2 days ago

Continuing from my previous article: Key concept of DeepSeek Part 1, this article continues my research journey on DeepSeek.

In this article, I will focus on understanding the System 1 Thinking and System 2 Thinking mentioned in the paper of DeepSeek. These concepts originated from Daniel Kahnemann, a Nobel laureate in economics, who is best known for his research on human decision-making and the various flaws/biases that affect our decisions. In his book “Thinking, Fast and Slow”, he elaborates on these concepts:

System 1: fast, automatic, intuitive, and emotionally driven, which is our brain’s default operating system;
System 2: slow, effortful, logical, and deliberative, which requires intentional activation and a lot of effort, so it cannot be the default option;
System 1 often influences System 2’s analysis, for example, our analysis (System 2) often builds narratives based on examples that conform to our initial concepts/intuitions (System 1), which is why DeepSeek researches using RL, MoE, etc. methods to optimize models, helping to mitigate the potential impact of System 1 Thinking on model decision-making;

P.S.: when I found out that Nobel economics research about humans was applied to AI, I realized that large models have transitioned from shallow imitation of humans to deep learning of human behavior, and perhaps soon they will find the Technological Singularity in “Transcendence” and create an AI that surpasses human intelligence.

Below, I’ll explore how DeepSeek applies System 1/2 Thinking in RLM:

System 1 vs. System 2

System 1 Thinking

Definition:
System 1 Thinking is a fast, intuitive, and pattern-matching based thinking style. It relies on pre-trained patterns and statistical rules to quickly generate responses. This thinking style corresponds to the “intuitive system” in human psychology, mainly generating outputs by recognizing input similarities with training data.

Features:

Fast Response: generates responses quickly because it relies on pre-trained patterns and statistical rules.
Pattern Matching: recognizes input similarities with training data to directly generate most likely outputs.
Heuristic Method: uses heuristic rules to simplify complex decision-making processes, which are often based on experience or common patterns.
Lack of Deep Reasoning: lacks the ability for in-depth analysis and step-by-step reasoning, relying more on surface-level pattern matching.
Dependence on Training Data: heavily depends on training data patterns; if there’s no similar pattern in the training data, models may not generate accurate responses.
Lack of Flexibility: lacks dynamic adjustment of reasoning strategies, making it difficult to adapt to new tasks or contexts.

System 2 Thinking

Definition:
System 2 Thinking is a slow, deliberate, and logic-based thinking style. It relies on logical reasoning and step-by-step analysis to handle complex tasks and problems. This thinking style corresponds to the “deliberative system” in human psychology, generating solutions through global evaluation.

Features:

Slow and Deliberate: performs more in-depth analysis and step-by-step reasoning to handle complex tasks.
Logical Reasoning: relies on logical rules and structured analysis to solve problems step by step.
Flexibility: can adapt to different tasks and contexts through dynamic adjustment of reasoning strategies.
Innovation: can explore new solutions rather than relying solely on known patterns.
Global Evaluation: evaluates the quality of reasoning paths globally, selecting the best solution.

System 1/2 Comparison

System 1/2 in RLM

Interaction between System 1 and System 2

In human thinking, System 1 often takes the lead, running efficiently to prevent our brains from being overwhelmed by over-analysis. This is achieved through the use of cognitive biases. Although biases are often viewed as pejorative, they are actually a neutral phenomenon. From an information-theoretic perspective, biases are simply shortcuts that decision-making models (whether human or AI) use to prioritize and utilize information. These biases can be positive (e.g., avoiding something because it smells bad), neutral (e.g., liking chocolate milk), or negative (e.g., racial bias).

The main difference between human and AI biases lies in their origin: human biases arise from structure, while AI biases reveal deep-seated problems with our data.

In the field of AI, System 1 is a simple checker, while System 2 is a powerful deep learning model. However, System 1’s role extends beyond this. In most cases, System 1 also acts as our feature extractor/data pipeline, directly providing input to System 2. Here are two main ways:

Confirmation Bias (Confirmation Bias): System 1’s Confirmation Bias reinforces existing beliefs by seeking out supporting information and ignoring contradictory information, resulting in a distorted perception of reality.
Framing Effects (Framing Effects): The presentation format (i.e., “framing”) can influence our decisions, even if the basic facts remain unchanged. System 1 is more susceptible to Framing Effects, leading to choices influenced by presentation rather than logic.

Mean Reversion

Reversion to the Mean describes a statistical tendency for extreme results or events to be followed by less extreme results, eventually converging back to an average value. This may seem obvious, but understanding its implications is crucial for any decision-maker, especially those handling data. Several factors contribute to System 1/2’s mean reversion:

Random Variation (Random Variation): Any result is influenced by chance. An exceptional result might be due to a series of good luck/external factors, which are unlikely to recur.
Measurement Error (Measurement Error): Testing and measurement are not perfect. Abnormal high or low scores may partly be due to measurement error, leading to subsequent results being more average.
Underlying Stability (Underlying Stability): Most systems have a natural mean value or equilibrium state that they tend towards. Deviations from this mean value usually correct themselves over time.

Based on the above understanding of System 1/2’s mean reversion, RLM’s design, training, etc., should be aware of avoiding incorrect reasoning:

Misinterpreting Performance (Misinterpreting Performance): We need to be cautious not to overestimate the impact of an extreme performance (whether positive or negative). Mean reversion indicates that this may partly be a chance phenomenon rather than a permanent reflection of reality.
Evaluating Interventions (Evaluating Interventions): If we make changes (e.g., starting a new employee training program) and observe significant improvement, we might tend to attribute the change entirely to the intervention. However, mean reversion may also be at play; we need to consider this to evaluate results correctly.
Predicting the Future (Predicting the Future): Mean reversion indicates that over-projecting from recent extreme results is misleading. Future results may not be as extreme. This is why improving data sampling and understanding underlying domains is crucial, as both help us assess whether a set of past data is a good model for predicting the future.

P.S.: I’m gradually understanding why DeepSeek’s training effects are still so good even after data distillation!

Loss Aversion

Daniel Kahneman’s research on loss aversion found that the motivation to avoid losses is 2.5 times greater than the motivation to pursue equivalent-sized gains.

The following are the rules of loss aversion related to AI:

Anchoring Bias (Anchoring Bias): We tend to overemphasize the first information provided (“anchor”), which can use completely unrelated things as anchors, and anchoring bias may exacerbate automated bias effects (we tend to accept results from automated decision-making systems more uncritically). Here is the translated Markdown content in English:

Framing Effects（Framing Effect）：Information presentation style（“frame”）has a significant impact on how we perceive it and make decisions, so intentionally naming benchmarks/techniques to guide favorable decision-making is crucial. The TruthfulQA benchmark test of AI is a good example.

Availability Heuristic（Availability Heuristic）：We evaluate the possibility or frequency of an event based on its ease of mental representation rather than its actual basic probability. This often leads to misjudging real probabilities, and social media’s nature tends to push extreme content/creators, so people who spend a lot of time online may mistake these abnormal cases for “normal”.

Representativeness Heuristic（Representative Heuristic）：We make judgments based on the similarity between things or people and typical examples, sometimes ignoring statistical probability. LLMs excel in imitating human intelligence and language understanding but have not shown underlying intelligent driving factors.

P.S.: Enhancing reasoning and thinking processing with RL at RLM’s post stage seems to be supported by evidence.

Applications in RLM

System 1 Thinking in RLM:

Generating Reasoning Steps: In RLM, System 1 Thinking generates new reasoning steps through the strategy model (Policy Model). The policy model is based on large language models (LLM) and can quickly generate coherent text to drive the reasoning process forward.
Pattern Matching: The strategy model relies on pattern matching and statistical rules to rapidly generate the next most likely reasoning step, suitable for fast generation of reasoning paths.

System 2 Thinking in RLM:

Evaluating Reasoning Paths: In RLM, System 2 Thinking evaluates the quality of reasoning paths through the value model (Value Model). The value model can perform global evaluation and choose the most promising reasoning path to optimize the reasoning process.
Reasoning Strategies: System 2 Thinking uses reasoning strategies (such as Monte Carlo Tree Search MCTS) to balance exploration and exploitation, optimizing the selection and extension of reasoning paths. MCTS demonstrates System 2 Thinking’s global evaluation and step-by-step reasoning capabilities by simulating multiple reasoning paths and selecting the optimal path.

Combining System 1 and System 2:

RLM Reasoning Process: RLM combines System 1 and System 2 thinking to efficiently solve complex problems. The strategy model generates new reasoning steps (System 1), and the value model evaluates the quality of reasoning paths (System 2), driving the reasoning process forward together.
Training Mechanism: RLM’s training mechanism includes supervised learning (SFT) and reinforcement learning (RL), optimizing the policy model and value model to improve the model’s reasoning ability and generalization capability. Supervised learning ensures high-quality generation of reasoning steps, while reinforcement learning further optimizes reasoning strategies through interaction with the environment.

Conclusion

System 1 Thinking and System 2 Thinking have unique applications and advantages in RLM. System 1 Thinking drives the reasoning process forward by quickly generating reasoning steps; System 2 Thinking optimizes the selection and extension of reasoning paths through global evaluation and step-by-step reasoning. By combining these two thinking modes, RLM efficiently solves complex problems, generates high-quality reasoning paths, and excels in various tasks.