This publication is licensed under the terms of the Creative Commons Attribution License 4.0 which permits unrestricted use, provided the original authors and source are credited.
To aid accessibility for government officials and policymakers, this summary has been streamlined by minimising technical complexity. For readers interested in more technical details, we recommend reading the full paper on which this Expert Analysis is based.
The relatively short history of Generative AI has been punctuated by big steps forward in model capability. One such step came with the releases of DeepSeek-V3 – a Chinese-made competitor to OpenAI’s GPT-4o – in late December 2024 and DeepSeek-R1 on 20 January. DeepSeek-V3 was reportedly trained in two months for approximately $5.6 million, or 2% of the cost of comparable models. DeepSeek-R1 includes a set of reasoning models containing “numerous powerful and intriguing reasoning behaviours,” achieving performance comparable to OpenAI’s o1. Both DeepSeek models are open for researchers to examine.
This openness is welcome for the many AI researchers who are keen to understand more about the models they are using. DeepSeek released the models as open weights – which can be built upon and freely used (under the MIT licence) – but without the training data. While this means that they are not truly open source, the company published more details about the training process than usual.
The release of the models has at least two major technical implications. The first is that it is possible to distil information from larger models into a smaller model, which provides a shortcut in post-training. The second is that simple reinforcement learning (RL) can yield significant, albeit narrow, performance improvements at lower computational costs. Both approaches could change risk thresholds across the defence and national security portfolio – not least in areas such as malicious cyber activity, misinformation and disinformation (including deepfake generation) – by providing a foundation for better reasoning ability in smaller, non-centralised models.
The open weights release of DeepSeek does not solve problems commonly associated with large language models (LLMs) such as hallucination but, bolstered by media attention, it has raised the issue of whether such models are good enough for widespread adoption by businesses, researchers and hobbyists. Some users have already installed the distilled version of Qwen, an AI model produced by Chinese multinational Alibaba, on Raspberry PI systems – although this has only yielded a relatively slow 1.2 tokens per second, representing relatively little ‘thinking time.’ And the comparatively cheap cost of application programming interface use has prompted developers to write their own VSCode plug-ins that use the DeepSeek model instead of GitHub’s Copilot.
Some experts hypothesise that this kind of grassroots adoption – a shift in the ubiquity rather than ability of AI systems – is a key step towards artificial general intelligence. If this is the case, it will be vital to understand the societal and security implications of DeepSeek’s models.
New efficiencies
DeepSeek-V3 employs the mixture of experts (MoE) architecture and many engineering efficiencies. This architecture essentially divides the model into a selection of specialised smaller models for various tasks, such as maths and coding, to ease the training burden. The architecture featured in machine translation transformers such as Google’s GShard in 2020 and the Mixtral LLM in January 2024. DeepSeek published a paper on its approach to MoE in January 2024, one of a flurry of papers on this approach to emerge last year.
DeepSeek-V3 is one of several recently released Chinese models that highlight the importance of algorithmic efficiency and resource optimisation. Instead of relying on brute-force scaling, DeepSeek has shown how to achieve high performance with significantly fewer resources. This is reflected in OpenAI’s subsequent price cuts, and the mounting pressure on the company to allow users to access reasoning tokens.
On 31 January, OpenAI also responded with the deployment of the o3-mini reasoning model. The model uses deliberative alignment, which involves the review of a set of internal policies at every reasoning step, to ensure that it is not ignoring any safety rules. However, OpenAI acknowledges that reasoning models are better than most a breaking through the guardrails their designers impose on them.
With DeepSeek’s app at the top of the App Store charts in the UK, US and China, the company’s breakthrough in efficiency appears to have broader commercial and policy implications. It suggests that the US CHIPS Act, which was designed to slow China in the AI race, may have inadvertently encouraged innovation. Nvidia, which makes the top-of-the-line chips used in many advanced AI models, lost nearly $600bn in market value in January.
DeepSeek-R1: Reasoning
DeepSeek aims to improve reasoning capabilities using pure RL without the need for supervised fine-tuning, to focus on self-evolution. With the V3 model (671B parameters) as a base and scalable group relative policy optimisation as the RL framework, the resulting R1-Zero model showed improvements in reasoning and maths but also produced challenges such as poor readability and language mixing.
The performance of the R1-Zero model initially increased from 15.6% on AIME 2024 to 71.0% – which is comparable to openAI-o1-0912 – before scoring 86.7% after DeepSeek used majority voting to adjust the RL. The company then reintroduced some supervised fine-tuning to produce the R1 model, which reportedly achieves scores on par with OpenAI’s o1 model for many reasoning and maths-based evaluation tasks.
As DeepSeek’s paper on the R1 observes, RL encourages the model to generate more tokens to solve reasoning tasks. During the process, and as test-time computation increases, behaviours such as reflection and the exploration of alternative approaches arise spontaneously. (Some experts use the term ‘aha moment’ to describe the point at which an intermediate model learns to rethink using an anthropomorphic tone.)
Another observation from the R1 paper is that the model’s performance decreased when DeepSeek introduced RL prompts to encourage language consistency, trading off its performance against benchmarks of its useability and readability. The paper explains how the reasoning patterns of larger models can be distilled into small models via the supervised fine-tuning dataset, arguing that these distilled versions perform better than the same process of RL on the model. The hope is that this distillation can be built upon to yield even smaller, yet still effective, models. The performance of the distilled models improved compared to their original baseline benchmarks, with R1-Distill-Qwen-32B and R1-Distill-Llama-70B outperforming OpenAI’s o1-mini on tasks involving coding and mathematical reasoning.
DeepSeek R1: Replication
On 25 January, researchers at the Hong Kong University of Science and Technology released a paper describing how their attempts to recreate the R1-Zero model achieved “surprisingly strong results on complex mathematical reasoning” with long chain-of-thought (CoT) and self-reflection on a 7B model with only 8k MATH examples. They started with the Qwen2.5-Math-7B as a base model and performed RL on it directly without supervised fine-tuning or a reward model. The researchers took a somewhat different approach to DeepSeek, starting with a smaller model and without a large-scale RL setup, while using proximal policy optimisation instead of group relative policy optimisation for RL. They observed the same increase in CoT length and emergent self-reflection. The resulting model achieved 33.3% AIME and 77.2% on MATH benchmarks (up from 16.7%, and 52.4% respectively on the base model). This is comparable to the performance of Microsoft’s rStar-MATH model, which uses greater than 50 times the data and requires more complicated components.
US company Hugging Face is recreating R1 in an open-sourced process, with the aim of releasing the full data and training pipeline. The firm intends to replicate the R1-distil models by extracting a high-quality reasoning corpus from DeepSeek-R1, reproducing the pure RL pipeline used to create the R1-Zero model and demonstrating an ability to transition from a base model to an RL-tuned model through multi-stage training.
Competing models
As discussed, the DeepSeek models are not the only notable innovations to come out of China in recent weeks. On 22 January, ByteDance – the company behind TikTok – released its Doubao-1.5 Pro model, which outperforms GPT-4o and is 50 times cheaper. It uses MoE and a highly optimised architecture that balances performance with reduced computational demands. Doubao is one of the most popular AI chatbots in China, with 60 million active users. The company focuses on building AI models that balance intelligence with communication, aiming to produce more emotionally aware, natural-sounding interactions. It is likely that Doubao incorporates improved prompt optimisation techniques and communication-efficient MoE training via locality-sensitive hashing. The latter aims to tackle the latency challenges inherent in training sparse-gated MoE models, resulting in 2.2 times quicker inferences.
On the 15 January, iFlytek launched its own deep reasoning large model, trained on a fully domestic computing platform, Spark Deep Reasoning X1. It demonstrates characteristics similar to ‘slow thinking’ during problem-solving, while achieving what it calls “industry-leading” results with relatively low computing power. It has particularly strong Chinese mathematical capabilities and has already been successfully applied in the education sector as an intelligent teaching assistant.
On 20 January, Chinese research company Moonshot AI released Kimi k1.5, reporting performance equivalent to o1 on reasoning tasks (i.e. 77.5% on AIME and 96.2% on MATH) and the use of RL in post-training. Reportedly, Kimi is multimodal, using text, code and images. It has a context length of 128k, meaning that it can read whole novels via the prompt. Its simplified RL framework balances exploration and exploitation, penalising the model for generating verbose responses. Kimi also encourages shorter and faster responses by blending the weights from both long and short CoT models. In late January, Qwen released a new family of models, Qwen2.5-VL. This multimodal model has several advantages over Qwen2, such as better text recognition (including handwriting, multiple languages and tables), improved object detection and spatial reasoning, and better agent and video functionality.
On 2 February, OpenAI announced Deep Research, claiming that “it accomplishes in tens of minutes what would take a human many hours.” After the release of the DeepSeek models, there was widespread speculation that this might force OpenAI to rush its next release to maintain market dominance. It is too early to determine whether this was the case – or, if so, whether it had an impact on the model.
Observations of AI researchers
Members of the AI research community have made several important observations about DeepSeek’s models:
- The smaller models can be run on a local machine, for free, with increased privacy. They can soon be installed via Hugging Face and Ollama.
- The R1 model can be brittle and difficult to prompt.
- Its reasoning capabilities can reportedly be used to help jailbreak itself and, indeed, it is easy for a user to jailbreak.
- The model refuses to answer questions on certain topics related to the censorship practices of the Chinese Government. This may be more relevant for the V3 model but, as the R1’s model has been developed for improved reasoning, censorship is unlikely to impact performance. (Censorship appears not to be present when the model is run locally.)
- There is some scepticism about the costs described in the V3 paper, with DeepSeek stating that it spent approximately $5.6M on training the V3 model. However, there are also suggestions that the figures presented are plausible. Scale.ai founder, Alexandr Wang, has said that he believes DeepSeek has 50,000 H100 GPUs.
- Similar approaches were tried on models two years ago, but the results were nowhere near as good. The assumption is that the quality of the base model is a key factor.
- RLCoT (chain of thought learned via RL) is considered emergent behaviour, which does not happen until about 1.5B size models. The type of RL does not make much difference in this.
- The CoT internal dialogue is often full of self-doubt and exhibits very little confidence, but the model gives an answer in an overly confident tone. This appears to be more honest and, as a consequence, builds user trust in the model.
- Many of these systems are using generative AI to help create or collate data sets to train for better reasoning. It is currently unclear whether this approach will suffer from the same degradation of training LLMs operating on LLM-generated material.
Figure 1: Comparison of ChatGPT and DeepSeek outputs


Political implications
Many observers have commented on the DeepSeek models’ refusal to answer questions on certain topics related to censorship by the Chinese Government. From a national security point of view, this raises several concerns, particularly in how the risk profile changes if most users go from using an American-aligned LLM to one aligned with the Chinese Government. This is especially relevant when a large proportion of users are using LLMs instead of search engines to produce facts (see Figure 1 for an example of the discrepancy between responses, which we generated on 3 February).
Political commentators have suggested the release of the DeepSeek-R1 model was timed to coincide with President Donald Trump’s inauguration, aiming to undermine the perceived US dominance of the AI sector or the impact of the new Stargate project. However, the timing could be due to the rush to release products before the Chinese new year.
The US and Australian Governments raised concerns about the use of DeepSeek by staff, with the US Navy banning the application on “security and ethical” grounds. Meanwhile, Italy has imposed a nationwide ban on the application, pending privacy watchdog Garante’s investigation into its handling of personal data. Coupled with a recent data breach that allowed researchers to access more than 1 million plain-text chat histories, this paints a worrying picture of data-handling practices within the fast-paced AI environment.
David Sacks, the White House’s AI and crypto czar, recently stated that “there’s substantial evidence that what DeepSeek did here is they distilled the knowledge out of OpenAI’s models.” It will be interesting to see whether OpenAI will mitigate teacher-student threats – and, if so, how it will do so without affecting usability. (Such threats, in which a highly capable, large ‘teacher’ model is used to guide and train a smaller ‘student’ model, are often implemented via knowledge distillation.) Additionally, it will be interesting to see the implications of a more restrictive usage policy if OpenAI chooses to go down this route, as it could push more people towards open-source, non-Western alternatives. Alternatively, such a move could fracture the frontier-model landscape, leading to siloed models that are tailored to their target audiences. Indeed, we are already seeing some suggestions of this with the development of the Open Euro LLM project.
Conclusion
As with any other model announcement, more thorough research into DeepSeek’s products is needed to ensure that the benchmark and statistics it put forward are consistent and provide a fair (and repeatable) comparison. Yet it does appear that the performance of the R1 model is impressing some in the online research community. Those performing early evaluations seem to be impressed with the model’s overall ability – even if, for some, the speed of responses is too long, especially when the task results in long chains of thought.
This flurry of reasoning model releases, with lower training and inference costs, is China’s technical response to data (and compute) scaling limitations. The new Chinese models demonstrate an innovative mix of KISS (‘keep it simple, stupid’) approaches and clever engineering, building on open-source literature and using many techniques that are traceable through recently published papers – albeit with the details of the data used for training frustratingly absent from the documentation.
The focus on improving maths and coding (through reasoning) may support future agentic approaches. Indeed, 2025 has sometimes been touted as the year of the agent. But these evaluations are relatively easy to automate: correct maths answers are definite; coding tasks with unit test can also be easily automated and, therefore, are more suitable for RL-type approaches.
However, if one considers that simple RL allows models to be upskilled with relatively small datasets (such as the 8k MATH), what other skills could be developed in small models? Is this technique only effective for pass/fail datasets? Or would, for example, upskilling a model to be more creative with its story-writing produce similar returns?
Without greater certainty about the technology used and the true costs of training, it is difficult to reach accurate and reliable conclusions about DeepSeek. This poses an interesting research question: can one use a released model to glean insights about the development pipeline and the datasets used during training?
While the DeepSeek models are impressive, they do the same for less rather than providing a step change in capability. Users can now train models much more efficiently, but this does not mean that they have access to a model that is substantially better than those that were previously available.
The views expressed in this article are those of the authors, and do not necessarily represent the views of The Alan Turing Institute or any other organisation.
For new insights into developments at the intersection of emerging technology and security, sign up to the CETaS Network here.
Authors
Citation information
Sarah Mercer, Samuel Spillard and Daniel Martin, "China’s AI Evolution: DeepSeek and National Security," Alan Turing Institute Expert Analysis (February 2025).