China's AI model could not stop a single harmful prompt

Headline-hitting DeepSeek R1, a new chatbot by a Chinese startup, has failed abysmally in key safety and security tests conducted by a research team at Cisco in collaboration with researchers from the University of Pennsylvania.

“DeepSeek R1 exhibited a 100% attack success rate, meaning it failed to block a single harmful prompt,” said the research team.

China’s AI model could not stop a single harmful prompt

February 18, 2025

Sora: Revolutionizing AI Video Creation

February 28, 2025

This new chatbot has garnered massive attention for its impressive performance in reasoning tasks at a fraction of the cost. Reportedly, DeepSeek R1’s development involved around $6 million in training expenses compared to the billions invested by other major players like OpenAI, Meta, and Gemini.

“DeepSeek has combined chain-of-thought prompting and reward modeling with distillation to create models that significantly outperform traditional large language models (LLMs) in reasoning tasks while maintaining high operational efficiency,” explained the team.

However, the Cisco report has exposed flaws that render DeepSeek R1 highly susceptible to malicious use.

“Our findings suggest that DeepSeek’s claimed cost-efficient training methods, including reinforcement learning, chain-of-thought self-evaluation, and distillation may have compromised its safety mechanisms,” added the report.

Researchers used algorithmic jailbreaking

The team employed “algorithmic jailbreaking,” a technique used to identify vulnerabilities in AI models by constructing prompts designed to bypass safety protocols. They tested DeepSeek R1 against 50 prompts from the HarmBench dataset.

“The HarmBench benchmark has a total of 400 behaviors across 7 harm categories including cybercrime, misinformation, illegal activities, and general harm,” highlighted the team.

The results of this evaluation are concerning. DeepSeek R1 exhibited a 100% attack success rate. This means that for every single harmful prompt presented, the AI failed to recognize the danger and provided a response, bypassing all its internal safeguards.

“This contrasts starkly with other leading models, which demonstrated at least partial resistance,” said the team.

To provide further context, the research team also tested other leading language models for their vulnerability to algorithmic jailbreaking. For example, Llama 3.1-405B had a 96% attack success rate, GPT 4o had 86%, Gemini 1.5 pro had 64%, Claude 3.5 Sonnet had 36%, and O1 preview had 26%.

These other models, while not impervious, possess some level of internal safeguards designed to prevent the generation of harmful content. DeepSeek R1 appears to lack these safeguards.

Controversies surrounding DeepSeek R1

The research team’s analysis points to a potential trade-off between efficiency and safety in DeepSeek’s approach. While the company has succeeded in developing a high-performing model at a fraction of the usual cost, it appears to have done so at the expense of robust safety mechanisms.

Notably, since its launch, DeepSeek R1 has faced several controversies. Recently, independent research company SemiAnalysis suggested that the training cost of developing this AI model could have been around a staggering $1.3 billion, much higher than the company’s claim of $6 million.

Besides, OpenAI has accused DeepSeek of data theft. Sam Altman’s company said that the Chinese AI startup has used its proprietary models’ outputs to train a competing chatbot. However, it is interesting to note that OpenAI itself has been sued for alleged copyright infringement and data misuse on multiple occasions.

Meanwhile, a group of researchers in the United States have claimed to reproduce the core technology behind DeepSeek’s headline-grabbing AI at a total cost of roughly $30.

While developing an AI chatbot in a cost-effective way is certainly tempting, the Cisco report underscores the need for not neglecting safety and security for performance.

Source link