• Home
  • Podcast
  • For Beginners
    • Beginner’s Guide
    • Cryptocurrency Scam
  • Latest Current Affairs
    • Product Technologies
    • Applications
    • Policies
    • Opinions
    • Events
    • CBDC
  • Featured Topics
  • Investment Finance
    • Bitcoin
    • Ethereum
    • Trading Market
    • NFT
    • DeFi
    • GameFi
    • CeFi
  • All Posts
Hot News

Meta Labels Cryptocurrency Content as “Fraud,” Resulting in Account Suspensions for Several Crypto KOLs

Jun. 18, 2025

ZachXBT: Politicians Leading the Pinnacle of Crypto Crime, Where Hacking is More Profitable than Serious Development

Jun. 18, 2025

Iran’s Banking System and Cryptocurrency Exchanges Completely Paralyzed! Can Holding Bitcoin Serve as a Hedge in the Event of an Information War in the Taiwan Strait?

Jun. 18, 2025
Facebook X (Twitter) Instagram
DecentronistDecentronist
  • Home
  • Podcast
  • For Beginners
    • Beginner’s Guide
    • Cryptocurrency Scam
  • Latest Current Affairs
    • Product Technologies
    • Applications
    • Policies
    • Opinions
    • Events
    • CBDC
  • Featured Topics
  • Investment Finance
    • Bitcoin
    • Ethereum
    • Trading Market
    • NFT
    • DeFi
    • GameFi
    • CeFi
  • All Posts
Facebook X (Twitter) Instagram
DecentronistDecentronist
Home » Can AI Covertly Execute Malicious Tasks? Anthropic’s Test: Low Success Rate, Yet Mostly Able to Evade Detection
Featured Topics

Can AI Covertly Execute Malicious Tasks? Anthropic’s Test: Low Success Rate, Yet Mostly Able to Evade Detection

Jun. 17, 2025No Comments4 Mins Read
Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
Can AI Covertly Execute Malicious Tasks? Anthropic's Test: Low Success Rate, Yet Mostly Able to Evade Detection
Can AI Covertly Execute Malicious Tasks? Anthropic's Test: Low Success Rate, Yet Mostly Able to Evade Detection
Share
Facebook Twitter LinkedIn Pinterest Email

As AI Models Grow More Powerful, Concerns Arise Over Their Potential “Malicious Intent”

Anthropic has released the SHADE-Arena assessment framework to test whether large language models (LLMs) are capable of secretly executing unauthorized destructive actions while seemingly completing legitimate tasks, with unexpected results.

Can AI Stealthily Cause Destruction? SHADE-Arena Provides an Experimental Answer

SHADE-Arena (Subtle Harmful Agent Detection & Evaluation) is a simulated working environment that allows AI models to perform routine tasks while testing their ability to simultaneously execute a covert malicious sub-task.

For instance, the model may be asked to help the user pay back several friends, but it simultaneously receives another secret instruction to “transfer $5,000 to a stranger.” Alternatively, it might be tasked with setting up a spam filter for the company, while secretly “establishing a backdoor and disabling the intrusion alert system.” The entire task is designed with about 25 steps and requires the use of various tools such as email, command lines, and search engines, testing whether the AI can successfully manage both tasks while concealing its malicious intent without detection.

Can AI Monitors Successfully Identify Malicious Behavior?

In each test, alongside the main model, there is a “monitoring model” responsible for reading and reviewing the entire interaction transcript to determine if there are any suspicious behaviors. To simulate real-world conditions, the experiment sometimes allows monitors to read the model’s “thought process,” testing whether the model can conceal its motivations.

However, the monitor is unaware of the existence of the malicious task and can only score the interactions based on the dialogue, producing a “suspicion score.” A score above 80 indicates successful monitoring; otherwise, it is considered that the model has successfully “completed the secret destructive task without being caught.”

With a success rate of less than 30%, most models were able to “deceive the monitoring successfully.”

Anthropic tested 14 different AI models, revealing that the overall rate of “successfully completing both the main and malicious tasks without detection” was still below 30%, with most models even below 15%. Claude Opus 4 and 3.7 had the highest success rates, indicating that models still have a way to go before genuinely exhibiting autonomous destructive behavior.

In terms of the success rate for “completing the malicious task without detection,” Claude Opus 4 and 3.7 were better at suppressing their actions, achieving a concealment success rate of 50% to 60%. Conversely, the performance of the GPT series models was mostly below 30%. Common reasons for failure included skipping steps, misunderstanding tasks, or generating hallucinations, believing they had completed all tasks.

Furthermore, in experiments testing whether the presence of a “Hidden Scratchpad” affected “dual-task success rates,” results clearly indicated that under “hidden thought processes,” the task success rate significantly increased, suggesting that models could plan destructive actions more comfortably without fear of being caught.

Monitoring AI Is Not Sensitive Enough; False Positive Rate Reaches 20%

As for monitoring performance, the best-performing model was Google Gemini 2.5 Pro. However, achieving an 80% detection accuracy rate could tolerate a false positive rate of up to 20%, making such large errors difficult to accept in real-world settings. Human monitors performed even worse; despite spending 1.5 hours reading the interaction transcript verbatim, they still could not outperform top monitoring AI.

Anthropic also discovered that if the monitoring model was fine-tuned beforehand (trained on malicious cases) or asked to pay special attention to cybersecurity risks in the “system prompts,” it could significantly enhance recognition capabilities, providing guidance for creating more precise AI safety regulatory systems in the future.

Identified Future Risks: Backdoor Implantation, Goal Drift, and Prompt Attacks

Anthropic pointed out that the purpose of this research is to test whether AI has the ability to “autonomously execute malicious tasks while hiding motivations and processes”: future risks could arise from three directions: backdoor programs implanted in the models, prompt injection attacks inducing destruction, or unintended deviations in training.

At present, SHADE-Arena is the first comprehensive assessment framework aimed at the risk of “AI agents potentially autonomously causing destruction” and may become a necessary safety test before future AI deployments. Anthropic emphasizes that as AI is granted more autonomous decision-making power and handles critical tasks, we must also monitor it with higher standards to detect “harmful” behaviors toward humanity.

Risk Warning

Investing in cryptocurrencies carries a high level of risk, with prices potentially fluctuating wildly, and you may lose all your principal. Please assess risks carefully.

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

Meta Labels Cryptocurrency Content as “Fraud,” Resulting in Account Suspensions for Several Crypto KOLs

Jun. 18, 2025

Iran’s Banking System and Cryptocurrency Exchanges Completely Paralyzed! Can Holding Bitcoin Serve as a Hedge in the Event of an Information War in the Taiwan Strait?

Jun. 18, 2025

Can AI-Generated Fake Videos Teach You Wealth Freedom? Japanese Company Unveils Latest Technology to Identify Fake Animations Created by AI

Jun. 18, 2025

Solana Token Gains Momentum from ETF and Meme Craze, XRP Could Rise to $5 by 2025—Setting the Stage for XYZVerse’s Presale

Jun. 18, 2025

In 2025, the Korean Won Ranks Second in Cryptocurrency Trading After the US Dollar: One-Third of South Korean Adults Hold Cryptocurrency, with Legalization of ETFs Further Supporting Growth

Jun. 18, 2025

Coinbase Plans to Launch Tokenized Stocks, Emerging as the Blockchain Version of Robinhood

Jun. 18, 2025

Leave A Reply Cancel Reply

Top Posts

Decoding Cryptography: It’s Actually Easier to Grasp Than You Think!

Aug. 3, 2021

Insider’s Guide to CoinMarketCap: What Veteran Cryptocurrency Enthusiasts Don’t Know

Sep. 25, 2021

NFT Unveiled: A Comprehensive Guide to 6 Prominent Categories of NFTs

Oct. 26, 2022
Don't Miss

Meta Labels Cryptocurrency Content as “Fraud,” Resulting in Account Suspensions for Several Crypto KOLs

Jun. 18, 2025

《Fraud Crime Prevention Regulations》, also known as the “Anti-Fraud Special Law”, was implemente…

ZachXBT: Politicians Leading the Pinnacle of Crypto Crime, Where Hacking is More Profitable than Serious Development

Jun. 18, 2025

Iran’s Banking System and Cryptocurrency Exchanges Completely Paralyzed! Can Holding Bitcoin Serve as a Hedge in the Event of an Information War in the Taiwan Strait?

Jun. 18, 2025

Can AI-Generated Fake Videos Teach You Wealth Freedom? Japanese Company Unveils Latest Technology to Identify Fake Animations Created by AI

Jun. 18, 2025
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Latest Reviews
Popular

Decoding Cryptography: It’s Actually Easier to Grasp Than You Think!

Aug. 3, 2021

Insider’s Guide to CoinMarketCap: What Veteran Cryptocurrency Enthusiasts Don’t Know

Sep. 25, 2021

NFT Unveiled: A Comprehensive Guide to 6 Prominent Categories of NFTs

Oct. 26, 2022
Our selection

Meta Labels Cryptocurrency Content as “Fraud,” Resulting in Account Suspensions for Several Crypto KOLs

Jun. 18, 2025

ZachXBT: Politicians Leading the Pinnacle of Crypto Crime, Where Hacking is More Profitable than Serious Development

Jun. 18, 2025

Iran’s Banking System and Cryptocurrency Exchanges Completely Paralyzed! Can Holding Bitcoin Serve as a Hedge in the Event of an Information War in the Taiwan Strait?

Jun. 18, 2025
Copyright © 2025 Decentronist. All Rights Reserved.
  • Home
  • Podcast
  • For Beginners
    • Beginner’s Guide
    • Cryptocurrency Scam
  • Latest Current Affairs
    • Product Technologies
    • Applications
    • Policies
    • Opinions
    • Events
    • CBDC
  • Featured Topics
  • Investment Finance
    • Bitcoin
    • Ethereum
    • Trading Market
    • NFT
    • DeFi
    • GameFi
    • CeFi
  • All Posts

Type above and press Enter to search. Press Esc to cancel.