A new AI benchmark suggests models can now handle tasks taking 16 hours, crossing a critical threshold for autonomous work and cybersecurity applications.
A frontier AI model from Anthropic has demonstrated the ability to autonomously complete complex software engineering tasks lasting up to 16 hours, a new capability threshold that is reshaping the landscape for AI-driven cybersecurity. The result from AI evaluation group METR suggests a super-exponential growth in model capabilities, a trend cybersecurity vendors like Palo Alto Networks Inc. report is already having a dramatic impact on both offensive and defensive operations.
"Using [frontier AI] to assist in vulnerability analysis, in just 3 weeks, the depth and breadth of work completed are equivalent to the workload of an entire top-level penetration testing team for a whole year," Palo Alto Networks wrote in a recent report on the technology's impact.
The new benchmark shows that Anthropic's Claude Mythos model can achieve a 50% success rate on tasks requiring 16 hours of human work. This leap in capability is forcing a rapid recalculation of risk and productivity in the software world. Palo Alto Networks, which was granted early access to the model, found it could compress the process of finding and chaining together multiple low-risk vulnerabilities into a deadly attack chain to just 25 minutes.
The development accelerates an AI arms race among cybersecurity firms, putting pressure on incumbents like Palo Alto Networks (PANW), Fortinet (FTNT), and Zscaler Inc. It also intensifies the platform competition between AI developers like Anthropic and its rival OpenAI. For investors, the key question is how this new level of AI autonomy translates into reliable enterprise products and defensible revenue streams.
A New Benchmark for AI Autonomy
The METR "time horizon" graph measures the length of software development tasks that frontier models can complete. The latest results show Mythos successfully handling 16-hour tasks half the time, a significant jump from the minutes- or single-hour-long tasks that models could handle in previous years. The evaluator noted that its own ability to test models is being challenged, as it has a limited number of tasks designed to take more than 16 hours, making it difficult to measure the true upper bound of the model's capability.
This rapid, accelerating progress has been dubbed "super-exponential" growth, with each generational leap in AI capability appearing larger than the last. The trendline suggests that capabilities predicted for 2027 are already being met, fueling both excitement about productivity gains and anxiety about the security implications of increasingly powerful and autonomous AI agents.
From Lab to Live Fire: Cybersecurity's 'Atomic Moment'
The findings from Palo Alto Networks' research provide a stark, real-world example of the METR benchmark's implications. The ability to automate a year's worth of work by a top-tier human team into three weeks represents a fundamental shift in the balance between cyber offense and defense.
This capability is not limited to one company. Competitors are also integrating advanced AI. CrowdStrike Holdings (CRWD), recently named a leader in the 2026 Gartner Magic Quadrant for Cyberthreat Intelligence, is expanding its Project QuiltWorks coalition to apply frontier AI to risk management. SentinelOne (S) has launched its Wayfinder service, using AI to identify and prioritize exploitable attack paths, while Okta Inc. (OKTA) is developing new frameworks to manage identities for AI agents themselves.
Reality Check: Is 50% Success Good Enough?
While the 16-hour figure is impressive, critics caution against over-extrapolating from the benchmark. The key qualifier is the 50% success rate. For research and development, where a human expert can review and discard failed attempts, a 50% success rate on a 16-hour task is transformative. It effectively doubles the output of a human engineer.
However, for a fully autonomous system deployed in a production environment, a 50% failure rate is unacceptable. "The reliability threshold for autonomous commercial use is somewhere between 95% and 99.9%," noted AI researcher Gary Marcus in a recent analysis. He argues that the METR graph, by focusing only on the 50% success line, doesn't show how quickly AI is closing the gap to enterprise-grade reliability. The debate over how long it will take to bridge the gap from 50% to 99% success is central to the discussion around artificial general intelligence (AGI) and its real-world impact.
This article is for informational purposes only and does not constitute investment advice.