Security

Mythos vs curl One Bug, Five Reports, and a Lot of Hype

Anthropic said Mythos was too dangerous to release. It scanned 178,000 lines of curl and found one low-severity issue. The takeaway is more interesting than the headline.

Yacine Kahlerras
Yacine KahlerrasSoftware Engineer, Platform & UX at TurboDocx
May 15, 20269 min read

The short version

Written after reading Daniel Stenberg's writeup and sitting with it for a couple of days

  • 1

    Anthropic introduced Mythos in April 2026 and described it as so capable at finding security flaws that they were rationing access. That framing set very high expectations.

  • 2

    When Mythos was pointed at curl, a 178,000 line C codebase running on roughly twenty billion devices, it returned five “confirmed” vulnerabilities. The curl team kept one.

  • 3

    The interesting question is not “is Mythos bad?” It is “what does this tell us about how mature curl already is, and how teams should actually use AI scanners?”

What actually happened

I want to be careful with the framing here because Daniel Stenberg, who has maintained curl for almost three decades, wrote his own post about this and his tone was measured. He was not dunking on Anthropic. He was telling a story about expectations.

curl is one of those projects that sits underneath almost everything. It ships with macOS, Linux, Windows, embedded firmware, satellites, and game consoles. The README says it runs on over twenty billion devices across more than a hundred operating systems. That is a lot of attack surface, and it is the kind of code base where finding a real, exploitable bug is genuinely hard because so many people have already looked.

In early May 2026, someone with access to Mythos ran it across the curl master branch. The scan covered roughly 178,000 lines of C at 455bebc. It returned five items the model called confirmed security vulnerabilities. The curl team then did what they always do, which is triage.

Three of the five were rejected as false positives. They flagged behaviors already documented as expected API limitations. The fourth was classified as a regular bug, worth fixing but not a security issue. The fifth survived, and it is now scheduled to ship as a low-severity CVE in curl 8.21.0, planned for late June 2026.

Around twenty smaller observations also came out of the scan. None of them are vulnerabilities. The curl team will work through them when there is time. That is a useful side effect, and worth acknowledging.

A short timeline

  1. Apr 2026

    Anthropic announces Mythos, calling it dangerously good at finding security flaws

  2. Late Apr 2026

    Anthropic distributes Mythos to selected open source projects via the Linux Foundation Alpha Omega initiative

  3. May 6, 2026

    Someone with Mythos access scans curl master, covering 178,000 lines of C at 455bebc

  4. May 6, 2026

    Mythos reports five confirmed security vulnerabilities to the curl team

  5. May 7 to 10, 2026

    curl security team triages the report and rejects four of the five findings

  6. May 11, 2026

    Daniel Stenberg publishes his writeup; one low-severity CVE confirmed for curl 8.21.0

The triage, in numbers

Numbers help. Here is what the five findings turned into after the curl team finished reading them, plus the broader observations.

5 reports submittedafter triage
1

Confirmed CVE

Low-severity. Ships in curl 8.21.0.

1

Regular bug

Worth fixing, not a security issue.

3

Documented behavior

False positives flagging known API limits.

~20secondary observations outside the original five, currently being worked through by the curl team.
0memory safety flaws found in a 178,000 line C codebase. The canonical demo case for AI code analysis, and Mythos came back empty.

Why curl is one of the worst possible targets for a benchmark

This is the part most of the coverage skipped. If you picked a random open source library and ran Mythos on it, the chances of finding something serious would probably be higher. curl is not a random library.

The project has been audited multiple times. It has continuous fuzzing through OSS-Fuzz. Daniel and the maintainers run a tight disclosure process with a real triage SLA. The codebase has been picked over by every previous generation of AI security tool, from AISLE to Zeropath to Codex Security. Those earlier tools triggered a couple of hundred bug fixes over eight to ten months. There is simply less low hanging fruit left.

When you point a new scanner at a hardened codebase, you are not measuring the scanner. You are measuring the residual surface that survived everything that came before. If the new scanner finds zero memory bugs in curl, that is not the same thing as it being unable to find memory bugs anywhere. It would probably embarrass the average npm package within an hour.

We saw this same pattern with the Axios supply chain attack and the Claude Code source map leak. The interesting failures live in the seams between projects, not inside the projects that have already been hardened. Mythos was tested in exactly the place where it had the least to discover.

What Mythos did get right

I do not want to leave the impression that this was a failed run. It was a focused, well-structured analysis. The technique itself is worth copying for teams who want to point their own LLMs at their own code.

  • Parallel reading. The setup used LLM subagents reading files in parallel, which is how you cover 178k lines of C in a reasonable time window.

  • Source verification. Findings were re-checked against the actual source before they were recorded, which cut down on hallucinated bugs.

  • No black-box SAST. No traditional static analyzer in the loop. The whole thing was driven by language model reasoning over the code.

  • Real triage, not just noise. Even with four false positives, the report was specific enough for the curl team to triage in days rather than weeks.

What this means for your team

If you read the headlines from the past week you might walk away thinking AI code review is overhyped. I think that is the wrong takeaway. The right takeaway is that an AI scanner can only do as much as the foundation underneath it.

Before you spend a quarter integrating an AI security tool, ask yourself if you have done the boring work first. The boring work is the work that makes the AI useful.

# A pragmatic baseline before you point an AI at your codebase
# This is not a Mythos config. It is a checklist most teams skip.
1. Run fuzzers on every parser and protocol boundary
- libFuzzer, AFL++, OSS-Fuzz if your project qualifies
- Fix every crash before you call anything "audited"
2. Set up Address Sanitizer and UB Sanitizer in CI
- Build at least one job with -fsanitize=address,undefined
- These catch the bugs an AI is most likely to confidently miss
3. Pin dependencies and gate updates
- Lockfiles in every language ecosystem
- Cooldown windows before pulling new versions
4. Establish a real disclosure policy
- A SECURITY.md with a reachable contact
- A triage SLA you actually meet
5. Have humans review the AI output
- Treat findings as leads, not verdicts
- One false positive is fine, ten will burn your team out

Once the foundation is there, the second lever is how you brief the model. The default behavior of most AI scanners is to report anything that looks unusual. That is where the false positives come from. A short framing prompt that names what you care about, and explicitly names what to ignore, can cut the noise dramatically.

# How to brief an AI security scanner so the output is useful
Before:
"Find security vulnerabilities in this codebase."
After:
"You are reviewing src/ for memory safety issues, input
validation gaps, and improper error handling that could
lead to information disclosure.
Ignore: documented API limitations, performance issues,
style problems, missing test coverage.
For each finding, include:
- File path and line range
- The exact code that is unsafe
- A concrete reproduction or trigger
- The class of vulnerability (CWE if you can map it)
- Confidence: high, medium, or low
If a finding is already documented as expected behavior
in the README, man page, or header comment, drop it."

The honest take on the hype

I think the most useful thing Daniel said in his writeup was that he saw no evidence Mythos was performing at a higher level than the previous wave of AI security tools. That is not the same as saying it is bad. He explicitly said AI code analyzers are valuable and that teams ignoring them are leaving real bugs on the table.

What he was pushing back on was the framing. When you announce a tool by saying you cannot release it because it is too dangerous, you set the expectation that the first public datapoint will be a headline. The first public datapoint was one low-severity CVE in a codebase that had already been picked apart by the previous generation of tools. The gap between framing and result is what created the story.

I am not in the camp that thinks AI scanners are marketing. I have shipped enough of my own bugs to know that a tireless second reader is worth a lot. I am in the camp that thinks the bar for “dangerous” should be higher than “finds documented behavior and calls it a vulnerability.” The product needs to grow into the framing, and that is a perfectly normal place for a brand new model to be.

If you want to read more about how engineering teams are adapting to this generation of tooling, I wrote about the junior developer crisis in an AI native team, and our backend team put together a comparison of Cursor, Claude Code, and OpenCode that goes into more of the day to day workflow. If you are setting your team up with AI assistants, our CLAUDE.md guide covers how to brief them well — the same discipline that determines whether a security scanner gives you signal or noise.

Frequently asked questions

Related reading

Ship faster, ship safer

TurboDocx automates documents, signatures, and workflows with enterprise-grade security baked in. Spend your engineering hours on the hard problems, not on the paperwork around them.

Yacine Kahlerras
Yacine KahlerrasSoftware Engineer, Platform & UX at TurboDocx