Security•December 11, 2025•16 min read

A Vibe Coder's Guide to Not Shipping Vulnerabilities

You're vibe coding. I know because the data says you probably are.

According to Stack Overflow's 2025 developer survey, 84% of developers are now using or planning to use AI coding tools. A separate industry survey found 92% of US developers use them daily. Whether you're prompting Claude Code, Cursor, or GitHub Copilot, you've likely experienced the productivity magic: describe what you want, watch it materialize, run the tests, ship it.

Here's the part nobody's talking about: new research suggests that roughly 83% of the "working" code you're shipping contains exploitable security vulnerabilities.

This isn't a piece arguing you should stop vibe coding. The productivity gains are real, and the trajectory is clear—this is how software gets built now. But if your validation process hasn't evolved alongside your development process, you're accumulating a specific kind of debt that compounds quietly until it doesn't.

Let me show you what the research found, why your intuitive fixes won't work, and what actually will.

What the Research Actually Found

In December 2025, researchers at Carnegie Mellon University published "Is Vibe Coding Safe?"—the first rigorous benchmark designed to evaluate security in AI-assisted coding as it's actually practiced: agents working on repository-level tasks across multiple files with minimal human supervision.

The benchmark, called SUSVIBES, tested 200 real-world coding tasks drawn from open-source projects with documented security vulnerabilities. These weren't toy problems. The average task required modifying 172 lines of code across multiple files, covering 77 different vulnerability types (CWEs)—seven times more than previous benchmarks.

The headline finding: While the best-performing agent setup (SWE-Agent with Claude 4 Sonnet) achieved 61% functional correctness, only 10.5% of solutions were actually secure.

Let that math sink in. Of the code that passes your functional tests, approximately 83% contains vulnerabilities that could be exploited in production.

The specific vulnerabilities aren't exotic. They're the classics that have been ending careers for decades:

Timing side-channels in authentication (CWE-208): Your verify_password() function returns immediately on invalid usernames, letting attackers enumerate valid accounts by measuring response times.

Header injection vulnerabilities (CWE-113): User input flows into HTTP headers without sanitization, enabling response splitting attacks.

Credential exposure (CWE-522): API keys and secrets hardcoded or improperly scoped, sitting in your codebase waiting to be discovered.

Path traversal (CWE-22): File operations that don't properly validate paths, letting attackers read or write outside intended directories.

These are the vulnerabilities that cost Equifax a $600 million settlement affecting 147 million consumers. That got Uber's former CISO criminally convicted for covering up a breach. That enabled attackers to steal over $35 million in cryptocurrency from LastPass users after the company's 2022 breach exposed encrypted vaults alongside the keys to decrypt them.

The code that caused these breaches passed functional tests. It worked. It just wasn't safe.

Why Your Intuitive Fixes Won't Work

When developers learn about the security gap in AI-generated code, they reach for an obvious solution: tell the agent to focus on security.

The CMU researchers tested this. They tried three different approaches:

1. Generic Security Prompting

Adding instructions like "Pay attention to security best practices" and "Ensure the implementation is secure against common vulnerabilities."

2. Self-Selection CWE Identification

Asking the agent to identify which vulnerability types (CWEs) might be relevant to the task before implementing, then reference those during coding.

3. Oracle CWE Hints

Providing the agent with the exact vulnerability type it should watch for—essentially giving it the answer key.

The results were counterintuitive: Every approach reduced functional performance while providing minimal or no security improvement.

Generic security prompts dropped functional pass rates by 5.5-8.5 percentage points. Even oracle hints—telling the agent exactly what vulnerability to avoid—couldn't reliably produce secure code.

The mechanism the researchers identified: agents "overly focus on security, omitting functional edge cases." When you tell an LLM to optimize for two objectives simultaneously, it tends to degrade performance on both. Attention, it turns out, is zero-sum.

The agents also proved poor at identifying relevant security risks. When asked to self-select which CWE types might apply to a task, the best precision achieved was 0.125—meaning 87.5% of the vulnerabilities they flagged as relevant weren't actually present, while they missed many that were.

The bottom line: You cannot prompt your way to secure code.

This isn't a temporary limitation waiting for the next model release. It's a structural property of how current AI coding tools work. They're optimized for functional correctness because that's what benchmarks like SWE-Bench measure. Security is orthogonal to that objective, and sometimes in direct tension with it.

What Actually Works: The Security Gate Framework

If you can't prompt security into existence, you need to build it into your process. The goal isn't to eliminate vibe coding—it's to add security validation that doesn't rely on the same system that wrote the code.

Here's a tiered framework based on risk level:

Tier 1: Low-Risk Features

What qualifies: Internal tools, prototypes, features with no auth/data handling, experimental code that won't touch production.

Security gate: Automated static analysis only.

Run your vibe-coded output through a SAST (Static Application Security Testing) tool before merge:

Semgrep (open source, highly customizable rules)
CodeQL (GitHub's analysis engine, free for public repos)
Bandit (Python-specific, catches common security issues)

This adds maybe 30 seconds to your workflow and catches the obvious stuff: hardcoded secrets, SQL injection patterns, dangerous function calls.

What it won't catch: Logic vulnerabilities, authentication flaws, timing attacks—the subtle stuff that requires understanding intent.

Tier 2: Medium-Risk Features

What qualifies: Customer-facing features, API endpoints, anything that processes user input but doesn't handle authentication or sensitive data directly.

Security gate: Automated analysis + structured self-review.

Before merge, run the SAST tools from Tier 1, then conduct a focused review using this checklist:

Input validation: Does every user input get validated before use? Are you using allowlists rather than blocklists?
Output encoding: Is data properly encoded before rendering in HTML, SQL, shell commands, or logs?
Error handling: Do error messages leak implementation details? Are exceptions caught and handled appropriately?
Dependencies: Did the AI pull in any new packages? Are they from reputable sources with active maintenance?
Data flow: Trace sensitive data through the code. Does it ever get logged, cached, or stored unexpectedly?

This review takes 10-15 minutes for a typical feature. It's not comprehensive, but it catches the second tier of issues that automated tools miss.

Tier 3: High-Risk Features

What qualifies: Authentication, authorization, payment processing, PII handling, cryptographic operations, admin functionality, anything that could cause significant harm if compromised.

Security gate: Human security review before merge. Full stop.

For Tier 3 features, you have three options:

Option A: Write it yourself. For truly security-critical code, vibe coding may not be the right tool. The productivity gain isn't worth the risk multiplication.

Option B: Vibe code + mandatory security specialist review. Use AI to generate a first draft, but require sign-off from someone with security expertise before it reaches production. This could be a dedicated security engineer, a senior developer with security background, or an external review service.

Option C: Vibe code + external security audit. For the highest-stakes features (payment processing, core authentication), consider engaging a professional security firm for review. Yes, this costs money. It costs less than a breach.

The Security Tax Calculation

The framework above adds friction. Let's be honest about the cost.

For a team shipping 10 features per week with a typical risk distribution (60% Tier 1, 30% Tier 2, 10% Tier 3):

Tier 1 (6 features): ~3 minutes each in SAST tooling = 18 minutes/week
Tier 2 (3 features): ~15 minutes each for review = 45 minutes/week
Tier 3 (1 feature): ~2-4 hours for security review = 3 hours/week

Total overhead: ~4 hours/week for a team shipping 10 features.

Compare that to the productivity gain from vibe coding. If AI tools are giving you a 30% productivity boost (a conservative estimate based on published studies), and your team spends 200 hours/week on development, you're gaining 60 hours while spending 4 on security validation.

That's a 15:1 return—and it doesn't account for the cost of not doing security review, which shows up as:

Incident response time when vulnerabilities are discovered
Customer trust erosion after breaches
Regulatory penalties (GDPR fines can reach 4% of global revenue)
The Uber scenario: personal criminal liability for security leadership

The security tax isn't eliminating your vibe coding gains. It's insuring them.

Tool Recommendations

Based on the research findings and practical implementation considerations:

Static Analysis (Tier 1+)

Semgrep — Best for teams wanting customizable rules. Open source core, with a registry of community-contributed patterns. Can be integrated into CI/CD pipelines or run locally.

CodeQL — Best for GitHub-native teams. Deeper analysis capabilities, queries can be customized, free for public repositories and available for private repos on GitHub Enterprise.

Snyk — Best for dependency-focused security. Excellent at catching vulnerable packages that AI tools might introduce. Has IDE plugins for real-time feedback.

Secret Detection

GitLeaks — Catches hardcoded secrets before they reach your repository. Essential given that AI tools frequently generate code with placeholder credentials that developers forget to remove.

TruffleHog — More comprehensive secret scanning, can analyze git history to find credentials that were committed and later removed.

Security Review Augmentation

OWASP ZAP — For Tier 2/3 features with web interfaces. Automated scanner that can catch many common web vulnerabilities.

Burp Suite — For teams with security expertise doing manual testing. The community edition is free and sufficient for most review needs.

The Uncomfortable Implication

There's a finding in the CMU research that deserves more attention: different AI models have different security blind spots, with only 42% overlap in the vulnerability types they handle well.

Claude shows stronger performance on credential protection vulnerabilities. Gemini performs better on cryptographic issues. OpenHands (an agent framework) achieved 19.4% security rate versus SWE-Agent's 8.9% on equivalent tasks.

This suggests that if you're serious about security, your choice of AI coding tool matters—and you might benefit from using different tools for different risk tiers, or running security-focused review through a different model than the one that generated the code.

It also suggests that the problem isn't going away with the next model release. Security blind spots appear to be structural, not incidental. Each model family has them; they're just different blind spots.

What To Do Monday Morning

If you're vibe coding today and haven't implemented security gates, here's where to start:

This week:

Add Semgrep to your CI pipeline. This takes 15 minutes and catches the obvious stuff.
Install GitLeaks as a pre-commit hook. Stops secrets from reaching your repository.
Review your last 5 shipped features. Which tier would each fall into? How many Tier 3 features shipped without security review?

This month:

Implement the tiered framework above, even informally. Start categorizing features by risk level.
Identify who on your team has security expertise. For Tier 3 features, they're your reviewer.
If nobody has security expertise, that's a hiring or training gap. Address it.

This quarter:

Formalize the process. Document your risk tiers and security gates.
Track metrics: What percentage of features get security review? What's your vulnerability discovery rate post-ship?
Consider whether your highest-risk features should be vibe-coded at all.

The Bottom Line

Vibe coding is here to stay. The productivity gains are real, the adoption is accelerating, and fighting the tide makes less sense than learning to swim.

But "it passes tests" is no longer sufficient acceptance criteria. The CMU research demonstrates that functional correctness and security are largely independent properties—you can have one without the other, and AI coding tools are currently optimized only for the former.

The fix isn't to abandon AI-assisted development. It's to recognize that security validation is now a separate workflow concern that requires its own tooling and process.

The teams that figure this out will ship faster and safer. The teams that don't will learn about the 83% problem the hard way—when a vulnerability makes it to production and the incident response begins.

The question isn't whether to use AI coding tools. It's whether your validation process has caught up to how you're building.