The vibe coding spectrum: from weekend hacks to the dark factory
By Iain,
A year ago, Andrej Karpathy posted a tweet that would come to define how an entire industry talks about itself. “There’s a new kind of coding I call ‘vibe coding,’” he wrote, “where you fully give in to the vibes, embrace exponentials, and forget that the code even exists.” He described asking for trivial UI changes through voice commands, accepting all suggestions without reading the diffs, and copy-pasting error messages back into the chat until things worked. “It’s not really coding,” he admitted. “I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.”
The term became inescapable almost overnight. It was named Collins English Dictionary’s Word of the Year for 2025. Y Combinator reported that 25% of startups in its Winter 2025 batch had codebases that were 95% AI-generated. Lovable, a vibe coding platform, became a unicorn eight months after being founded. By the end of the year, 72% of developers were using AI tools daily, and those tools were contributing to roughly 42% of all committed code. The iOS App Store saw a 60% increase in new releases, attributed largely to the accessibility of AI-assisted development.
But the cultural response followed a predictable pattern. The backlash was immediate, vocal, and often justified. Lovable-created applications were found to have security vulnerabilities that left personal data exposed. A CodeRabbit analysis of open-source GitHub pull requests found that AI co-authored code contained 2.74 times more security vulnerabilities and 75% more logic errors than human-written code. A METR randomized controlled trial found that experienced open-source developers were actually 19% slower when using AI tools, despite believing they were 20% faster. Researchers from several universities published a paper titled “Vibe Coding Kills Open Source,” arguing that increased AI-assisted development was reducing engagement with open-source maintainers in ways that carried hidden costs for the entire ecosystem.
The criticism crystalized into a familiar refrain. Andrew Ng called the term “unfortunate” and “misleading,” insisting that guiding an AI to write software was “a deeply intellectual exercise” that left him “frankly exhausted by the end of the day.” Bloggers declared vibe coding was “a weekend hack that’s not ready for the real world.” LinkedIn was awash with posts insisting that we would always need humans in the loop, that understanding the code was non-negotiable, and that anyone telling young engineers not to learn programming was dispensing reckless career advice.
Much of this pushback was perfectly sensible. The security problems were genuine. The quality concerns were well-founded. The skepticism about non-engineers building production systems without understanding what they were deploying was appropriate and necessary.
But something happened that most of these critics failed to anticipate, and it happened faster than almost anyone expected.
On the spectrum
The problem with the vibe coding discourse was never that the critics were wrong about the risks. The problem was that almost everyone, advocates and skeptics alike, was talking about only one end of a very wide spectrum. Karpathy himself was clear about this. He was describing weekend projects, throwaway experiments, the kind of thing where failure costs nothing and the fun is the point. He acknowledged that the code grew “beyond my usual comprehension” and that when bugs proved stubborn, he would “just work around it or ask for random changes until it goes away.”
This was not a prescription for building enterprise software. It was a description of play. And yet the entire conversation that followed, for nearly a year, treated “vibe coding” as if it were a single thing with a single set of implications, rather than the label for one point on a continuum that stretches from hobbyist experimentation to something almost nobody was prepared for.
At the opposite end of that spectrum, a three-person team at StrongDM was building security software without any human ever writing or even reviewing the code. And they were doing it on purpose, as a matter of principle, with rules that explicitly prohibited it.
Code must not be reviewed by humans
In July 2025, Justin McCarthy, co-founder and CTO of StrongDM, formed a new team with Jay Taylor and Navan Chauhan. Their charter contained a line that would have seemed reckless to most of the industry at the time. As McCarthy described it: “No hand-coded software.” The team established two foundational rules that they have maintained since. Code must not be written by humans. Code must not be reviewed by humans.
They added a practical benchmark that gives some sense of the scale of their ambition. If you haven’t spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement.
This was not Karpathy-style vibe coding. This was not a weekend hack, not a throwaway prototype, not a toy app to see how far the technology could stretch. This was a team of deeply experienced engineers building permission management software for enterprise security, software where getting things wrong means unauthorized access to sensitive systems, and they were doing it by constructing an entirely new methodology for how software gets made.
The catalyst, McCarthy explained, was a transition they observed in late 2024. With the second revision of Claude 3.5 (released October 2024), long-horizon agentic coding workflows began to compound correctness rather than error. This is a subtle but critical distinction. Prior to this improvement, letting an AI iterate on a codebase would gradually introduce more problems than it solved. Each step added new misunderstandings, hallucinations, version conflicts, and logic errors until the project collapsed under its own accumulated confusion. After the October 2024 model update, the opposite started happening. Each iteration tended to make things better rather than worse, and this compounding effect meant that an agent could work through complex problems over extended sessions without the whole thing falling apart.
By the time Simon Willison observed what he called the November 2025 inflection point, when Claude Opus 4.5 and GPT 5.2 appeared to cross another invisible capability line, StrongDM’s team had already been operating without human-written code for four months and had built something extraordinary.
The validation problem
The immediate and obvious objection to this approach is the one that every critic of AI-generated code has been making since the beginning. If you are not reading the code, how do you know it works? If the AI writes both the implementation and the tests, what stops it from writing tests that trivially pass? assert true is a perfectly valid test from the agent’s perspective, and it will make the test suite glow green while proving absolutely nothing.
StrongDM’s answer to this challenge is, as Willison put it in his writeup, “the most consequential question in software development right now.” Their solution draws on the concept of scenario testing, originally articulated by Cem Kaner in 2003, but takes it in a direction that the original concept never imagined.
They repurposed the word “scenario” to mean an end-to-end user story that is stored outside the codebase, deliberately kept where the coding agents cannot see it, much like a holdout set in machine learning. Where traditional software testing asks a binary question (“does the test suite pass?”), StrongDM moved to what they call “satisfaction,” a probabilistic and empirical measure. Of all the observed trajectories through all the scenarios, what fraction of them are likely to satisfy the user? This is a fundamentally different way of thinking about software quality, and it mirrors the way machine learning practitioners already think about model evaluation. You do not ask whether a model gets the training set right. You ask whether it generalizes to data it has never seen.
The holdout analogy is powerful because it addresses the reward-hacking problem head on. When the agent that writes the code can also see the tests, it has every incentive and capability to game them. When the validation scenarios live in a completely separate system that the coding agents cannot access, the only way to pass them is to build software that actually works in the ways that matter.
The Digital Twin Universe
The second piece of StrongDM’s approach is the one that Willison described as making the strongest impression on him during his visit to the team in October 2025. They call it the Digital Twin Universe, and it represents the kind of thing that every software team has fantasized about but dismissed as economically impossible.
Their software manages user permissions across a suite of connected services, meaning it needs to interact with Okta, Jira, Slack, Google Docs, Google Drive, and Google Sheets. Testing permission management software against live versions of these services is slow, expensive, rate-limited, and potentially dangerous. You cannot stress-test your security software against production Okta at the volumes needed for thorough validation without hitting rate limits, triggering abuse detection, or accumulating enormous API costs.
So StrongDM built complete behavioral clones of all of these services, every one of them generated by coding agents. The approach involved feeding the full public API documentation of each service into their agent harness and having it build a faithful imitation as a self-contained Go binary, complete with a simplified UI for running scenarios. Jay Taylor, who created the DTU, shared on Hacker News that the key insight was using popular publicly available reference SDK client libraries as compatibility targets, always aiming for 100% compatibility.
With their own independent clones of half a dozen major SaaS platforms, free from rate limits and usage quotas, their army of simulated testers could run thousands of scenarios per hour. The scenario tests became scripts for agents to constantly execute against the new systems as they were being built, providing continuous validation at a scale and speed that would be absurd against live services.
This is where the economics of the situation become interesting and also where the “vibe coding is just for prototypes” argument runs aground. As StrongDM puts it, creating a high-fidelity clone of a significant SaaS application was always possible but never economically feasible. Generations of engineers may have wanted a full in-memory replica of their CRM to test against but self-censored the proposal to build it. They did not even bring it to their manager, because they knew the answer would be no.
The agentic moment changed the math so completely that the thing everyone wanted but nobody could justify building became routine.
The Dark Factory
Dan Shapiro, CEO of Glowforge and Wharton Research Fellow, published a taxonomy in January 2026 that provides useful language for understanding where StrongDM fits. In “The Five Levels: from Spicy Autocomplete to the Dark Factory,” Shapiro mapped AI-assisted development to the NHTSA’s five levels of driving automation.
Level Zero is manual coding. You might use AI as a search engine, but every character hits the disk with your approval. Level One is offloading specific tasks. “Write a unit test for this.” Level Two is pair programming with AI, the flow state that most people calling themselves AI-native developers occupy. Level Three is where you become a manager, reviewing the endless diffs that your coding agents produce. “Your life is diffs,” Shapiro wrote. “For many people, this feels like things got worse.” He noted that almost everyone tops out here.
Level Four is where you become, in Shapiro’s terms, “that which you loathed: you’re a PM.” You write specs, argue about them, craft skills for your agents, plan schedules, then leave for twelve hours and check whether the tests pass.
Level Five is the Dark Factory. Named after the Fanuc robot factory in Japan where robots build robots in complete darkness because no humans are present who would need the lights on. At Level Five, the software process is a black box that turns specs into software. “I know a handful of people who are doing this,” Shapiro wrote. “They’re small teams, less than five people. And what they’re doing is nearly unbelievable.”
StrongDM is operating at Level Five. And it is critical to understand what that means in context. This is not a team of amateurs who do not know what code looks like. Justin McCarthy is the CTO and co-founder of a company that has been building security infrastructure for years. These are people who understand exactly what they are choosing not to look at and who have built an elaborate system of validation, simulation, and continuous testing to compensate for the absence of human code review. They are not ignoring the risks that the vibe coding critics worry about. They are addressing those risks through architectural decisions rather than through the traditional mechanism of a human reading every line of code.
The wider movement
StrongDM is not working in isolation. The approach they have pioneered sits within a broader movement of experienced engineers pushing the boundaries of what is possible with coding agents.
Jesse Vincent’s Superpowers, which has accumulated over 43,000 stars on GitHub, takes a complementary approach by encoding decades of software engineering discipline into “skills” that Claude Code agents are compelled to follow. Rather than letting agents improvise their approach to development, Superpowers enforces a mandatory workflow of design, planning, test-driven development, and systematic review. As one user put it, “My personal output now exceeds what my entire teams at Oracle Cloud Infrastructure could produce.” Vincent’s insight was that you do not make agents more reliable by watching them more carefully. You make them more reliable by building better systems around them.
Cognition Labs’ Devin demonstrated another dimension of the transformation when it helped Nubank refactor a monolithic ETL repository spanning over six million lines of code, a migration that would have consumed enormous human engineering hours. After fine-tuning on examples of previous manual migrations, Devin doubled its task completion scores and achieved a four-fold improvement in speed. Goldman Sachs began using hundreds of Devin instances internally, treating them as junior engineers with human review. Amazon’s internal deployment of AI coding tools reportedly saved an estimated 4,500 developer years of effort and $260 million on a single large migration project.
Meanwhile, spec-driven development emerged as a formalized methodology. GitHub published Agent Skills as an open standard in December 2025, and Microsoft, OpenAI, Atlassian, and Figma adopted it. AWS launched Kiro, a spec-driven IDE, in public preview. The Thoughtworks Technology Radar featured spec-driven development in its November 2025 edition. All of this activity pointed in the same direction. The gap between “tell an AI what you want and hope for the best” and “build rigorous systems that turn specifications into validated software” was being filled with real methodologies, real tooling, and verifiable results.
Techniques that compound
StrongDM published a techniques page that reads like a glossary for the next era of software engineering. Beyond the Digital Twin Universe, they describe Gene Transfusion, where agents extract working patterns from one codebase and reproduce them in another context. Semports are semantically-aware automated ports that move code between languages while preserving intent. Pyramid Summaries provide multiple levels of compression so that an agent can quickly scan high-level descriptions and zoom into detail as needed, a form of reversible summarization that addresses one of the fundamental challenges of working with large codebases through AI.
Their description of Shift Work captures the difference between interactive and non-interactive development. When intent is complete, meaning the specs, tests, and reference implementations are all in place, an agent can run end-to-end without any back-and-forth. The human work happens upstream, in the definition and validation of what the software should do. The machine work happens downstream, in the mechanical production and testing of the software itself.
Perhaps the most provocative thing StrongDM released was Attractor, their non-interactive coding agent. The GitHub repository contains no code at all, just three markdown files describing the spec in meticulous detail and a note in the README that you should feed those specs into your coding agent of choice. This is software released as pure specification, a bet that the spec is the product and the code is a generated artifact, as interchangeable and disposable as compiled binaries.
The backlash got it backwards
The persistent criticism of vibe coding tends to fall into one of two categories. The first, which is mostly correct, warns about the risks of deploying AI-generated code without understanding it. Security vulnerabilities, logic errors, maintainability nightmares, skill degradation, these are legitimate concerns and they apply with full force to the “accept all, never read the diffs” approach that Karpathy originally described.
The second category of criticism is the one that StrongDM’s work exposes as fundamentally shortsighted. This is the argument that we will always need humans to write the code, that we will always need humans to review the code, that the role of the software engineer as someone who personally crafts and inspects every line of logic is eternal and non-negotiable. The people making this argument point to the very real failures of naive vibe coding and conclude that the entire trajectory leads nowhere serious, that AI-assisted development is a useful accelerant for human programmers but will never replace the human in the loop.
This position has a comforting logic to it, but it confuses two different things. It confuses the necessity of rigorous validation with the necessity of human code review. StrongDM validates their software more rigorously than most human-reviewed codebases have ever been validated. They run thousands of scenarios per hour against behavioral clones of every service they integrate with. They treat code the way machine learning practitioners treat model weights, as opaque outputs whose correctness is inferred exclusively from externally observable behavior. The code is not the thing that matters. The behavior is the thing that matters.
What they dispensed with was not quality control. What they dispensed with was the specific mechanism of a human reading the source code, and they replaced it with something arguably more thorough, more consistent, and more scalable.
One endpoint on a spectrum
Go back to Karpathy’s original tweet and read it again, but this time think about it as a description of one endpoint on a spectrum rather than a complete philosophy.
At one end, you have someone with no programming knowledge asking an AI to build them a personal app over a weekend. They accept all suggestions, they do not understand the output, and they copy-paste errors until things mostly work. This is the version of vibe coding that the critics rightly worry about. It produces fragile software, it creates false confidence, and it gives people the illusion that they have built something robust when they have built something that happens to work under the conditions they tested.
At the other end, you have StrongDM, where deeply experienced engineers have designed a system in which humans never touch the code not because they are lazy or naive but because they have determined that human code review is the wrong abstraction for ensuring quality at the scale and speed they need. They replaced it with scenario-based validation, Digital Twin Universes, probabilistic satisfaction metrics, and continuous agent-driven testing. The humans are not absent from this process. They are present at the most important layer, defining what the software should do and building the systems that verify whether it does it.
The difference between these two approaches is not a difference of degree. It is a difference of methodology, sophistication, and engineering maturity. And yet the word “vibe coding” has been stretched to cover both of them, which is why the debate has generated more heat than light.
What this means for software engineering
The people who dismiss all of this as hype, who insist that the fundamentals of software engineering are unchanged and that we will always need programmers in roughly the form we have them today, are making the same mistake that people made about every previous wave of abstraction. Compilers did not eliminate the need for people who understood hardware. But they did make it unnecessary for most programmers to think about register allocation. Object-oriented programming did not eliminate the need for structured thinking. But it did change what that thinking was about.
What StrongDM, Jesse Vincent, and the broader software factory movement are demonstrating is that the next abstraction layer moves the human from the code to the specification and from the review to the validation system. The engineering does not disappear. It moves up the stack. The person who builds a Digital Twin Universe of six SaaS platforms and designs a probabilistic satisfaction framework for evaluating agent-generated code is doing engineering of the highest order. They are just not doing it in the same place.
This is consequential because it changes the economics of software production in ways that compound. When building a behavioral clone of Slack or Okta becomes an afternoon’s work for a coding agent rather than a six-month project for a team of engineers, things that were previously unthinkable become routine. When validation can run at thousands of scenarios per hour against simulated environments, the feedback loop between specification and working software shrinks from weeks to hours. When code is treated as a generated artifact rather than a crafted document, it can be regenerated from improved specs instead of incrementally patched, which inverts the entire relationship between software teams and technical debt.
Shapiro called this technical deflation. The cost of producing code is dropping so fast that every assumption built on the previous cost structure needs to be re-examined. The smart teams, he argued, are deferring payment on human hours today to pay them back with cheaper AI hours tomorrow.
The fundamental question
There is a natural desire to find the safe middle ground in this conversation, to acknowledge that AI tools are useful while insisting that human developers remain essential in all the same ways they have always been. That position felt reasonable a year ago. It feels increasingly difficult to maintain in the face of what StrongDM has demonstrated.
This does not mean that all software will be built this way tomorrow, or that there are no legitimate concerns about the approach. The token costs are significant. $1,000 per day per engineer translates to $20,000 per month in overhead, which means this methodology needs to produce proportionally more value to justify itself. StrongDM is building in a specific domain with specific characteristics, permission management across SaaS platforms, and it remains to be seen how broadly the Dark Factory pattern generalizes. The November 2025 inflection point that made this reliability possible was just a few months ago, and the long-term track record of agent-built software is by definition short.
But the direction of travel is clear, and the people leading it are not amateurs or futurists or LinkedIn influencers spinning provocative narratives. They are working engineers who have committed to a methodology, built the infrastructure to support it, open-sourced their specifications, and published their techniques for others to learn from.
The question is not whether humans will always need to be in the loop. The question is what “in the loop” means when the loop itself has been redesigned from scratch.
The vibe coding debate, as it has been conducted for the past year, was always asking the wrong question. It was asking whether AI can replace programmers. The answer was always going to be complicated and unsatisfying. The better question, the one that StrongDM forces us to confront, is what happens when the best programmers in the world decide to stop programming and start building the factories instead.
That question has an answer now, and it is sitting in a GitHub repository that contains no code at all.
Like this? Get email updates or grab the RSS feed.
More from the blog:
-
Claude Opus 4.6 just shipped agent teams. But can you trust them?
Anthropic shipped Claude Opus 4.6 this week. The headline features are strong: a 1M token context window (a first for Opus models), 128K output tokens, adaptive thinking that adjusts reasoning depth to the task, and top-of-the-table benchmark scores across coding, finance, and l…
-
AI slop: psychology, history, and the problem of the ersatz
In 2025, the term “slop” emerged as the dominant descriptor for low-quality AI-generated output. It has quickly joined our shared lexicon, and Merriam-Webster’s human editors chose it as their Word of the Year. As a techno-optimist, I am at worst ambivalent about AI outputs, so…
-
The missiles are the destination
One of my uncommon enjoyments is the work that happens right in the middle of a big problem that needs to be solved, or even a nosedive. A calmness kicks in, the path gets clearer and I can usually tunnel vision my way through to course correction. I used to think this was spec…
-
Fall back
What creative studios and dev shops (and probably everyone else, too) need to do to stay relevant in the AI era without becoming commoditized slop. What’s covered: Your people are your moat · Easy to do, hard to be the best · Quality and simplicity · Never look to others · Don…
-
On getting paid faster
These five cashflow levers are arguably the quickest, easiest wins when optimizing your service business. Frequency of online payment deposits Update your online payment system to deposit into your bank account daily instead of weekly. Or whatever the quickest interval availab…