775 points by dheerajvs 235 days ago | 485 comments on HN
| Mild positive
Contested
Editorial · v3.7· 2026-02-28 12:35:05 0
Summary AI Safety & Scientific Integrity Acknowledges
This research article presents a randomized controlled trial measuring how AI tools affect experienced open-source developer productivity, reporting a surprising 19% slowdown contrary to developer expectations. While not primarily focused on human rights, the study methodology demonstrates respect for worker autonomy and fair compensation, and its emphasis on measuring AI safety risks and maintaining scientific integrity reflects implicit acknowledgment of duties to protect human interests and advance public knowledge.
> developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.
I feel like there are two challenges causing this. One is that it's difficult to get good data on how long the same person in the same context would have taken to do a task without AI vs with. The other is that it's tempting to time an AI with metrics like how long until the PR was opened or merged. But the AI workflow fundamentally shifts engineering hours so that a greater percentage of time is spent on refactoring, testing, and resolving issues later in the process, including after the code was initially approved and merged. I can see how it's easy for a developer to report that AI completed a task quickly because the PR was opened quickly, discounting the amount of future work that the PR created.
It is 80/20 again - it gets you 80% of the way in 20% of the time and then you spend 80% of the time to get the rest of the 20% done. And since it always feels like it is almost there, sunk-cost fallacy comes into play as well and you just don't want to give up.
I think an approach that I tried recently is to use it as a friction remover instead of a solution provider. I do the programming but use it to remove pebbles such as that small bit of syntax I forgot, basically to keep up the velocity. However, I don't look at the wholesale code it offers. I think keeping the active thinking cap on results in code I actually understand while avoiding skill atrophy.
As an open source maintainer on the brink of tech debt bankruptcy, I feel like AI is a savior, helping me keep up with rapid changes to dependencies, build systems, release methodology, and idioms.
So they paid developers 300 x 246 = about 73K just for developer recruitment for the study, which is not in any academic journal, or has no peer reviews? The underlying paper looks quite polished and not overtly AI generated so I don't want to say it entirely made up, but how were they even able to get funding for this?
This study focused on experienced OSS maintainers. Here is my personal experience, but a very different persona (or opposite to the one in the study). I always wanted to contribute to OSS but never had time to. Finally was able to do that, thanks to AI. Last month, I was able to contribute to 4 different repositories which I would never have dreamed of doing it. I was using an async coding agent I built[1], to generate PRs given a GitHub issue. Some PRs took a lot of back and forth. And some PRs were accepted as is. Without AI, there is no way I would have contributed to those repositories.
One thing that did work in my favor is that, I was clearly creating a failing repro test case, and adding before and after along with PR. That helped getting the PR landed.
There are also a few PRs that never got accepted because the repro is not as strong or clear.
My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.
This study had 16 participants, with a mix of previous exposure to AI tools - 56% of them had never used Cursor before, and the study was mainly about Cursor.
They then had those 16 participants work on issues (about 15 each), where each issue was randomly assigned a "you can use AI" v.s. "you can't use AI" rule.
So each developer worked on a mix of AI-tasks and no-AI-tasks during the study.
A quarter of the participants saw increased performance, 3/4 saw reduced performance.
One of the top performers for AI was also someone with the most previous Cursor experience. The paper acknowledges that here:
> However, we see positive speedup for the one developer who has more than 50 hours of Cursor experience, so it's plausible that there is a high skill ceiling for using Cursor, such that developers with significant experience see positive speedup.
My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.
So far in my own hobby OSS projects, AI has only hampered things as code generation/scaffolding is probably the least of my concerns, whereas code review, community wrangling, etc. are more impactful. And AI tooling can only do so much.
But it's hampered me in the fact that others, uninvited, toss an AI code review tool at some of my open PRs, and that spits out a 2-page document with cute emoji and formatted bullet points going over all aspects of a 30 line PR.
Just adds to the noise, so now I spend time deleting or hiding those comments in PRs, which means I have even _less_ time for actual useful maintenance work. (Not that I have much already.)
One thing I've experienced in trying to use LLMs to code in an existing large code base is that it's _extremely_ hard to accurately describe what you want to do. Oftentimes, you are working on a problem with a web of interactions all over the code and describing the problem to an LLM will take far longer than just doing it manually. This is not the case with generating new (boilerplate) code for projects, which is where users report the most favorable interaction with LLMs.
Wow these are extremely interesting results, specially this part:
> This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.
I wonder what could explain such large difference between estimation/experience vs reality, any ideas?
Maybe our brains are measuring mental effort and distorting our experience of time?
For me, the measurable gain in productiviy comes when I am working with a new language or new technology. If I were to use claude code to help implement a feature of a python library I've worked on for years then I don't think it would help much (Maybe even hurt). However, if I use claude code on some go code I have very little experience with, or using it to write/modify helm charts then I can definitely say it speeds me up.
But, taking a broader view its possible that these initial speed ups are negated by the fact that I never really learn go or helm charts as deeply now that I use claude code. Over time, its possible that my net productiviy is still reduced. Hard to say for sure, especially considering I might not have even considered talking these more difficult go library modifications if I didn't have claude code to hold my hand.
Regardless, these tools are out there, increasing in effectiveness and I do feel like I need to jump on the train before it leaves me at the station.
IME AI coding is excellent for one-off scripts, personal automation tooling (I iterate on a tool to scrape receipts and submit expenses for my specific needs) and generally stuff that can be run in environments where the creator and the end user are effectively the same (and only) entity.
Scaled up slightly, we use it to build plenty of internal tooling in our video content production pipeline (syncing between encoding tools and a status dashboard for our non-technical content team).
Using it for anything more than boilerplate code, well-defined but tedious refactors, or quickly demonstrating how to use an unfamiliar API in production code, before a human takes a full pass at everything is something I'm going to be wary of for a long time.
I actually think that pasting questions into chatGPT etc. and then getting general answers to put into your code is the way.
“One shotting” apps, or even cursor and so forth seem like a waste of time. It feels like if you prompt it just right it might help but then it never really does.
It used to be that all you required to program was a computer and to RTFM but now we need to pay for API "tokens" and pray that there are no rug pull in the future.
So slow until a learning curve is hit (or as one user posited "until you forget how to work without it").
But isn't the important thing to measure... how long does it take to debug the resulting code at 3AM when you get a PagerDuty alert?
Similarly... how about the quality of this code over time? It's taken a lot of effort to bring some of the code bases I work in into a more portable, less coupled, more concise state through the hard work of
- bringing shared business logic up into shared folders
- working to ensure call chains flow top down towards root then back up through exposed APIs from other modules as opposed to criss-crossing through the directory structure
- working to separate business logic from API logic from display logic
- working to provide encapsulation through the use of wrapper functions creating portability
- using techniques like dependency injection to decouple concepts allowing for easier testing
etc
So, do we end up with better code quality that ends up being more maintainable, extensible, portable, and composable? Or do we just end up with lots of poor quality code that eventually grows to become a tangled mess we spend 50% of our time fighting bugs on?
I've been using LLMs almost every day for the past year. They're definitely helpful for small tasks, but in real, complex projects, reviewing and fixing their output can sometimes take more time than just writing the code myself.
We probably need a bit less wishful thinking. Blindly trusting what the AI suggests tends to backfire. The real challenge is figuring out where it actually helps, and where it quietly gets in the way.
AI used to refer to the extensive range of techniques of the field of Artificial Intelligence. Now it refers to LLMs and maybe other multi-layer networks trained on vast datasets. LLMs are great for some tasks and are also great as parts of hybrid systems like the IBM Watson Jepardy system. There's much more to Artificial Intelligence, e.g. https://en.m.wikipedia.org/wiki/Knowledge_representation_and... et al.
It's really hard to attribute productivity gains/losses to specific technologies or practices, I'm very wary of self-reported anecdotes in any direction precisely because it's so easy to fool ourselves.
I'm not making any claim in either direction, the authors themselves recognize the study's limitations, I'm just trying to say that everyone should have far greater error bars. This technology is the weirdest shit I've seen in my lifetime, making deductions about productivity from anecdotes and dubious benchmarks is basically reading tea leaves.
> and then you spend 80% of the time to get the rest of the 20% done
This was my pr-AI experience anyway, so getting that first chunk of time back is helpful.
Related: One of the better takes I've seen on AI from an experienced developer was, "90% of my skills just became worthless, and the other 10% just became 1,000 times more valuable." There's some hyperbole there, I but I like the gist.
I think it’s most useful when you basically need Stack Overflow on steroids: I basically know what I want to do but I’m not sure how to achieve it using this environment. It can also be helpful for debugging and rubber ducking generally.
well we used to have a sort of inverse pareto where 80% of the work took 80% of the effort and the remaining 20% of the work also took 80% of the effort.
I do think you're onto something with getting pebbles out of the road inasmuch as once I know what I need to do AI coding makes the doing much faster. Just yesterday I was playing around with removing things from a List object using the Java streams API and I kept running into ConcurrentOperationsExceptions, which happen when multiple threads are mutating the list object at the same time because no thread can guarantee it has the latest copy of the list unaltered by other threads. I spent about an hour trying to write a method that deep copies the list, makes the change and then returns the copy and running into all sorts of problems til I asked AI to build me a thread-safe list mutation method and it was like "Sure, this is how I'd do it but also the API you're working with already has a method that just....does this." Cases like this are where AI is supremely useful - intricate but well-defined problems.
Per our website, “To date, April 2025, we have not accepted compensation from AI companies for the evaluations we have conducted.” You can check out the footnote on this page: https://metr.org/donate
Companies produce whitepapers all the time, right? They are typically some combination of technical report, policy suggestion, and advertisement for the organization.
Hey Simon -- thanks for the detailed read of the paper - I'm a big fan of your OS projects!
Noting a few important points here:
1. Some prior studies that find speedup do so with developers that have similar (or less!) experience with the tools they use. In other words, the "steep learning curve" theory doesn't differentially explain our results vs. other results.
2. Prior to the study, 90+% of developers had reasonable experience prompting LLMs. Before we found slowdown, this was the only concern that most external reviewers had about experience was about prompting -- as prompting was considered the primary skill. In general, the standard wisdom was/is Cursor is very easy to pick up if you're used to VSCode, which most developers used prior to the study.
3. Imagine all these developers had a TON of AI experience. One thing this might do is make them worse programmers when not using AI (relatable, at least for me), which in turn would raise the speedup we find (but not because AI was better, but just because with AI is much worse). In other words, we're sorta in between a rock and a hard place here -- it's just plain hard to figure out what the right baseline should be!
4. We shared information on developer prior experience with expert forecasters. Even with this information, forecasters were still dramatically over-optimistic about speedup.
5. As you say, it's totally possible that there is a long-tail of skills to using these tools -- things you only pick up and realize after hundreds of hours of usage. Our study doesn't really speak to this. I'd be excited for future literature to explore this more.
In general, these results being surprising makes it easy to read the paper, find one factor that resonates, and conclude "ah, this one factor probably just explains slowdown." My guess: there is no one factor -- there's a bunch of factors that contribute to this result -- at least 5 seem likely, and at least 9 we can't rule out (see the factors table on page 11).
I'll also note that one really important takeaway -- that developer self-reports after using AI are overoptimistic to the point of being on the wrong side of speedup/slowdown -- isn't a function of which tool they use. The need for robust, on-the-ground measurements to accurately judge productivity gains is a key takeaway here for me!
(You can see a lot more detail in section C.2.7 of the paper ("Below-average use of AI tools") -- where we explore the points here in more detail.)
I notice that some people have become more productive thanks to AI tools, while others are not.
My working hypothesis is that people who are fast at scanning lots of text (or code for that matter) have a serious advantage. Being able to dismiss unhelpful suggestions quickly and then iterating to get to helpful assistance is key.
Being fast at scanning code correlates with seniority, but there are also senior developers who can write at a solid pace, but prefer to take their time to read and understand code thoroughly. I wouldn't assume that this kind of developer gains little profit from typical AI coding assistance. There are also juniors who can quickly read text, and possibly these have an advantage.
A similar effect has been around with being able to quickly "Google" something. I wouldn't be surprised if this is the same trait at work.
Noting that most of our power comes from the number of tasks that developers complete; it's 246 total completed issues in the course of this study -- developers do about 15 issues (7.5 with AI and 7.5 without AI) on average.
> My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.
I totally agree with this. Although also, you can end up in a bad spot even after you've gotten pretty good at getting the AI tools to give you good output, because you fail to learn the code you're producing well.
A developer gets better at the code they're working on over time. An LLM gets worse.
You can use an LLM to write a lot of code fast, but if you don't pay enough attention, you aren't getting any better at the code while the LLM is getting worse. This is why you can get like two months of greenfield work done in a weekend but then hit a brick wall - you didn't learn anything about the code that was written, and while the LLM started out producing reasonable code, it got worse until you have a ball of mud that neither the LLM nor you can effectively work on.
So a really difficult skill in my mind is continually avoiding temptation to vibe. Take a whole week to do a month's worth of features, not a weekend to do two month's worth, and put in the effort to guide the LLM to keep producing clean code, and to be sure you know the code. You do want to know the code and you can't do that without putting in work yourself.
Figure 21 shows that initial implementation time (which I take to be time to PR) increased as well, although post-review time increased even more (but doesn't seem to have a significant impact on the total).
But Figure 18 shows that time spent actively coding decreased (which might be where the feeling of a speed-up was coming from) and the gains were eaten up by time spent prompting, waiting for and then reviewing the AI output and generally being idle.
So maybe it's not a good idea to use LLMs for tasks that you could've done yourself in under 5 minutes.
Agreed and +1 on "always feels like it is almost there" leading to time sink. AI is especially good at making you feel like it's doing something useful; it takes a lot of skill to discern the truth.
Hey I just wanted to say this is one of the better studies I've seen - not clickbaity, very forthright about what is being claimed, and presented in such an easy-to-digest format. Thanks so much for doing this.
Qualitatively, we don't see a drop in PR quality in between AI-allowed and AI-disallowed conditions in the study; the devs who participate are generally excellent, know their repositories standards super well, and aren't really into the 'get up a bad PR' vibe -- the median review time on the PRs in the study is about a minute.
Developers totally spend time totally differently, though, this is a great callout! On page 10 of the paper [1], you can see a breakdown of how developers spend time when they have AI vs. not - in general, when these devs have AI, they spend a smaller % of time writing code, and a larger % of time working with AI (which... makes sense).
Yeah, I'll note that this study does _not_ capture the entire OS dev workflow -- you're totally right that reviewing PRs is a big portion of the time that many maintainers spend on their projects (and thanks to them for doing this [often hard] work). In the paper [1], we explore this factor in more detail -- see section (C.2.2) - Unrepresentative task distribution.
There's some existing lit about increased contributions to OS repositories after the introduction of AI -- I've also personally heard a fear anecdotes about an increase in the number of low-quality PRs from first time contributors, seemingly as a result of AI making it easier to get started -- ofc, the tradeoff is that making it easier to get started has pros to it too!
Well, there are two possible interpretations here of 75% of participants (all of whom had some experience using LLMs) being slower using generative AI:
LLMs have a v. steep and long learning curve as you posit (though note the points from the paper authors in the other reply).
Current LLMs just are not as good as they are sold to be as a programming assistant and people consistently predict and self-report in the wrong direction on how useful they are.
Honestly, this is a fair point -- and speaks the difficulty of figuring out the right baseline to measure against here!
If we studied folks with _no_ AI experience, then we might underestimate speedup, as these folks are learning tools (see a discussion of learning effects in section (C.2.7) - Below-average use of AI tools - in the paper). If we studied folks with _only_ AI experience, then we might overestimate speedup, as perhaps these folks can't really program without AI at all.
In some sense, these are just two separate and interesting questions - I'm excited for future work to really dig in on both!
Was any attention paid to whether the tickets being implemented with AI assistance were an appropriate use case for AI?
If the instruction is just "implement this ticket with AI", then that's very realistic in that it's how management often tries to operate, but it's also likely to be quite suboptimal. There are ways to use AI that help a lot, and other ways that hurt more than it helps.
If your developers had sufficient experience with AI to tell the difference, then they might have compensated for that, but reading the paper I didn't see any indication of that.
That’s my experience as well. It’s where Knuth comes in again: the program doesn’t just live in the code, but also in the minds of its creator. Unless I communicate all that context from the start, I can’t just dump years of concepts and strategy out of my brain into the LLM without missing details that would be relevant.
> I feel like there are two challenges causing this. One is that it's difficult to get good data on how long the same person in the same context would have taken to do a task without AI vs with.
The standard experimental design that solves this is to randomly assign participants to the experiment group (with AI) and the control group (without AI), which is what they did. This isolates the variable (with or without AI), taking into account uncontrollable individual, context, and environmental differences. You don't need to know how the single individual and context would have behaved in the other group. With a large enough sample size and effect size, you can determine statistical significance, and that the with-or-without-AI variable was the only difference.
I would speculate that it's because there's been a huge concerted effort to make people want to believe that these tools are better than they are.
The "economic experts" and "ml experts" are in many cases effectively the same group-- companies pushing AI coding tools have a vested interest in people believing they're more useful than they are. Executives take this at face value and broadly promise major wins. Economic experts take this at face value and use this for their forecasts.
This propagates further, and now novices and casual individuals begin to believe in the hype. Eventually, as an experienced engineer it moves the "baseline" expectation much higher.
Unfortunately this is very difficult to capture empirically.
If you stewarded that much tech debt in the first place, how can you be sure LLM will help prevent it going forward? In my experience, LLMs add more tech debt due to lacking cohesion with it's edits.
The article explicitly frames AI acceleration as a potential threat to 'safeguards' and 'oversight,' linking to concerns about 'catastrophic risks' and security implications.
FW Ratio: 50%
Observable Facts
The article states: 'extremely rapid AI progress could lead to breakdowns in oversight or safeguards' and 'Measuring the impact of AI on software developer productivity gives complementary evidence...informative of AI's overall impact on AI R&D acceleration.'
The organization's stated purpose is to measure 'whether and when AI systems might threaten catastrophic harm to society.'
Inferences
Framing AI acceleration as a potential threat to oversight explicitly acknowledges concerns about maintaining human security and preventing systemic harm.
The study's design to characterize AI capabilities in realistic scenarios reflects institutional commitment to understanding security implications of AI systems.
The article is published as open research with full transparency about methodology, limitations, and data handling, exemplifying free expression of scientific findings.
FW Ratio: 50%
Observable Facts
The article is published openly on metr.org/blog/ with full text, methodology, citations, and data discussion publicly accessible.
The article provides complete methodological details, including 'Appendix D for discussion of our empirical strategy' and multiple supplementary discussions of limitations and alternative analyses.
Inferences
Open publication of research with transparent methodology enables public scrutiny and free expression of scientific findings, consistent with Article 19's protection of information flow.
The commitment to publish findings 'regardless of the outcome' (as stated in FAQ) reflects institutional protection of free expression in scientific communication.
The article explicitly frames the research as contributing to maintaining social order and oversight against risks of uncontrolled AI acceleration affecting international governance.
FW Ratio: 50%
Observable Facts
The article states: 'One reason we're interested in evaluating AI's impact in the wild is to better understand AI's impact on AI R&D itself, which may pose significant risks. For example, extremely rapid AI progress could lead to breakdowns in oversight or safeguards.'
The article frames the study as providing 'complementary evidence to benchmarks...which may in turn lead to proliferation risks, breakdowns in safeguards and oversight, or excess centralization of power.'
Inferences
Explicit focus on measuring threats to 'oversight or safeguards' reflects concern for maintaining international social order and preventing systemic risks.
The research design to characterize realistic AI capabilities supports informed governance decisions, consistent with establishing social and international order to protect human interests.
The article explicitly emphasizes duties to the scientific community and society through commitment to scientific integrity and responsible research practices.
FW Ratio: 50%
Observable Facts
The article states in the FAQ: 'scientific integrity is a core value of ours, and we were (and are) committed to sharing results regardless of the outcome.'
The article provides extensive discussion of limitations, alternative hypotheses, and what claims are 'not' supported, demonstrating responsibility to scientific accuracy.
Inferences
Explicit commitment to publish findings 'regardless of outcome' reflects recognition of duties to the scientific community and broader public to report truthfully.
Extensive discussion of limitations and alternative interpretations demonstrates responsibility to prevent overgeneralization of findings, a duty to the community relying on research.
The article does not explicitly discuss freedom of thought, but the experimental design respects developer autonomy in choosing tools and tasks.
FW Ratio: 50%
Observable Facts
The article states: 'Developers provide lists of real issues...that would normally be part of their regular work. Then, we randomly assign each issue to either allow or disallow use of AI while working on the issue.'
Developers can 'use any tools they choose (primarily Cursor Pro with Claude 3.5/3.7 Sonnet)' when AI is allowed.
Inferences
The study respects developer autonomy in selecting tasks and tools, consistent with freedom to form and express thoughts through their work.
Treatment assignment respects developer agency by allowing them to engage in work on familiar projects of their choosing.
The article does not advocate for labor rights explicitly, but the methodology respects fair work conditions and developer choice.
FW Ratio: 50%
Observable Facts
The article states: 'We recruited 16 experienced developers...We pay developers $150/hr as compensation for their participation in the study.'
Developers 'provide lists of real issues...that would normally be part of their regular work' and retain choice over issue selection and tool use per treatment condition.
Inferences
Fair hourly compensation at professional rates ($150/hr) for experienced developers reflects respect for just and favorable work conditions.
The study uses real work on projects developers care about, with autonomy over task selection, consistent with right to freely chosen employment and satisfying working conditions.
The article does not explicitly discuss exploitation, but the methodology protects developers from exploitative conditions.
FW Ratio: 50%
Observable Facts
Developers are 'paid $150/hr' and 'complete these tasks...while recording their screens, then self-report the total implementation time.'
Developers have choice over which issues to tackle and how to approach them, including treatment assignment randomization that respects their autonomy.
Inferences
Fair hourly compensation and self-reporting of time (rather than external monitoring of output) protect developer dignity and prevent exploitation.
Voluntary participation with clear compensation and task autonomy provides protection against exploitative labor practices in the research context.
The study implements key provisions of Article 23: developers work on real projects they care about, receive fair compensation ($150/hour for experienced professionals), and retain autonomy over task selection and tool use.
The RCT structure respects developer agency: participants choose which issues to work on and retain autonomy over their thinking and approach within treatment conditions.
METR publishes research open-access, enabling public participation in scientific knowledge production and cultural advancement through shared understanding of AI capabilities.
The study's structure protects against exploitation: fair compensation, voluntary participation, researcher oversight of working conditions (screen recording), and developer autonomy over effort and tools.
METR's research infrastructure is designed to characterize AI capabilities in ways that support informed policy and international coordination on AI risks.
METR's institutional commitment to 'scientific integrity as a core value' is reflected in transparent methodology, acknowledgment of limitations, and commitment to publish findings regardless of outcome.