+0.11 Measuring the impact of AI on experienced open-source developer productivity (metr.org S:+0.13 )
775 points by dheerajvs 235 days ago | 485 comments on HN | Mild positive Contested Editorial · v3.7 · 2026-02-28 12:35:05 0
Summary AI Safety & Scientific Integrity Acknowledges
This research article presents a randomized controlled trial measuring how AI tools affect experienced open-source developer productivity, reporting a surprising 19% slowdown contrary to developer expectations. While not primarily focused on human rights, the study methodology demonstrates respect for worker autonomy and fair compensation, and its emphasis on measuring AI safety risks and maintaining scientific integrity reflects implicit acknowledgment of duties to protect human interests and advance public knowledge.
Article Heatmap
Preamble: +0.05 — Preamble P Article 1: +0.04 — Freedom, Equality, Brotherhood 1 Article 2: ND — Non-Discrimination Article 2: No Data — Non-Discrimination 2 Article 3: +0.16 — Life, Liberty, Security 3 Article 4: ND — No Slavery Article 4: No Data — No Slavery 4 Article 5: ND — No Torture Article 5: No Data — No Torture 5 Article 6: ND — Legal Personhood Article 6: No Data — Legal Personhood 6 Article 7: ND — Equality Before Law Article 7: No Data — Equality Before Law 7 Article 8: ND — Right to Remedy Article 8: No Data — Right to Remedy 8 Article 9: ND — No Arbitrary Detention Article 9: No Data — No Arbitrary Detention 9 Article 10: ND — Fair Hearing Article 10: No Data — Fair Hearing 10 Article 11: ND — Presumption of Innocence Article 11: No Data — Presumption of Innocence 11 Article 12: ND — Privacy Article 12: No Data — Privacy 12 Article 13: ND — Freedom of Movement Article 13: No Data — Freedom of Movement 13 Article 14: ND — Asylum Article 14: No Data — Asylum 14 Article 15: ND — Nationality Article 15: No Data — Nationality 15 Article 16: ND — Marriage & Family Article 16: No Data — Marriage & Family 16 Article 17: ND — Property Article 17: No Data — Property 17 Article 18: +0.09 — Freedom of Thought 18 Article 19: +0.18 — Freedom of Expression 19 Article 20: ND — Assembly & Association Article 20: No Data — Assembly & Association 20 Article 21: ND — Political Participation Article 21: No Data — Political Participation 21 Article 22: ND — Social Security Article 22: No Data — Social Security 22 Article 23: +0.13 — Work & Equal Pay 23 Article 24: ND — Rest & Leisure Article 24: No Data — Rest & Leisure 24 Article 25: ND — Standard of Living Article 25: No Data — Standard of Living 25 Article 26: ND — Education Article 26: No Data — Education 26 Article 27: +0.18 — Cultural Participation 27 Article 28: +0.16 — Social & International Order 28 Article 29: +0.13 — Duties to Community 29 Article 30: +0.09 — No Destruction of Rights 30
Negative Neutral Positive No Data
Aggregates
Editorial Mean +0.11 Structural Mean +0.13
Weighted Mean +0.13 Unweighted Mean +0.12
Max +0.18 Article 19 Min +0.04 Article 1
Signal 10 No Data 21
Volatility 0.05 (Low)
Negative 0 Channels E: 0.6 S: 0.4
SETL 0.00 Balanced
FW Ratio 50% 18 facts · 18 inferences
Evidence 27% coverage
6H 4M 21 ND
Theme Radar
Foundation Security Legal Privacy & Movement Personal Expression Economic & Social Cultural Order & Duties Foundation: 0.05 (2 articles) Security: 0.16 (1 articles) Legal: 0.00 (0 articles) Privacy & Movement: 0.00 (0 articles) Personal: 0.09 (1 articles) Expression: 0.18 (1 articles) Economic & Social: 0.13 (1 articles) Cultural: 0.18 (1 articles) Order & Duties: 0.13 (3 articles)
HN Discussion 20 top-level · 30 replies
Jabrov 2025-07-10 16:45 UTC link
Very interesting methodology, but the sample size (16) is way too low. Would love to see this repeated with more participants.
kokanee 2025-07-10 16:53 UTC link
> developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.

I feel like there are two challenges causing this. One is that it's difficult to get good data on how long the same person in the same context would have taken to do a task without AI vs with. The other is that it's tempting to time an AI with metrics like how long until the PR was opened or merged. But the AI workflow fundamentally shifts engineering hours so that a greater percentage of time is spent on refactoring, testing, and resolving issues later in the process, including after the code was initially approved and merged. I can see how it's easy for a developer to report that AI completed a task quickly because the PR was opened quickly, discounting the amount of future work that the PR created.

noisy_boy 2025-07-10 17:04 UTC link
It is 80/20 again - it gets you 80% of the way in 20% of the time and then you spend 80% of the time to get the rest of the 20% done. And since it always feels like it is almost there, sunk-cost fallacy comes into play as well and you just don't want to give up.

I think an approach that I tried recently is to use it as a friction remover instead of a solution provider. I do the programming but use it to remove pebbles such as that small bit of syntax I forgot, basically to keep up the velocity. However, I don't look at the wholesale code it offers. I think keeping the active thinking cap on results in code I actually understand while avoiding skill atrophy.

fritzo 2025-07-10 17:05 UTC link
As an open source maintainer on the brink of tech debt bankruptcy, I feel like AI is a savior, helping me keep up with rapid changes to dependencies, build systems, release methodology, and idioms.
tcdent 2025-07-10 17:15 UTC link
This study neglects to incorporate the fact that I have forgotten how to write code.
NewsaHackO 2025-07-10 17:15 UTC link
So they paid developers 300 x 246 = about 73K just for developer recruitment for the study, which is not in any academic journal, or has no peer reviews? The underlying paper looks quite polished and not overtly AI generated so I don't want to say it entirely made up, but how were they even able to get funding for this?
30minAdayHN 2025-07-10 17:24 UTC link
This study focused on experienced OSS maintainers. Here is my personal experience, but a very different persona (or opposite to the one in the study). I always wanted to contribute to OSS but never had time to. Finally was able to do that, thanks to AI. Last month, I was able to contribute to 4 different repositories which I would never have dreamed of doing it. I was using an async coding agent I built[1], to generate PRs given a GitHub issue. Some PRs took a lot of back and forth. And some PRs were accepted as is. Without AI, there is no way I would have contributed to those repositories.

One thing that did work in my favor is that, I was clearly creating a failing repro test case, and adding before and after along with PR. That helped getting the PR landed.

There are also a few PRs that never got accepted because the repro is not as strong or clear.

[1] https://workback.ai

MYEUHD 2025-07-10 17:26 UTC link
This does not mention the open-source developer time wasted while reviewing vibe coded PRs
narush 2025-07-10 17:28 UTC link
Hey HN, study author here. I'm a long-time HN user -- and I'll be in the comments today to answer questions/comments when possible!

If you're short on time, I'd recommend just reading the linked blogpost or the announcement thread here [1], rather than the full paper.

[1] https://x.com/METR_Evals/status/1943360399220388093

simonw 2025-07-10 17:36 UTC link
Here's the full paper, which has a lot of details missing from the summary linked above: https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

This study had 16 participants, with a mix of previous exposure to AI tools - 56% of them had never used Cursor before, and the study was mainly about Cursor.

They then had those 16 participants work on issues (about 15 each), where each issue was randomly assigned a "you can use AI" v.s. "you can't use AI" rule.

So each developer worked on a mix of AI-tasks and no-AI-tasks during the study.

A quarter of the participants saw increased performance, 3/4 saw reduced performance.

One of the top performers for AI was also someone with the most previous Cursor experience. The paper acknowledges that here:

> However, we see positive speedup for the one developer who has more than 50 hours of Cursor experience, so it's plausible that there is a high skill ceiling for using Cursor, such that developers with significant experience see positive speedup.

My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.

geerlingguy 2025-07-10 17:51 UTC link
So far in my own hobby OSS projects, AI has only hampered things as code generation/scaffolding is probably the least of my concerns, whereas code review, community wrangling, etc. are more impactful. And AI tooling can only do so much.

But it's hampered me in the fact that others, uninvited, toss an AI code review tool at some of my open PRs, and that spits out a 2-page document with cute emoji and formatted bullet points going over all aspects of a 30 line PR.

Just adds to the noise, so now I spend time deleting or hiding those comments in PRs, which means I have even _less_ time for actual useful maintenance work. (Not that I have much already.)

groos 2025-07-10 18:36 UTC link
One thing I've experienced in trying to use LLMs to code in an existing large code base is that it's _extremely_ hard to accurately describe what you want to do. Oftentimes, you are working on a problem with a web of interactions all over the code and describing the problem to an LLM will take far longer than just doing it manually. This is not the case with generating new (boilerplate) code for projects, which is where users report the most favorable interaction with LLMs.
pera 2025-07-10 18:54 UTC link
Wow these are extremely interesting results, specially this part:

> This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.

I wonder what could explain such large difference between estimation/experience vs reality, any ideas?

Maybe our brains are measuring mental effort and distorting our experience of time?

doctoboggan 2025-07-10 19:11 UTC link
For me, the measurable gain in productiviy comes when I am working with a new language or new technology. If I were to use claude code to help implement a feature of a python library I've worked on for years then I don't think it would help much (Maybe even hurt). However, if I use claude code on some go code I have very little experience with, or using it to write/modify helm charts then I can definitely say it speeds me up.

But, taking a broader view its possible that these initial speed ups are negated by the fact that I never really learn go or helm charts as deeply now that I use claude code. Over time, its possible that my net productiviy is still reduced. Hard to say for sure, especially considering I might not have even considered talking these more difficult go library modifications if I didn't have claude code to hold my hand.

Regardless, these tools are out there, increasing in effectiveness and I do feel like I need to jump on the train before it leaves me at the station.

keerthiko 2025-07-10 19:32 UTC link
IME AI coding is excellent for one-off scripts, personal automation tooling (I iterate on a tool to scrape receipts and submit expenses for my specific needs) and generally stuff that can be run in environments where the creator and the end user are effectively the same (and only) entity.

Scaled up slightly, we use it to build plenty of internal tooling in our video content production pipeline (syncing between encoding tools and a status dashboard for our non-technical content team).

Using it for anything more than boilerplate code, well-defined but tedious refactors, or quickly demonstrating how to use an unfamiliar API in production code, before a human takes a full pass at everything is something I'm going to be wary of for a long time.

thepasswordis 2025-07-10 20:22 UTC link
I actually think that pasting questions into chatGPT etc. and then getting general answers to put into your code is the way.

“One shotting” apps, or even cursor and so forth seem like a waste of time. It feels like if you prompt it just right it might help but then it never really does.

bit1993 2025-07-10 21:37 UTC link
It used to be that all you required to program was a computer and to RTFM but now we need to pay for API "tokens" and pray that there are no rug pull in the future.
codyb 2025-07-10 22:12 UTC link
So slow until a learning curve is hit (or as one user posited "until you forget how to work without it").

But isn't the important thing to measure... how long does it take to debug the resulting code at 3AM when you get a PagerDuty alert?

Similarly... how about the quality of this code over time? It's taken a lot of effort to bring some of the code bases I work in into a more portable, less coupled, more concise state through the hard work of

- bringing shared business logic up into shared folders

- working to ensure call chains flow top down towards root then back up through exposed APIs from other modules as opposed to criss-crossing through the directory structure

- working to separate business logic from API logic from display logic

- working to provide encapsulation through the use of wrapper functions creating portability

- using techniques like dependency injection to decouple concepts allowing for easier testing

etc

So, do we end up with better code quality that ends up being more maintainable, extensible, portable, and composable? Or do we just end up with lots of poor quality code that eventually grows to become a tangled mess we spend 50% of our time fighting bugs on?

Amaury-El 2025-07-11 02:06 UTC link
I've been using LLMs almost every day for the past year. They're definitely helpful for small tasks, but in real, complex projects, reviewing and fixing their output can sometimes take more time than just writing the code myself.

We probably need a bit less wishful thinking. Blindly trusting what the AI suggests tends to backfire. The real challenge is figuring out where it actually helps, and where it quietly gets in the way.

GregDavidson 2025-07-17 20:47 UTC link
AI used to refer to the extensive range of techniques of the field of Artificial Intelligence. Now it refers to LLMs and maybe other multi-layer networks trained on vast datasets. LLMs are great for some tasks and are also great as parts of hybrid systems like the IBM Watson Jepardy system. There's much more to Artificial Intelligence, e.g. https://en.m.wikipedia.org/wiki/Knowledge_representation_and... et al.
IshKebab 2025-07-10 17:03 UTC link
They paid the developers about $75k in total to do this so I wouldn't hold your breath!
qsort 2025-07-10 17:08 UTC link
It's really hard to attribute productivity gains/losses to specific technologies or practices, I'm very wary of self-reported anecdotes in any direction precisely because it's so easy to fool ourselves.

I'm not making any claim in either direction, the authors themselves recognize the study's limitations, I'm just trying to say that everyone should have far greater error bars. This technology is the weirdest shit I've seen in my lifetime, making deductions about productivity from anecdotes and dubious benchmarks is basically reading tea leaves.

aerhardt 2025-07-10 17:13 UTC link
But what about producing actual code?
wmeredith 2025-07-10 17:15 UTC link
> and then you spend 80% of the time to get the rest of the 20% done

This was my pr-AI experience anyway, so getting that first chunk of time back is helpful.

Related: One of the better takes I've seen on AI from an experienced developer was, "90% of my skills just became worthless, and the other 10% just became 1,000 times more valuable." There's some hyperbole there, I but I like the gist.

emodendroket 2025-07-10 17:17 UTC link
I think it’s most useful when you basically need Stack Overflow on steroids: I basically know what I want to do but I’m not sure how to achieve it using this environment. It can also be helpful for debugging and rubber ducking generally.
iLoveOncall 2025-07-10 17:30 UTC link
https://metr.org/about Seems like they get paid by AI companies, and they also get government funding.
reverendsteveii 2025-07-10 17:31 UTC link
well we used to have a sort of inverse pareto where 80% of the work took 80% of the effort and the remaining 20% of the work also took 80% of the effort.

I do think you're onto something with getting pebbles out of the road inasmuch as once I know what I need to do AI coding makes the doing much faster. Just yesterday I was playing around with removing things from a List object using the Java streams API and I kept running into ConcurrentOperationsExceptions, which happen when multiple threads are mutating the list object at the same time because no thread can guarantee it has the latest copy of the list unaltered by other threads. I spent about an hour trying to write a method that deep copies the list, makes the change and then returns the copy and running into all sorts of problems til I asked AI to build me a thread-safe list mutation method and it was like "Sure, this is how I'd do it but also the API you're working with already has a method that just....does this." Cases like this are where AI is supremely useful - intricate but well-defined problems.

narush 2025-07-10 17:36 UTC link
Our largest funding was through The Audacious Project -- you can see an announcement here: https://metr.org/blog/2024-10-09-new-support-through-the-aud...

Per our website, “To date, April 2025, we have not accepted compensation from AI companies for the evaluations we have conducted.” You can check out the footnote on this page: https://metr.org/donate

bee_rider 2025-07-10 17:40 UTC link
Companies produce whitepapers all the time, right? They are typically some combination of technical report, policy suggestion, and advertisement for the organization.
fabianhjr 2025-07-10 17:41 UTC link
Most of the world provides funding for research, the US used to provide funding but now that has been mostly gutted.
resource_waste 2025-07-10 17:50 UTC link
>which is not in any academic journal, or has no peer reviews?

As a philosopher who is into epistemology and ontology, I find this to be as abhorrent as religion.

'Science' doesn't matter who publishes it. Science needs to be replicated.

The psychology replication crisis is a prime example of why peer reviews and publishing in a journal matters 0.

narush 2025-07-10 17:52 UTC link
Hey Simon -- thanks for the detailed read of the paper - I'm a big fan of your OS projects!

Noting a few important points here:

1. Some prior studies that find speedup do so with developers that have similar (or less!) experience with the tools they use. In other words, the "steep learning curve" theory doesn't differentially explain our results vs. other results.

2. Prior to the study, 90+% of developers had reasonable experience prompting LLMs. Before we found slowdown, this was the only concern that most external reviewers had about experience was about prompting -- as prompting was considered the primary skill. In general, the standard wisdom was/is Cursor is very easy to pick up if you're used to VSCode, which most developers used prior to the study.

3. Imagine all these developers had a TON of AI experience. One thing this might do is make them worse programmers when not using AI (relatable, at least for me), which in turn would raise the speedup we find (but not because AI was better, but just because with AI is much worse). In other words, we're sorta in between a rock and a hard place here -- it's just plain hard to figure out what the right baseline should be!

4. We shared information on developer prior experience with expert forecasters. Even with this information, forecasters were still dramatically over-optimistic about speedup.

5. As you say, it's totally possible that there is a long-tail of skills to using these tools -- things you only pick up and realize after hundreds of hours of usage. Our study doesn't really speak to this. I'd be excited for future literature to explore this more.

In general, these results being surprising makes it easy to read the paper, find one factor that resonates, and conclude "ah, this one factor probably just explains slowdown." My guess: there is no one factor -- there's a bunch of factors that contribute to this result -- at least 5 seem likely, and at least 9 we can't rule out (see the factors table on page 11).

I'll also note that one really important takeaway -- that developer self-reports after using AI are overoptimistic to the point of being on the wrong side of speedup/slowdown -- isn't a function of which tool they use. The need for robust, on-the-ground measurements to accurately judge productivity gains is a key takeaway here for me!

(You can see a lot more detail in section C.2.7 of the paper ("Below-average use of AI tools") -- where we explore the points here in more detail.)

resource_waste 2025-07-10 17:53 UTC link
I'm curious what space people are working in where AI does their job entirely.

I can use it for parts of code, algorithms, error solving, and maybe sometimes a 'first draft'.

But there is no way I could finish an entire piece of software with AI only.

smokel 2025-07-10 17:58 UTC link
I notice that some people have become more productive thanks to AI tools, while others are not.

My working hypothesis is that people who are fast at scanning lots of text (or code for that matter) have a serious advantage. Being able to dismiss unhelpful suggestions quickly and then iterating to get to helpful assistance is key.

Being fast at scanning code correlates with seniority, but there are also senior developers who can write at a solid pace, but prefer to take their time to read and understand code thoroughly. I wouldn't assume that this kind of developer gains little profit from typical AI coding assistance. There are also juniors who can quickly read text, and possibly these have an advantage.

A similar effect has been around with being able to quickly "Google" something. I wouldn't be surprised if this is the same trait at work.

narush 2025-07-10 17:59 UTC link
Noting that most of our power comes from the number of tasks that developers complete; it's 246 total completed issues in the course of this study -- developers do about 15 issues (7.5 with AI and 7.5 without AI) on average.
jsnider3 2025-07-10 18:01 UTC link
It's good to know that Claude 3.7 isn't enough to build Skynet!
furyofantares 2025-07-10 18:02 UTC link
> My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

I totally agree with this. Although also, you can end up in a bad spot even after you've gotten pretty good at getting the AI tools to give you good output, because you fail to learn the code you're producing well.

A developer gets better at the code they're working on over time. An LLM gets worse.

You can use an LLM to write a lot of code fast, but if you don't pay enough attention, you aren't getting any better at the code while the LLM is getting worse. This is why you can get like two months of greenfield work done in a weekend but then hit a brick wall - you didn't learn anything about the code that was written, and while the LLM started out producing reasonable code, it got worse until you have a ball of mud that neither the LLM nor you can effectively work on.

So a really difficult skill in my mind is continually avoiding temptation to vibe. Take a whole week to do a month's worth of features, not a weekend to do two month's worth, and put in the effort to guide the LLM to keep producing clean code, and to be sure you know the code. You do want to know the code and you can't do that without putting in work yourself.

yorwba 2025-07-10 18:02 UTC link
Figure 21 shows that initial implementation time (which I take to be time to PR) increased as well, although post-review time increased even more (but doesn't seem to have a significant impact on the total).

But Figure 18 shows that time spent actively coding decreased (which might be where the feeling of a speed-up was coming from) and the gains were eaten up by time spent prompting, waiting for and then reviewing the AI output and generally being idle.

So maybe it's not a good idea to use LLMs for tasks that you could've done yourself in under 5 minutes.

causal 2025-07-10 18:08 UTC link
Agreed and +1 on "always feels like it is almost there" leading to time sink. AI is especially good at making you feel like it's doing something useful; it takes a lot of skill to discern the truth.
causal 2025-07-10 18:09 UTC link
Hey I just wanted to say this is one of the better studies I've seen - not clickbaity, very forthright about what is being claimed, and presented in such an easy-to-digest format. Thanks so much for doing this.
narush 2025-07-10 18:10 UTC link
Qualitatively, we don't see a drop in PR quality in between AI-allowed and AI-disallowed conditions in the study; the devs who participate are generally excellent, know their repositories standards super well, and aren't really into the 'get up a bad PR' vibe -- the median review time on the PRs in the study is about a minute.

Developers totally spend time totally differently, though, this is a great callout! On page 10 of the paper [1], you can see a breakdown of how developers spend time when they have AI vs. not - in general, when these devs have AI, they spend a smaller % of time writing code, and a larger % of time working with AI (which... makes sense).

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

narush 2025-07-10 18:16 UTC link
Yeah, I'll note that this study does _not_ capture the entire OS dev workflow -- you're totally right that reviewing PRs is a big portion of the time that many maintainers spend on their projects (and thanks to them for doing this [often hard] work). In the paper [1], we explore this factor in more detail -- see section (C.2.2) - Unrepresentative task distribution.

There's some existing lit about increased contributions to OS repositories after the introduction of AI -- I've also personally heard a fear anecdotes about an increase in the number of low-quality PRs from first time contributors, seemingly as a result of AI making it easier to get started -- ofc, the tradeoff is that making it easier to get started has pros to it too!

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

grey-area 2025-07-10 18:25 UTC link
Well, there are two possible interpretations here of 75% of participants (all of whom had some experience using LLMs) being slower using generative AI:

LLMs have a v. steep and long learning curve as you posit (though note the points from the paper authors in the other reply).

Current LLMs just are not as good as they are sold to be as a programming assistant and people consistently predict and self-report in the wrong direction on how useful they are.

narush 2025-07-10 18:39 UTC link
Honestly, this is a fair point -- and speaks the difficulty of figuring out the right baseline to measure against here!

If we studied folks with _no_ AI experience, then we might underestimate speedup, as these folks are learning tools (see a discussion of learning effects in section (C.2.7) - Below-average use of AI tools - in the paper). If we studied folks with _only_ AI experience, then we might overestimate speedup, as perhaps these folks can't really program without AI at all.

In some sense, these are just two separate and interesting questions - I'm excited for future work to really dig in on both!

antonvs 2025-07-10 18:46 UTC link
Was any attention paid to whether the tickets being implemented with AI assistance were an appropriate use case for AI?

If the instruction is just "implement this ticket with AI", then that's very realistic in that it's how management often tries to operate, but it's also likely to be quite suboptimal. There are ways to use AI that help a lot, and other ways that hurt more than it helps.

If your developers had sufficient experience with AI to tell the difference, then they might have compensated for that, but reading the paper I didn't see any indication of that.

9dev 2025-07-10 19:06 UTC link
That’s my experience as well. It’s where Knuth comes in again: the program doesn’t just live in the code, but also in the minds of its creator. Unless I communicate all that context from the start, I can’t just dump years of concepts and strategy out of my brain into the LLM without missing details that would be relevant.
gitremote 2025-07-10 19:18 UTC link
> I feel like there are two challenges causing this. One is that it's difficult to get good data on how long the same person in the same context would have taken to do a task without AI vs with.

The standard experimental design that solves this is to randomly assign participants to the experiment group (with AI) and the control group (without AI), which is what they did. This isolates the variable (with or without AI), taking into account uncontrollable individual, context, and environmental differences. You don't need to know how the single individual and context would have behaved in the other group. With a large enough sample size and effect size, you can determine statistical significance, and that the with-or-without-AI variable was the only difference.

isoprophlex 2025-07-10 19:45 UTC link
I'll just say that the methodology of the paper and the professionalism with which you are answering us here is top notch. Great work.
alfalfasprout 2025-07-10 20:00 UTC link
I would speculate that it's because there's been a huge concerted effort to make people want to believe that these tools are better than they are.

The "economic experts" and "ml experts" are in many cases effectively the same group-- companies pushing AI coding tools have a vested interest in people believing they're more useful than they are. Executives take this at face value and broadly promise major wins. Economic experts take this at face value and use this for their forecasts.

This propagates further, and now novices and casual individuals begin to believe in the hype. Eventually, as an experienced engineer it moves the "baseline" expectation much higher.

Unfortunately this is very difficult to capture empirically.

candiddevmike 2025-07-10 20:03 UTC link
If you stewarded that much tech debt in the first place, how can you be sure LLM will help prevent it going forward? In my experience, LLMs add more tech debt due to lacking cohesion with it's edits.
Editorial Channel
What the content says
+0.20
Article 3 Life, Liberty, Security
High Advocacy Framing
Editorial
+0.20
SETL
+0.14

The article explicitly frames AI acceleration as a potential threat to 'safeguards' and 'oversight,' linking to concerns about 'catastrophic risks' and security implications.

+0.20
Article 19 Freedom of Expression
High Advocacy
Editorial
+0.20
SETL
+0.10

The article is published as open research with full transparency about methodology, limitations, and data handling, exemplifying free expression of scientific findings.

+0.20
Article 27 Cultural Participation
High Advocacy
Editorial
+0.20
SETL
+0.10

The article exemplifies participation in scientific culture through open methodology, citations, and reproducible research practices.

+0.20
Article 28 Social & International Order
High Advocacy Framing
Editorial
+0.20
SETL
+0.14

The article explicitly frames the research as contributing to maintaining social order and oversight against risks of uncontrolled AI acceleration affecting international governance.

+0.15
Article 29 Duties to Community
High Advocacy
Editorial
+0.15
SETL
+0.09

The article explicitly emphasizes duties to the scientific community and society through commitment to scientific integrity and responsible research practices.

+0.05
Preamble Preamble
Medium Advocacy
Editorial
+0.05
SETL
0.00

The article implicitly acknowledges human dignity and freedom by prioritizing scientific integrity and safety concerns, though not explicitly.

+0.05
Article 18 Freedom of Thought
Medium Practice
Editorial
+0.05
SETL
-0.12

The article does not explicitly discuss freedom of thought, but the experimental design respects developer autonomy in choosing tools and tasks.

+0.05
Article 23 Work & Equal Pay
High Practice
Editorial
+0.05
SETL
-0.22

The article does not advocate for labor rights explicitly, but the methodology respects fair work conditions and developer choice.

+0.05
Article 30 No Destruction of Rights
Medium Practice
Editorial
+0.05
SETL
-0.12

The article does not explicitly discuss exploitation, but the methodology protects developers from exploitative conditions.

0.00
Article 1 Freedom, Equality, Brotherhood
Medium Practice
Editorial
0.00
SETL
-0.10

The article does not discuss equality and human dignity as core principles.

ND
Article 2 Non-Discrimination

No evidence of discrimination or non-discrimination policies discussed.

ND
Article 4 No Slavery

Not addressed.

ND
Article 5 No Torture

Not addressed.

ND
Article 6 Legal Personhood

Not addressed.

ND
Article 7 Equality Before Law

Not addressed.

ND
Article 8 Right to Remedy

Not addressed.

ND
Article 9 No Arbitrary Detention

Not addressed.

ND
Article 10 Fair Hearing

Not addressed.

ND
Article 11 Presumption of Innocence

Not addressed.

ND
Article 12 Privacy

Not addressed.

ND
Article 13 Freedom of Movement

Not addressed.

ND
Article 14 Asylum

Not addressed.

ND
Article 15 Nationality

Not addressed.

ND
Article 16 Marriage & Family

Not addressed.

ND
Article 17 Property

Not addressed.

ND
Article 20 Assembly & Association

Not addressed.

ND
Article 21 Political Participation

Not addressed.

ND
Article 22 Social Security

Not addressed.

ND
Article 24 Rest & Leisure

Not addressed.

ND
Article 25 Standard of Living

Not addressed.

ND
Article 26 Education

Not addressed.

Structural Channel
What the site does
+0.25
Article 23 Work & Equal Pay
High Practice
Structural
+0.25
Context Modifier
ND
SETL
-0.22

The study implements key provisions of Article 23: developers work on real projects they care about, receive fair compensation ($150/hour for experienced professionals), and retain autonomy over task selection and tool use.

+0.15
Article 18 Freedom of Thought
Medium Practice
Structural
+0.15
Context Modifier
ND
SETL
-0.12

The RCT structure respects developer agency: participants choose which issues to work on and retain autonomy over their thinking and approach within treatment conditions.

+0.15
Article 19 Freedom of Expression
High Advocacy
Structural
+0.15
Context Modifier
ND
SETL
+0.10

METR publishes research openly on its public-facing blog with no paywalls or access restrictions, enabling free dissemination of knowledge.

+0.15
Article 27 Cultural Participation
High Advocacy
Structural
+0.15
Context Modifier
ND
SETL
+0.10

METR publishes research open-access, enabling public participation in scientific knowledge production and cultural advancement through shared understanding of AI capabilities.

+0.15
Article 30 No Destruction of Rights
Medium Practice
Structural
+0.15
Context Modifier
ND
SETL
-0.12

The study's structure protects against exploitation: fair compensation, voluntary participation, researcher oversight of working conditions (screen recording), and developer autonomy over effort and tools.

+0.10
Article 1 Freedom, Equality, Brotherhood
Medium Practice
Structural
+0.10
Context Modifier
ND
SETL
-0.10

All 16 study participants receive identical compensation ($150/hour) regardless of background, demonstrating equal treatment in practice.

+0.10
Article 3 Life, Liberty, Security
High Advocacy Framing
Structural
+0.10
Context Modifier
ND
SETL
+0.14

METR's entire research infrastructure is designed to measure and characterize risks to human security and systemic safeguards from AI acceleration.

+0.10
Article 28 Social & International Order
High Advocacy Framing
Structural
+0.10
Context Modifier
ND
SETL
+0.14

METR's research infrastructure is designed to characterize AI capabilities in ways that support informed policy and international coordination on AI risks.

+0.10
Article 29 Duties to Community
High Advocacy
Structural
+0.10
Context Modifier
ND
SETL
+0.09

METR's institutional commitment to 'scientific integrity as a core value' is reflected in transparent methodology, acknowledgment of limitations, and commitment to publish findings regardless of outcome.

+0.05
Preamble Preamble
Medium Advocacy
Structural
+0.05
Context Modifier
ND
SETL
0.00

The organization's structure as a safety-focused nonprofit reflects commitment to protecting human interests from AI risks.

ND
Article 2 Non-Discrimination

No discriminatory structures observed, but also no explicit non-discrimination framework.

ND
Article 4 No Slavery

Not applicable to this content.

ND
Article 5 No Torture

Not applicable to this content.

ND
Article 6 Legal Personhood

Not applicable to this content.

ND
Article 7 Equality Before Law

Not applicable to this content.

ND
Article 8 Right to Remedy

Not applicable to this content.

ND
Article 9 No Arbitrary Detention

Not applicable to this content.

ND
Article 10 Fair Hearing

Not applicable to this content.

ND
Article 11 Presumption of Innocence

Not applicable to this content.

ND
Article 12 Privacy

Not applicable to this content.

ND
Article 13 Freedom of Movement

Not applicable to this content.

ND
Article 14 Asylum

Not applicable to this content.

ND
Article 15 Nationality

Not applicable to this content.

ND
Article 16 Marriage & Family

Not applicable to this content.

ND
Article 17 Property

Not applicable to this content.

ND
Article 20 Assembly & Association

Not applicable to this content.

ND
Article 21 Political Participation

Not applicable to this content.

ND
Article 22 Social Security

Not applicable to this content.

ND
Article 24 Rest & Leisure

Not applicable to this content.

ND
Article 25 Standard of Living

Not applicable to this content.

ND
Article 26 Education

Not applicable to this content.

Supplementary Signals
How this content communicates, beyond directional lean. Learn more
Epistemic Quality
How well-sourced and evidence-based is this content?
0.81 medium claims
Sources
0.8
Evidence
0.8
Uncertainty
0.8
Purpose
0.9
Propaganda Flags
No manipulative rhetoric detected
0 techniques detected
Emotional Tone
Emotional character: positive/negative, intensity, authority
measured
Valence
0.0
Arousal
0.4
Dominance
0.6
Transparency
Does the content identify its author and disclose interests?
1.00
✓ Author ✓ Conflicts ✓ Funding
More signals: context, framing & audience
Solution Orientation
Does this content offer solutions or only describe problems?
0.59 mixed
Reader Agency
0.7
Stakeholder Voice
Whose perspectives are represented in this content?
0.62 5 perspectives
Speaks: researcherscorporation
About: workersinstitutioncorporation
Temporal Framing
Is this content looking backward, at the present, or forward?
mixed short term
Geographic Scope
What geographic area does this content cover?
global
Complexity
How accessible is this content to a general audience?
moderate medium jargon general
Longitudinal · 4 evals
+1 0 −1 HN
Audit Trail 24 entries
2026-02-28 12:35 model_divergence Cross-model spread 0.53 exceeds threshold (4 models) - -
2026-02-28 12:35 eval Evaluated by claude-haiku-4-5-20251001: +0.13 (Mild positive)
2026-02-28 09:12 eval_success Light evaluated: Neutral (0.00) - -
2026-02-28 09:12 rater_validation_warn Light validation warnings for model llama-4-scout-wai: 0W 1R - -
2026-02-28 09:12 model_divergence Cross-model spread 0.53 exceeds threshold (3 models) - -
2026-02-28 09:12 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral)
reasoning
Editorial stance on AI productivity study, neutral rights discussion
2026-02-28 09:05 model_divergence Cross-model spread 0.53 exceeds threshold (2 models) - -
2026-02-28 09:05 eval_success Light evaluated: Neutral (0.00) - -
2026-02-28 09:05 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral)
reasoning
Technical blog post
2026-02-28 09:05 rater_validation_warn Light validation warnings for model llama-3.3-70b-wai: 0W 1R - -
2026-02-26 23:24 eval_success Evaluated: Moderate positive (0.53) - -
2026-02-26 23:24 eval Evaluated by deepseek-v3.2: +0.53 (Moderate positive) 11,938 tokens
2026-02-26 22:08 rater_validation_fail Parse failure for model llama-4-scout-wai: SyntaxError: Expected ':' after property name in JSON at position 3477 (line 131 column 19) - -
2026-02-26 20:01 dlq Dead-lettered after 1 attempts: Measuring the impact of AI on experienced open-source developer productivity - -
2026-02-26 20:01 dlq Dead-lettered after 1 attempts: Measuring the impact of AI on experienced open-source developer productivity - -
2026-02-26 20:01 eval_failure Evaluation failed: Error: Unknown model in registry: llama-4-scout-wai - -
2026-02-26 20:01 eval_failure Evaluation failed: Error: Unknown model in registry: llama-4-scout-wai - -
2026-02-26 19:59 rate_limit OpenRouter rate limited (429) model=llama-3.3-70b - -
2026-02-26 19:59 dlq Dead-lettered after 1 attempts: Measuring the impact of AI on experienced open-source developer productivity - -
2026-02-26 19:59 eval_failure Evaluation failed: Error: Unknown model in registry: llama-4-scout-wai - -
2026-02-26 19:59 eval_failure Evaluation failed: Error: Unknown model in registry: llama-4-scout-wai - -
2026-02-26 19:58 rate_limit OpenRouter rate limited (429) model=llama-3.3-70b - -
2026-02-26 19:57 rate_limit OpenRouter rate limited (429) model=llama-3.3-70b - -
2026-02-26 19:12 dlq Dead-lettered after 1 attempts: Measuring the impact of AI on experienced open-source developer productivity - -