345 points by nee1r 6 days ago | 80 comments on HN
| Neutral Editorial · v3.7· 2026-02-26 00:31:01 0
Summary Technology Governance & Labor Rights Acknowledges
This technical product announcement for FDM-1, a foundation model for autonomous computer use, demonstrates mild positive engagement with free expression and information access (Article 19) through public research sharing and internet-scale data advocacy, but exhibits significant negative signals on privacy (Article 12), accessibility (Article 2), labor rights (Article 23), and social governance (Articles 28-29). The post frames technological capability advancement without addressing privacy consent for training data, worker displacement concerns, or institutional frameworks needed to govern autonomous systems deployment in high-stakes domains.
Curious about the masked diffusion IDM choice. They mention CTC loss and cross-entropy both underperformed — I'd love to see ablations on that. The claim that typos were "extremely common" with non-causal cross-entropy is interesting but hand-wavy without numbers.
Hey guys! I’m Neel, been holed up in our south park office for the past year working on model training. excited to share our research!
This is a preview of a very different type of computer use model—we train on the internet. Specifically we have 11 million hours of computer video stored on our storage cluster (previously shared https://news.ycombinator.com/item?id=45438496 !) and the model can work in 30 FPS. Since we match the fundamental form factor of computer-use, we can get our model to do CAD, browse websites, and even drive a car using arrow keys. I’m super excited to see what our model can do as we scale more, it's a fun frontier to work on (not language models :) ).
The team and I will be online responding to the comments, so drop any questions.
I rly liked the point about ctrl-c only being able to be labelled retrocausally. I do think that with enough past context you should be able to know what was copied - in some sense the past does encode the future - but also an agentic decision is precisely the kind where the future is more informative than the past for reconstructing that decision.
It does make me wonder if you should have the inverse dynamics model split into specifically retrocausal and causal. You kind of do this already with the inverse and forward dynamics model, but the idea of a model that knows only
about the future training in a feedback loop with a model that knows only about the past is kind of interesting.
I think you could just do a clever masking regime in your diffusion model to achieve the same effect without a whole architecture change.
The video compression is very cool. And the small tricks like binning the mouse movements.
Wonder how much data is generalizable across different UIs? ie how good will the model be at using Figma if it’s never seen it before but has seen a lot of Photoshop
This looks extremely impressive, really deserves more attention here.
Are the inverse dynamics and forward dynamics models trained separately? It sounds like if the inverse dynamics model is meant to extrapolate more training data, then perhaps all that means is it takes very little data to generalize directly with the forward dynamics model assuming the right architecture.
At first glance, this looks incredible to me. The authors train one model on 40K hours of computer-use video, previously labeled by contractors with keyboard and mouse actions, then use that model, in effect, to label 11M hours of computer-use video, which they use to train the computer-action model. The key advance is in compression. Quoting from the OP:
> [previous models] burn a million tokens to understand just one minute of 30 FPS computer data. Our video encoder encodes nearly 2 hours of video in the same number of tokens—that’s 50x more token-efficient than the previous state-of-the-art and 100x more token-efficient than OpenAI’s encoder.
While I was already aware that there are people working on new, more efficient "world models," this is the first one I've seen in action. I'm a bit in shock at how good it is, quite frankly.
I've added the OP, as well as a related 2018 paper on Behavioral Cloning from Obervation (BCO) to my reading list.[a] So far, I've only skimmed the 2018 paper, but it's already evident that it's well-written. I'm no expert in deep RL, and I can understand it. BTW, "Behavioral Cloning from Obervation" is a really good name, with an easy-to-remember acronym.
yeah we've done audio work in the past so we'll def merge the recipes at some point, long term should have full io that a human has (except maybe not generating video for video calls that seems a bit much)
good question! we use exponential binning (map the mouse movements onto a plane with exponentially increasing tick marks https://si.inc/fdm1/exponential_binning.webp) but tried a bunch of other methods (linear creates too many tokens for the model to learn well). Polar coordinates seem like a better solution but empirically didn't work well because the tokens got too coarse too fast.
the main chain of experiments was trying causal => non-causal => non-causal with ctc and CE. i think a good intuition here is that you need a generative approach fundamentally because there definitely are multiple correct IDM labels.
i actually drove the car (with arrow keys) around south park for around ~45 minutes as finetuning data, no extra labelling other than that. think the car line graph is super cool because you actually see the videegame prior working
it's a pretty general policy but this is all super early, it's great at exploring websites so fuzzing was easy, for CAD it has good enough base rates with the few-shot prompt when we do the repetitive stuff, and we gave it checkpoints on each step, the other stuff in the mosaic are just some of our favorite clips from internal evals
thanks! the inverse dynamics model is trained first on 40k hours of data and then frozen to label all 11 million hours. yup! the idea is that it should take a small amount of data to generalize environment dynamics, then you can use a lot of data to understand actions.
In particular the Forward rollout module is very important. It aligns your (effectively) world model with what it expects from the world, and keeping those in sync I think gives this the power it needs to be able to generate the state action pairs to continuously train semi supervised
Content explicitly discusses information freedom: model trains 'unsupervised from the entirety of the internet' and post advocates for 'internet-scale video corpus' as necessary for AI development. Technical documentation and demos are publicly shared.
FW Ratio: 57%
Observable Facts
Post states model 'can learn unsupervised from the entirety of the internet,' affirming internet-scale information access.
Content advocates for 'internet-scale video corpus' as prerequisite for AI capability, asserting need for broad information access.
Technical documentation and video demos are publicly published without paywall.
DCP notes 'Blog post is publicly readable; technical content is open-access.'
Inferences
Advocacy for unrestricted internet-scale data sourcing directly supports Article 19 freedom to seek and receive information.
Public publication of technical findings supports freedom of expression and information dissemination.
DCP access model modifier (+0.05) reflects mild positive signal from open-access policy.
Post describes technical innovation and scientific advancement; frames FDM-1 as enabling new capability in computer vision and AI research. Advocates for 'internet-scale video corpus' as necessary for scientific progress.
FW Ratio: 57%
Observable Facts
Post presents FDM-1 as scientific advancement: 'the first model' with capability for previously unachievable tasks.
Content shares technical methodology, training approach, and performance demonstrations publicly.
Post advocates for 'internet-scale' data access as prerequisite for scientific advancement.
DCP notes 'Org develops AI/computer automation tech; neutral toward human rights; modest framing around technical capability.'
Inferences
Public research sharing supports scientific participation and advancement.
Framing innovation as incremental progress on scale aligns with cultural and scientific contribution narrative.
DCP mission modifier (+0.1) reflects neutral-to-modest positive signal from public technical contribution.
Content is published openly and accessible globally; post discusses training on internet-scale video and knowledge sharing through technical publication.
FW Ratio: 60%
Observable Facts
Blog post is publicly accessible without paywall or login requirement.
Content references 'unsupervised' learning from 'the entirety of the internet,' suggesting global information access.
Post is available on public domain (si.inc) with no apparent geographic blocking.
Inferences
Public availability of technical content supports freedom of movement through information dissemination.
Global training data sourcing indicates engagement with worldwide information flows.
No observable engagement with right to life or personal security in context of autonomous vehicle deployment.
FW Ratio: 67%
Observable Facts
Self-driving demo demonstrates autonomous navigation on public roads in San Francisco.
No safety assurance statements or risk disclosures visible in the post.
Inferences
Absence of explicit safety framing on an autonomous vehicle demo is neutral; product deployment itself raises Article 3 questions but post does not address them.
No engagement with slavery or forced servitude; content does not address labor implications of automation.
FW Ratio: 50%
Observable Facts
Post discusses model training on 11-million-hour video dataset but does not detail data labeling workforce conditions.
Inferences
While large-scale data labeling historically involves precarious labor, the post provides no information about labeling workforce conditions or protections.
Content framing emphasizes technical capability and product launch without engagement with dignity, human welfare, or social purpose dimensions reflected in UDHR Preamble.
FW Ratio: 60%
Observable Facts
Page presents FDM-1 as 'the first model' with capability framing centered on technical scale and performance metrics.
No explicit reference to human dignity, freedom, or justice in the opening framing.
Content is publicly available and readable without paywall or login.
Inferences
The framing prioritizes innovation and capability over societal impact or human-centered values.
Public accessibility suggests some alignment with information dissemination principles, but lack of values-based framing indicates mild negative directionality on Preamble.
Post frames technical capability and capability advancement without discussing duties or responsibilities. Omits discussion of human rights obligations, responsible deployment, or ethical constraints on model use.
FW Ratio: 57%
Observable Facts
Content focuses on capability demonstration (CAD, autonomous driving, testing) without discussing responsible use or limitations.
No discussion of model limitations, failure modes, or potential harms.
Post does not reference human rights constraints on AI deployment or model use.
Framing emphasizes capability scaling and capability advancement without counterbalancing with duty or responsibility language.
Inferences
Absence of responsibility framing in context of powerful automation technologies suggests Article 29 duties are not integrated into narrative.
Technology is presented as advancing capability independent of corresponding human rights duties or ethical constraints.
Lack of limitations discussion suggests incomplete engagement with responsibilities associated with advanced AI systems.
Content does not address discrimination or equal protection; self-driving demo shows real-world deployment without discussion of accessibility or inclusivity.
FW Ratio: 50%
Observable Facts
Self-driving demo uses 'arrow keys' for remote control, assuming standard keyboard interaction.
Video demonstrations lack visible captions or transcripts.
No discussion of accessibility considerations in model design or deployment.
Inferences
Keyboard-only control excludes users with motor disabilities; lack of captions excludes deaf and hard-of-hearing users.
Absence of accessibility discussion in a product with real-world applications (autonomous driving) suggests discrimination is not a design priority.
Post emphasizes replacing 'contractor-annotated screenshots' with unsupervised learning, framing contractor labor as an obsolete cost center rather than recognizing worker dignity. Describes automation of 'CAD, finance, engineering, and eventually ML research' without discussion of labor displacement or worker rights.
FW Ratio: 57%
Observable Facts
Post states previous approach 'requires contractor-labeled annotations. These are expensive, so current computer action datasets are tiny.'
Presents unsupervised learning as solution to eliminate need for contractor labor.
Enumerates applications as 'CAD, finance, engineering, and eventually ML research' without discussing labor impact.
No discussion of retraining, transition support, or worker rights in context of automation.
Inferences
Framing contractor labor as a cost problem to be eliminated treats workers as expenses rather than rights-holders.
Absence of labor impact discussion in a post about workplace automation suggests workers' rights (Article 23) are not a design consideration.
Technology is explicitly designed to displace human labor in multiple sectors without observable mitigation measures.
Post presents technical capability (autonomous driving, UI testing, CAD automation) without discussing social order or frameworks needed to govern deployment. Frames capabilities as inevitable ('coworker for CAD, finance, engineering') without addressing regulatory, ethical, or institutional structures required by Article 28.
FW Ratio: 57%
Observable Facts
Self-driving demo shows autonomous vehicle navigation on public roads without discussion of regulatory compliance or testing protocols.
Post presents applications as future certainties: 'will become a coworker for CAD, finance, engineering, and eventually ML research.'
No reference to legal frameworks, ethical review, or institutional oversight of model deployment.
Describes 'fuzzing' banking apps and autonomous driving without discussing security, safety, or regulatory considerations.
Inferences
Framing capabilities as inevitable without engaging social structures needed to govern them suggests Article 28 (social order and security) is not a primary consideration.
Autonomous vehicle deployment on public roads without discussion of institutional coordination or regulatory frameworks indicates potential misalignment with ordered social frameworks.
Absence of governance discussion in high-stakes applications (autonomous driving, financial systems) suggests technology is presented as advancing independent of social order.
Post describes training on 11-million-hour screen recording dataset without addressing privacy consent or data subject notification. Frames data collection as technical achievement rather than privacy-sensitive activity.
FW Ratio: 57%
Observable Facts
Content states FDM-1 is 'trained on videos from a portion of our 11-million-hour screen recording dataset.'
No disclosure of consent mechanisms, data subject notification, or privacy protections for recorded video.
Dataset sourced from internet-scale video corpus including 'film editing, coding livestreams, video game playthroughs' without discussion of consent.
No privacy policy link or disclosure visible on the post.
Inferences
Training on massive internet-scale video corpus implies collection of personal data (screen recordings may contain sensitive information) without observable consent disclosures.
Framing dataset as a technical resource rather than privacy-sensitive material suggests privacy protections were not a primary design consideration.
Absence of privacy policy and consent transparency suggests structural underperformance on Article 12.
Blog post itself is public and freely accessible; DCP access model modifier (+0.05) reflects open-access technical content. Technical audience orientation somewhat limits accessibility but does not restrict information availability.
Blog post is publicly readable without geographic restrictions; open technical content suggests support for freedom of movement through information access.
Public sharing of technical research and demonstrations supports scientific participation; DCP notes modest mission framing around technical capability (modifier +0.1 affects Article 19 and 27).
No observable structural engagement with social order principles; product deployment (autonomous vehicle demo) occurs without apparent coordination with governance frameworks.
No privacy policy visible on URL per DCP; JavaScript-heavy site may employ tracking. DCP notes potential privacy concerns from dataset training sourced but no on-domain policy visible (DCP modifier: null, but affects Article 12).
Video content embedded without captions; JavaScript-heavy player may exclude assistive technology users; inherited DCP modifier (-0.15) for accessibility applies.
Video-heavy content with JavaScript-intensive player and no captions; DCP accessibility modifier (-0.15) applies. Technical content is specialized and assumes domain knowledge, limiting education accessibility.
Supplementary Signals
How this content communicates, beyond directional lean. Learn more
Claims of 'first model' and 'uniquely good' without peer review or third-party validation visible in post; authority rests on company assertion.
loaded language
Describes contractor labor as 'expensive' cost problem; frames elimination of human annotation work as technical progress without acknowledging worker impact.
causal oversimplification
Suggests internet-scale video corpus is necessary condition for AI competence without discussing alternatives, safeguards, or tradeoffs (privacy, consent, labor).
How accessible is this content to a general audience?
technicalhigh jargondomain specific
Longitudinal
1883 HN snapshots· 8 evals
Audit Trail
28 entries
2026-02-28 14:29
eval_success
Lite evaluated: Neutral (0.00)
--
2026-02-28 14:29
eval
Evaluated by llama-3.3-70b-wai: 0.00 (Neutral)
reasoning
Technical post no rights stance
2026-02-27 16:34
eval_success
Light evaluated: Neutral (0.00)
--
2026-02-27 16:34
eval
Evaluated by llama-4-scout-wai: 0.00 (Neutral)
2026-02-26 21:17
eval_success
Evaluated: Neutral (0.02)
--
2026-02-26 21:17
eval
Evaluated by deepseek-v3.2: +0.02 (Neutral) 11,043 tokens
2026-02-26 20:26
dlq
Dead-lettered after 1 attempts: The First Fully General Computer Action Model
--
2026-02-26 20:25
rate_limit
OpenRouter rate limited (429) model=llama-3.3-70b
--
2026-02-26 20:23
rate_limit
OpenRouter rate limited (429) model=llama-3.3-70b
--
2026-02-26 20:22
rate_limit
OpenRouter rate limited (429) model=llama-3.3-70b
--
2026-02-26 17:51
dlq
Dead-lettered after 1 attempts: The First Fully General Computer Action Model
--
2026-02-26 17:49
rate_limit
OpenRouter rate limited (429) model=llama-3.3-70b
--
2026-02-26 17:48
rate_limit
OpenRouter rate limited (429) model=llama-3.3-70b
--
2026-02-26 17:47
rate_limit
OpenRouter rate limited (429) model=llama-3.3-70b
--
2026-02-26 15:05
rater_validation_fail
Parse failure for model deepseek-v3.2: Error: Failed to parse OpenRouter JSON: SyntaxError: Expected ',' or '}' after property value in JSON at position 421 (line 14 column 4). Extracted text starts with: {
"schema_version": "3.7",
"ev
--
2026-02-26 09:19
dlq
Dead-lettered after 1 attempts: The First Fully General Computer Action Model
--
2026-02-26 09:19
dlq
Dead-lettered after 1 attempts: The First Fully General Computer Action Model
--
2026-02-26 09:19
dlq
Dead-lettered after 1 attempts: The First Fully General Computer Action Model
build 1ad9551+j7zs · deployed 2026-03-02 09:09 UTC · evaluated 2026-03-02 13:57:54 UTC
Support HN HRCB
Each evaluation uses real API credits. HN HRCB runs on donations — no ads, no paywalls.
If you find it useful, please consider helping keep it running.