1082 points by jacquesm 2758 days ago | 140 comments on HN
| Moderate positive
Contested
Landing Page · v3.7· 2026-02-28 11:42:51 0
Summary Open Knowledge Access Champions
Academic Torrents is a platform for globally distributing research datasets via BitTorrent infrastructure. The content champions open access to scientific knowledge and educational resources, with strong positive alignment to UDHR Articles 19 (freedom of information), 26 (education), and 27 (scientific advancement and intellectual property rights). The platform's mission and structure are fundamentally oriented toward democratizing research access for researchers and institutions worldwide.
I can't find any info on how the data is hosted from the website, so I am wondering whether it works like pirate bay or it also hosts data itself? If the former is the case, it will be hard for researcher to use and share. One reason is that academic institutions nowadays has tightened control on net access, which definitely hinders hosting large amount of data shared with BT protocol; second, because researches are often fragmented which by itself will limit interests of possible users, then the sharing falls on a few people's goodwill.
Perhaps I'm going off tangent, but the social dynamics associated with torrenting are pretty darn interesting.
On one hand, they seem to converge towards a consensus with most seeded and downloaded files and popularity as a trust factor. On the other, they also promote the dissemination of ideas the knowledge of which poses a threat to the status quo that is, the state towards which a society was coerced to.
On one hand, Torrents are about rejecting the Publisher and Big Media status but on the other they are about arriving to a democratic status about which films/books/... are the best or most useful.
And don't even get me started about the constant ethical dilemmas associated with sharing and who should control or own the data.
To link all that threads into a broader topic, we could associate the torrent subculture to the Dionysian archetype which Nietzsche wrote about.
Curious if a potential solution would be having open, read only databases, that you could query directly, vs everyone copying the same data over and over. Kind of how you don’t download Wikipedia but access what you need. I realize there are a lot of things to consider. But not even a rest api/etc, an actual database.
Realize it wouldn’t scale, would cost money etc, but could be interesting
I find it fascinating how difficult it is to find geological data. The combined datasets of oil and mining companies plus government data has a huge amount of Earth mapped. And yet, this data is extremely hard to find in a computer-consmable way. Most of it is locked up in pdf or image scans of maps, or locked in proprietary MapInfo/Autodesk formats. It seems to me that a large dataset of all human knowledge of Earth would be massively valuable to humanity. Unfortunately, oil/mineral maps are a cornerstone of a lot of very powerful companies. So I don't think we'll see them any time soon.
Organizing this data would also be a hell of an effort because the maps use different projections, are from a huge variety of times, and are often inconsistent (overlapping areas with different mineral deposit analyses).
My browser reports the "create an account" page is not secure, so maybe best not to use this as an uploader at least until they fix that. For the creator of the site: pages that collect passwords should be served over https.
All the .torrent files are served over http so with a simple MITM attack a bad actor could swap in their own custom tweaked version of any data set here in order to achieve whatever goals that might serve for the bad actor's interests.
I really wish we could get basic security concepts added to the default curriculum for grade schoolers. You shouldn't need a PhD in computer security to know this stuff. These site creators have PhDs in other fields, but obviously no concept of security. This stuff should be basic literacy for everyone.
Anyone know why they opted not to use Webtorrent for this? Obviously straight Bittorrent is more battle-tested, but the extra friction of having to know how to work a BT client is non-trivial.
Its good that more than one way exist for something like this, though I personally prefer something like zenodo, were every record automatically gets a DOI attached.
1. It appears to be sponsored by seedbox hosting companies -plus- a google ad. This is misleading (no, it is -not- directly sponsored nor endorsed by Salesforce, which is the Google Ad I see).
2. Many higher education institutions will block BitTorrent on their firewalls to prevent/reduce copyright infringement.
3. How legitimate is the data? Is there any vetting of the content to ensure that it doesn't violate copyright or that the data was legally obtained eg, the site scrapes? A DMCA takedown is too late if we've already accidentally seeded infringing information and could harm our reputation.
4. The site claims to be "used by" a group of very big names (Stanford, MIT, UT Austin etc). Did they ask/give permission to be cited? Do they endorse the use of this service?
5. HTTPS. Please?
It's a great idea but it needs a bit more polish before I could even suggest this to my management.
Perfect use case for http://datproject.org/. It has git versioning on top of bittorrent, so if something gets updated in the dataset you only download the diff (unlike torrent).
It's in the name - Academic 'Torrents'. They just host the torrent files which are only a couple hundred kilobytes. I feel like your sorting of missing the point of a service like this, it's not to provide potentially the fastest download available, but it's to ensure data is accessible, even if the original download source is unavailable or is inaccessible for certain people or locations.
As long as you're not downloading copyrighted data there should be no issue with using the BT protocol on a company or academic network, providing their is no outright ban on the protocol in your network usage policy. The BT protocol itself actually lends itself quite well to large datasets such as what is hosted here due to its inbuilt error checking (so no more spending hours downloading a huge dataset only to find your connection did something silly for a second and corrupted the whole file) and can provide much faster download speeds on popular files due to the number of peers available, instead of a normal hosting arrangement which would likely provide slower speeds on popular files due to network congestion and file access speeds.
Based on the sponsors I'd say a lot of the content is hosted by some seedbox companies so you wouldn't have to worry about people seeding at the beginning or on slow connections that much.
The absence of rules is anarchic, but not democratic. In anarchy, the powerful coerce, manipulate, and otherwise dominate the masses, creating a status quo that the powerful desire, and abusing the weak without restraint. Historically, the outcome is despots, warlords, feudalism, and brutality. In democracy everyone has an equal vote and equal rights, and it requires a system of rules.
Many had the same hopes for the Internet and social media, for example. But when these things became valuable - influential - powerful interests acted to control and manipulate them, to obtain money, political power and social outcomes. It's hard to claim that the results are that people are choosing information that is "the best or most useful".
I think politics and social outcomes, such as status quos, are unavoidable results of human interaction. Eliminating rules eliminates the protection against arbitrary power and returns us to the world of despots. The politics is unavoidable; the question is, how do we want to manage it?
EDIT: Some major edits; sorry if you read an earlier version.
>On the other, they also promote the dissemination of ideas the knowledge of which poses a threat to the status quo that is, the state towards which a society was coerced to.
Actually, one of the biggest uses of torrents is to disseminate pop culture materials that fall right in the middle of US culture. Probably dwarfing "radical" stuff by many orders of magnitude.
For me torrenting is mostly just about its stigma for being illegal on the one side and its very competitive performance on the other side.
So as soon as someone distributes some data via a torrent, everybody starts asking if it is legal to use that data. When the data is offered via a download link on some website, most people assume that they got the data through a legal channel.
That is ignoring that a sufficiently motivated actor can ensure that doesn't happen.
In one of my private trackers there is a person with a seedbox that downloads every single torrent as soon as it is uploaded, and they have been doing so for quite a few years now.
This ensures that while some things will indeed, be seeded more, nothing quite vanishes.
Then again the form media of that specific tracker is fairly small, so it is not prohibitively expensive to archive everything. One raw ISO Blueray movie file elsewhere could be thousands upon thousands of torrents in that specific tracker.
Maybe something of utility would be creating a distributed torrent system that is a bit more closely tied to the tracker. Where membership would require you to integrate to the swarm by automatically downloading a percentage of the entire corpus, ensuring the health of the tracker.
So a new peer would be bearing part of the load of having everything be accessible.
I think this would require decently heavy curation, but I could see how it could be useful for something like the OP specifically, where having scientific papers lost for good would be a shame.
Most US government maps are available in a single clearinghouse. https://nationalmap.gov/. State governments and counties also have websites with geological (need it for a septic permit) data and land plots. The data is just getting more open, which is awesome. It may be in different formats, but nothing a few lines of python and a PostGIS database can't handle.
I work with a slowly changing dataset that's about 100GB to download in full. A few people a week download it.
I've considered adding a torrent download, because it includes built-in verification of the download. A common problem is users reporting that their download over HTTP is corrupt, but I'm not sure if they'd be able or want to use Bittorrent.
(Also, for many users the download is probably fine, but they can't open it in Excel. Bittorrent won't help that. )
I've been thinking about the same for quite a while now. In fact, look at the overlap between GraphQL and SQL conceptually. I absolutely think there is something to this.
In the past, I have used certain wide open read only genomics databases (not going to name it so it doesn't get hammered by HN).
Other posters are right about services such as BigQuery but I think there's a place for an open source project here that interfaces SQL to databases through a layer that adds caching, throttling and more services on top of that. That's how you make it scale.
The Dremio project (open source by the backers of Apache Arrow) has a SQL REST API that converts a standard SQL dialect/datatypes to the underlying systems. I think that's a good start and Dremio has a ton of other awesome functionality like Apache Arrow caching.
Simple model is expose an expression language (even could be not SQL, like jsoniq, or other expression languages), mapper from that to SQL, web service API on top with a pluggable connector model.
I say that I'm going to start an open source project around this all the time but haven't gotten the inertia to do it. Argh!
> This stuff should be basic literacy for everyone.
Arguably, one compromised PKI x.509 CA jeopardizes all SSL/TLS channel sec if there's no certificate pinning and an alternate channel for distributing signed cert fingerprints (cryptographically signed hashes).
We could teach blockchain and cryptocurrency principles: private/secret key, public key, hash verification; there there's money on the table.
GPG presumes secure key distribution (`gpg --verify .asc`).
Sometimes people upload 1TB files which are not intended to be mirrored or not of interest to many people. We don't want people who donate hosting to mirror this content unless they really want to. But we also want to make it easy and automatic to mirror content. Using collections, which each have an RSS feed, content can be curated by someone you trust to decide what should be mirrored. I curate many collections including videos lectures, deep learning, and medical datasets.
The project is run by the U.S. 501(c)3 Non-profit called Institute for Reproducible Research (http://reproducibilityinstitute.org) and this site has an overhead cost of ~$500/year. We plan to fund this project for at least the next 30 years. The community hosts the data and we also coordinate donations of hosting from our sponsors (listed on the home page).
We also run the project ShortScience.org! Check it out!
The data is hosted by the community and we also coordinate hosting from our sponsors (Listed on the home page)
We work with academic institutions to ensure they allow this service. Please report universities which block the service using the feedback button shown in the lower right of the webpage.
We also encourage HTTP seeds to be specified (aka url-lists) by the uploader to offer a backup URL which can be contacted automatically if BT is blocked. We also offer a python API designed for clusters and university computers written in pure python which supports HTTP seeds: https://github.com/AcademicTorrents/python-r-api
> I find it fascinating how difficult it is to find geological data.
Discoverability for openly released scientific datasets is a huge problem in general. While some enterprising folks have worked on adding parsers for scientific data formats such NetCDF and HDF5 to Apache Tika (which can then be indexed by Solr/Elasticsearch/whatever) [0], the vast majority of scientific file formats don't have parsers available. Even worse, in the climate of publish-or-perish, most scientists are unaware of or less likely to prioritize the incorporation of metadata extraction / indexing tools, even though these would make their data more readily searchable based on relevant metadata (such as equipment settings, etc).
I have some personal experience in this area- when I was working as a research assistant, I basically did helpdesk support for an open access dataset, answering questions from researchers at other institutions. I'd estimate that of the questions I received in my inbox, close to 70% could have been resolved with a good implementation of faceted search. A related issue I encountered is that rather than relevant metadata existing alongside a dataset, sometimes I'd have to dive into an article's methods section to find it, often in a weird place that wasn't obvious at first glance due to the obtuse writing style that is encouraged for scientific publications.
The bigger problem, however, is that the culture of science in academia right now puts way too much emphasis on flashiness over sustainability and admittedly non-sexy tasks like properly versioning and packaging scientific software, documenting analyses, and producing well-characterized datasets.
It's not up currently, but make a note about a domain for a project I'm working on and releasing soon. It's a general platform for accumulating any type of geo/metadata/media possible about a point in space.
Strongest signal: explicitly advocates for scientific advancement and researcher intellectual property rights through open scientific knowledge sharing
FW Ratio: 57%
Observable Facts
Mission statement: 'distributed system for sharing enormous datasets - for researchers, by researchers'
Page includes 'Upload a dataset' enabling researcher contribution
Identified as 'community-maintained distributed repository for datasets and scientific knowledge'
Platform designed specifically to enable scientific advancement through globally distributed research access
Inferences
The repeated emphasis on researcher control ('by researchers') indicates respect for researcher intellectual property and scientific authorship.
The infrastructure enabling universal scientific knowledge distribution exemplifies Article 27 vision of advancing scientific advancement as human right.
Researcher-controlled upload and distribution model balances IP protection with scientific advancement goals.
Core advocacy for freedom of expression and access to information; explicitly champions open access to 298TB+ of research data with no paywall mentioned
FW Ratio: 67%
Observable Facts
Page headline: 'Making over 298.05TB of research data available'
Navigation includes 'Browse' and 'Search' functions
Platform described as 'distributed repository' enabling 'blazing fast download speeds'
No paywall or access restrictions mentioned; free distribution is foundational model
Inferences
The platform's entire purpose and messaging center on removing barriers to research information access, directly exemplifying Article 19 freedoms.
The technical infrastructure choice (BitTorrent) specifically optimizes for global, uncensored information distribution.
Strongly advocates for educational access; prominently features partnerships with 12+ major research universities and frames data as supporting research education
FW Ratio: 60%
Observable Facts
Page displays logos of 12+ major universities including Stanford, Berkeley, CMU, Cornell, MIT-affiliated institutions
Platform explicitly described as supporting research education across academic institutions
Mission includes enabling educational use globally via institutional partnerships
Inferences
The prominence of educational institution partnerships demonstrates active commitment to supporting research education rights.
Free distribution to universities exemplifies infrastructure designed to advance educational access.
Infrastructure is designed entirely around free, global, universal access to research information; search and browse functions enable information discovery
Platform architecture centered on enabling researcher-controlled scientific knowledge distribution; respects researcher IP while facilitating global scientific advancement
Infrastructure directly enables educational access by providing students and researchers at partner institutions with free access to 298TB+ of research datasets
Evaluated by deepseek-v3.2: 0.00 (Neutral) 9,755 tokens-0.39
2026-03-02 11:54
rater_validation_warn
Validation warnings for model deepseek-v3.2: 0W 31R
--
2026-03-02 10:53
rater_validation_fail
Parse failure for model deepseek-v3.2: Error: Failed to parse OpenRouter JSON: SyntaxError: Expected ',' or '}' after property value in JSON at position 16977 (line 467 column 4). Extracted text starts with: {
"schema_version": "3.7",
build 1ad9551+j7zs · deployed 2026-03-02 09:09 UTC · evaluated 2026-03-02 11:31:12 UTC
Support HN HRCB
Each evaluation uses real API credits. HN HRCB runs on donations — no ads, no paywalls.
If you find it useful, please consider helping keep it running.