343 points by Twirrim 5 days ago | 125 comments on HN
| Moderate positive
Contested
Low agreement (3 models)
Editorial · v3.7· 2026-03-15 22:17:54 0
Summary Technical Knowledge Sharing Neutral
This URL contains a detailed technical blog post analyzing Python performance optimization strategies through empirical benchmarking, code examples, and comparative analysis of different optimization tools (PyPy, Mypyc, NumPy, JAX, Numba, Cython). The content engages minimally with human rights frameworks, with one notable positive signal in Article 19 (freedom of expression) through the author's transparent, critical publication of technical findings. The post demonstrates intellectual freedom in publishing detailed performance analysis without evident restriction, though the dominant thematic focus remains computational efficiency rather than human rights engagement.
CPython 3.13 went further with an experimental copy-and-patch JIT compiler -- a lightweight JIT that stitches together pre-compiled machine code templates instead of generating code from scratch. It's not a full optimizing JIT like V8's TurboFan or a tracing JIT like PyPy's;
Good news. Python 3.15 adapts Pypy tracing approach to JIT and there are real performance gains now:
> The real story is that Python is designed to be maximally dynamic -- you can monkey-patch methods at runtime, replace builtins, change a class's inheritance chain while instances exist -- and that design makes it fundamentally hard to optimize. ...
> 4 bytes of number, 24 bytes of machinery to support dynamism. a + b means: dereference two heap pointers, look up type slots, dispatch to int.__add__, allocate a new PyObject for the result (unless it hits the small-integer cache), update reference counts.
Would Python be a lot less useful without being maximally dynamic everywhere? Are there domains/frameworks/packages that benefit from this where this is a good trade-off?
I can't think of cases in strong statically typed languages where I've wanted something like monkey patching, and when I see monkey patching elsewhere there's often some reasonable alternative or it only needs to be used very rarely.
I've been in the pandas (and now polars world) for the past 15 years. Staying in the sandbox gets most folks good enough performance. (That's why Python is the language of data science and ML).
I generally teach my clients to reach for numba first. Potentially lots of bang for little buck.
One overlooked area in the article is running on GPUs. Some numpy and pandas (and polars) code can get a big speedup by using GPUs (same code with import change).
Python is perfect as a "glue" language. "Inner Loops" that have to run efficiently is not where it shines, and I would write them in C or C++ and patch them with Python for access to the huge library base.
This is the "two language problem" ( I would like to hear from people who extensively used Julia by the way, which claims to solve this problem, does it really ?)
Surprised Python is only 21x slower than C for tree traversal stuff. In my experience that's one of the most painful places to use Python. But maybe that's because I use numpy automatically when simple arrays are involved, and there's no easy path for trees.
Kudos for going through all the existing JIT approaches, instead of reaching for rewrite into X straight away.
However if Rust with PyO3 is part of the alternatives, then Boost.Python, cppyy, and pybind11 should also be accounted for, given their use in HPC and HFT integrations.
Significant AI smell in this write up. As a result, my current reflex is to immediately stop reading. Not judgement on the actual analysis and human effort which went in. It’s just that the other context is missing.
>The usual suspects are the GIL, interpretation, and dynamic typing. All three matter, but none of them is the real story. The real story is that Python is designed to be maximally dynamic -- you can monkey-patch methods at runtime, replace builtins, change a class's inheritance chain while instances exist -- and that design makes it fundamentally hard to optimize.
ok I guess the harder question is. Why isn't python as fast as javascript?
> Missing @cython.cdivision(True) inserts a zero-division check before every floating-point divide in the inner loop. Millions of branches that are never taken.
I thought never taken branches were essentially free. Does this mean something in the loop is messing with the branch predictor?
In practice the ladder has two rungs for me. Write it in Python with numpy/scipy doing the heavy lifting, and if that's not enough, rewrite the hot path in C. The middle steps always felt like they added complexity without fully solving the problem.
The JIT work kenjin4096 describes is really promising though. If the tracing JIT in 3.15 actually sticks, a lot of this ladder just goes away for common workloads.
Great write up and recognisable performance. For a pipeline with many (~50) build dependencies unfortunately switching interpreter or experimenting with free threading is not an easy route as long as packages are not available (which is completely understandable).
I’m not one of these rewrite in Rust types, but some isolated jobs are just so well sorted for full control system programming that the rust delegation is worth the investment imo.
Another part worth investigating for IO bound pipelines is different multiprocessing techniques. We recently got a boost from using ThreadPoolExecutor over standard multiprocessing, and careful profiling to identify which tasks are left hanging and best allocated its own worker. The price you pay though is shared memory, so no thread safety, which only works if your pipeline can be staggered
Seeing Graal and Pypy beat the gcc C versions suggests to me there's something wrong with the C version. Perhaps they need a -march=native or there's something else wrong. The C version would be a different implementation in the benchmark game, but usually they are highly optimised.
Edit: looking at [1] the top C version uses x86 intrinsics, perhaps the article's writer had to find a slower implementation to have it running natively on his M4 Pro? It would be good to know which C version he used, there's a few at [1]. The N-body benchmark is one where they specify that the same algorithm must be used for all implementations.
All the approaches beyond PyPy are to either use a different lang that's superficially similar to python or to write a native extension for python in a different language, which is at odds with the stated premise.
People here on HN have in the past suggested that TypeScript is the superior-in-all-ways, just-as-easy/fun-to-code-in language and should replace Python in pretty much all use cases.
Anyone have an opinion on how TS would fare in this comparison?
It's missing the easiest of the choices: core performance-sensitive code in C, interface it to python with pybind11, build app in python. Small stack, huge gains, best of both worlds.
> I don't know JAX well enough to explain exactly why it's 3x faster than NumPy on the same matrix multiplications.
JAX is basically a frontend for the XLA compiler, as you note. The secret sauce is two insights - 1) if you have enough control, you can modify the layout of tensor computations and permute them so they don’t have to match that of the input program but have a more favorable memory access pattern; 2) most things are memory bound, so XLA creates fusion kernels that combine many computations together between memory accesses. I don’t know if the Apple BLAS library has fused kernels with GEMM + some output layer, but XLA is capable of writing GEMM fusions and might pick them if they autotune faster on given input/output shapes.
> But I haven't verified that in detail. Might be time to learn.
If you set the environment variable XLA_FLAGS=--dump_hlo_to=$DIRECTORY then you’ll find out! There will be a “custom-call” op if it’s dispatching to BLAS, otherwise it will have a “dot” op in the post-optimization XLA HLO for the module. See the docs:
While this is great, I expected faster CPython to eventually culminate into what YJIT for Ruby is. I'm not sure the current approaches they are trying will get the ecosystem there.
This problem has been solved already by Lisp, Scheme, Java, .NET, Eiffel, among others, with their pick and choose mix of JIT and AOT compiler toolchains and runtimes.
There are some use cases for very dynamic code, like ORMs; with descriptors you can add attributes + behavior at runtime and it's quite useful.
Anyways, breaking metaprogramming and more dynamic features would mean python 4 and we know how 2 -> 3 went. I also don't think it's where the core developers are going. Also also, there are other things I'd change before going after monkey patching like some scoping rules, mutable defaults in function attributes, better async ergonomics, etc.
I've always thought the flexibility should allow python to consume things like gRPC proto files or OpenAPI docs and auto-generate the classes/methods at runtime as opposed to using codegen tools. But as far as I know, there aren't any libraries out there actually doing that.
Be careful with that, numpy arrays can be slower than Python tuples for some operations. The creation is always slower and the overhead has to be worth it.
I didn't notice any signs of AI writing until seeing this comment and re-reading (though I did notice it on the second pass).
That said, I think this article demonstrates that focusing on whether or not an article used AI might be focusing on the wrong “problem.” I appreciate being sensitive to the "smell" (the number of low-effort, AI posts flying around these days has made me sensitive too), but personally, I found this article both (1) easy to read and (2) insightful. I think the number of AI-written content lacking (2) is the problem.
You can turn trees into numpy-style matrix operations because graphs and matrices are two sides of the same coin. I don't see the code for the binary-tree benchmark in the repo to see how it's written, but there are libraries like graphblas that use the equivalence for optimization.
I also seem to be developing an immune response to several slopisms. But the actual content is useful for outlining tradeoffs if you’re needing to make your Python code go faster.
One thing with python is that usually I will use one of the many c based libraries to get reasonable speed and well thought out abstractions from the start. I architect around numpy, scipy, shapely, pandas/polars or whatever. So my code runs at reasonable speed from the start. But transpiling to rust then effectively means a complete redesign of the code, data structures, algorithms etc. And I have seen the AI tools really struggle to get it right, as my intent gets lost somewhere.
So what I do now (since Claude Code) is write really bare bones (and slow) pure python implementation (like I used to do for numba, pypy or cython ready code), with minimal dependencies. Then I use the REPL, notebooks and nice plotting tools to get a real understanding of the problem space and the intricacies of my algorithm/problem at hand. When done, I let Claude add tests and I ask it to transpile to equivalent Rust and boom! a flawless 1000x speed upgrade in a minutes.
The great thing is I don't need to do the mental gymnastics to vectorize code in a write only mode like I've had to do since my Matlab days. Instead I can write simple to read for loops that follow my intent much better, and result in much more legible code. So refreshing!
And with pyO3 i can still expose the Rust lib to python, and continue to use Python for glue and plotting
They're cheap but not free, especially at the front end of the CPU where it's just a lot more instructions to churn through. What the branch predictor gets you is it turns branches, which would normally cause a pipeline bubble, to be executed like straightline code if they're predicted right. It's a bit like a tracing jit. But you will still have a bunch of extra instructions to, like, compute the branch predicate.
> ok I guess the harder question is. Why isn't python as fast as javascript?
Actually there is a pretty easy answer: worldwide, the amount of javascript being evaluated every day is many orders of magnitude higher than the amount of python. The amount of money available for optimizing it has thus been many orders of magnitude higher as well.
That has been a thing forever, many "Python" libraries, are actually bindings to C, C++ and Fortran.
The culture of calling them "Python" is one reason why JITs are so hard to gain adoption in Python, the problem isn't the dynamism (see Smalltalk, SELF, Ruby,...), rather the culture to rewrite code in C, C++ and Fortran code and still call it Python.
I have used Julia for my main language for years. Yes, it really does solve the two language problem. It really is as fast as C and as expressive as Python.
It then gives you a bunch of new problems. First and foremost that you now work in a niche language with fewer packages and fewer people who can maintain the code. Then you get the huge JIT latency. And deployment issues. And lack of static tooling which Rust and Python have.
For me, as a research software engineer writing performance sensitive code, those tradeoffs are worth it. For most people, it probably isn’t. But if you’re the kind of person who cares about the Python optimization ladder, you should look into Julia. It’s how I got hooked.
The dynamism exists to support the object model. That's the actual dependency. Monkey-patching, runtime class mutation, vtable dispatch. These aren't language features people asked for. They're consequences of building everything on mutable objects with identity.
Strip the object model. Keep Python.
You get most of the speed back without touching a compiler, and your code gets easier to read as a side effect.
I built a demo: Dishonest code mutates state behind your back; Honest code takes data in and returns data out. Classes vs pure functions in 11 languages, same calculation. Honest Python beats compiled C++ and Swift on the same problem. Not because Python is fast, but because the object model's pointer-chasing costs more than the Python VM overhead.
Don't take my word for it. It's dockerized and on GitHub. Run it yourself: honestcode.software, hit the Surprise! button.
If you're patching hot paths with C and praying the interface layer doesn't explode, you can spend almost as much time chasing ABI boundary bugs as you save on perf. Type hints in Python are still docs for humans and maybe your LSP. Julia does address the two language problem in theory, but getting your whole stack and your deps to exist there is its own wierd pain, and people underplay how much library inertia matters once you leave numerics.
I have the same issue now. It's especially annoying when it happens while reading a "serious" publication like a newspaper or long form magazine. Whether it was because an AI wrote it or "real" writers have spent so much time reading AI slop they've picked up the same style is kinda by the by. It all reads to me like SEO, which was the slop template that LLMs took their inspiration from, apparently. It just flattens language into the most exhausting version of it, where you need to try to subconsciously blank out all the unnecessary flourishes and weird hype phrases to try figure out what actually is trying to be said. I guess humans who learn to ignore it might to do better in this brave new world, but it's definitely annoying that humans are being forced to adapt to machines instead of the other way around.
As a sibling comment mentions, yes it does. Just don’t expect to have code that runs as fast as C without some effort put into it. You still need to write your program in a static enough way to obtain those speed. It’s not the easiest thing in the world, since the tooling is, yes, improving but is still not there yet.
If you then want to access fully trimmed small executables then you have to start writing Julia similarly to how you write rust.
To me the fact that this is even possible blows my mind and I have tons of fun coding in it. Except when precompiling things. That is something that really needs to be addressed.
From when I was working on optimizing one or two things with Cython years ago, it wasn’t per-se the branch cost that hurt: it was impeding the compiler from various loop optimisations, potentially being the impediment from going all the way to auto-vectorisation.
Beyond the economic arguments, there’s a lot in JS that actually makes it a lot easier: almost all of the operators can only return a subset of the types and cannot be overridden (e.g., the binary + operator in JS can only return a string or a number primitive), the existence of like string and number primitives dramatically reduce the amount of dynamic behaviour they can have, only proxy objects can exhibit the same amount of dynamism as arbitrary Python ones (and thus only they pay the performance cost)…
Jax seems quite interesting even from this point of view… numpy has the same problem as blas basically, right? The limited interface. Eventually this leads to heresies like daxpby, and where does the madness stop once you’ve allowed that sort of thing? Better to create some sort of array language.
Article 19 guarantees freedom of opinion and expression. Content exemplifies expressive freedom through detailed technical analysis, code examples, and benchmark data. Author freely publishes findings, methodology, and interpretation of optimization trade-offs without evident restriction.
FW Ratio: 60%
Observable Facts
Page displays full technical article with code examples, benchmark tables, and detailed analysis publicly accessible without authentication.
Author presents interpretive analysis of optimization trade-offs, explicitly framing findings as 'one developer's journey' rather than definitive ranking.
Content includes direct quotes from external sources (Reddit commenter) and attribution, enabling verification of claims.
Inferences
The author's ability to publish detailed critical analysis of programming tools and methodologies reflects meaningful exercise of expressive freedom.
The framing of limitations ('constraints'), methodology transparency, and acknowledgment of alternative approaches supports dialectical expression characteristic of Article 19.
Preamble concerns recognition of inherent dignity and equal rights. Content addresses technical optimization, not human dignity, rights hierarchy, or justice frameworks.
Article 1 affirms universal human equality and dignity. Content focuses on programming language performance optimization with no engagement with equality or dignity.
Article 27 addresses cultural participation and intellectual property. Content discusses code optimization and tool usage without addressing cultural or intellectual property rights.
Website permits public posting of technical content without apparent editorial gatekeeping. No paywalls, registration barriers, or publishing restrictions observed. Content remains accessible for reading and sharing.