C with intrinsics can get very close to straight assembly performance. The FFmpeg devs are somewhat infamously against intrinsics (IIRC they don't allow them in their codebase even if the performance is as good as equivalent assembly) but even by TFAs own estimates the difference between intrinsics and assembly is on the order of 10-15%.
You might see a 10x difference if you compare meticulously optimized assembly to naive C in cases where vectorization is possible but the compiler fails to capitalize on that, which is often, because auto-vectorization still mostly sucks beyond trivial cases. It's not really a surprise that expert code runs circles around naive code though.
You might see a 10x difference if you compare meticulously optimized assembly to naive C in cases where vectorization is possible but the compiler fails to capitalize on that,
I can get far more than 10x over naive C just by reordering memory accesses. With SIMD it can be 7x more, but that can be done with ISPC, it doesn't need to be done with asm.
However you can write better than naive C by compiling and watching the compiler output.
I stopped writing assembly back around y2k as I was fairly consistently getting beaten by the compiler when I wrote compiler-friendly high-level code. Memory organization is also something you can control fairly well on the high-level code side too.
Sure some niches remained, but for my projects the gains were very modest compared to invested time.
"The FFmpeg devs are somewhat infamously against intrinsics (they don't allow them in their codebase even if the performance is as good as equivalent assembly)"
You're not wrong but that's more of an issue with C than an issue with intrinsics, in higher level languages like C++ or Rust you have the option to wrap instrinsics in types which are much nicer to work with.
Not just an eyesore, they also are typed, so any widening or narrowing or using only part of a vector register ends up needing casts so things can get really extremely confusing and cluttered when doing anything beyond basic algebra. With asm it's a much shorter, more elegant and visually-aligned waterfall of code.
TL;DR They want to squeeze every drop of performance out of the CPU when processing media, and maintaining a mixture of intrinsics code and assembly is not worth the trade off when doing 100% assembly offers better performance guarantees, readability, and ease of maintenance / onboarding of developers.
Intrinsics have the disadvantages of asm (non-portable) but also don't reliably have the advantages of them (compilers are pretty unpredictable about optimizing with them) and they're ugly (especially x86 with its weird Hungarian stuff).
There is just a little bit of intrinsics code in ffmpeg, which I wrote, that does memory copies.
It's like this because we didn't want to hide the memory accesses from the compiler, because that hurts optimization, as well as memory tools like ASan.
Intrinsics have the huge advantage of enabling wrapper functions, which remove the ugly names and allow you to write user code only once, such that it is even portable (or at least multiplatform-dependent).
Good point about asan and other instrumentation :) hm, I'd think that is very important for codecs in particular?
Well that was more true when you had to care about the 8 registers of x86, CPUs were only like 2-4 wide, and codecs preferred to operate on 8x8 blocks and one bitdepth.
Nowadays the impact of suboptimal register allocation and addressing calculations of compilers is almost unmeasurable between having 16/32 registers available and CPUs that are 8-10 wide in the frontend but only 3-4 vector units in the backend. But the added complexity of newer codecs has strained their use of the nasm/gas macro systems to be far less readable or maintainable than intrinsics. Like, think of how unmaintainable complex C macros are and double that.
And it's not uncommon to find asm in ffmpeg or related projects written suboptimally in a way a compiler wouldn't, either because the author didn't fully read/understand CPU performance manuals or because rewriting/twisting the existing macros to fix a small suboptimality is more work than it's worth.
(yes, I have written some asm for ffmpeg in the past)
You might see a 10x difference if you compare meticulously optimized assembly to naive C in cases where vectorization is possible but the compiler fails to capitalize on that, which is often, because auto-vectorization still mostly sucks beyond trivial cases. It's not really a surprise that expert code runs circles around naive code though.