C *with intrinsics* can get very close to straight assembly performance. The FFm...

CyberDildonics · 2025-02-22T20:29:51 1740256191

You might see a 10x difference if you compare meticulously optimized assembly to naive C in cases where vectorization is possible but the compiler fails to capitalize on that,

I can get far more than 10x over naive C just by reordering memory accesses. With SIMD it can be 7x more, but that can be done with ISPC, it doesn't need to be done with asm.

magicalhippo · 2025-02-23T04:32:04 1740285124

> I can get far more than 10x over naive C

However you can write better than naive C by compiling and watching the compiler output.

I stopped writing assembly back around y2k as I was fairly consistently getting beaten by the compiler when I wrote compiler-friendly high-level code. Memory organization is also something you can control fairly well on the high-level code side too.

Sure some niches remained, but for my projects the gains were very modest compared to invested time.

UltraSane · 2025-02-22T18:42:17 1740249737

"The FFmpeg devs are somewhat infamously against intrinsics (they don't allow them in their codebase even if the performance is as good as equivalent assembly)"

Why?

Narishma · 2025-02-22T19:00:05 1740250805

I don't know if it's their reason but I myself avoid them because I find them harder to read than assembly language.

oguz-ismail · 2025-02-22T18:46:09 1740249969

Have you seen C code with SIMD intrinsics? They are an eyesore

jsheard · 2025-02-22T18:48:01 1740250081

You're not wrong but that's more of an issue with C than an issue with intrinsics, in higher level languages like C++ or Rust you have the option to wrap instrinsics in types which are much nicer to work with.

oguz-ismail · 2025-02-22T19:03:54 1740251034

>C++ or Rust

Nah. I find well commented three column AT&T assembly with light use of C preprocessor macros easier and more enjoyable to read.

Inityx · 2025-02-22T20:21:28 1740255688

Now that's what I call an unpopular opinion.

saagarjha · 2025-02-23T08:57:42 1740301062

Among people who write assembly regularly it's not that unpopular

t-3 · 2025-02-22T22:26:47 1740263207

Not just an eyesore, they also are typed, so any widening or narrowing or using only part of a vector register ends up needing casts so things can get really extremely confusing and cluttered when doing anything beyond basic algebra. With asm it's a much shorter, more elegant and visually-aligned waterfall of code.

xgkickt · 2025-02-23T00:21:12 1740270072

Only if using x86-64 IME. Other architectures that don’t require as much shuffling of data are far more legible.

schainks · 2025-02-22T18:47:44 1740250064

Did you read lesson one?

TL;DR They want to squeeze every drop of performance out of the CPU when processing media, and maintaining a mixture of intrinsics code and assembly is not worth the trade off when doing 100% assembly offers better performance guarantees, readability, and ease of maintenance / onboarding of developers.

astrange · 2025-02-22T23:57:49 1740268669

Intrinsics have the disadvantages of asm (non-portable) but also don't reliably have the advantages of them (compilers are pretty unpredictable about optimizing with them) and they're ugly (especially x86 with its weird Hungarian stuff).

There is just a little bit of intrinsics code in ffmpeg, which I wrote, that does memory copies.

https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/x86/i...

It's like this because we didn't want to hide the memory accesses from the compiler, because that hurts optimization, as well as memory tools like ASan.

janwas · 2025-02-23T16:02:42 1740326562

Intrinsics have the huge advantage of enabling wrapper functions, which remove the ugly names and allow you to write user code only once, such that it is even portable (or at least multiplatform-dependent).

Good point about asan and other instrumentation :) hm, I'd think that is very important for codecs in particular?

brigade · 2025-02-22T20:19:49 1740255589

Well that was more true when you had to care about the 8 registers of x86, CPUs were only like 2-4 wide, and codecs preferred to operate on 8x8 blocks and one bitdepth.

Nowadays the impact of suboptimal register allocation and addressing calculations of compilers is almost unmeasurable between having 16/32 registers available and CPUs that are 8-10 wide in the frontend but only 3-4 vector units in the backend. But the added complexity of newer codecs has strained their use of the nasm/gas macro systems to be far less readable or maintainable than intrinsics. Like, think of how unmaintainable complex C macros are and double that.

And it's not uncommon to find asm in ffmpeg or related projects written suboptimally in a way a compiler wouldn't, either because the author didn't fully read/understand CPU performance manuals or because rewriting/twisting the existing macros to fix a small suboptimality is more work than it's worth.

(yes, I have written some asm for ffmpeg in the past)