It doesn't mention the downsides of using assembly. The biggest of which is that...

PaulDavisThe1st · 2025-02-22T19:14:51 1740251691

I can't speak for ffmpeg, but I can report on why we use non-portable assembler inside Ardour (a x-platform digital audio workstation).

Ardour's own code doesn't do very much DSP (it's a policy choice), but one thing that our own code does do is metering: comparing a current sample value to every previous sample value in a given audio data stream within a given time window to decide if it is higher (or lower) than the previous max (or min).

When someone stepped forward (hi Sampo!) to code this in hand-written SIMD assembler, we got a 30% reduction in CPU usage when using mid-sized buffers on moderate size sessions (say, 24 tracks or so).

That's a worthy tradeoff, even though it means that we now have 5 different asm versions of about half-a-dozen functions. The good news is that they don't really need to be maintained. New SIMD architectures mean new implementations, not hacks to existing code.

However, I should note that it is always very important to compare what compilers are capable of, and to keep comparing that. In the decade or more after our asm metering code was first written, gcc improved to the point where simply using C(++) and some compiler flags produced code that was within an instruction or two of our hand-crafted version (and may be more correct in the face of all possible conditions).

So ... you can get dramatic performance benefits that are worth the effort, the maintainance costs are low, you should keep checking how your code compares with today's compiler's best optimization effort.

thayne · 2025-02-23T05:06:59 1740287219

I'm not at all saying that it isn't worth it for ffmpeg to use assembly, but there is a tradeoff there. Ffmpeg either needs to either only support a limited number of architectures, and duplicate code for all of them, have asm implementations for the most popular architectures (probably x86(_64) and arm), and a slower, arch independent fallback implementation in c for the rest, or have asm implementations in a large number of ISAs. I'm guessing ffmpeg does the middle option, especially since this guide focuses on x86 assembly, but ffmpeg supports many other architectures.

The performance wins may very well be worth it, but it is still good to be aware of the tradeoff involved.

saagarjha · 2025-02-23T09:02:30 1740301350

ffmpeg has multiple implementations for each architecture to take advantage of microarchitectural wins.

sweeter · 2025-02-22T22:54:05 1740264845

Ardour is a great piece of software! Thanks for that. I love hearing experiences like these.

arkj · 2025-02-22T18:53:35 1740250415

If you look at this from a top-down perspective, you’ll see downsides, but from a bottom-up view, those same differences can be an advantage. Different architectures have different capabilities, and writing assembly means you’re optimizing for performance rather than prioritizing code portability or maintenance.

adgjlsfhk1 · 2025-02-22T20:27:25 1740256045

The counterpoint to this is that if you can write AVX2 assembly, that will be supported on ~99% of x86 CPUs around today (Haswell was 2013), so just that one branch covers ~80% of the desktop/laptop market.

sorenjan · 2025-02-22T21:12:36 1740258756

94.67% according to Steam hardware survey, which is probably close enough.

https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...

Someone · 2025-02-22T21:32:01 1740259921

There’s no guarantee that the fastest AVX2 assembly is equal on all CPUs, and reading https://stackoverflow.com/a/64782733, there are differences between CPUs.

So, chances are you’ll need to have more than one AVX2 assembly version of your code if you want to have the fastest code.

anonymoushn · 2025-02-23T18:00:22 1740333622

I suspect that it is not worth using AVX2 vector gathers on any CPU. But certainly you could end up with the best implementation varying between microarchitectures for other reasons.

renhanxue · 2025-02-23T06:54:47 1740293687

If you really care about performance though you'd want to be a lot more specific than this. I've seen image processing code that not only does things like avoid specific instructions on some CPU families (like for example it avoids the vpermd instruction on Zen1/2/3 CPU's because of excessive latency), but also queries the CPU cache topology at runtime and uses buffer allocation strategies that ensure that it can work in data batches that fit in cache.

withinboredom · 2025-02-22T21:05:51 1740258351

hmmm... that's not exactly true. Hosts may not expose all instructions to VMs, especially certain hosts. So, yeah, I agree with you on the desktop/laptop market, but be wary if your target is servers.

wffurr · 2025-02-22T19:13:01 1740251581

What about Highway? https://github.com/google/highway I suppose that's C++ not C though.

kccqzy · 2025-02-22T21:16:30 1740258990

I've enjoyed using Highway, but it does in fact use plenty of C++ features that would make it unappealing to C projects. And if you make even just one mistake, it's easy to get several screenfuls of error messages; I accept that as a C++ developer but C developers would hate it.

femto · 2025-02-22T21:23:09 1740259389

In a similar vein (C++) there is also, Eigen: https://eigen.tuxfamily.org

astrange · 2025-02-22T23:52:24 1740268344

The code would be architecture specific anyway. ffmpeg is meant to be fast, so it's split into architecture independent and dependent (DSP) parts. The first relies on compiler optimizations, second is what uses SIMD, asm etc.

There is no such thing as a generic "SIMD API" it could use because it uses all specific hardware tools it can to be performant. Anyone who thinks this is posssible is simply mistaken. You can tell because none of them have written ffmpeg.

(There are some things called "array languages" or "stream processing" or "autoscalarization" that work better than SIMD - an example is ispc. But they're not a great fit here, because ffmpeg isn't massively parallel. It's just parallel enough to work.)

hereonout2 · 2025-02-22T19:04:56 1740251096

Possibly they could have added that warning, but at the same time this is a guide from the ffmpeg project, presumably for ffmpeg developers.

They lay it out quite clearly I think, but things like libavcodec are probably one of the few types of project where the benefits of assembly outweigh the lack of portability.

I'm not sure rust or zig's support for SIMD would be the project's first complaint either. Likely more concerned with porting a 25 year old codebase to a new language first.

aidenn0 · 2025-02-22T22:24:06 1740263046

I don't know what is state of the art today, but historically compilers are terribly inefficient for inline assembly because they inhibit optimizations around inline assembly, so inline asm is often slower than intrinsics. For DSP code, your performance critical code is often a large number of iterations through a hot loop, so the function-call overhead incurred by calling your assembly function is negligible.

jsheard · 2025-02-22T23:38:12 1740267492

MSVC doesn't even support inline assembly anymore, so to be portable across the big three compilers you have to use either intrinsics or standalone assembly.

brigade · 2025-02-22T21:30:27 1740259827

Asm is only good on one architecture; inline asm further restricts that to at most two compilers. Plus most of the "documentation" for inline asm constraints is scattered across various comments in the source code of those compilers, and you generally can't safely use gas macros or directives.

wyldfire · 2025-02-22T21:41:19 1740260479

> at most two compilers.

As far as C, C++ go - that's two out of three. So it's not as bad as it sounds to be "at most two".

anonymoushn · 2025-02-23T18:04:22 1740333862

You can use https://github.com/simd-everywhere/simde if you like. In general portable SIMD libraries are of limited utility because having different primitives available on different architectures often means that you should approach problems differently. That is to say, in many cases using any portable SIMD API to solve your problem means leaving 200% speedups on the table on at least one of your top 3 targets.

The thing that is present in Zig and not yet stable in Rust does not include any dynamic shuffles, so these end up requiring intrinsics or asm for all sorts of things. It's a significant weakness compared to e.g. highway, eve, or simde.