Vector ALU Patterns

Taniwha · on June 27, 2023

Author here - this is very much a work in progress, not that interesting (yet)

For those who are interested RISC-V has a cool vector unit - you can write portable code that runs on a whole range of vector units of different sizes (rather than say just AVX-512 - the same code can run on 64/128/256/512/1024/... bit units and just be faster on the faster ones), different from most mainline ones - it's worth a look-see if you're a vector fan

NooneAtAll3 · on June 27, 2023

may I suggest adding some faster way to navigate the blog?

since this is a series of posts, current "slam full posts, starting from most recent one" is hard to start reading whole for a new reader

either display short (cutoff) versions when displaying multiple posts at once or add a navigation menu - by month, or by order of publication - with only the names

dzaima · on June 27, 2023

"posts" on the sidebar links to https://moonbaseotago.github.io/posts.html, which is a page with just the dates & titles.

moonchild · on June 28, 2023

refer to https://www.youtube.com/watch?v=HNEm8zmkjBU

dzaima · on June 28, 2023

That's a bit dramatic. Yes, RVV does add to the problem of the same binary running differently on different implementations, but it's not like that doesn't already happen a lot. The main thing that sucks about it imo is that you can't test a piece of code on a single implementation and be sure it'll work the same on all (though I'd imagine CPUs would allow restricting their VLEN to a smaller one, and you can always emulate higher ones), but if your code doesn't work on all implementations, you've really just written as bad code as if it didn't work on any (assuming you didn't do it intentionally).

A compiler can be made to handle the variable vector width relatively easily (clang and gcc already have autovectorization obeying the variable width). With explicit intrinsics or ASM, a programmer can of course write bad code, but it's actually surprisingly non-trivial to do so without noticing - yes, an in-register shuffle/permute can easily break, but to be able to run it you need to have to have some source operands for it, and it'll be obvious when you get the to-be-shuffled vector incorrectly - pretty much the only way to get bad length data is to load it from memory (as RVV doesn't provide you any way to make constant vectors, as those don't make sense), and to do that badly you'd have to pass a bad vl to it, which'll be very obvious. Or you can decide to have the to-be-shuffled vector be capped at the minimum guaranteed 128 bits, at which point there's no problem. On the other hand, you can actually use the shuffles in a fully length-agnostic way via constructing the table programmatically.

I imagine that there will be a decent amount of things that explicitly stick to the 128 due to not being suited for automatic extension to wider vectors (and RVV does allow doing this very easily!), but I don't think that's particularly bad, nor should it restrict how other vector-width-independent things in the same program have to perform. And RISC-V's openness means that 256-bit and higher width implementations will easily exist, making it not completely ignorable. And explicit intrinsics will be 0.01% of actual RVV usage anyway, the 99.99% of autovectorization will nicely extend to future wide implementations.

moonchild · on June 29, 2023

My apologies. I linked that video mostly as a cheap shot at risc-v, and because it articulates the problem nicely. But doing that is somewhat antisocial, and further, I do not per se agree with its solution. It paints fixed-width vectors as the lesser of two evils; you seem to me to argue that variable-width vectors are not quite that bad, actually.

To begin with, performance is always beholden to microarchitectural concerns. It seems likely that, in general, adding two pairs of numbers at once can be faster than adding one pair and then adding another (and perhaps very significantly so, especially for large values of two). But how much exactly depends on the specifics of the hardware. An algorithm which works well for hardware implementing 128-bit vectors might be quite bad for one implementing 2048-bit vectors. Some problems might be amenable to implementation for variable-length vectors—many will not. But even fewer will be implementable in a performance-portable fashion.

The things which be implemented performance-portably are likely to be the simplest. Basically, applications of rank where a given atom is only ever matched with other atoms or with large cells. Few interesting applications are bottlenecked solely by such loops. They will likely have some; but they are likely to also have many loops which cannot be vectorised performance-portably. Hence, the promise of these vector isas, even if it could be realised nominally, would be a lie.

Of course, this problem of performance portability afflicts fixed-width vectors, too. Intel goes to some lengths to try to provide it[0], but it's not obvious that that's sustainable in the long term (though, it's not clear that x86 is sustainable in the long term). (An example: I have some half-written code to use vector shuffles in place of loads for indexing small arrays in the j interpreter. What's 'small'? Once you go past a single array, the size of the mux grows quite quickly; but amd has a much better shuffle unit than intel, and a much worse memory subsystem, so I wouldn't be surprised if amd wanted to go a level deeper than intel.)

In a recently-oft-circulated thread[1], fabian giesen mentions, among other things, the reason why avx512 hasn't seen wider adoption: if your logical vector-width is n, then you can implement that on an n/2-width alu without too much trouble, but n/4 starts to cause problems. (Where 'without too much trouble' means that the cost model doesn't change much on either the hardware or the software side.) So with fixed-width vectors, you do get one factor of two of hardware flexibility, but no more. Arguably, this gives the software more flexibility (than variable-length vectors), and doesn't harm the hardware, if you can pick the right n. Avx512 seems to demonstrate that you can't; some heavy vector workloads want 512, and some low-power client hardware wants 128.

Recently, I predicted[2] that sve would effectively standardise on 256 bits (not 128 bits!—too anemic, and anyway neon already exists). I still hope that that happens. I think 256-bit vectors are a monotonic improvement over variable-width ones. Of course, see the above comment about heavier vector workloads really benefiting from 512-bit vectors; but it's naive to expect that the ideal vector size will grow without bound over the course of time, without other fundamental aspects of our models changing too, invalidating the entire enterprise.

And this really gets at the heart of the issue. ISAs are bad[3] (which is why we keep needing to make more of them[4]), because they expose an overly low-level interface. They should be gotten rid of. The correct interface to these sorts of functionality is a higher-level language—a sensible one which can actually be optimised, not c; perhaps a nice apl, perhaps with some concessions. Code should be designed to work uniformly and universally; you should be able to create arrays that logically have any size, and the hardware and software collaborate to execute it as well they know how. If you know something about the underlying microarchitecture, then—uniformly, and as you always have been able to do—you can exploit that. For instance, if you know the underlying vector size is 64 bytes, then you may take care to work with length-32 vectors of 2-byte integers, which may enable further interesting optimisations to be done by the compiler.

Making the high-level language expressive and optimisable helps to avoid the need to expose specific low-level functionality (as in c via 'intrinsics'). Where they are necessary—as they may occasionally be[5]—they can still be cast in more generic form and optimised more for clarity, e.g. (taking a pertinent example) acting generically wrt array size.

0. https://lobste.rs/s/gjza8a/#c_ircalq

1. https://mastodon.gamedev.place/@rygorous/110572829749524388

2. https://lobste.rs/s/4cynth/introducing_iguana_extremely_fast...

3. https://lobste.rs/s/oxzs1q/note_on_metal_shader_converter#c_.... Sorry I only sketch the argument here; I have been planning to write it up more completely and rigorously, but have not done with that yet.

4. Take care of the implicatory arrow; continually needing to make more of something is not in general an indication that the thing is bad.

5. Programming languages may be prone to the same pathologies as isas; an isa is a programming language. The difference is one of degree, and higher-level languages, in balance, end up in a much nicer and more useful place.

dzaima · on June 29, 2023

> An algorithm which works well for hardware implementing 128-bit vectors might be quite bad for one implementing 2048-bit vectors

Yeah, that's definitely a concern. I do wish that RVV had some built-in way to cap the width in a request (of course emulatable by doing a min() manually), but beyond that it's up to the hardware I guess.

> but they are likely to also have many loops which cannot be vectorised performance-portably

Such would also likely not be autovectorizable either, making whether it can be done performance-portably much less significant due to the much lower frequency of manually-vectorized things. In most situations, if you're at the point of manually vectorizing some piece of code, it's enough of a bottleneck that you can tune it for whatever (set of) hardware you're targeting.

I, as someone who enjoys writing manually vectorized stuff, don't particularly like that added burden, but it's a pretty reasonable decision.

> I have some half-written code to use vector shuffles in place of loads for indexing small arrays in the j interpreter

Marshall has written that for CBQN[0] for AVX2, primarily tuned for Intel[1], but yeah the boundary is pretty sketchy to have to hard-code and RVV could definitely make it worse. But base 128-bit RVV, with LMUL=8 provides 1024-bit lookup tables in a single instruction, which should at least mostly negate the desire of manually blending multiple, which is where you'd start desiring tuning. Though, I wouldn't be surprised if an LMUL=8 in-register shuffle on some hardware would be slower than a memory gather for some element widths, which'd suck.

> but n/4 starts to cause problems

The existence of LMUL=8 means that hardware will necessarily have to figure out some way to split single instructions across many cycles. Arguably that's by itself pretty bad, but at least I'd imagine that further splitting up should be largely free, though the register file size problem of course persists.

> sve would effectively standardise on 256 bits

Yeah, I feel like 256 is a pretty nice sweet spot too.

In regards to higher-level hardware interfacing, I'd worry that what it'd do would be making everything uniformly slow, instead of making everything uniformly fast, thus potentially not actually being any better than status quo. And such would be at a very significantly worse spot in terms of performance-portable tuning than RVV. Not to mention that nicer array operation behavior wouldn't help the vast majority of code which is very scalar & dispatchy. And an equivalent of this would be to just share programs in IR and not machine code form, which we can already do, but don't really for whatever reason (if requiring near-native speed; otherwise there's JS/WASM/JVM. Then there's LLVM IR/MLIR but I haven't heard of it being used as a shared binary format. Maybe Julia would count with it compiling LLVM IR at runtime, and it having high-level array ops?)

[0]: https://github.com/dzaima/CBQN/blob/a67a19dd0bf6be5eb765cac8... - wd is bit width of the table elements, xl - number of such elements in the table. Beyond that, it just does the expected thing of blending permutes & shuffles via some Singeli that not even I have bothered to particularly understand

[1]: "Select" section at https://mlochbaum.github.io/bencharray/pages/summary.html on Tiger Lake i5-1135G7, with similarly scaling results on haswell & skylake

dzaima · on June 29, 2023

As an extra note on big/little cores - instead of having the small core break up ops, you could possibly instead have the big core handle higher LMUL natively. Requires code to use said higher LMUL (currently, clang defaults to LMUL 2, gcc to LMUL 1[0], but that's of course up for change), but would mean that you'd have higher potential throughput for big cores while leaving the register size tractible for the small cores.

Also has some fun possibilities of, say, having native LMUL=2 add/sub, but only LMUL=1 multiplications or something.

[0]: https://godbolt.org/z/YqjWhT8Y8 - "m1" vs "m2" in "vsetvli"