There's a difference because audio processing is often "massively parallel", or at least like 1024 samples at once, but in video codecs operations could be only 4 pixels at once and you have to stretch to find extra things to feed the SIMD operations.
So, you can't necessarily do that because video is compression, and compression means not predictable. (If it's predictable it's not compressed well enough.)
That means you have to stick to inside the current block. But there are some tricks; like for an IDCT there's a previous stage where you can rearrange the output memory elements for free, so you can shuffle things as you need to fit them into vectors.