In remote sensing | computation physicas applications it's rare to have a single FFT to compute (whatever algorithm is chosen).
Hence the practice of stuffing many FFT's through GPU grids in parallel and working to max out the hardware usage in order to increase application throughput.
What I mean is: where did you take that from? I program FFTs on GPUs, and I see no reason for the "inherently can't reach 100% utilization by any metric".
I interpret that comment as you're not going to be using every silicon block that the GPU provides, like video codecs and rasterizing. If you've maxed out compute without going over the power budget, for example, you'd likely still be able to decode video if the GPU has a separate block for it.
I had a similar read .. I packed a lot of parallel FFT's and other processing into custom TI DSP cards but the DSP family chips were RISC and carried little 'baggage' - just fat fat 32 bit | 64 bit floating point pipelines with instruction sets optimised for modular ring indexing of scalar | vector operations.
Even then they ran @ 80% "by design" for expected hard real time usage .. they only went to 11 and dropped results in toast until they smoke tests and with operators that redlined limits (and got feedback to that effect).