> but which requires all parallelism to be statically declared ahead of time
this is what all specialized chips like TPU/Cerebras require today, and it allows for better optimization than a generic CPU since you can "waste" 30 min figuring out the perfect routing/sequencing of operations, instead of doing it in the CPU in nanoseconds/cycles
another benefit is you can throw away all the CPU out-of-order/branch prediction logic and put useful matrix multipliers in it's place
this is what all specialized chips like TPU/Cerebras require today, and it allows for better optimization than a generic CPU since you can "waste" 30 min figuring out the perfect routing/sequencing of operations, instead of doing it in the CPU in nanoseconds/cycles
another benefit is you can throw away all the CPU out-of-order/branch prediction logic and put useful matrix multipliers in it's place