To take advantage of the blocking algorithm, matrix multiplication is implemented using Fast Fourier Transforms (FFT). The optimized FFT routine executes a length 512 FFT in approximately 7200 cycles. I am using a property of the FFT to do two real valued FFTs in one FFT operation (with some overhead). The number of FFTs per block is where M is the number of microphones. The block size is length 256. So for a sample rate of 8khz, the total computation for the FFT code alone is approximately or about 12 million cycles for 8 microphones. At first this seems pretty good. But after you consider the matrix inverse update code, the FFT overhead, and correcting the filter output, it turns out to be more than 5 or 6 times as much computation. And this doesn't even include beamforming.
Ok, so it still runs in under 150 million cycles, right? Not quite. The memory requirements exceed the 64k on the processor. And even the L2 cache runs considerably slower than the L1 cache. At a minimum we need to try to keep memory in the L2 cache as much as possible. Unfortunately, the algorithm runs through more than 64k of memory 8000 times a second. So just using cache may not be very efficient. Another solution is to use memory overlays. This gives the programmer more control over where the memory is at all times. And this is what I'm currently working on.