Another solution is to model the acoustic echo path of the near room. Since we know the reference signal, the input to the microphone can be corrected by processing the reference signal with our model of the room echo response and subtracting it from the received signal. Unfortunately there is no way to know the room echo response ahead of time, and worse yet, the room response can change at any time.
The typical solution is to use an adaptive filter. An adaptive filter uses the statistics of the reference signal and the error in the filter output to estimate a new filter. The new filter is recomputed after every new sample of data (that's about 8000 times a second for telephone quality speech). Because the filter is updated many times a second, we always have an estimate of the room response, even when it's changing.
Here is a simulation of what echo sounds like.
Far room talkers.
Near room talker.
Signal received at the microphone.
Corrected signal using an adaptive filter.