Cross-correlation can be more robust than auto-correlation when analyzing short stretches of speech. Regardless of the total duration of your segment of speech, a voice report requires analyzing short sub-sequences. This is because jitter involves the difference between consecutive periods, and if you compute f0 over long stretches, you've already averaged out the differences between consecutive periods.

The service this board uses to render math appears to be having some kind of technical problem. I'll explain why cross-correlation can be more robust than auto-correlation more clearly once the service goes back up. But here's an attempt to explain without mathematical notation. Remember that a cross-correlation has three arguments: two signals *f* and *g*, and a lag tau. The cross-correlation of the two signals at a particular lag is an infinite sum over all *t* of the product of *f*(t) and *g*(t-tau): \sum_*t*^\infty *f*(t)*g*(*t*-tau) . We estimate f0 in the autocorrelation approach by setting *f* and *g* to be the same signal, call it *h*. The largest autocorrelation value occurs when tau is zero, because *h*(*t*) = *h*(*t*-0) for all *t*. The second largest autocorrelation value should usually be the wavelength of the fundamental frequency, but things like random noise can interfere.

In practice, we do not have an infinite sum, because we can only gather so many air pressure measurements before the signals change. To deal with this, practical applications will 'window' each signal. At the midpoint of the window, the signal is multiplied by 1, and it gets multiplied by smaller numbers as you get further away from (before and after) the midpoint. Outside of the window, the scaled measurements are 0.

However, this windowing can cause a problem for the autocorrelation approach. When tau gets to be close to the width of the window, either *h*(t) or *h*(t-tau) will be scaled towards zero or actually fall outside of the window, and the sum of each *h*(*t*-tau) will be artificially small. This means that the windowed autocorrelation will give shorter peaks that are more likely to be lost in noise when tau approaches the window size. This problem can become especially prevalent when the window is very narrow, as is the case when analyzing short stretches of speech to estimate variation in adjacent periods.

The cross-correlation approach to f0 estimation addresses this by essentially windowing the lagged signals separately. *f* is set to be the speech signal windowed around the time period of interest, and *g* is set to be the speech signal windowed about the lag point. This cross-correlation approach essentially decouples the choice of window size from the choice of the range of pitch periods to consider, and allows the algorithm to use the same number of non-scaled samples for both short and long lags.