原文链接 LONG-TERM PREDICTION

by: Adit Aviv 
      Kfir Grichman

introduction:

The speech signal has been studied for various reasons and applications by many researchers for many years. Some studies broke down the speech signal into its smallest portions, called phonemes. However, here we will describe the speech signal in terms of its general characteristics. The traditional vocoders which have been in use for many years classify the input speech signal either as voiced or unvoiced. A voiced speech segment is known by its relatively high energy content, but more importantly it contains periodicity which is called the pitch of voiced speech. The unvoiced part of speech, on the other hand, looks more like random noise with no periodicity. However, there are some parts of speech that.are neither voiced nor unvoiced, but a mixture of the two. These are usually called the transition regions, where there is a change either from voiced to unvoiced or unvoiced to voiced.

One of the most powerful speech analysis methods is that of Linear Predictive Coding, or LPC analysis as it is commonly referred to. In LPC analysis the short-term correlations between speech samples (formants) are modelled and removed by a very efficient short order filter. Another equally powerful and related method is pitch prediction. In pitch prediction, the long-term correlation of speech samples are modelled. In the following report, these linear prediction techniques will be examined and discussed.

 

LINEAR PREDICTIVE CODING (LPC) OF SPEECHThe linear predictive coding (LPC) method for speech analysis and synthesis is based on modeling the vocal tract as a linear all-pole (IIR) filter having the system function:

When p is the number of poles, G is the filter gain, and { ap (k) } are the parameters that determine the poles. There are two mutually exclusive excitation function to model voiced and unvoiced speech sounds. On a short time basis, voiced speech is periodic with a fundamental frequency F0, or a pitch period 1/F0, which depend on the speaker. 
Thus voiced speech is generated by exciting the all-pole filter model by a periodic 
impulse train with a period equal to the desired pitch period. 
Unvoiced speech sounds are generated by exciting the all pole filter model by the

Block diagram model for the generation of a speech signal

Given a short-time segment of a speech signal, usually about 20ms or 160 samples at an 8 kHz sampling rate, the speech encoder at the transmitter must determine the proper excitation function, the pitch period for voiced speech, the gain parameter G, and the coefficients ap (k). 
A block diagram that illustrates the speech encoding system is given in the next figure :

Encoder and decoder for LPC

At the  receiver the speech signal is synthesized from the model and the excitation signal.

The parameter of the all pole filter model are easily determined from the speech samples by mans of linear prediction.

To be specific, the output of the fir linear prediction filter is:

                 1.1

and the corresponding error between the observed sample s(n) and the predicted value is,

          1.2

By minimizing the sum of squared errors, that is,

                   1.3

We can determine the pole parameters {ap(k)} of the model. The result of differentiating with respect to each of the parameters and equating the result to zero, is a set of p linear equations,

                  1.4

Where rss(m) is the autocorrelation of the sequence s(n) defined as,

                                        1.5

The linear equation (1.4) can be expressed in matrix form as

                                                         1.6

When Rss is a p*p autocorrelation matrix, rss is a p*1 autocorrelation vector, and a p*1 vector of model parameters. Hence

 1.7

LONG - TERM  PREDICTION (LTP)

In the residual LPC we can see the  ability of LPC analysis to remove the adjacent or neighbouring sample correlations present in speech . As observed, this was equivalent to removing the spectral envelope in the signal spectrum. However, as can be seen from the Figure pitch prediction, after LPC analysis there are still considerable variations in the spectrum, i.e. it is far from white. 
  
  
  
Spectra of (a) original speech envelope, (b) original speech spectrum, and (c) LPC residual spectrum .

Looking at the residual signal in the Figures above, it is clear that long-term correlations, especially during voiced regions, still exist between samples.

To hear the original signal click here... 
To hear the residual LPC click here...

The most evident of these are the sharp periodic pulses which, being the excitation signal, is hardly surprising, as our original source-filter model assumes this type of input signal. This also explains why the LPC analysis, which models our vocal tract, cannot adequately remove them. Consequently, to remove the periodic structure of the residual or excitation signal, a second stage of prediction is required. The objective of this second stage is again to spectrally flatten our signal, i.e. to remove the fine structure. But unlike the LPC analysis, it exploits correlation between the speech samples that are one 'pitch' or multiple 'pitch' period away. For this reason, the pitch prediction (filter) is usually called the long-term prediction (LTP) and the filter delay is called the lag. In the following report, these long-term or distant sample based predictors will be described.

 Pitch predictor (filter) formulation

Before discussing methods of pitch or long-term prediction, it is perhaps worth considering what our objectives are. Our aim is to model the long-term correlation left in the speech residual signal after LPC inverse filtering (or in the original speech signal) such that when the model parameters are used in a filter, it will remove the long-term correlation as much as possible, or spectrally flatten our signal. There are no obvious reasons why we must use the residual and not the original signal to model the long-term correlation in the speech signal, as long as the effects of the formants are taken into account during determination of the long-term delay (pitch) in our model.

The order of the LTP is not too critical if the combination is carefully optimised, e.g. block edge effects must be carefully compensated to avoid 'clicking' type distortions. It is worth noting that the prediction gain of the combined system will always be less than the sum of the gains in systems employing the LTP  in isolation. This is because in reality the vocal tract and excitation are not completely separable, as assumed in our model, but are interconnected. The LTP can be interpreted as 
  
  
 1.8           
  
where T is the 'pitch period', and bj are the 'pitch gain' coefficients which reflect the amount of correlation between the distant samples. The combined analysis model can be represented by a time domain difference equation: 
1.9 

where r(n) is the past excitation signal. Following a similar procedure to that of the LPC analysis, our goal is to determine estimates (Bj, T, aj) of the model parameter (bj, T, ai). Then, the prediction error is given by: 
  
  2.0

The mean squared error solution to equation (2.0) is not as straightforward as for the LPC analysis due to the presence of the delay factor T. To overcome this hurdle, two sub-optimal approaches can be taken:

(1) One-Shot Optimisation: if one assumes that the pitch spectrum information of the residual r(n) is 
     close to the pitch spectrum information of the input speech s(n), then we can solve for ai as before 
     and then use the residual from the LPC inverse filter to determine (B,T).Thus during the first iteration, 
     the STP coefficients are estimated to minimise the intermediate residual energy. The LTP coefficients 
     are then found using this intermediate residual signal. This procedure can be considered to be near 
     optimal provided the long-term lag, T, is greater than the analysis frame size, i.e. T > N. 
(2) Iterative Sequential Approach: an analysis similar to the one-shot method described above is first 
     performed. During subsequent iterations, the STP is re- opfimised given the previously determined 
     LTP coefficients . Also, the LTP is recalculated based on the newly formed intermediate 
     residual. This iteration process can then be continued until a certain threshold, or a fixed number of 
     iterations, is reached.

For practical reasons, the one-shot method is usually preferred as it only requires one iteration. In the iterative sequential method the main difficulty is to set a suitable threshold for the termination of the iteration run. Overall, it is substantially more complicated. However, the iterative method has been reported to give a better prediction gain and better perceptual performance . This is usually achieved with a shifting of the STP prediction gain to the LTP prediction gain. Here, only the one- shot method is considered as follows. 
By removing the STP effect in equation (2.0), we obtain 
 2.1

The estimates can now be determined by mean squared error, i.e. 
 2.2 

Replacing the expectation with finite summations, we get 
2.3 

By setting 
 
to zero, we obtain

2.4  
  
which can be written in matrix form as 
  
2.5  
  
where

2.6  
2.7 

The Bj coefficients can now be solved by inverting V(i,j), e.g. using Cholesky's decomposition. In the above formulation, a 'fix-up' may be used to ensure that the filter so formed is stable, e.g. by adding a small noise source into the formulation, the matrix inversion to obtain [V(i,j)]-l can be made more reliably. However, a stable LTP is not a pre-condition on the LTP analysis as rapid transitions are sometimes desired.

In the above formulation it is assumed that the pitch lag, T, has already been found and that 
Bj =Bj,T To determine T, various pitch measurement algorithms can be used (in our project we took the 3-tap LTP, I = 1, which forms the pitch prediction based on three past samples at T - 1, T, T + 1) . These include the auto-correlation , average magnitude difference function (AMDF) , Cepstrum  and Maximum Likelihood . These methods perform with different characteristics, especially with a noisy input signal. For simplicity, the auto-correlation algorithm is used for the general description below. 
As the preceding analysis to determine Bj has shown, pitch analysis is performed on a block containing N samples. However, the size of our window in which the block is taken is required to be considerably longer than the analysis frame length, N. This is because our pitch value, T, can vary between a minimum, Tmin, of around 16 samples to a maximum, Tmax, of around 150 samples. Therefore, our ideal analysis window is much greater (N +Tmax) in length (200-256 samples) such that it contains more than one complete pitch period. For simplicity, consider a 1-tap LTP, i.e. (I = 0)

2.8 

Thus

2.9 

3.0 

Substituting this into equation (2.3),

3.1 

The main problem

To determine the optimum T, values of the lags are tested between Tmin, and Tmax, and the lag which minimises the error E is the optimum. Having found T, the gain B can be found. A plot of the LPC residual and the signal (secondary excitation) after LTP inverse filtering is shown in the next Figures .

To hear the residual LPC click here...

To hear theresidual LTP click here...

Time domain plots of  LPC and pitch residuals

Time domain plots of both LPC and pitch residuals.

It is clear that the secondary excitation no longer possesses the sharp pulse-like characteristics of the residual, i.e. it looks much whiter than the LPC residual. Similar formulation can also be given for multiple tap LTPS.

Multiple tap LTPs tend to provide better performance than the single-tap LTP, in general, but with increased complexity and larger capacity requirement for the extra two filter taps B-1 and B1.

Final conclution: 
After we found the "white residual" we can comprees the error to minimum and by that get a clear signal with low-bit rate. 
  
To hear the final signal after the Resiver click here...

很好的一篇讲LTP在编解码中的作用的文章的更多相关文章

  1. 推荐一篇讲arm架构gcc内联汇编的文章

    这是来自ethernut网站的一篇文章,原文链接: http://www.ethernut.de/en/documents/arm-inline-asm.html 另外,据说nut/os是个不错的开源 ...

  2. 讲的很详细的一篇关于object equals() & hashCode() 的文章

    转: 讲的很详细的一篇关于object equals() & hashCode() 的文章 哈希表这个数据结构想必大多数人都不陌生,而且在很多地方都会利用到hash表来提高查找效率.在Java ...

  3. 转 一篇关于sql server 三种恢复模式的文章,从sql server 的机制上来写的,感觉很不错,转了

    简介 SQL Server中的事务日志无疑是SQL Server中最重要的部分之一.因为SQL SERVER利用事务日志来确保持久性(Durability)和事务回滚(Rollback).从而还部分确 ...

  4. 第三十二篇:在SOUI2.0中像android一样使用资源

    SOUI2.0之前,在SOUI中使用资源通常是直接使用这个资源的name(一个字符串)来引用.使用字符串的好处在于字符串能够表达这个资源的意义,因此使用字符串也是现代UI引擎常用的方式. 尽管直接使用 ...

  5. 【分享】几篇关于Repository 相关的讨论、提问、文章

    一.引入 最近在了解DDD,对于里面Repository 有点疑问和关注.闲来无事,去找了一些文章,来补补.在这里分享出来给大家.文章大多数都是英文的,见谅哈. 二.推荐列表 2.1 Filters ...

  6. 一篇和Redis有关的锁和事务的文章

    部分参考链接 Transaction StackExchange.Redis Transaction hashest 正文 Redis 是一种基于内存的单线程数据库.意味着所有的命令是一个接一个的执行 ...

  7. 可视化(番外篇)——在Eclipse RCP中玩转OpenGL

    最近在看有关Eclipse RCP方面的东西,鉴于Gephi是使用opengl作为绘图引擎,所以,萌生了在Eclipse RCP下添加画布,使用opengl绘图的想法,网上有博文详细介绍这方面的内容, ...

  8. iOS开发UI篇—在UITableview的应用中使用动态单元格来完成app应用程序管理界面的搭建

    iOS开发UI篇—在UITableview的应用中使用动态单元格来完成app应用程序管理界面的搭建 一.实现效果 说明:该示例在storyboard中使用动态单元格来完成. 二.实现 1.项目文件结构 ...

  9. [ionic开源项目教程] - 第5讲 如何在项目中使用全局配置

    第5讲 如何在项目中使用全局配置? Q:ionic开发,说纯粹一点,用的就是html+css+js,那么无疑跟web开发的方式是类似的.在这里给大家分享一个小技巧,如何在项目中使用全局配置? A:我的 ...

随机推荐

  1. Robot Framework-工具简介及入门使用

    Robot Framework-Mac版本安装 Robot Framework-Windows版本安装 Robot Framework-工具简介及入门使用 Robot Framework-Databa ...

  2. day5----模块

    1.定义     模块:用来从逻辑上组织python代码(变量,函数,类,运行逻辑:实现一个功能),本质就是.py结尾的python文件(文件名:test.py,对应的模块名:test)     包: ...

  3. sql 过了试用期不能启动的,修改时间启动后还原。

    @echo off    set nowtime=%date%    echo 2014-12-01|date    sc start MSSQLSERVER    ping -n 5 127.1&g ...

  4. MongoDB中的高级查询(二)

    $mod取模运算 查询index对5取模运算等于1的数据. $not $not是元条件句,即可以用在任何其他条件之上.查询index对5取模运算不等于1的数据. $exists判断字段是否存在 查询出 ...

  5. GitLab:解决Merge Request中Commits不更新的问题

    最近在使用 GitLab 的 Merge Requests 功能进行 Code Review .操作流程是这样的: 1)开发人员A要给一个项目增加一个新功能,先在这个项目上创建一个 Git 分支. 2 ...

  6. 结对实验报告-android计算器设计

     一:引言  目前手机可以说是普及率非常高的电子设备了,由于其便于携带,使用方便,资费适中等等原因,现在手机已经在一定程度开始代替固定电话的通话功能,以及一些原来电脑软件上的功能了.手机上的软件也随着 ...

  7. mac下载百度云盘大文件及断点续传的方法

    问题 作为资源共享平台, 百度云做的还是很出色的, "xxx site:pan.baidu.com"就可以找到很丰富的资源. 然而, 下载百度云上的文件就略蛋疼了. 早在12年的时 ...

  8. JfreeChart使用(转载)

    http://www.cnblogs.com/xingyun/ http://www.huosen.net/archives/156.html(此篇除了struts2外,还介绍了servlet下Jfr ...

  9. httpwebrequest 请求压缩,接受压缩的字符流请求

    请看图,客户端和服务端都使用gzip压缩. 客户端参数以json字符串形式gzip压缩,以二进制流发送到服务端:服务端接收到请求,解压缩gzip后,进行业务逻辑处理:处理完后在将返回值以json形式, ...

  10. Atitit 三论”(系统论、控制论、信息论

    Atitit 三论"(系统论.控制论.信息论 1. 系统论的创始人是美籍奥地利生物学家贝塔朗菲1 2. 信息论是由美国数学家香农创立的,2 3. 什么是控制论? 2 1. 系统论的创始人是美 ...