基音检测算法的性能:Performance Evaluation of Pitch Detection Algorithms
http://access.feld.cvut.cz/view.php?cisloclanku=2009060001
Vydáno dne 02. 06. 2009 (15123 přečtení)
Porovnání výkonnosti algoritmů detekce základního tónu řeči
Článek pojednává o porovnávací studii výkonnosti čtyř algoritmů detekce základního tónu. Algoritmy představené ve studii jsou následující: autokorelační metoda upravená technikou centrálního klipování, metoda založená na normované krosskorelační funkci, metoda založená na funkci střední odchylky amplitud a kepstralní metoda. Pro porovnání algoritmů byla použita databáze řečových signálů, která se skládá z promluv 5 mužů a 5 žen v českém jazyce. Výkonnost algoritmů byla vyhodnocena pomoci různých typů chyb, vznikajících při detekci základního tónu výše zmíněnými algoritmy.
Klíčová slova (Keywords): algoritmy detekce základního tónu; výkonnost algoritmů; autokorelační metoda; metoda krosskorelační funkce; metoda střední odchylky amplitud; kepstralní metoda
1. Introduction
Speech coding is a fundamental element in digital communications. It has progressed in parallel to the increase of telecommunication services demand. The development of good quality low bit-rate speech codecs has been an objective for the substantial amount of research [1]. A primary underlying task in this area is the extraction of features from speech signals, which can be used for speech synthesis applications. One of the important features is fundamental frequency, more commonly referred to as pitch. In this paper, the terms pitch and its primary acoustical correlate fundamental frequency are used interchangeably. Fundamental frequency (F0) corresponds to the rate at which the human vocal cords vibrate. A pitch detector is an essential component in a variety of speech processing systems. Besides providing necessary information about the nature of the excitation source for speech coding, the pitch contour of an utterance is useful for recognizing speakers, determination of their emotion state, for voice activity detection task, and many others applications [2]. Various pitch detection algorithms (PDAs) have been developed in the past: autocorrelation method [1], HPS [2], RAPT [3], AMDF method [4], CPD [5], SIFT [6], DFE [7]. Most of them have very high accuracy for voiced pitch estimation, but the error rate considering voicing decision is still quite high. Moreover, the PDAs performance degrades significantly as the signal conditions deteriorate [8]. Pitch detection algorithms can be classified into the following basic categories: time-domain based tracking, frequency domain based tracking or joint time-frequency domain based tracking. In this paper, the principles of four PDAs including preprocessing and extraction of pitch pattern techniques are summarized. The implementation of them is described. Some experiments and discussions are presented.
2. Pitch detection algorithms
2.1 Modified Autocorrelation Function (MACF) Method
The autocorrelation approach is the most widely used time domain method for estimating pitch period of a speech signal [2]. This method is based on detecting the highest value of the autocorrelation function in the region of interest. For given discrete signal x(n), the autocorrelation function is generally defined
(1) |
where N is the length of analyzed sequence and M0 is the number of autocorrelation points to be computed. For pitch detection, if we assume that x(n) is periodic sequence,x(n)=x(n+P) for all n, it is shown that the autocorrelation function is also periodic with the same period, R(m)=R(m+P). Conversely, the periodicity in the autocorrelation function indicates periodicity in the signal. For a non-stationary signal, such as speech, the concept of a long-time autocorrelation measurement given by (1) is not really meaningful. In practice, we operate with a short speech segments, consisting of finite number of samples. That is why in autocorrelation based PDAs short-time autocorrelation function, given by
(2) |
is used. The variable m in (2) is called lag or delay, and the pitch is equal to the value of m which results in the maximum R(m). The modified autocorrelation pitch detector MACF [6] differs from the common autocorrelation method by using center-clipping technique in a pre-processing stage. The relation between the input signal x(n), and the center-clipped signal y(n) is
(3) |
where CL is the clipping threshold. Generally, CL is about 50% of the maximum absolute signal value within the signal frame. Non-linear operations on the speech signal such as center-clipping tend to flatten the spectrum of the signal passed to the candidate generator. This results in the increase of the distinctiveness of the true period peaks in the autocorrelation function. Figure 1 presents the example of voiced frame, its center clipped version, and the difference between the autocorrelation function calculated from original signal frame and center-clipped signal frame.
Figure 1: Comparison of autocorrelation function calculated from voiced frame and its center-clipped version.
Figure 2: Block diagram of modified autocorrelation pitch detector.
Figure 2 shows the block diagram of the modified autocorrelation pitch detection algorithm. At the beginning of processing speech signal must be segmented into overlapping frames. Speech recordings with 8kHz sampling frequency were used for experiments. Therefore, input speech signal must be segmented into overlapping frames of 240 samples (30ms), length of overlap is 160 samples (20ms). This method requires low-pass filtering (LPF) up to 800 Hz. Then, the autocorrelation function for the 30-ms section is computed over the range of lags from 16 to 160 samples (i.e. 2ms-20ms period). This range of lags corresponds to the real values of the fundamental frequency of 50-500Hz. Additionally, the autocorrelation function at 0 delay is computed for appropriate normalization purposes. The normalized autocorrelation function is then searched for its maximum peak. The value of the peak and its position are defined. If the value of peak exceeds threshold, the frame is classified as voiced. Otherwise, the section is classified as unvoiced (Fig.3). The value of fundamental frequency can be computed from the pitch period by equalization F0[Hz]=1/T[s].
Figure 3: Process of voiced/unvoiced classification.
In general, the pitch detection described above still exhibits errors as a result of erroneous voiced/unvoiced decisions and inaccurate pitch estimation. Consequently some smoothing stage, such as median filtering is necessary to improve the performance of the system. A major limitation of the auto-correlation function is that it can contain many other peaks, other than those due to basic periodic components (see Fig. 2). For voiced speech signals, the numerous peaks are present in the auto-correlation function due to the damped oscillations of the vocal tract response. It is difficult for any simple peak picking process to discriminate peaks, which doesn’t correspond to the real pitch period, due to periodicity of these extraneous peaks. The peak selection is more robust, if some preprocessing techniques or a relatively large time window are used. However, using of large time window results in improper tracking of rapid changes in pitch. In spite of some peak picking problems, the autocorrelation performs well in the majority of cases and is relatively noise immune [2].
2.2 Normalized Cross Correlation Function (NCCF) Method
The normalized cross correlation function (NCCF) is very similar to the autocorrelation function, but is better follows the rapid changes in pitch and the amplitude of speech signal [3]. The NCCF based PDA overcomes most of the shortcomings of the autocorrelation based algorithms at a slight increase in computational complexity.The NCCF function for speech segment x(n), 0 ≤ n ≤ N-1 is defined
(4) |
where N is the length of analyzed frame, m is a lag and M0 is the number of autocorrelation points to be computed.It should be noted that the values of NCCF function always lie in the interval [-1, 1]. The values of NCCF tends to be close to 1 for lags corresponding to the integer multiples of the true pitch period, regardless of the rapid changes in amplitude of x(n) (Fig. 4). NCCF is better suited for pitch detection than the normal autocorrelation function [6]. It is more frequently used in pitch detection algorithms [2]. In comparison with the normal autocorrelation function, the peaks corresponding to pitch period in the NCCF are more prominent and less affected by the rapid variations in the signal amplitude. The advantage of the NCCF over the autocorrelation method is illustrated in Fig. 4.
Figure 4: Illustration of the autocorrelation and NCCF for a typical voiced speech frame.
Figure 5: Block diagram of NCCF pitch detector.
The block diagram of the NCCF pitch detector is shown in Fig. 5. After segmentation of speech signal into overlapping frames, low-pass filtering is also required. Further normalized cross-correlation function is computed over a range of lags from 16 to 160 samples. The NCCF function is then searched for its maximum value. The value of the peak and its position are defined. Despite of the relative robustness of the NCCF for pitch detection, the magnitude of the NCCF largest peak is not a reliable indicator of whether the speech segment is voiced or unvoiced. A voiced/unvoiced decision is made on the basis of the energy of the frames. If the speech frame is classed as voiced, then the lag corresponding to the highest peak of NCCF is considered to be the pitch period. In order to compensate the effects of some errors in pitch detection and voiced/unvoiced decision median filter is used. The filtering operation produces final F0 estimates.
2.3 Average Magnitude Difference Function (AMDF) Method
The average magnitude difference function (AMDF) [4] is another type of autocorrelation analysis. Instead of correlating the input speech at various delays (where multiplications and summations are formed at each value), a difference signal is formed between the delayed speech and original, and at each delay value the absolute magnitude is taken. For the frame of N samples, the short-term difference function AMDF is defined
(5) |
where x(n) are the samples of analyzed speech frame, x(n-m) are the samples time shifted on m samples and N is the frame length. The difference function is expected to have a strong local minimum if the lag m is equal to or very close to the fundamental period. Figure 6 depicts values of AMDF function for voiced frame.
Figure 6: ADMF function of voiced frame of speech
PDA based on average magnitude difference function has advantage in relatively low computational cost and simple implementation [2]. Unlike the autocorrelation function, the AMDF calculations require no multiplications. This is a desirable property for real-time applications. Procedure of processing operations for AMDF based pitch detector is quite similar to the NCCF algorithm. After segmentation, the signal is pre-processed to remove the effects of intensity variations and background noise by low-pass filtering. Then the average magnitude difference function is computed on speech segment at lags running from 16 to 160 samples. The pitch period is identified as the value of the lag at which the minimum AMDF occurs. In addition to the pitch estimate, the ratio between the maximum and minimum values of AMDF (MAX/MIN) is obtained. This measurement with the frame energy is used to make a voiced/unvoiced decision [6]. In transition segments between voiced, unvoiced or silence regions some determination errors may occur. Especially F0doubling or halving errors are most frequent. Therefore median filtering is used in AMDF based PDA.
2.4 Cepstrum Pitch Determination (CPD)
Cepstral analysis also provides a way for the pitch estimation. The cepstrum of voiced speech intervals has strong peak corresponding to the pitch period [5]. Cepstrum pitch determination technique has some advantages over autocorrelation based PDAs. It is assumed that the sequence of voiced speech s(n) can be presented as
(6) |
where e(n) is source excitation sequence and h(n) is the vocal tract’s discrete impulse response. In the autocorrelation function the effects of the vocal source and vocal tract are convolved with each other. This results in broad peaks and in some cases multiple peaks in the autocorrelation function. In frequency domain convolution relationship between vocal source and vocal tract effects becomes a multiplicative relationship
(7) |
where S(ω)=F{s(n)}, E(ω)=F{e(n)} and H(ω)=F{h(n)}. Symbol F stands for Discrete Fourier Transform (DFT). Function (7) then can be represented as (8),
(8) |
The multiplicative relationship between source and tract effects in cepstrum is transformed into an additive relationship. The effects of the vocal source and vocal tract are nearly independent or easily identifiable and separable. It is possible to separate the part of the cepstrum, which is represents source signal and find true pitch period. That is why, in general, cepstrum pitch determination is more accurate than autocorrelation PDAs [2]. For pitch determination, real part of the cepstrum is sufficient. The real cepstrum of the discrete signal s(n) is defined as
(9) |
where S(k) is logarithmic magnitude spectrum of s(n)
(10) |
The cepstrum consists of peak occurring at a high quefrency equal to the pitch period in seconds and low quefrency information corresponding to the formant structure in the log spectrum [5]. To obtain an estimation of the fundamental frequency from the cepstrum we look for a peak in the quefrency region corresponding to typical speech fundamental frequencies. In Figure 7 an example of fundamental frequency estimation from the cepstrum of voiced frame is presented.
Figure 7: Waveform (a) and real cepstrum (b) of voiced speech segment.
Procedure of processing operations for cepstrum based pitch detector is similar to the PDAs described above. It should also be noted that the cepstral pitch detector uses the full-band speech signal for processing. Each block of 240 samples is weighted by a 240-point Hamming window and the cepstrum of that block is computed. The peak cepstral value and its location is determined. If the value of this peak exceeds a fixed threshold, the section is called voiced and the pitch period is the location of the peak. If the peak does not exceed the threshold, a zero-crossing count is made on the block. If the zero-crossing count exceeds a given threshold, the block is marked as unvoiced. Otherwise, it is called voiced and the period is the location of the maximum value of the cepstrum.
3. Experiments and discussion
3.1 Experiment Settings and Evaluation Criteria
Our experiments evaluate performance of described PDAs. In the experiments we used speech signals from the Czech “SpeechDat” database of telephone speech. Speech recordings from the database consist of numbers from 1 to 10 in Czech language pronounced by 5 males and 5 females. The sampling rate for the speech signals was 8kHz using 16-bit A/D converter. Speech recordings were hand-marked into voiced/unvoiced regions and individual pitch values.
The accuracy of the different pitch detection algorithms was measured according to the following criteria [8]:
1. Classification Error (CE): it is the percentage of unvoiced frames classified as voiced and voiced frames classified as unvoiced.
2. Gross Error (GE): percentage of voiced frames with an estimated fundamental frequency value that deviates from the reference value more than 20%.
3.2 Results of PDAs Performance Evaluation
In order to observe performance of four described PDAs utterances pronounced by male and female speakers were selected from SpeechDat database. Each utterance consists of ten words (numbers from one to ten) in Czech language. Results of PDA performance evaluation are shown in Table 1 and Fig. 8. Results of evaluation of pitch detection algorithms are presented separately for male and female speech. In Fig. 8 graphical representation of PDAs performance is presented. For better view, Fig. 8 presents only part of the utterances used in experiments. It should be noted that pitch detection algorithms performed better on utterances pronounced by male speakers. As follows from Table 1, the best results among the algorithms were achieved by algorithm based on normalized cross-correlation function (NCCF). For male speech only 0.72% of voiced frames have the estimated F0 value more than 20% deviation from the reference. NCCF is the best in the overall gross errors although it is closely followed by MACF (GE=0.81). In addition, MACF algorithm outperforms NCCF in voiced/unvoiced classification for male speech. For MACF only 3.07% of voiced frames were misclassified as unvoiced and unvoiced frames misclassified as voiced. The CPD algorithm gets the good results in pitch estimation for male a female speech (GE=0.79 and 3.02 respectively). However it fails MACF and NCCF algorithms in voiced/unvoiced classification (CE=10.84 and 18.83 for male and female speech respectively). Results of our experiments show that AMDF method is the most inaccurate one. It has the biggest values of gross error and classification error parameters. Problem of pitch multiple and pitch halving is evident for this pitch detection algorithm.
Table 1. Performance evaluation results of different PDAs on clean speech
a) b)
Figure 8: Evaluation of pitch detection algorithms on male (a) and female (b) speech.
4. Conclusion
This paper is focused on pitch detection algorithms for speech signals. Four PDAs based on the autocorrelation function, the normalized cross-correlation function, the average magnitude difference function and cepstral analysis were introduced. Each of the described algorithms have their advantages and drawbacks. From the experimental results, the MACF method is more convenient for common usage. This algorithm exhibits accurate results of pitch estimation and low computational complexity. The NCCF method presents the best results in pitch detection accuracy and voiced/unvoiced classification, but it is computationally more complex than the MACF. In addition, it needs energy calculations for voiced/unvoiced classification. The CPD show good pitch estimation accuracy. Fundamental frequency estimation in this algorithm is immune to errors due to effects of vocal tract. However, CPD method is computationally complex; it needs additional parameters for voiced/unvoiced decision. The AMDF method has great advantage in very low computational complexity, it possible to implement it in real-time applications. However this algorithm showed poor results in accuracy of pitch estimation and pattern recognition.
Acknowledgements
This paper has originated thanks to the support from the Ministry of Education, Youth and Sports of Czech Republic within the project MSM6840770014.
References
[1] A. M. Kondoz , "Digital speech: Coding for low bit rate communication systems", 2nd Edn, John Wiley&Sons, England, 2004.
[2] W. J. Hess, Pitch Determination of Speech Signals. New York: Springer, 1993.
[3] D. Talkin, “A robust algorithm for pitch tracking (RAPT)”.Speech Coding and Synthesis, Elsevier Science, Amsterdam, pp.495-518,1995.
[4] M. J. Ross, H. L. Shaffer, A. Cohen, R. Freudberg, and H. J. Manley, “Average magnitude difference function pitch extractor,” vol.ASSP-22, no. 5, pp. 353–362, Oct. 1974.
[5] A. M. Noll, "Cepstrum Pitch Determination", Journal of the Acoustical Society of America, Vol. 41, No. 2, pp. 293-309, 1967
[6] L. Rabiner, M. J. Cheng, A. E. Rosenberg, and C. A. McGonegal, "A comparative performance study of several pitch detection algorithms," IEEE Transactions on ASSP, vol. 24, pp. 399-417, 1976.
[7] H. Bořil, P. Pollák, "Direct Time Domain Fundamental Frequency Estimation of Speech in Noisy Conditions". Proc. EUSIPCO2004, Wien, Austria, vol. 1, p. 1003-1006, 2004.
[8] B. Kotnik, H. Höge, and Z. Kacic, “Evaluation of Pitch Detection Algorithms in Adverse Conditions”. Proc. 3rd International Conference on Speech Prosody, Dresden, Germany,pp. 149-152, 2006.
Autor: E. Verteletskaya, B. Šimák
基音检测算法的性能:Performance Evaluation of Pitch Detection Algorithms的更多相关文章
- 分类模型的性能评价指标(Classification Model Performance Evaluation Metric)
二分类模型的预测结果分为四种情况(正类为1,反类为0): TP(True Positive):预测为正类,且预测正确(真实为1,预测也为1) FP(False Positive):预测为正类,但预测错 ...
- 回归模型的性能评价指标(Regression Model Performance Evaluation Metric)
回归模型的性能评价指标(Performance Evaluation Metric)通常有: 1. 平均绝对误差(Mean Absolute Error, MAE):真实目标y与估计值y-hat之间差 ...
- kaggle信用卡欺诈看异常检测算法——无监督的方法包括: 基于统计的技术,如BACON *离群检测 多变量异常值检测 基于聚类的技术;监督方法: 神经网络 SVM 逻辑回归
使用google翻译自:https://software.seek.intel.com/dealing-with-outliers 数据分析中的一项具有挑战性但非常重要的任务是处理异常值.我们通常将异 ...
- Signal Processing and Pattern Recognition in Vision_15_RANSAC:Performance Evaluation of RANSAC Family——2009
此部分是 计算机视觉中的信号处理与模式识别 与其说是讲述,不如说是一些经典文章的罗列以及自己的简单点评.与前一个版本不同的是,这次把所有的文章按类别归了类,并且增加了很多文献.分类的时候并没有按照传统 ...
- Harris角点检测算法优化
Harris角点检测算法优化 一.综述 用 Harris 算法进行检测,有三点不足:(1 )该算法不具有尺度不变性:(2 )该算法提取的角点是像素级的:(3 )该算法检测时间不是很令人满意. 基于以上 ...
- 【深度学习】目标检测算法总结(R-CNN、Fast R-CNN、Faster R-CNN、FPN、YOLO、SSD、RetinaNet)
目标检测是很多计算机视觉任务的基础,不论我们需要实现图像与文字的交互还是需要识别精细类别,它都提供了可靠的信息.本文对目标检测进行了整体回顾,第一部分从RCNN开始介绍基于候选区域的目标检测器,包括F ...
- yolo类检测算法解析——yolo v3
每当听到有人问“如何入门计算机视觉”这个问题时,其实我内心是拒绝的,为什么呢?因为我们说的计算机视觉的发展史可谓很长了,它的分支很多,而且理论那是错综复杂交相辉映,就好像数学一样,如何学习数学?这问题 ...
- 目标检测算法(1)目标检测中的问题描述和R-CNN算法
目标检测(object detection)是计算机视觉中非常具有挑战性的一项工作,一方面它是其他很多后续视觉任务的基础,另一方面目标检测不仅需要预测区域,还要进行分类,因此问题更加复杂.最近的5年使 ...
- 目标检测算法SSD之训练自己的数据集
目标检测算法SSD之训练自己的数据集 prerequesties 预备知识/前提条件 下载和配置了最新SSD代码 git clone https://github.com/weiliu89/caffe ...
随机推荐
- POJ1860-Currency Exchange (正权回路)【Bellman-Ford】
<题目链接> <转载于 >>> > 题目大意: 有多种汇币,汇币之间可以交换,这需要手续费,当你用100A币交换B币时,A到B的汇率是29.75,手续费是0. ...
- 移动端H5页面返回并且刷新页面(BFcache)
项目中的需求:点击浏览器中的返回按钮,要让页面重新加载资源.因为这部分的资源每次去加载的内容都不一样,如果返回的时候,还是看到原先的内容,那做这个内容块的意义就很小了:而如果用户看完了这部分内容,再返 ...
- Java内存管理-初始JVM和JVM启动流程(二)
勿在流沙住高台,出来混迟早要还的. 做一个积极的人 编码.改bug.提升自己 我有一个乐园,面向编程,春暖花开! 上一篇分享了什么是程序,以及Java程序运行的三个阶段.也顺便提到了Java中比较重要 ...
- Java设计模式从精通到入门二 装饰器模式
介绍 我尽量用最少的语言解释总结: Java23种设计模式之一,属于结构型模式,允许向一个现有的对象添加新的功能,不改变其结构. 应用实例: 给英雄联盟种的射手,添加不同的装备.先装备攻速 ...
- 2018即将过去,立个flag
过去的2018 自己有没有值得有意义的地方呢? 没有, 自己有没有认识新的异性朋友呢? 没有, 自己都在忙啥呢? 敲代码,然后发现敲坏了一个键盘,换了HHKB,一个字舒服,还有就是通宵把一部电视剧看完 ...
- ps怎么撤销的三种方法和ps撤销快捷键以及连续撤销多步快捷键
内容提要:文章综合介绍ps撤销快捷键相关的一些操作,包括PS怎么撤销.PS撤销多步.ps连续撤销快捷键.历史记录面板操作等等. 关于ps怎么撤销操作,有多种方法:使用PS撤销快捷键.编辑菜单.文件菜单 ...
- js滚动分页原理
<!doctype html><html> <head> <!--声明当前页面的编码集:charset=gbk,gb2312(中文编码),utf-8国际编码- ...
- React系列文章:Webpack模块组织关系
现代前端开发离不开打包工具,以Webpack为代表的打包工具已经成为日常开发必备之利器,拿React技术栈为例,我们ES6形式的源代码,需要经过Webpack和Babel处理,才能生成发布版文件,在浏 ...
- javascript中用正则表达式判断是否为汉字及常用的判断
a.判断是否为汉字: 1.汉字为任意长度时: var han = /^[\u4e00-\u9fa5]+$/; 例如: var han = /^[\u4e00-\u9fa5]+$/; var vals= ...
- ftp不能登录报错
虚拟机装好RedHat后,准备使用filezilla连接,输入IP地址,root用户,密码,快速连接,报错: 530 Permission denied. 故障排除: 1.首先检查系统是否开启了vsf ...