Before you start

开始之前

Before you start the development of the speech application, you need to consider several important points. They will define the way you'll implement the application.

在做语音应用开发之前,你需要考虑几个重要的问题,它们决定了你实现应用的途径。

Algorithms

算法

Speech technology puts several important limits on the way it's possible to implement the application. For example, as noted above it is impossible to recognize any known word of the language. You need to consider the ways to overcome such limitations. Such ways are known for most types of applications out there and described later in tutorial. To follow them, you sometimes need to rethink how your application will behave and interact with the user.

在可能实现的应用程序中,语音技术有几个重要限制。比如,语音技术不可能识别任何一个已知语言的单词,这就需要采取措施来克服这样的限制,大多数应用程序采用这些已知的方法,稍后会在教程中阐述。为了遵循这些方法,有事需要重新思考应用程序该如何与用户进行交互。

Although we try to provide important examples, we obviously can't cover everything. There is no utterance verification or speaker identification example yet, though they could be created later. Most algorithms are widely covered in scientific literature, and some of them are explained in tutorial later in the section. Moreover, new methods to solve old problems raise each year.

尽管我们努力提供一些重要的例子,但是还是不能涵盖所有的方面。现在仍然没有发音确认或说话人识别的例子,尽管它们以后可能会实现。大部分算法在科学文献中都可以找到,在本章的稍后部分会对有些算法做出解释。此外,为了解决老问题,每年都会提出一些新方法。

To name several common applications and the way to approach them:

几种常见的应用程序的名称和它们使用的方法:

Generic dictation is never so generic. You need to find out a domain you'll recognize which can be dialogs, readings, meetings, voicemails, legal or medical transcriptions. If you consider voicemails, note that the language there is way more restricted than general language. It's actually a very small vocabulary with specialized sequence of terms:

  • It's Sandy. Let's meet tomorrow
  • Hi. That's Joe, I'm going to sell you that car

通用听写程序是非通用的,需要找一个需要识别的场合,可以是对话、读物、会议、语音邮件、法律或者医学的录音,如果你选择语音邮件,注意,语音邮件的语言比一般语言的限制更多,它实际上是一个具有专业术语的小词汇表。

There will be a lot of names and that's a problem, but you'll never find a voicemail about quantum physics and that's a very good thing. The recognizer will use the restrictions you provided with the language model to improve accuracy of the result.

邮件里面会出现很多的名字,这样就会有问题,但是你从未发现一篇量子力学的语音邮件,那是一件幸运的事。识别器将使用提供的语言模型的限制规则来提高识别结果的准确率。

You'll have to build a language model for your domain, but that's not as complicated as you might think. Don't afraid as well, if you'll cover the 60k most common words in English; the accuracy will be the same as with 120k words. For other languages with rich morphology the situation is different, but also solvable with morphology-based subwords. Also, you have to build a post-processing system, adaptation system and user-identification system.

我们必须为自己的应用场合建立一个语言模型,语言模型并没有你想象的那么复杂,也不要害怕,如果有60K的普通英语单词的语言模型,那么识别的准确率和120K大小的模型是一样的。对于具有丰富形态学的其他语言来说,情况就不一样了,但是用基于形态学的亚单词模型仍然可以解决。另外,必须建立一个后期处理系统、自适应系统和用户识别系统。

For recognition on an embedded processor, there are two ways to consider - recognition on the server and recognition on the device. The former is more popular now days because it lets you use the power and flexibility of the cloud computations.

对于嵌入式处理器的识别来说,有两个方面需要考虑,服务器上的识别和设备上的识别。这样的形式现在很流行,因为它可以让你使用云计算的强大能力和灵活性。

Language learning will require you to build a framework for tracking incorrect pronunciations. That will include generation of incorrect pronunciations and scoring them.

语言学习需要建立一个追踪错误发音的框架,这个框架包括错误发音的产生并为他们打分。

For command and control, it was popular to use a finite state grammar for a long time. Unfortunately, we could not recommend that to you now days. It's way better to employ a medium vocabulary recognizer with semantic analysis framework on the top to improve user experience and let him use more or less natural language. There is no sense to start with finite grammar right now. For more details on the semantic architecture, look at theOlimpus project. Dialog systems will require user feedback framework as well.

命令和控制使用有限状态语法,这样的形式已经流行很长一段时间了,不幸的是,我们现在不推荐使用这种形式。在顶端采用带语义分析框架的中等词汇量的识别器来提高用户体验倒是一个不错的方法,没必要用有限状态语法开始。Olimpus项目可以得到更多关于语义结构的细节信息。此外,对话系统还需要用户反馈的框架。

Voice search, semantic analysis and translation will need to be build on the top of the lattices generated by engine. You need to take lattices with confidence scores and feed them into the upper levels like translation engine.

语音搜索,语义分析和转化需要在引擎产生的网格之前进行构建,你需要取得网格的信任分数并将他们置于像机器翻译这样更高层。

For open vocabulary recognition like name and places recognition, you will need a subword language model.

对于像名字和地点的开放词汇识别,就需要亚单词语言模型了。

Text alignment, like captions synchronization, will require you to build a specialized language model from reference text to restrict the search.

文本对齐,像字幕同步,要求建立一个从参考文本到限制搜索的专业语言模型。

Existing accuracy figures

现在的准确率

For most tasks above there are published accuracy results. You can find them if you'll identify the task. Those results could be useful or not useful in terms of accuracy for your users. You might count that you'll jump over the figures, but it's unlikely that it will be done quickly.

上面大部分任务都发布了识别结果准确率,当你确定一个任务的时候,你就能找到它们,对于用户来说,结果的准确率可能有用,也可能没用。你可能认为,可以不用关心这些数字,但这是不可能的。

For example, the broadcast news recognition task is done with 20-25% accuracy. If it's not enough for your application, you probably need to consider modification of the application. You might add hand-correction step or preliminary adaptation step to improve accuracy. If accuracy will not be sufficient after that, probably it's better to think if you need speech at all. There are other more reliable interfaces you could use.

比如,新闻广播的识别率有20-25%,这样的识别率对你的应用来说还不够的话,你可能就得考虑修改应用程序了,你可以添加手动修正步骤或者适应预处理步骤来提高准确率。如果经过上面的步骤,准确率提高的不够明显,可能需要考虑一下,是否需要语音识别了。尽管这样,还是有其他可靠的接口供使用。

For example, though ASR-based IVR systems are fancy and handy, many people still prefer communication with DTMF systems or web-based forms or just email to contact the company. Remember that you need an effective interface, not modest one.

比如,尽管基于ASR的IVR系统很精准、便利,但是很多人仍然喜欢使用DTMF系统通信,或者是基于网页的形式,抑或是使用email来联系公司。记住,你需要的是一个高效的接口,而不是一个适度的接口。

Resources

资源

Next issue you need to consider, is the availability of the speech material for training, testing and optimizing the system. You need to find out which resources are available to you.

接下来需要考虑的问题是用于训练、测试和优化系统的语音材料是否可以得到。你需要找出哪些资源是可以获得的。

The testing set is a critical issue for any speech recognition application. The testing set should be representative enough acoustically and terms of language. But the test set shouldn't necessary be large, you can spend 10 minutes to create a good one. It might be a sample recordings you could do yourself.

用于训练的数据对任何语音识别应用来说都是一个关键问题,测试数据集应该在声学和语言形式上具有代表性,但是测试数据集没必要很大,你可以花10分钟就能创建一个好的,它也可以是你自己录制的录音样本。

For training set and models you should check the resources that are already present. The increasing interest in speech technology makes people contribute by creation of models for their native languages. In general, you'll have to collect audio material for specified language. Actually it's not so complicated thing to do. Audio books, movies and podcasts provide enough recordings to build very good acoustic model with little effort.

你应该检查现有资源来获得训练集和模型,随着人们对语音技术的兴趣与日俱增,他们会把创建的本国语言模型贡献出来,一般来说,你只需收集特定语言的音频资料。实际上,这并不是一件复杂的事情,有声读物、电影和播客都提供了足够多的录音,只需要很少的努力就可以构建一个很好的声学模型。

To build a phonetic dictionary you can use existing TTS synthesizer which nowdays cover a lot of languages. Also you can boostrap dictionary by hand and then extend it with machine learning tools.

为了构建一个语音字典,你可以使用涵盖多种语言类型的TTS 合成器,你也可以使用手动引导字典,然后用机器学习工具来扩展它。

For language models you'll have to find a lot of texts for your domain. It might be textbooks, already transcribed recordings or some other sources like website contents crawled on the web.

你必须为你的应用场合寻找许多的文本来创建语言模型,可是课本、记录或者一些网站上的资源。

Technologies

技术

Third thing to consider is the set of particular technologies you will build on. Although CMUSphinx tries to provide more or less complete program suite for development of speech applications, you'll sometimes need to use other packages/programming languages/tools. You need to find out yourself if you are going to continue with Java, C or any of scripting languages CMUSphinx supports. The rule to choose between sphinx4 or pocketsphinx is the following:

  • Need speed or portability → use pocketsphinx
  • Need flexibility and managability → use sphinx4

第三件要考虑的事情是构建采用的一系列技术,尽管CMUSphinx试图为语音应用开发提供完整的程序套件,但是有时候还是要使用其他的软件包和语言工具,你需要先确定你即将使用Java语言、C语言或者是CMUSphinx支持的脚本语言。sphinx4或者pocketsphinx的选择:

需要速度或者便捷性 - 使用pocketsphinx

需要灵活性和可管理性 - 使用sphinx4

Although people often ask what is more accurate sphinx4 or pocketsphinx, you shouldn't bother with this question at all. Accuracy is not the argument here. Both sphinx4 and pocketsphinx provide acceptable accuracy and even then it depends on many factors, not just the engine. The thing is that engine is just a part of the system which should include many more components. If we are talking about large vocabulary decoder, there must be diarization framework, adaptation framework and postprocessing framework. They all need to cooperate somehow. Flexibility of sphinx4 allows you to build such a system quickly. It's easy to embed sphinx4 into flash server like red5 to provide web-based recognition, it's easy to manage many sphinx4 instances doing large-scale decoding on a cluster.

尽管人们经常会问sphinx4和pocketsphinx谁准确率更高,无需对这个问题烦恼,准确率无需在此论证。sphinx4和pocketsphinx的准确率都是可接受的,它们由很多音素决定,而不是引擎本身。引擎只是系统的一部分,它包含了很多组件。大词汇量解码器具有聚类框架、自适应框架和后置处理框架,它们需要在一起合作,灵活的sphinx4允许你快速建立一个系统。向red5这样的flash服务器中嵌入sphinx4来提供基于网页版的语音识别是非常容易的事情,通过大规模解码集群,可以很容易的管理sphinx4的实例。

On the other side, if your system needs to be efficient and reasonably accurate, if you are running on embedded device or you are interested in using recognizer with some exotic language like Erlang, pocketsphinx is your choice. It's very hard to integrate Java with other languages not supported by JVM pocketsphinx is way better here.

另一方面,如果你的系统需要高效和可靠的准确率,如果运行在嵌入式设备中,或者你有兴趣使用Erlang语言来做识别器,你应该选择pocketsphinx。当Java和不支持JVM的其他语言难以集成时,pocketsphinx是一个好的选择。

Next example of what you need to consider a development platform choice. If you are bound to some, that's an easy question for you. If you can choose, we highly recommend you to use GNU/Linux as a development platform. We can help you with Windows or Mac issues but there are no guarantees, our main development platform is Linux. For many tasks you'll need to run complex scripts using perl of python. On Windows it might be problematic.

需要考虑的下一个情况是开发平台的选择,当你遇到某些限制,这些限制对你来说很简单。如果你可以选择,我们强烈推荐你使用GNU/Linux作为开发平台,我们可以帮助你解决Windows或者Mac上的问题,但不能给于保证,我们主要的开发平台是Linux。你可以运行复杂的perl的python脚本来完成多任务,但在Windows上可能是有问题的。

Got it? Let's start! Next section will describe the process of creation the sample application either with sphinx4 or pocketsphinx. Choose the right one.

明白了吗?让我们出发吧!下一节将会阐述使用sphinx4或者pocketsphinx创建样例程序的过程。选择你需要的那一个吧。

CMUSphinx Learn - Before you start的更多相关文章

  1. CMUSphinx Learn - Basic concepts of speech

    Basic concepts of speech Speech is a complex phenomenon. People rarely understand how is it produced ...

  2. Atitit learn by need 需要的时候学与预先学习知识图谱路线图

    Atitit learn by need 需要的时候学与预先学习知识图谱路线图 1. 体系化是什么 架构 知识图谱路线图思维导图的重要性11.1. 体系就是架构21.2. 只见树木不见森林21.3. ...

  3. Python 爬取所有51VOA网站的Learn a words文本及mp3音频

    Python 爬取所有51VOA网站的Learn a words文本及mp3音频 #!/usr/bin/env python # -*- coding: utf-8 -*- #Python 爬取所有5 ...

  4. [转载]VIM 教程:Learn Vim Progressively

    文章来源:http://yannesposito.com/Scratch/en/blog/Learn-Vim-Progressively/   Learn Vim Progressively   TL ...

  5. some tips learn from work experience

    1.you can't avoid office politics 2.you'll never have a job which you "can't quit" - if yo ...

  6. Java-集合(没做出来)第四题 (List)写一个函数reverseList,该函数能够接受一个List,然后把该List 倒序排列。 例如: List list = new ArrayList(); list.add(“Hello”); list.add(“World”); list.add(“Learn”); //此时list 为Hello World Learn reverseL

    没做出来 第四题 (List)写一个函数reverseList,该函数能够接受一个List,然后把该List 倒序排列. 例如: List list = new ArrayList(); list.a ...

  7. Learn RxJava

    Learn RxJava http://reactivex.io/documentation/operators.html https://github.com/ReactiveX/RxJava/wi ...

  8. ANSI Common Lisp Learn

    It has been a long time that I haven't dealt with my blog. On one hand I was preparing the exams.On ...

  9. [Notes] Learn Python2.7 From Python Tutorial

    I have planed to learn Python for many times. I have started to learn Python for many times . Howeve ...

随机推荐

  1. 关于call_rcu在内核模块退出时可能引起kernel panic的问题

    http://paulmck.livejournal.com/7314.html RCU的作者,paul在他的blog中有提到这个问题,也明确提到需要在module exit的地方使用rcu_barr ...

  2. postfix发信提示 Error: too many connectino from

    查看提示,很明显是提示连接数过多导致的. 有提示上面的信息,看提示的IP地址是一个网关的地址,使用netstat -ano|grep ':25'|wc -l 看了下,25端口的连接的IP地址,几乎全是 ...

  3. JSP动作

    JSP动作元素在请求处理阶段起作用,他们会被转换成Java代码来执行操作,如访问一个Java对象或调用方法. JSP动作元素是用XML语法写成的. 动作元素基本上都是预定义的函数,JSP规范定义了一系 ...

  4. [Robot Framework] Robot Framework怎么调试?

    Robot Framework怎么debug? 在eclipse里面安装一个插件,就可以debug robot framework的project. 插件下载地址: https://github.co ...

  5. Continuous Subarray Sum LT523

    Given a list of non-negative numbers and a target integer k, write a function to check if the array ...

  6. CSS实现背景透明而背景上的文字不透明

    在我们设计制作一些网页的时候可能会用到半透明的效果,首先我们可能会想到用PNG图片处理,当然这是一个不错的办法,唯一的兼容性问题就是ie6 下的BUG,但这也不困难,加上一段js处理就行了.但假如我们 ...

  7. ''TclError: no display name and no $DISPLAY environment variable''解决方法

    在模块前写入一下代码: import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt 具体解释见   http://m ...

  8. 使用Simple MvvmToolkit开发Android和iOS程序

    详情见:Android and iOS Development with Simple MVVM Toolkit? Yes you can! :http://blog.tonysneed.com/20 ...

  9. 利用HBuilder开发基于MUI的H5+ app中使用百度地图定位功能

    定位功能有两种方法: 首先要初始化内置地图: var map = new plus.maps.Map("map"); 这里黄色的map是html里面的id: <div id= ...

  10. canvas 实现烟花效果

    一:创建画布 <canvas width="600" height="600" id="canvas" style="bor ...