The Baum-Welch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training
data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model when used for prediction.
This project proposes to extend Mahout's Baum-Welch to a parallel, distributed version using the Map-Reduce programming framework for enhanced model fitting over large data sets.

Detailed Description:

Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made
them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and parts-of-speech tagging.

A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple
(A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose
elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are
unobservable and follow the Markov property of memorylessness.

Rabiner [1] defined three main problems for HMMs:

1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called
Forward algorithm.

2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict
the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting
to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the
Baum-Welch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons:

(1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.


(2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map-Reduce framework
[2], such as the existing Map Reduce implementation of k-means in Mahout.

Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward
algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in
[3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization
is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence
condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition
and emission counts using the Forward and Backward algorithms. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training
instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:

(1) Expected number of times a hidden state is reached (Pi).

(2) Number of times each observable symbol is generated by each hidden state (B).

(3) Number of transitions between each pair of states in the hidden state space (A).

The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary
to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters
by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed,
unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. Since k-means is also an EM algorithm, particular attention will be paid to its code at each step for possible reuse.

A list of milestones, associated deliverable and high level implementation details is given below.

Time-line: April 26 - Aug 15.

Milestones:

April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize
with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages.

May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration.
The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from
the HmmTrainer.trainBaumWelch() method if possible. Also, mid-term review.

July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization
problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW
[4]. Tidy up code and fix loose ends, finish wiki documentation.

Additional Information:

I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory
of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank
[5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in
[6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem
[7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.


I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer
mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow
me to contribute within my modest means to the overall spirit of open source programming and Machine Learning.

References:

[1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

[2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

[3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

[4] http://flexcrfs.sourceforge.net/#Case_Study

[5] http://www.rcsb.org/pdb/home/home.do

[6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

[7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.

MR for Baum-Welch algorithm的更多相关文章

  1. Baum–Welch algorithm

    Baum–Welch algorithm 世界上只有一个巴菲特,也只有一家文艺复兴科技公司_搜狐财经_搜狐网 http://www.sohu.com/a/157018893_649112

  2. Baum Welch估计HMM参数实例

    Baum Welch估计HMM参数实例 下面的例子来自于<What is the expectation maximization algorithm?> 题面是:假设你有两枚硬币A与B, ...

  3. [ML] Concept Learning

    Candidate Elimination Thanks for Sanketh Vedula. This is a good demo to understand candidate elimina ...

  4. [Scikit-learn] Dynamic Bayesian Network - HMM

    Warning The sklearn.hmm module has now been deprecated due to it no longer matching the scope and th ...

  5. HMM基础

    一.HMM建模 HMM参数: 二.HMM的3个假设 (一)马尔科夫假设 (二)观测独立性假设 (三)不变性假设 转移矩阵A不随时间变化 三.HMM的3个问题 (一)概率计算/评估---likeliho ...

  6. [转]语音识别中区分性训练(Discriminative Training)和最大似然估计(ML)的区别

    转:http://blog.sina.com.cn/s/blog_66f725ba0101bw8i.html 关于语音识别的声学模型训练方法已经是比较成熟的方法,一般企业或者研究机构会采用HTK工具包 ...

  7. 转Generative Model 与 Discriminative Model

    没有完全看懂,以后再看,特别是hmm,CRF那里,以及生成模型产生的数据是序列还是一个值,hmm应该是序列,和图像的关系是什么. [摘要]    - 生成模型(Generative Model) :无 ...

  8. Generative Model 与 Discriminative Model

      [摘要]    - 生成模型(Generative Model) :无穷样本==>概率密度模型 = 产生模型==>预测    - 判别模型(Discriminative Model): ...

  9. 生成模型(Generative Model)和 判别模型(Discriminative Model)

    引入 监督学习的任务就是学习一个模型(或者得到一个目标函数),应用这一模型,对给定的输入预测相应的输出.这一模型的一般形式为一个决策函数Y=f(X),或者条件概率分布P(Y|X). 监督学习方法又可以 ...

  10. 生成模型(Generative Model)Vs 判别模型(Discriminative Model)

      概率图分为有向图(bayesian network)与无向图(markov random filed).在概率图上可以建立生成模型或判别模型.有向图多为生成模型,无向图多为判别模型. 判别模型(D ...

随机推荐

  1. Java之继承深刻理解

    1.关于私有成员变量 无论父类中的成员变量是私有的.共有的.还是其它类型的,子类都会拥有父类中的这些成员变量.但是父类中的私有成员变量,无法在子类中直接访问,必须通过从父类中继承得到的protecte ...

  2. springMVC+Hibernate4+Spring整合一(配置文件部分)

    本实例采用springMvc hibernate 与 spring 进行整合, 用springmvc 取代了原先ssh(struts,spring,hibernate)中的struts来扮演view层 ...

  3. T-SQL动态查询(2)——关键字查询

    接上文:T-SQL动态查询(1)--简介 前言: 在开发功能的过程中,我们常常会遇到类似以下情景:应用程序有一个查询功能,允许用户在很多查询条件中选择所需条件.这个也是本系列的关注点. 但是有时候你也 ...

  4. Struts 2 标签库

    <s:if>标签 拥有一个test属性,其表达式的值用来决定标签里内容是否显示 <s:if test="#request.username=='clf'"> ...

  5. JavaScript与jQuery获取相邻控件

    原始代码如下,需求是onclick中的OpenIframe方法捕捉到input中的value值,由于某些限制无法使用正常的操作dom根据name值来取,所以决定通过相邻空间的方式获取 <div& ...

  6. Linux设备驱动编程---miscdevice杂类设备的使用方法

    miscdev简称杂类设备杂类设备就是对字符设备驱动做一个封装,方便简单使用杂类设备封装字符设备需要包含的头文件:#include <linux/miscdevice.h>(1)杂类设备的 ...

  7. javascript 下拉列表 自动取值 无需value

    <select id="applyType" name="$!{status.expression}" class="inp" onc ...

  8. 【unix网络编程第三版】阅读笔记(五):I/O复用:select和poll函数

    本博文主要针对UNP一书中的第六章内容来聊聊I/O复用技术以及其在网络编程中的实现 1. I/O复用技术 I/O多路复用是指内核一旦发现进程指定的一个或者多个I/O条件准备就绪,它就通知该进程.I/O ...

  9. 基于表单数据的封装,泛型,反射以及使用BeanUtils进行处理

    在Java Web开发过程中,会遇到很多的表单数据的提交和对表单数据的处理.而每次都需要对这些数据的字段进行一个一个的处理就显得尤为繁琐,在Java语言中,面向对象的存在目的便是为了消除重复代码,减少 ...

  10. (NO.00003)iOS游戏简单的机器人投射游戏成形记(二十)

    接上一篇文章,我们现在来实现篮框的感应器. 所谓感应器,就是在物体接触到的时候做出反应的节点.我们需要将感应器放在篮框底部,这样子弹接触感应器的时候,我们就知道子弹坠入了篮框,从而得分. 为了放置子弹 ...