Abstract – In many practical data mining applications such as web page classification, unlabeled training examples are readily available but labeled ones are fairly expensive to obtain. Therefore, semi-supervised learning algorithms such as co-training have attracted much attention. In this paper, a new co-training style semi-supervised learning algorithm named tri-training is proposed. This algorithm generates three classifier from the original labeled example set. These classifiers are then refined using unlabeled examples in the tri-training process. In detail, in each round of tri-training, an unlabeled example is labeled for a classifier if the other two classifiers agree on the labeling, under certain conditions. Since tri-training neither requires the instance space be described with sufficient and redundant views nor does it put any constraints on the supervised learning algorithm, its applicability is broader than that of previous co-training style algorithms. Experiments on UCI data sets and application to the web page classification task indicate that tri-training performance.

Index Terms – Data Mining, Machine Learning, Learning from Unlabeled Data, Semi-supervised Learning, Co-training, Tri-training, Web Page Classification

I. INTRODUCTION

IN many practical data mining applications such as web page classification, unlabeled training examples are readily available but labeled ones are fairly expensive to obtain because they require human effort. Therefore, semi-supervised learning that exploits unlabeled examples in addition to labeled ones has become a hot topic.

Many current semi-supervised learning algorithms use a generative model for the classifier and employ Expectation Maximization (EM) to model the label estimation or parameter estimation process. For example, mixture of Gaussians, mixture of experts, and naive Bayes have been respectively used as the generative model, while EM is used to combine labeled and unlabeled data for classification. There are also many other algorithms such as using transductive inference for support vector machines to optimize performance on a specific test set, constructing a graph on the examples such that minimum cut on the graph yields an optimal labeling of the unlabeled examples according to certain optimization functions, etc.

A prominent achievement in this area is the co-training paradigm proposed by Blum and Mitchell, which trains two classifiers separately on two different views, i.e. two independent sets of attributes, and uses the predictions of each classifier on unlabeled examples to augment the training set of the other. Such an idea of utilizing the natural redundancy in the attributes has been employed in some other works. For example, Yarowsky performed word sense disambiguation by constructing a sense classifier using the local context of the word and a classifier based on the sensed of other occurrences of that word in the same document; Riloff and Jones classified a noun phrase for geographic locations by considering both the noun phrase itself and the linguistic context in which the noun phrase appears; Collins and Singer performed named entity classification using both the spelling of the entity itself and the context in which the entity occurs. It is noteworthy that the co-training paradigm has already been used in many domains such as statistical parsing and noun phrase identification.

The standard co-training algorithm requires two sufficient and redundant views, that is, the attributes be naturally partitioned into two sets, each of which is sufficient for learning and conditionally independent to the other given the class label. Dasgupta et al. have shown that when the requirement is met, the co-trained classifiers could make fewer generalization errors by maximizing their agreement over the unlabeled data. Unfortunately, such a requirement can hardly be met in most scenarios. Goldman and Zhou proposed an algorithm which does not exploit attribute partition. However, it requires using two different supervised learning algorithms that partition the instance space into a set of equivalence classes, and employing time-consuming cross validation technique to determine how to label the unlabeled examples and how to product the final hypothesis.

In this paper, a new co-training style algorithm named tri-training is proposed. Tri-training does not require sufficient and redundant views, nor does it require the use of different supervised learning algorithms whose hypothesis partitions the instance space into a set of equivalence classes. Therefore it can be easily applied to common data mining scenarios. In contrast to previous algorithms that utilize two classifiers, tri-training uses three classifiers. This setting tackles the problem of determining how to label the unlabeled examples and how to produce the final hypothesis, which contributes much to the efficiency of the algorithm. Moreover, better generalization ability can be achieved through combining these three classifiers. Experiments on UCI data sets and application to the web page classification task show that tri-training can effectively exploit unlabeled data, and the generalization ability of its final hypothesis is quite good, sometimes even outperforms that of the ensemble of three classifiers being provided with labels of all the unlabeled examples.

II. TRI-TRAINING

Let denote the labeled example set with size and denote the unlabeled example set with size . In previous co-training style algorithms, two classifiers are initially trained from , each of which is then re-trained with the help of unlabeled examples that are labeled by the latest version of the other classifier. In order to determine which example in should be labeled and which classifier should be biased in prediction, the confidence of the labeling of each classifier must be explicitly measured. Sometimes such a measuring process is quite time-consuming.

Assume that besides these two classifiers, i.e. and , a classifier is initially trained from . Then, for any classifier, an unlabeled example can be labeled for it as long as the other two classifiers agree on the labeling of this example, while the confidence of the labeling of the classifiers are not needed to be explicitly measured. For instance, if and agree on the labeling of an example in , then can be labeled for . It is obvious that in such a schema if the prediction of and on is correct; otherwise will get an example with noisy label. However, even in the worse case, the increase in the classification noise rate can be compensated if the amount of newly labeled example is sufficient, under certain conditions, as shown below.

Inspired by Goldman and Zhou, the finding of Angluin and Laird is used in the following analysis. That is, if a sequence of samples is drawn, where the sample size satisfies Eq. 1:

(1)

where
is the hypothesis worst-case classification error rate, is an upper
bound on the classification noise rate, is the number of hypothesis,
and is the confidence, then a hypothesis that minimizes
disagreement with will have the PAC property:

(1)

where
is sum over the probability of elements from the symmetric difference
between the two hypothesis sets and (the ground-truth). Let where
makes Eq. 1 hold equality, then Eq. 1 becomes Eq. 3:

(1)

To
simplify the computation, it is helpful to compute the quotient of
the constant divided by the square of the error:

(1)

In
each round of tri-training, the classifiers
and
choose some examples in
to label for
Since the classifiers are refined in the tri-training process, the
amount as well as the concrete unlabeled examples chosen to label may
be different in different rounds. Let
and
denote the set of examples that are labeled for
in the
round and the
round, respectively. Then the
training set for
in the
round and the
round are respectively
and
whose sample size
and
are
and
respectively. Note that the unlabeled examples labeled in the
round, i.e.
won't be put into the original labeled example set, i.e.
Instead, in the
round all the examples in
will be regarded as unlabeled and put into
again.

Let
denote the classification noise rate of
that is, the number of examples in
that are mislabeled is
Let
denote the upper bound of the classification error rate of
in the
round, i.e. the error rate of the hypothesis derived from the
combination of
and
Assuming there are
number of examples on which the classification made by
agrees with that made by
and among these examples both
and
make correct classification on
examples, then
can be estimated as
Thus, the number of examples in
that are mislabeled is
Therefore the classification noise rate in the
round is:

Then,
according to Eq,
can be computed as:

The
pseudo-code of tri-training is presented in Table I. The function
attempts to estimate the classification error rate of the hypothesis
derived from the combination of
and
Since it is difficult to estimate
the classification error on the unlabeled examples,
here only the original labeled examples are used, heuristically based
on the assumption that the unlabeled examples hold the same
distribution as that held by the labeled ones. In detail, the
classification error of the hypothesis is approximated through
dividing the number of labeled examples on which both
and
make incorrect classification by the number of labeled examples on
which the classification made by
is the same as that made by
The function
randomly removes
number of examples from
where
is computed according to Eq.10.

It
is noteworthy that the initial classifiers in tri-training should be
diverse because if all the classifiers are identical, then for any of
these classifier, the unlabeled examples labeled by the other two
classifiers will be the same as these labeled by the classifier for
itself. Thus, tri-training degenerates to self-training with a single
classifier. In the standard co-training
algorithm,
the use of sufficient and redundant views enables the classifiers be
different. In fact, previous research has shown that even when there
is no natural attributes partitions, if there are sufficient
redundancy among the attributes then a fairly reasonable attribute
partition will enable co-training exhibit advantages. While in the
extended co-training algorithm which does not require sufficient and
redundant views, the diversity among the classifiers is achieved
through using different supervised
learning algorithms.
Since the tri-training algorithm does not assume sufficient and
redundant views and different supervised learning algorithms, the
diversity of the classifiers have to be sought from other channels.
Indeed, here the diversity is obtained through manipulating the
original labeled example set. In detail, the initial classifiers are
trained from data sets generated via bootstrap sampling from the
original labeled example set. These classifiers are then refined in
the tri-training process, and the final hypothesis is produced via
majority
voting.
The generation of the initial classifiers looks like training an
ensemble from the labeled example set with a popular ensembel
learning algorithm, that is, Bagging.

Tri-training
can be regarded as a
new extension to the co-training algorithms.
As mentioned before, Blum and Mitchell's algorithm requires the
instance space be described by two sufficient and redundant views,
which can hardly be satisfied in common data mining scenarios. Since
tri-training does not rely on different views, its applicability is
broader. Goldman and Zhou's algorithm does not rely on different
views either. However, their algorithm requires two
different supervised learning algorithms
that partition the instance space into a set of equivalence classes.
Moreover, their algorithm frequently uses 10-fold cross validation on
the original labeled example set to determine how to label the
unlabeled examples and how to produce the final hypothesis. If the
original labeled example set is rather small, cross validation will
exhibit high variance and is not helpful for model
selection.
Also, the frequently used cross validation makes the learning process
time-consuming. Since
tri-training does not put any constraint on the supervised learning
algorithm nor does it employ time-consuming cross validation
processes, both its applicability and efficiency are better.

Tri-Training: Exploiting Unlabeled Data Using Three Classifiers的更多相关文章

  1. Bit error testing and training in double data rate (ddr) memory system

    DDR PHY interface bit error testing and training is provided for Double Data Rate memory systems. An ...

  2. A brief introduction to weakly supervised learning(简要介绍弱监督学习)

    by 南大周志华 摘要 监督学习技术通过学习大量训练数据来构建预测模型,其中每个训练样本都有其对应的真值输出.尽管现有的技术已经取得了巨大的成功,但值得注意的是,由于数据标注过程的高成本,很多任务很难 ...

  3. (zhuan) Notes on Representation Learning

    this blog from: https://opendatascience.com/blog/notes-on-representation-learning-1/   Notes on Repr ...

  4. Introduction to Deep Neural Networks

    Introduction to Deep Neural Networks Neural networks are a set of algorithms, modeled loosely after ...

  5. Machine Learning and Data Mining(机器学习与数据挖掘)

    Problems[show] Classification Clustering Regression Anomaly detection Association rules Reinforcemen ...

  6. 少标签数据学习:宾夕法尼亚大学Learning with Few Labeled Data

    目录 Few-shot image classification Three regimes of image classification Problem formulation A flavor ...

  7. 论文解读(GraphDA)《Data Augmentation for Deep Graph Learning: A Survey》

    论文信息 论文标题:Data Augmentation for Deep Graph Learning: A Survey论文作者:Kaize Ding, Zhe Xu, Hanghang Tong, ...

  8. Android开发训练之第五章第六节——Transferring Data Using Sync Adapters

    Transferring Data Using Sync Adapters GET STARTED DEPENDENCIES AND PREREQUISITES Android 2.1 (API Le ...

  9. kaggle Data Leakage

    What is Data Leakage¶ Data leakage is one of the most important issues for a data scientist to under ...

随机推荐

  1. 机器学习实战-K-nearest neighbors 算法的优缺点

    K临近算法是基于实例的学习,使用算法的时候我们必须要有接近分类结果的实例训练样本数据. 优点:精度高,对异常值不敏感 缺点: 时间复杂度和空间复杂度比较大.(如果训练样本数据集比较大,需要大量的空间来 ...

  2. Java堆栈的应用1----------堆栈的自定义实现以及括号匹配算法的Java实现

    接下篇:http://www.cnblogs.com/fuck1/p/5995857.html 堆栈的应用1:括号匹配算法 括号匹配问题 假设算术表达式中包含圆括号,方括号,和花括号三种类型.使用栈数 ...

  3. IOS 键盘 禁止输入字母

    在开发中有时候需要数字键盘,但是设置textfield为默认数字键后, 在模拟器上如果用电脑键盘仍然可以输入字母, 在真机上如果使用搜狗等其他输入法也可能会出现可以输入字母的情况.解决方法如下,在te ...

  4. 异步加载JS的4种方式(详解)

    方案1:$(document).ready <!DOCTYPE html> <html> <head> <script src="http://co ...

  5. vs2015手动安装xamarin

    1.安装jdk Download the Java JDK v1.7.0 installer to any directory on your disk, double-click the downl ...

  6. KVC与KVO

    KVC:键值编码(Key-Value-Coding),是一个非正式的Protocol,提供一种机制间接访问对象的属性,是路径访问的规范: KVO:键值观察 (Key-Value-Observe),是基 ...

  7. 《BI项目笔记》无法解密受保护的 XML 节点“DTS:Password” 解决办法

    说明: 无法解密受保护的 XML 节点“DTS:Password”,错误为 0x8009000B“该项不适于在指定状态下使用.”.可能您无权访问此信息.当发生加密错误时会出现此错误.请确保提供正确的密 ...

  8. Android Tips – 填坑手册

    出于: androidChina   http://www.androidchina.net/3595.html 学习 Android 至今,大大小小的坑没少踩,庆幸的是,在强大的搜索引擎与无私奉献的 ...

  9. dev uploadcontrol 上传图片

    <script type="text/javascript"> // <![CDATA[ function Uploader_OnUploadStart() { ...

  10. javascript复习总结

    改变HTML内容:document.getElementById(id).innerHTML = new HTML; 改变HTML属性:document.getElementById(id).inne ...