Weka:call for the EM algorithm to achieve clustering.（EM算法）

EM算法：

在Eclipse中写出读取文件的代码然后调用EM算法计算输出结果：

package EMAlg;

import java.io.*;

import weka.core.*;

import weka.filters.Filter;

import weka.filters.unsupervised.attribute.Remove;

import weka.clusterers.*;

public class EMAlg {

    public EMAlg() {

        // TODO Auto-generated constructor stub

        System.out.println("this is the EMAlg");

    }

    public static void main(String[] args) throws Exception {

        // TODO Auto-generated method stub

        String file="C:\\Program Files/DataMining/Weka-3-6-10/data/labor.arff";

        FileReader FReader=new FileReader(file);

        BufferedReader Reader= new BufferedReader(FReader);

        Instances data=new Instances(Reader);

        data.setClassIndex(data.numAttributes()-1);//设置最后一个属性作为分类属性

        Remove filter=new Remove();

        System.out.println("''+data.classIndex()的输出内容是："+""+data.classIndex());

        System.out.println("读取数据的属性个数一共有："+data.numAttributes()+"个.");

        filter.setAttributeIndices(""+(data.classIndex()+1));

        /*filter.setAttributeIndices();

         * Set which attributes are to be deleted (or kept if invert is true)

         * 用来设置哪一个属性应该被删除的方法。

         * Parameters:

         *     rangeList - a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.

         *     eg: first-3,5,6-last

         */

        filter.setInputFormat(data);

        /*

         * public boolean setInputFormat(Instances instanceInfo)throws java.lang.Exception

         * Sets the format of the input instances(设置输入数据的格式). If the filter is able to determine the output format before seeing any input instances, it does so here（如果过滤器在查看任何输入文件之前可以决定 输入文件的格式，那么这个函数就放在这里）.

         * This default implementation clears the output format and output queue, and the new batch flag is set.

         * Overriders should call super.setInputFormat(Instances)

         * Parameters:

         * instanceInfo - an Instances object containing the input instance structure (any instances contained in the object are ignored - only the structure is required).

         * Returns:

         *        true if the outputFormat may be collected immediately

         * Throws:

         * java.lang.Exception - if the inputFormat can't be set successfully

        */

        Instances dataCluster=Filter.useFilter(data, filter);

        /*public static Instances useFilter(Instances data,Filter filter)throws java.lang.Exception

         * Filters an entire set of instances through a filter and returns the new set.

         * 传入两个参数，第一个是需要进行过滤的数据，第二个是使用的过滤器，返回只为新的数据集。

         * Parameters:

         * data - the data to be filtered

         * filter - the filter to be used

         * Returns:

         *     the filtered set of data

         * Throws:

         *     java.lang.Exception - if the filter can't be used successfully

         */

        EM clusterer=new EM();

        /*

         * public class EM

         * extends RandomizableDensityBasedClusterer

         * implements NumberOfClustersRequestable, WeightedInstancesHandler

         * Simple EM (expectation maximisation) class.

         * EM assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters. EM can decide how many clusters to create by cross validation, or you may specify apriori how many clusters to generate.

         * The cross validation performed to determine the number of clusters is done in the following steps:

         * 1. the number of clusters is set to 1

         * 2. the training set is split randomly into 10 folds.

         * 3. EM is performed 10 times using the 10 folds the usual CV way.

         * 4. the loglikelihood is averaged over all 10 results.

         * 5. if loglikelihood has increased the number of clusters is increased by 1 and the program continues at step 2.

         * The number of folds is fixed to 10, as long as the number of instances in the training set is not smaller 10. If this is the case the number of folds is set equal to the number of instances.

         * Valid options are:

         *  -N <num>

         *    number of clusters. If omitted or -1 specified, then cross validation is used to select the number of clusters.

         * -I <num>

         *     max iterations.(default 100)

         * -V

         *   verbose.

         * -M <num>

         *   minimum allowable standard deviation for normal density  computation

         *    (default 1e-6)

         * -O

         *   Display model in old format (good when there are many clusters)

         * -S <num>

         *   Random number seed.(default 100)

         */

        String [] options=new String[4];

        // max. iterations //最大迭代次数

        options[0] = "-I";

        options[1] = "100";

        //set cluster numbers,设置簇的个数

        options[2]="-N";

        options[3]="2";  

        clusterer.setOptions(options);

        clusterer.buildClusterer(dataCluster);

        //clusterer.buildClusterer(dataClusterer);  

        // evaluate clusterer

        ClusterEvaluation eval = new ClusterEvaluation();

        eval.setClusterer(clusterer);

        eval.evaluateClusterer(data);  

        // print results

        System.out.println("数据总数："+data.numInstances()+"属性个数为："+data.numAttributes());

        System.out.println(eval.clusterResultsToString());

      }

    }

使用的数据是Weka安装目录下data文件夹中的labor.arff文件。

输出的结果是：

''+data.classIndex()的输出内容是：16

读取数据的属性个数一共有：17个.

数据总数：57属性个数为：17

EM

==

Number of clusters: 2

                                 Cluster

Attribute                              0       1

                                  (0.14)  (0.86)

=================================================

duration

  mean                             1.5702  2.2532

  std. dev.                        0.4953  0.6764

wage-increase-first-year

  mean                             3.0708  3.9184

  std. dev.                        1.0028  1.3571

wage-increase-second-year

  mean                             3.8141  3.9964

  std. dev.                        0.8153  1.0624

wage-increase-third-year

  mean                             3.9133  3.9133

  std. dev.                        0.6522  0.6952

cost-of-living-adjustment

  none                             7.0614 36.9386

  tcf                              1.3707  8.6293

  tc                               2.2872  6.7128

  [total]                         10.7192 52.2808

working-hours

  mean                            39.4412 37.8196

  std. dev.                        0.8911  2.4268

pension

  none                             6.4515  6.5485

  ret_allw                         2.3211  3.6789

  empl_contr                       1.9466 42.0534

  [total]                         10.7192 52.2808

standby-pay

  mean                             6.7945  7.5462

  std. dev.                        1.4912   1.918

shift-differential

  mean                             3.4074  5.1002

  std. dev.                        1.6629  3.4277

education-allowance

  yes                               3.167   8.833

  no                               6.5522 42.4478

  [total]                          9.7192 51.2808

statutory-holidays

  mean                             10.555 11.1788

  std. dev.                         0.572  1.2533

vacation

  below_average                     4.657  21.343

  average                          4.0313 14.9687

  generous                         2.0309 15.9691

  [total]                         10.7192 52.2808

longterm-disability-assistance

  yes                              2.9977 48.0023

  no                               6.7215  3.2785

  [total]                          9.7192 51.2808

contribution-to-dental-plan

  none                             7.1218  3.8782

  half                             2.5419 34.4581

  full                             1.0556 13.9444

  [total]                         10.7192 52.2808

bereavement-assistance

  yes                              5.7192 50.2808

  no                                    4       1

  [total]                          9.7192 51.2808

contribution-to-health-plan

  none                             6.2887  3.7113

  half                             1.8752  9.1248

  full                             2.5554 39.4446

  [total]                         10.7192 52.2808

Clustered Instances

0       8 ( 14%)

1      49 ( 86%)

Log likelihood: -18.37167

Class attribute: class

Classes to Clusters:

  0  1  <-- assigned to cluster

  8 12 | bad

  0 37 | good

Cluster 0 <-- bad

Cluster 1 <-- good

Incorrectly clustered instances :    12.0     21.0526 %

Weka:call for the EM algorithm to achieve clustering.（EM算法）的更多相关文章

Andrew Ng机器学习公开课笔记 -- Mixtures of Gaussians and the EM algorithm
网易公开课,第12,13课 notes,7a, 7b,8 从这章开始,介绍无监督的算法对于无监督,当然首先想到k means, 最典型也最简单,有需要直接看7a的讲义 Mixtures of G ...
EM算法 The EM Algorithm
(EM算法)The EM Algorithm http://www.cnblogs.com/jerrylead/archive/2011/04/06/2006936.html EM算法原理 http: ...
Machine Learning—Mixtures of Gaussians and the EM algorithm
印象笔记同步分享:Machine Learning-Mixtures of Gaussians and the EM algorithm
Gaussian Mixture Models and the EM algorithm汇总
Gaussian Mixture Models and the EM algorithm汇总作者:凯鲁嘎吉 - 博客园 http://www.cnblogs.com/kailugaji/ 1. 漫谈 ...
Maximum likelihood from incomplete data via the EM algorithm (1977)
Maximum likelihood from incomplete data via the EM algorithm (1977)
（EM算法）The EM Algorithm
http://www.cnblogs.com/jerrylead/archive/2011/04/06/2006936.html http://blog.sina.com.cn/s/blog_a7da ...
The EM Algorithm
EM是我一直想深入学习的算法之一,第一次听说是在NLP课中的HMM那一节,为了解决HMM的参数估计问题,使用了EM算法.在之后的MT中的词对齐中也用到了.在Mitchell的书中也提到EM可以用于贝叶 ...
Mixtures of Gaussians and the EM algorithm
http://cs229.stanford.edu/ http://cs229.stanford.edu/notes/cs229-notes7b.pdf
挑子学习笔记：两步聚类算法（TwoStep Cluster Algorithm）——改进的BIRCH算法
转载请标明出处:http://www.cnblogs.com/tiaozistudy/p/twostep_cluster_algorithm.html 两步聚类算法是在SPSS Modeler中使用的 ...

随机推荐

实现bootstrap的dropdown-menu(下拉菜单)点击后不关闭的方法 (转)
实现bootstrap的dropdown-menu(下拉菜单)点击后不关闭的方法问题描述,在下拉菜单中,添加其他元素,例如,原文作者所述的<a>和我自己实际用到的<input> ...
nodejs npm包管理常用命令介绍
1.输入 npm config ls -l 可以查看当前的设置 2.针对某一项设置,可以通过下面方式: npm config set 属性名属性值 eg:npm config set prefix ...
php 实现无限极分类
原始数据 $array = array( array('id' => 1, 'pid' => 0, 'n' => '河北省'), array('id' => 2, 'pid' ...
HDU_1846 Brave Game 【巴什博弈】
题目: 十年前读大学的时候,中国每年都要从国外引进一些电影大片,其中有一部电影就叫<勇敢者的游戏>(英文名称:Zathura),一直到现在,我依然对于电影中的部分电脑特技印象深刻. 今天, ...
Codeforces - 600E 树上启发式合并
题意:求每一个子树存在最多颜色的颜色代号和(可重复) 本题是离线统计操作,因此可以直接合并重儿子已达到$O(nlogn)$的复杂度 PS.不知道什么是启发式合并的可以这样感受一下:进行树链剖分,分 ...
POJ - 1456 贪心堆常用操作注意细节
题意:给定n个商品的deadline和profit,求每天卖一件的情况下的最大获利显然是一道贪心按deadline从小到大排序好,动态维护小根(profit)堆的大小<=当前deadline ...
log4j详解与实战
[转自] http://www.iteye.com/topic/378077 log4j是一个非常强大的log记录软件,下面我们就来看看在项目中如何使log4j. 首先当然是得到log4j的jar档, ...
Python学习 day03
一.基本数据类型 python中的基本数据类型有以下几种: int -- 整数 python3中默认整数都是int型,python2中int的范围为-231~232-1(32位系统中)/ ...
ubuntu下安装vue-cli框架
首先安装好node.js,安装方式见 http://www.cnblogs.com/teersky/p/7255334.html 之后正式开始vue-cli之旅吧,输入以下代码安装vue-cli模块 ...
TOJ 4393 Game
描述 Bob always plays game with Alice.Today,they are playing a game on a tree.Alice has m1 stones,Bob ...

Weka:call for the EM algorithm to achieve clustering.（EM算法）

Weka:call for the EM algorithm to achieve clustering.（EM算法）的更多相关文章

随机推荐

热门专题