聚类结果的评估指标及其JAVA实现

一. 前言

又GET了一项技能。在做聚类算法的时候，由于要评估所提出的聚类算法的好坏，于是需要与一些已知的算法对比，或者用一些人工标注的标签来比较，于是用到了聚类结果的评估指标。我了解了以下几项。

首先定义几个量：（借鉴该博客：http://blog.csdn.net/luoleicn/article/details/5350378）

TP:是指被聚在一类的两个量被正确的分类了（即在标准标注里属于一类的两个对象被聚在一类）

TN:是指不应该被聚在一类的两个对象被正确地分开了（即在标准标注里不是一类的两个对象在待测结果也没聚在一类）

FP：指不应该放在一类的对象被错误的放在了一类。（即在标准标注里不是一类，但在待测结果里聚在一类）

FN：指不应该分开的对象被错误的分开了。（即在标准标注里是一类，但在待测结果里没聚在一类）

P = TP + FP

N = TN + FN

1.准确率、识别率：(rank Index) RI

accuracy = (TP + TN)/(P + N)

2.错误率、误分类率

error rate = (FP + FN)/(P + N)

3.敏感度

sensitivity = TP / P

4.特效性

specificity = TN / N

5.精度

precision = TP / (TP + FP)

6.召回率

recall = TP / (TP + FN)

7.RI 其实就是 1 的 accuracy

8.F度量

P为precision

R为recall

9.NMI(normalized mutual information)

10 Jaccard

J = TP / (TP + F)

二、JAVA实现（未优化）

其中很多重复代码，还没有优化。。。

package others;

import java.util.HashMap;

import java.util.HashSet;

import java.util.Iterator;

import java.util.Map;

import java.util.Map.Entry;

import java.util.Set;

import javax.rmi.CORBA.Util;

import org.graphstream.algorithm.measure.NormalizedMutualInformation;

/*function:常用的聚类评价指标有purity, precision, recall，  RI 和 F-score,jaccard

 * @param:

 * @author:Wenbao Li

 * @Data:2015-07-13

 */

public class ClusterEvaluation {

	public static void main(String[] args){

		int[] A = {1,3,3,3,3,3,3,2,1,0,2,0,2,0,2,1,1,0,1,1};

		int[] B = {2,2,0,0,0,3,2,2,3,1,3,1,0,1,2,1,0,1,3,3};

		double purity = Purity(A,B);

		System.out.println("purity\t\t"+purity);

		System.out.println("Pre\t\t"+Precision(A,B));

		System.out.println("Recall\t\t"+Recall(A,B));

		System.out.println("RI(Accuracy)\t\t"+RI(A,B));

		System.out.println("Fvalue\t\t"+F_score(A,B));

		System.out.println("NMI\t\t"+NMI(A,B));

	}

	/*

	 * 计算一个聚类结果的簇的个数，以及每一簇中的对象个数,

	 */

	public static Map<Integer,Set<Integer>> clusterDistri(int[] A){

		Map<Integer,Set<Integer>> clusterD = new HashMap<Integer,Set<Integer>>();

		int max = -1;

		for(int i = 0;i< A.length;i++){

			if(max < A[i]){

				max = A[i];

			}

		}

		for(int i = 0;i< A.length;i++){

			int temp = A[i];

			if(temp < max+1){

				if(clusterD.containsKey(temp)){

					Set<Integer> set = clusterD.get(temp);

					set.add(i+1);

					clusterD.put(temp, set);

				}else{

					Set<Integer> set = new HashSet<Integer>();

					set.add(i+1);

					clusterD.put(temp, set);

				}

			}

		}

		return clusterD;

	}

	public static double ClusEvaluate(String method,int[] A,int[] B){

		switch(method){

		case "Purity":

			return Purity(A,B);

		case "Precision":

			return Precision(A,B);

		case "Recall":

			return Recall(A,B);

		case "RI":

			return RI(A,B);

		case "F_score":

			return F_score(A,B);

		case "NMI":

			return NMI(A,B);

		case "Jaccard":

			return Jaccard(A,B);

		default:

			return -1.0;

		}

	}

	public static int[] commNum(Map<Integer,Set<Integer>> A,Map<Integer,Set<Integer>> B){

		int[] commonNo = new int[A.size()];

		int com = 0;

		Iterator<Map.Entry<Integer,Set<Integer>>> itA = A.entrySet().iterator();

		int i = 0;

		while(itA.hasNext()){

			Entry<Integer,Set<Integer>> entryA = itA.next();

			Set<Integer> setA = entryA.getValue();

			Iterator<Map.Entry<Integer,Set<Integer>>> itB = B.entrySet().iterator();

			int maxComm = -1;

			while(itB.hasNext()){

				Entry<Integer,Set<Integer>> entryB = itB.next();

				Set<Integer> setB = entryB.getValue();

				int lengthA = setA.size();

				Set<Integer> temp = new HashSet<Integer>(setA);

				temp.removeAll(setB);

				int lengthCom = lengthA - temp.size();

				if(maxComm < lengthCom){

					maxComm = lengthCom;

				}

			}

			commonNo[i] = maxComm;

			com = com + maxComm;

			i++;

		}

		return commonNo;

	}

	/*

	 * 所有簇分配正确的除以总的。其中B是对比的标准标签。

	 */

	public static double Purity(int[] A,int[] B){

		double value;

		Map<Integer,Set<Integer>> clusterA = clusterDistri(A);

		Map<Integer,Set<Integer>> clusterB = clusterDistri(B);

		int[] commonNo = commNum(clusterA,clusterB);

		int com = 0;

		for(int i = 0;i<commonNo.length;i++){

			com = com + commonNo[i];

		}

		value = com*1.0/A.length;

		return value;

	}

	/*

	 * @param A,B

	 * @return 精度

	 */

	public static double Precision(int[] A,int[] B){

		double value = 0.0;

		Map<Integer,Set<Integer>> clusterA = clusterDistri(A);//得到聚类结果A的类分布

		Map<Integer,Set<Integer>> clusterB = clusterDistri(B);//得到聚类B（标准）的类分布

		int[] commonNo = commNum(clusterA,clusterB);//得到A中每个簇中聚类正确的数目。

		int allP = 0;

		int TP = 0;

		int FP = 0;

		int TN = 0;

		int FN = 0;

		Iterator<Map.Entry<Integer,Set<Integer>>> itA = clusterA.entrySet().iterator();

		int i = 0;

		while(itA.hasNext()){

			Entry<Integer,Set<Integer>> entryA = itA.next();

			allP = allP + combination(entryA.getValue().size(),2);

			TP = TP + combination(commonNo[i],2);

			i++;

		}

		FP = allP - TP;

		itA = clusterA.entrySet().iterator();

		while(itA.hasNext()){

			Entry<Integer,Set<Integer>> entryA = itA.next();

			Iterator<Map.Entry<Integer,Set<Integer>>> itA2 = clusterA.entrySet().iterator();

			while(itA2.hasNext()){

				Entry<Integer,Set<Integer>> entryA2 = itA2.next();

				if(entryA != entryA2){

					Set<Integer> s1 = entryA.getValue();

					Set<Integer> s2 = entryA2.getValue();

					for(Integer i1 :s1){

						for(Integer i2:s2){

							if(B[i1-1] != B[i2-1]){

								TN++;

							}else{

								FN++;

							}

						}

					}

				}

			}

		}

		double P = TP*1.0/(TP + FP);

		return P;

	}

	/*

	 * @param A,B

	 * @return recal召回率

	 */

	public static double Recall(int[] A,int[] B){

		double value = 0.0;

		Map<Integer,Set<Integer>> clusterA = clusterDistri(A);//得到聚类结果A的类分布

		Map<Integer,Set<Integer>> clusterB = clusterDistri(B);//得到聚类B（标准）的类分布

		int[] commonNo = commNum(clusterA,clusterB);//得到A中每个簇中聚类正确的数目。

		int allP = 0;

		int TP = 0;

		int FP = 0;

		int TN = 0;

		int FN = 0;

		Iterator<Map.Entry<Integer,Set<Integer>>> itA = clusterA.entrySet().iterator();

		int i = 0;

		while(itA.hasNext()){

			Entry<Integer,Set<Integer>> entryA = itA.next();

			allP = allP + combination(entryA.getValue().size(),2);

			TP = TP + combination(commonNo[i],2);

			i++;

		}

		FP = allP - TP;

		itA = clusterA.entrySet().iterator();

		while(itA.hasNext()){

			Entry<Integer,Set<Integer>> entryA = itA.next();

			Iterator<Map.Entry<Integer,Set<Integer>>> itA2 = clusterA.entrySet().iterator();

			while(itA2.hasNext()){

				Entry<Integer,Set<Integer>> entryA2 = itA2.next();

				if(entryA != entryA2){

					Set<Integer> s1 = entryA.getValue();

					Set<Integer> s2 = entryA2.getValue();

					for(Integer i1 :s1){

						for(Integer i2:s2){

							if(B[i1-1] != B[i2-1]){

								TN++;

							}else{

								FN++;

							}

						}

					}

				}

			}

		}

		double R = TP * 1.0/(TP + FN);

		return R;

	}

	/*

	 * @param A,B

	 * @return RankIndex

	 */

	public static double RI(int[] A,int[] B){

		double value = 0.0;

		Map<Integer,Set<Integer>> clusterA = clusterDistri(A);//得到聚类结果A的类分布

		Map<Integer,Set<Integer>> clusterB = clusterDistri(B);//得到聚类B（标准）的类分布

		int[] commonNo = commNum(clusterA,clusterB);//得到A中每个簇中聚类正确的数目。

		int P = 0;

		int TP = 0;

		int FP = 0;

		int TN = 0;

		int FN = 0;

		Iterator<Map.Entry<Integer,Set<Integer>>> itA = clusterA.entrySet().iterator();

		int i = 0;

		while(itA.hasNext()){

			Entry<Integer,Set<Integer>> entryA = itA.next();

			P = P + combination(entryA.getValue().size(),2);

			TP = TP + combination(commonNo[i],2);

			i++;

		}

		FP = P - TP;

		itA = clusterA.entrySet().iterator();

		while(itA.hasNext()){

			Entry<Integer,Set<Integer>> entryA = itA.next();

			Iterator<Map.Entry<Integer,Set<Integer>>> itA2 = clusterA.entrySet().iterator();

			while(itA2.hasNext()){

				Entry<Integer,Set<Integer>> entryA2 = itA2.next();

				if(entryA != entryA2){

					Set<Integer> s1 = entryA.getValue();

					Set<Integer> s2 = entryA2.getValue();

					for(Integer i1 :s1){

						for(Integer i2:s2){

							if(B[i1-1] != B[i2-1]){

								TN++;

							}else{

								FN++;

							}

						}

					}

				}

			}

		}

		value = (TP + TN)*1.0/(TP + FP + FN + TN);

		return value;

	}

	/*

	 * F值，是对精度和召回率的平衡，

	 * @param A:评估对象。B：评估标准；beta：均衡参数

	 * @return F值

	 */

	public static double F_score(int[] A,int[] B){

		double beta = 1.0;

		double value = 0.0;

		Map<Integer,Set<Integer>> clusterA = clusterDistri(A);//得到聚类结果A的类分布

		Map<Integer,Set<Integer>> clusterB = clusterDistri(B);//得到聚类B（标准）的类分布

		int[] commonNo = commNum(clusterA,clusterB);//得到A中每个簇中聚类正确的数目。

		int allP = 0;

		int TP = 0;

		int FP = 0;

		int TN = 0;

		int FN = 0;

		Iterator<Map.Entry<Integer,Set<Integer>>> itA = clusterA.entrySet().iterator();

		int i = 0;

		while(itA.hasNext()){

			Entry<Integer,Set<Integer>> entryA = itA.next();

			allP = allP + combination(entryA.getValue().size(),2);

			TP = TP + combination(commonNo[i],2);

			i++;

		}

		FP = allP - TP;

		itA = clusterA.entrySet().iterator();

		while(itA.hasNext()){

			Entry<Integer,Set<Integer>> entryA = itA.next();

			Iterator<Map.Entry<Integer,Set<Integer>>> itA2 = clusterA.entrySet().iterator();

			while(itA2.hasNext()){

				Entry<Integer,Set<Integer>> entryA2 = itA2.next();

				if(entryA != entryA2){

					Set<Integer> s1 = entryA.getValue();

					Set<Integer> s2 = entryA2.getValue();

					for(Integer i1 :s1){

						for(Integer i2:s2){

							if(B[i1-1] != B[i2-1]){

								TN++;

							}else{

								FN++;

							}

						}

					}

				}

			}

		}

		double P = TP*1.0/(TP + FP);

		double R = TP * 1.0/(TP + FN);

		value = (beta*beta + 1)*P * R/(beta*beta*P + R);

		return value;

	}

	public static double Jaccard(int[] A,int[] B){

		double value = 0.0;

		Map<Integer,Set<Integer>> clusterA = clusterDistri(A);//得到聚类结果A的类分布

		Map<Integer,Set<Integer>> clusterB = clusterDistri(B);//得到聚类B（标准）的类分布

		int[] commonNo = commNum(clusterA,clusterB);//得到A中每个簇中聚类正确的数目。

		int allP = 0;

		int TP = 0;

		int FP = 0;

		int TN = 0;

		int FN = 0;

		Iterator<Map.Entry<Integer,Set<Integer>>> itA = clusterA.entrySet().iterator();

		int i = 0;

		while(itA.hasNext()){

			Entry<Integer,Set<Integer>> entryA = itA.next();

			allP = allP + combination(entryA.getValue().size(),2);

			TP = TP + combination(commonNo[i],2);

			i++;

		}

		FP = allP - TP;

		itA = clusterA.entrySet().iterator();

		while(itA.hasNext()){

			Entry<Integer,Set<Integer>> entryA = itA.next();

			Iterator<Map.Entry<Integer,Set<Integer>>> itA2 = clusterA.entrySet().iterator();

			while(itA2.hasNext()){

				Entry<Integer,Set<Integer>> entryA2 = itA2.next();

				if(entryA != entryA2){

					Set<Integer> s1 = entryA.getValue();

					Set<Integer> s2 = entryA2.getValue();

					for(Integer i1 :s1){

						for(Integer i2:s2){

							if(B[i1-1] != B[i2-1]){

								TN++;

							}else{

								FN++;

							}

						}

					}

				}

			}

		}

		value = TP * 1.0 / (TP + FP + FN);

		return value;

	}

	public static double NMI(int[] A,int[] B){

		Map<Integer,Set<Integer>> clusterA = clusterDistri(A);//得到聚类结果A的类分布

		Map<Integer,Set<Integer>> clusterB = clusterDistri(B);//得到聚类B（标准）的类分布

		Iterator<Map.Entry<Integer,Set<Integer>>> itA = clusterA.entrySet().iterator();

		Iterator<Map.Entry<Integer,Set<Integer>>> itB = clusterB.entrySet().iterator();

		Set<Set<Integer>> partitionF = new HashSet<Set<Integer>>();

		Set<Set<Integer>> partitionR = new HashSet<Set<Integer>>();

		int nodeCount = B.length;

		while(itA.hasNext()){

			Entry<Integer,Set<Integer>> entryA = itA.next();

			Set<Integer> setA = entryA.getValue();

			partitionF.add(setA);

			setA = null;

			entryA = null;

		}

		while(itB.hasNext()){

			Entry<Integer,Set<Integer>> entryB = itB.next();

			Set<Integer> setB = entryB.getValue();

			partitionR.add(setB);

			setB = null;

			entryB = null;

		}

		return computeNMI(partitionF,partitionR,nodeCount);

	}

	public static double computeNMI(Set<Set<Integer>> partitionF,

			Set<Set<Integer>> partitionR,int nodeCount) {

		int[][] XY = new int[partitionR.size()][partitionF.size()];

		int[] X = new int[partitionR.size()];

		int[] Y = new int[partitionF.size()];

		int i = 0;

		int j = 0;

		for (Set<Integer> com1 : partitionR) {

			j = 0;

			for (Set<Integer> com2 : partitionF) {

				XY[i][j] = intersect(com1, com2);//待测结果第i个簇和标准结果第j个簇的共有元素个数

				X[i] += XY[i][j];//待测结果第i个簇与所有标准结果簇的公共元素个数（感觉就是第i个簇的元素个数）

				Y[j] += XY[i][j];//标准结果簇第j个簇的元素个数（）

				j++;

			}

			i++;

		}

		int N = nodeCount;

		double Ixy = 0;

		double Ixy2 = 0;

		for (i = 0; i < partitionR.size(); i++) {

			for (j = 0; j < partitionF.size(); j++) {

				if (XY[i][j] > 0) {

					Ixy += ((double) XY[i][j] / N)

							* (Math.log((double) XY[i][j] * N / (X[i] * Y[j])) / Math

									.log(2.0));

//					Ixy2 = (float) (Ixy2 + -2.0D * XY[i][j]

//							* Math.log(XY[i][j] * N / X[i] * Y[j]));

				}

			}

		}

//		System.out.println(Ixy2);

//		double denom = 0.0F;

//		for (int ii = 0; ii < X.length; ++ii)

//			denom = (double) (denom + X[ii] * Math.log(X[ii] / N));

//		for (int jj = 0; jj < Y.length; ++jj) {

//			denom = (double) (denom + Y[jj] * Math.log(Y[jj] / N));

//		}

//

//		System.out.println(denom);

//		double M = (Ixy / denom);

//

//		return M;

		double Hx = 0;

		double Hy = 0;

		for (i = 0; i < partitionR.size(); i++) {

			if (X[i] > 0)

				Hx += h((double) X[i] / N);

		}

		for (j = 0; j < partitionF.size(); j++) {

			if (Y[j] > 0)

				Hy += h((double) Y[j] / N);

		}

		double InormXY = Ixy / Math.sqrt(Hx * Hy);

		return InormXY;

	}

	private static double h(double p) {

		return -p * (Math.log(p) / Math.log(2.0));

	}

	/*

	 * 两个集合的公共元素个数

	 */

	private static int intersect(Set<Integer> com1, Set<Integer> com2) {

		int num = 0;

		for (Integer v1 : com1) {

			if (com2.contains(v1))

				num++;

		}

		return num;

	}

	/*

	 * C(m,n)=m取n

	 */

	public static int combination(int m,int n){

		int result = 1;

		if(m < n){

			return -1;

		}

		result = factorial(m)/(factorial(n)*factorial(m-n));

		return result;

	}

	public static int factorial(int m){

		if((m == 1) || (m == 0)){

			return 1;

		}else if(m < 0){

			return -1;

		}else{

			return m*factorial(m-1);

		}

	}

}

聚类结果的评估指标及其JAVA实现的更多相关文章

Python机器学习笔记：常用评估指标的用法
在机器学习中,性能指标(Metrics)是衡量一个模型好坏的关键,通过衡量模型输出y_predict和y_true之间的某种“距离”得出的. 对学习器的泛化性能进行评估,不仅需要有效可行的试验估计方法 ...
python实现六大分群质量评估指标（兰德系数、互信息、轮廓系数）
python实现六大分群质量评估指标(兰德系数.互信息.轮廓系数) 1 R语言中的分群质量--轮廓系数因为先前惯用R语言,那么来看看R语言中的分群质量评估,节选自笔记︱多种常见聚类模型以及分群质量评 ...
评估指标：准确率(Precision)、召回率(Recall)以及F值(F-Measure)
为了能够更好的评价IR系统的性能,IR有一套完整的评价体系,通过评价体系可以了解不同信息系统的优劣,不同检索模型的特点,不同因素对信息检索的影响,从而对信息检索进一步优化. 由于IR的目标是在较短时间 ...
[DeeplearningAI笔记]ML strategy_1_1正交化/单一数字评估指标
机器学习策略 ML strategy 觉得有用的话,欢迎一起讨论相互学习~Follow Me 1.1 什么是ML策略机器学习策略简介情景模拟假设你正在训练一个分类器,你的系统已经达到了90%准确 ...
【机器学习】--模型评估指标之混淆矩阵，ROC曲线和AUC面积
一.前述怎么样对训练出来的模型进行评估是有一定指标的,本文就相关指标做一个总结. 二.具体 1.混淆矩阵混淆矩阵如图: 第一个参数true,false是指预测的正确性. 第二个参数true,p ...
评估指标：ROC，AUC，Precision、Recall、F1-score
一.ROC,AUC ROC(Receiver Operating Characteristic)曲线和AUC常被用来评价一个二值分类器(binary classifier)的优劣 . ROC曲线一般的 ...
【Udacity】机器学习性能评估指标
评估指标 Evaluation metrics 机器学习性能评估指标选择合适的指标分类与回归的不同性能指标分类的指标(准确率.精确率.召回率和 F 分数) 回归的指标(平均绝对误差和均方误差) ...
ubuntu之路——day10.2单一数字评估指标与满足和优化的评估指标
单一数字评估指标: 我们在平时常用到的模型评估指标是精度(accuracy)和错误率(error rate),错误率是:分类错误的样本数站样本总数的比例,即E=n/m(如果在m个样本中有n个样本分类错 ...
召回率、AUC、ROC模型评估指标精要
混淆矩阵精准率/查准率,presicion 预测为正的样本中实际为正的概率召回率/查全率,recall 实际为正的样本中被预测为正的概率 TPR F1分数,同时考虑查准率和查全率,二者达到平衡,= ...

随机推荐

bzoj3743 Kamp
Description 一颗树n个点,n-1条边,经过每条边都要花费一定的时间,任意两个点都是联通的. 有K个人(分布在K个不同的点)要集中到一个点举行聚会. 聚会结束后需要一辆车从举行聚会的这点出发 ...
sublime_text3 用户配置
{ "auto_complete_triggers": [ { "characters": "", "selector" ...
android学习笔记42——图形图像处理2——绘图
绘图 android的绘图应该继承View组件,并重写onDraw(Canvas canvas)方法即可. 重写onDraw(Canvas canvas)方法时涉及一个绘图API:Canvas,Can ...
2014 年10个最佳的PHP图像操作库
2014 年10个最佳的PHP图像操作库 Thomas Boutell 以及众多的开发者创造了以GD图形库闻名的一个图形软件库,用于动态的图形计算. GD提供了对于诸如C, Perl, Pytho ...
solr基于tomcat增加主界面登录权限
tomcat-user.xml增加下面标签(用户名,密码,角色)<user username="admin" password="new-password" ...
Eclipse中构建Fluent风格到Formatter
The place to set this is on the "Line Wrapping" tab of the code formatting preferences pag ...
ylbtech-Unitity-cs:计算阶乘值
ylbtech-Unitity-cs:计算阶乘值 1.A,效果图返回顶部 1.B,源代码返回顶部 1.B.1, using System; namespace Functions { public ...
MySQL 开启与关闭远程访问&&授权前需执行GRANT USAGE ON *.* TO 'cai'@'%' IDENTIFIED BY 'caigan2015';才能终端访问
MySQL 开启与关闭远程访问 (1)通过MySQL用户去限制访问权限系统目的: MySQL基于安全考虑root账户一般只能本地访问,但是在开发过程中可能需要打开root的远程访问权限,今天介绍的就 ...
Cassandra安装及其简单试用
官方主页:http://cassandra.apache.org/ 简介: The Apache Cassandra Project develops a highly scalable second ...
http请求的referer属性
HTTP Referer是header的一部分,当浏览器向web服务器发送请求的时候,一般会带上Referer,告诉服务器我是从哪个页面链接过来的,服务器籍此可以获得一些信息用于处理.比如从我主页上链 ...

聚类结果的评估指标及其JAVA实现

聚类结果的评估指标及其JAVA实现的更多相关文章

随机推荐

热门专题