Unsupervised Classification - Sprawl Classification Algorithm
[comment]: # Unsupervised Classification - Sprawl Classification Algorithm
Idea
Points (data) in same cluster are near each others, or are connected by each others.
So:
- For a distance d,every points in a cluster always can find some points in the same cluster.
- Distances between points in difference clusters are bigger than the distance d.
The above condition maybe not correct totally, e.g. in the case of clusters which have common points, the condition will be incorrect.
So need some improvement.
Sprawl Classification Algorithm
- Input:
- data: Training Data
- d: The minimum distance between clusters
- minConnectedPoints: The minimum connected points:
- Output:
- Result: an array of classified data
- Logical:
Load data into TotalCache.
i = 0
while (TotalCache.size > 0)
{
Find a any point A from TotalCache, put A into Cache2.
Remove A from TotalCache
In TotalCache, find points 'nearPoints' less than d from any point in the Cache2.
Put Cache2 points into Cache1.
Clear Cache2.
Put nearPoints into Cache2.
Remove nearPoints from TotalCache.
if Cache2.size = 0, add Cache1 points into Result[i].
Clear Cache1.
i++
}
Return Result
Note: As the algorithm need to calculating the distances between points, maybe need to normalize data first to each feature has same weight.
Improvement
A big problem is the method need too much calculation for the distances between points. The max times is \(/frac{n * (n - 1)}{2}\).
Improvement ideas:
- Check distance for one feature first maybe quicker.
We need not to calculate the real distance for each pair, because we only need to make sure whether the distance is less than \(d\),
if points x1, x2, the distance will be bigger or equals to \(d\) when there is a $ \vert x1[i] - x2[i] \vert \geqslant d$. - Split data in multiple area
For n dimensions (features) dataset, we can split the dataset into multiple smaller datasets, each dataset is in a n dimension space whose size \(d^{n}\).
We can image that each small space is a n dimensions cube and adjoin each other.
so we only need to calculate points in the current space and neighbour spaces.
Cons
- Need a amount of calculating.
- Need to improve to handle clusters which have common points.
Unsupervised Classification - Sprawl Classification Algorithm的更多相关文章
- 微软亚洲实验室一篇超过人类识别率的论文:Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification ImageNet Classification
在该文章的两大创新点:一个是PReLU,一个是权值初始化的方法.下面我们分别一一来看. PReLU(paramter ReLU) 所谓的PRelu,即在 ReLU激活函数的基础上加入了一个参数,看一个 ...
- What are the advantages of different classification algorithms?
What are the advantages of different classification algorithms? For instance, if we have large train ...
- Classification / Recognition
转载 https://handong1587.github.io/deep_learning/2015/10/09/recognition.html#facenet Classification / ...
- sklearn中的metrics模块中的Classification metrics
metrics是sklearn用来做模型评估的重要模块,提供了各种评估度量,现在自己整理如下: 一.通用的用法:Common cases: predefined values 1.1 sklearn官 ...
- 机器学习-TensorFlow应用之classification和ROC curve
概述 前面几节讲的是linear regression的内容,这里咱们再讲一个非常常用的一种模型那就是classification,classification顾名思义就是分类的意思,在实际的情况是非 ...
- 学习笔记之k-nearest neighbors algorithm (k-NN)
k-nearest neighbors algorithm - Wikipedia https://en.wikipedia.org/wiki/K-nearest_neighbors_algorith ...
- Exploratory Undersampling for Class-Imbalance Learning
Abstract - Undersampling is a popular method in dealing with class-imbalance problems, which uses on ...
- arcmap Command
The information in this document is useful if you are trying to programmatically find a built-in com ...
- A Gentle Guide to Machine Learning
A Gentle Guide to Machine Learning Machine Learning is a subfield within Artificial Intelligence tha ...
随机推荐
- Java IO 之 OutputStream源码
Writer :BYSocket(泥沙砖瓦浆木匠) 微 博:BYSocket 豆 瓣:BYSocket FaceBook:BYSocket Twitter ...
- <转载> 优秀程序员必备的23条好习惯
转自 优秀程序员必备的23条好习惯 编程是一项聪明人玩的游戏,它既是对智力的考验,也是对习惯的考验,智力的好坏取决于父母的基因,人们无从左右,但习惯的好坏却是可以不断培养.一项由美国芝加哥大学国家研究 ...
- IoC控制反转与DI依赖注入
IoC控制反转与DI依赖注入 IoC: Inversion of Control IoC是一种模式.目的是达到程序的复用.下面的两篇论文是对IoC的权威解释: InversionOfControl h ...
- Android View自动生成插件
在ButterKnife这样强大的注入库出来之后,使用注入进行UI开发已经非常普遍.但考虑到效率.学习成本等问题,findViewById方式仍然是不错的选择. 但是当页面UI变得复杂后,我们从Lay ...
- [转]c#截取指定长度的字符串
/// <summary> /// 截取指定長度的字符串 /// </summary> /// <param name="s"></par ...
- VS2012下安装NuGet
关于NuGet的两篇文章:MSDN上的使用 NuGet 管理项目库,和博客园dudu的程序员,用NuGet管理好你的包包. VS2012下安装NuGet 在工具菜单下选择“扩展和更新”. 选择“联机” ...
- 关于导出Excel
Asp.Net 在刚毕业那会,做项目全是服务器控件.导出Excel的代码也很简单,在button触发后台事件后,后台生成一个excel文件,然后读取成字节,输出到客户端. Response.AddHe ...
- CentOS 6.4 快速安装Nginx笔记
CentOS 6.4 快速安装Nginx笔记 本系列文章由ex_net(张建波)编写,转载请注明出处. http://blog.csdn.net/ex_net/article/details/9860 ...
- Google Kubernetes设计文档之服务篇-转
摘要:Kubernetes是Google开源的容器集群管理系统,构建于Docker之上,为容器化的应用提供资源调度.部署运行.服务发现.扩容缩容等功能. Pod是创建.调度和管理的最小部署单位,本文详 ...
- GNOME3任务栏、标题栏过宽问题
Debian 7.0 默认安装的是GNOME 3.4.2桌面系统,缺省状态下,用户会发现桌面系统的桌面任务栏及标题栏宽度太大,影响美观,同时也浪费屏幕显示的有效宽度,针对这个问题我们可以通过以下方式进 ...