[comment]: # Unsupervised Classification - Sprawl Classification Algorithm

Idea

Points (data) in same cluster are near each others, or are connected by each others.

So:

For a distance d，every points in a cluster always can find some points in the same cluster.
Distances between points in difference clusters are bigger than the distance d.

The above condition maybe not correct totally, e.g. in the case of clusters which have common points, the condition will be incorrect.

So need some improvement.

Sprawl Classification Algorithm

Input：
- data: Training Data
- d: The minimum distance between clusters
- minConnectedPoints: The minimum connected points:
Output:
- Result: an array of classified data
Logical：

Load data into TotalCache.

i = 0

while (TotalCache.size > 0)

{

    Find a any point A from TotalCache, put A into Cache2.

    Remove A from TotalCache

    In TotalCache, find points 'nearPoints' less than d from any point in the Cache2.

    Put Cache2 points into Cache1.

    Clear Cache2.

    Put nearPoints into Cache2.

    Remove nearPoints from TotalCache.

    if Cache2.size = 0, add Cache1 points into Result[i].

    Clear Cache1.

    i++

}

Return Result

Note: As the algorithm need to calculating the distances between points, maybe need to normalize data first to each feature has same weight.

Improvement

A big problem is the method need too much calculation for the distances between points. The max times is $/frac{n * (n - 1)}{2}$.

Improvement ideas:

Check distance for one feature first maybe quicker.

We need not to calculate the real distance for each pair, because we only need to make sure whether the distance is less than $d$,

if points x1, x2, the distance will be bigger or equals to $d$ when there is a $ \vert x1[i] - x2[i] \vert \geqslant d$.
Split data in multiple area

For n dimensions (features) dataset, we can split the dataset into multiple smaller datasets, each dataset is in a n dimension space whose size $d^{n}$.

We can image that each small space is a n dimensions cube and adjoin each other.

so we only need to calculate points in the current space and neighbour spaces.

Cons

Need a amount of calculating.
Need to improve to handle clusters which have common points.

Unsupervised Classification - Sprawl Classification Algorithm的更多相关文章

微软亚洲实验室一篇超过人类识别率的论文：Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification ImageNet Classification
在该文章的两大创新点:一个是PReLU,一个是权值初始化的方法.下面我们分别一一来看. PReLU(paramter ReLU) 所谓的PRelu,即在 ReLU激活函数的基础上加入了一个参数,看一个 ...
What are the advantages of different classification algorithms?
What are the advantages of different classification algorithms? For instance, if we have large train ...
Classification / Recognition
转载 https://handong1587.github.io/deep_learning/2015/10/09/recognition.html#facenet Classification / ...
sklearn中的metrics模块中的Classification metrics
metrics是sklearn用来做模型评估的重要模块,提供了各种评估度量,现在自己整理如下: 一.通用的用法:Common cases: predefined values 1.1 sklearn官 ...
机器学习-TensorFlow应用之classification和ROC curve
概述前面几节讲的是linear regression的内容,这里咱们再讲一个非常常用的一种模型那就是classification,classification顾名思义就是分类的意思,在实际的情况是非 ...
学习笔记之k-nearest neighbors algorithm (k-NN)
k-nearest neighbors algorithm - Wikipedia https://en.wikipedia.org/wiki/K-nearest_neighbors_algorith ...
Exploratory Undersampling for Class-Imbalance Learning
Abstract - Undersampling is a popular method in dealing with class-imbalance problems, which uses on ...
arcmap Command
The information in this document is useful if you are trying to programmatically find a built-in com ...
A Gentle Guide to Machine Learning
A Gentle Guide to Machine Learning Machine Learning is a subfield within Artificial Intelligence tha ...

随机推荐

u3d 性能优化
http://blog.csdn.net/candycat1992/article/details/42127811 写在前面这一篇是在Digital Tutors的一个系列教程的基础上总结扩展而得 ...
spring中bean配置和bean注入
1 bean与spring容器的关系 Bean配置信息定义了Bean的实现及依赖关系,Spring容器根据各种形式的Bean配置信息在容器内部建立Bean定义注册表,然后根据注册表加载.实例化Bean ...
POJ 2234 Matches Game
Matches Game Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 7567 Accepted: 4327 Desc ...
Cloning EBS from Linux 5 to Linux 6 Fails: "Error While Loading Shared Libraries: libclntsh.so.10.1
SYMPTOMS During clone Oracle Applications R12 from Linux 5 to Linux 6 the following error occurs ...
led显字风扇原理？
神奇的是上面的图案居然会变,十分好奇,求告知原理?? 其实就是依靠转速计算出LED灯变化的频率.这点和老式CRT的显示原理差不多.比如说风扇的转速时60rpm就是每分钟60圈,每秒1圈(当然实际转速快 ...
安卓开发笔记——重识Activity
Activity并不是什么新鲜的东西,老生常谈,这里只是随笔记录一些笔记. 每当说起Activity,感觉最关注的还是它的生命周期,因为要使我们的应用程序更加健壮,客户体验更加良好,如果对生命周期不熟 ...
js判定IE
var ie=!-[1,]; 这句话对于多数前端来说都很熟悉,遇到判定是否是ie浏览器就用这个,但是对于由来以及为什么可能没有深入了解过. 短短6个bytes就做了判定.这个表达式是利用IE和标准浏览 ...
Potocol Buffer详解
protocol安装及使用上一篇博文介绍了一个综合案例,这篇将详细介绍protocol buffer. 为什么使用protocol buffer? java默认序列化效率较低. apache的thr ...
解决Unreal Engine 4.7.6的DerivedDataCache在C盘疯狂膨胀的问题
打开 YourEngineFolder\Engine\Config\BaseEngine.ini 将 Local=(Type=FileSystem, ReadOnly=, FoldersToClean ...
数组、单链表和双链表介绍以及双向链表的C/C++/Java实现
概要线性表是一种线性结构,它是具有相同类型的n(n≥0)个数据元素组成的有限序列.本章先介绍线性表的几个基本组成部分:数组.单向链表.双向链表:随后给出双向链表的C.C++和Java三种语言的实现. ...

Unsupervised Classification - Sprawl Classification Algorithm