摘自:http://aircconline.com/ijdkp/V4N6/4614ijdkp04.pdf

In the syntactical approach we define binary attributes that correspond to each fixed length substring of words (or characters). These substrings are a framework for near-duplicate detection called shingles. We can say that a shingle is a sequence of words. A shingle has two parameters: the length and the offset. The length of the shingle is the number of the words in a shingle and the
offset is the distance between the beginnings of the shingles. We assign a hash code to each shingle, so equal shingles have the same hash code and it is improbable that different shingles
would have the same hash codes (this depends on the hashing algorithm we use). After this we randomly choose a subset of shingles for a concise image of the document [6, 8, and 9]. M.Henzinger [32] uses like this approach AltaVista search engine .There are several methods for selecting the shingles for the image: a fixed number of shingles, a logarithmic number of shingles, a linear number of shingle (every nth shingle), etc. In lexical methods, representative words are chosen according to their significance. Usually these values are based on frequencies. Those words whose frequencies are in an interval (except for stop- words from a special list

about 30 stop-words with articles, prepositions and 
pronouns) are taken. The words with high 
frequency can be non informative and words with low
frequencies can be misprints or occasional

words. 
In lexical methods, like I-Match [11], a large text 
corpus is used for generating the lexicon. The 
words that appear in the lexicon represent the docu
ment. When the lexicon is generated the words

with the lowest and highest frequencies are deleted
. I-Match generates a signature and a hash 
code of the document. If two documents get the same
hash code it is likely that the similarity

measures of these documents are equal as well. I-Ma

tch is sometimes instable to changes in texts [22]. Jun Fan et al. [16] introduced the idea of fusing algorithms (shingling, I-Match, simhash) and presented the experiments. The random lexicons based multi fingerprints generations are imported into shingling based simhash algorithm and named it "shingling based multi fingerprints simhash algorithm". The combination performance was much better than original Simhash.

 
The paper proposed the novel task for detecting and eliminating near duplicate and duplicate web pages to increase the efficiency of web crawling. So, the technique proposed aims at helping document classification in web content mining by eliminating the near-duplicate documents and in document clustering. For this, a novel Algorithm has been proposed to evaluate the similarity content of two ocuments.
 
 
Duplicate Detection (DD) Algorithm
Step 1: Consider the Stemmed keywords of the web page.
Step 2: Based on the starting character i.e. A-Z we here by assumed the hash values should start with1-26.
Step 3: Scan every word from the sample and compare with DB (data base) (initially DB Contains NO key values. Once the New keyword is found then generate respective hash value. Store that key value in temporary DB.
Step 4: Repeat the step 3 until all the keywords get completes.
Step 5: Store all Hash values for a given sample in local DB (i.e. here we used array list)
Step 6: Repeat step 1 to step 6 for N no. of samples.
Step 7: Once the selected samples were over then calculate similarity measure on the samples hash values which we stored in local DB with respective to webpages in repository.
Step 8: From similarity measure, we can generate a report on the samples in the score of %forms. Pages that are 80% similar are considered tobe near duplicates
 
我晕,貌似没有看到精髓啊!
 

A N EAR -D UPLICATE D ETECTION A LGORITHM T O F ACILITATE D OCUMENT C LUSTERING——有时间看看里面的相关研究的更多相关文章

  1. WLST 命令和变量

    下列部分将详细描述 WLST 命令和变量.主题包括:  WSLT 命令类别概述  浏览命令  控制命令  部署命令  诊断命令  编辑命令  信息命令  生命周期命令  节点管理器命令  树命令  W ...

  2. Asterisk重要App

    elastix82*CLI> core show application  SoftHangup -= Info about application 'SoftHangup' =- [Synop ...

  3. Java虚拟机系列——检视阅读

    Java虚拟机系列--检视阅读 参考 java虚拟机系列 入门掌握JVM所有知识点 2020重新出发,JAVA高级,JVM JVM基础系列 从 0 开始带你成为JVM实战高手 Java虚拟机-垃圾收集 ...

  4. 基于Java的打包jar、war、ear包的作用与区别详解

      本篇文章,小编为大家介绍,基于Java的打包jar.war.ear包的作用与区别详解.需要的朋友参考下   以最终客户的角度来看,JAR文件就是一种封装,他们不需要知道jar文件中有多少个.cla ...

  5. 关于war包 jar包 ear包 及打包方法

    关于war包 jar包 ear包 及打包方法 war包:是做好一个web应用后,通常是网站打成包部署到容器中 jar包:通常是开发的时候要引用的通用类,打成包便于存放管理. ear包:企业级应用 通常 ...

  6. what is a ear

    http://docs.oracle.com/javaee/6/tutorial/doc/bnaby.html An EAR file (see Figure 1-6) contains Java E ...

  7. 【转】 JAR、WAR、EAR的使用和区别

    Jar.war.EAR.在文件结构上,三者并没有什么不同,它们都采用zip或jar档案文件压缩格式.但是它们的使用目的有所区别: Jar文件(扩展名为. Jar,Java Application Ar ...

  8. 使用JAR命令打EAR包

    恩,我又得了一个发布应用的活,常常使用JAR命令来打EAR包,所以下面记录一下,以免忘记! 前提条件如下: 1)我的WEB服务器是WebLogic Server (版本是: 10.3.6.0) 2)假 ...

  9. Oracle11g安装出现em.ear

    在windows 7下安装Oracle 11g R2 时大概安装到45%时 提示找不到em.ear文件,如果点击继续还会出现其他错误,最后安装不成功. 检查文件发现另外一个zip没有解压 解压第二个压 ...

随机推荐

  1. HBase中我认为比较常用的两个类:Scan和Filter

    学习HBase一段时间后,我认为HBase中比较常用,同时也是必须掌握的两个API是Scan和Filter.如下是我的理解: 1.Scan  ---- 扫描类 作用:用来对一个指定Table进行按行扫 ...

  2. linux heads分析(转)

    内核默认的运行地址为PHY_OFFSET+0x8000,即物理地址开始后的0x8000字节处,前面是留给参数用的.参数以atag方式存储,默认放在0x100偏移位置. http://blog.chin ...

  3. 如何在Windows 10 IoT Core中添加其他语言的支持,如中文

    目前很多开发者已经开始使用Windows 10 IoT来做物联网领域的开发了,目前Windows 10 IoT Core的版本支持树莓派2(以及新出的树莓派3).Minnowboard Max以及Dr ...

  4. [转]mysqlx 同时使用 AND OR

  5. C#:ref和out的联系及区别。

    总结以上四条得到ref和out使用时的区别是: ①:ref指定的参数在函数调用时候必须初始化,不能为空的引用.而out指定的参数在函数调用时候可以不初始化: ②:out指定的参数在进入函数时会清空自己 ...

  6. PHP-Manual的学习----【语言参考】----【类型】-----【Boolean类型】

    2017年7月20日15:41:26Boolean 布尔类型 1.这是最简单的类型.boolean 表达了真值,可以为 TRUE 或 FALSE. 其实就是真假的问题.2.语法 要指定一个布尔值,使用 ...

  7. 关于PHP反射

    本文实例讲述了PHP反射机制原理与用法.分享给大家供大家参考,具体如下: 反射 面向对象编程中对象被赋予了自省的能力,而这个自省的过程就是反射. 反射,直观理解就是根据到达地找到出发地和来源.比如,一 ...

  8. HYSBZ - 1799 self 同类分布

    self 同类分布 HYSBZ - 1799 给出a,b,求出[a,b]中各位数字之和能整除原数的数的个数.Sample Input 10 19 Sample Output 3 Hint [约束条件] ...

  9. Python菜鸟之路:Python基础——函数

    一.函数 1. 简介 函数是组织好的,可重复使用的,用来实现单一,或相关联功能的代码段.函数能提高应用的模块性,和代码的重复利用率. 2. 组成 函数代码块以 def 关键词开头,后接函数名和圆括号( ...

  10. 我的Android进阶之旅------>Android SDK支持的配置标识符(有用的参考文件)

    Android SDK支持的配置标致符 配置标识符 标识符值 描      述 MCC   MNC 例子: mcc310: MCC310-MNC004: MCC208-MNC00 MCC(移动国家代码 ...