[IR] Information Extraction

郝一二三 2024-10-02 11:45:59 原文

阶段性总结

Boolean retrieval

单词搜索

【Qword1 and Qword2】 O(x+y)

【Qword1 and Qword2】- 改进： Galloping Search O(2a*log₂(b/a))

【Qword1 and not Qword2】 O(m*log₂n)

【Qword1 or not Qword2】 O(m+n)

【Qword1 and Qword2 and Qword3 and ...】 O(Total_Length * log₂k)

句子搜索

1. Biword Indexes

2. Positional Index --> Proximity Queries

Index Construction

构建过程中的Sort的探索：

基于块的排序索引方法
内存式单遍扫描索引构建方法
动态索引 - Dynamic Indexing

Compression

Heaps’ law: M = kT^b

Zipf’s law: cf_i = K/i

压缩Dictionary

压缩Posting list

思路：基本查询，构建，然后压缩

Tolerant Retrieval & Spelling Correction & Language Model

WILD-CARD QUERIES

prefix　
suffix
"mon*ing"
“Permuterm vocabulary"
K-gram indexes

Spelling Correction

(1) Error detection

(2) Error correction

Language Model

查询似然模型 --> 混合模型：Jelinek-Mercer method

求Query在M_d中出现的概率，然后Ranking.

Probabilistic Model

二值独立模型 - Binary Independence Model

针对一个Query，某Term是否该出现在文档中呢？

一篇New doc出现，遂统计every Term与该doc的关系，得到C_i。

Link Analysis

In degree i 正比于 1/i^α, 例如: α = 2.1

1. Number of In Degree.

2. "Flow" Model

- small graphs.
- large graphs. (Markov渐进性质)

- - Spider traps
  - Dead Ends

Ranking - top k

精确方式：

Consine Similarity: tf-idf

精确加速：

使用Quick Select：n + k * log(k) : "find top k" + "sort top k"

Threshold Methods - MaxScore Method

模糊加速：

Index Elimination (heuristic function)

3 of 4 query terms

Champion List

Cluster Pruning Method

　　

Evaluation

无序检索结果的评价方法
有序检索结果的评价方法

大目标 --> 小目标

• Text Categorization:
　　– Classify an entire document

• Information Extraction (IE):
　　– Identify and classify small units within documents

segmentation: 提取Term (NE) 语法
classification: 认识Term (type, Chunking) 语义
association: 聚类Term

• Named Entity Extraction (NE):
　　– A subset of IE
　　– Identify and classify proper names: "People, locations, organizations"

Main tasks
• Named Entity Recognition
• Relation Extraction

Pattern-based Relation Extraction

– Relation extraction and its difficulties

– Use of POS Tags
– Use of Constituent Parse
– Use of Dependency Parse

1.

2.

3.

[IR] Information Extraction的更多相关文章

HDU 4868 Information Extraction(2014 多校联合第一场 H)
看到这道题时我的内心是奔溃的,没有了解过HTML,只能靠窝的渣渣英语一点一点翻译啊TT. Information Extraction 题意:(纯手工翻译,有些用词可能在html中不是一样的,还多包涵 ...
spatial-temporal information extraction典型方法总结
==================================== 咳咳咳由于科研的直接对象就是video sequence,所以,如何更好地提取spatial-temporal inform ...
[阅读笔记]Zhang Y. 3D Information Extraction Based on GPU.2010.
1.立体视觉基础深度定义为物体间的距离视差定义为同一点在左图(reference image) 和右图( target image) 中的x坐标差. 根据左图中每个点的视差得到的灰度图称为视差图. ...
Maximum Entropy Markov Models for Information Extraction and Segmentation
1.The use of state-observation transition functions rather than the separate transition and observat ...
本人AI知识体系导航 - AI menu
Relevant Readable Links Name Interesting topic Comment Edwin Chen 非参贝叶斯徐亦达老板 Dirichlet Process 学习 ...
ACM会议列表与介绍(2014/05/06)
Conferences ACM SEACM Southeast Regional Conference ACM Southeast Regional Conference the oldest, co ...
### Paper about Event Detection
Paper about Event Detection. #@author: gr #@date: 2014-03-15 #@email: forgerui@gmail.com 看一些相关的论文. 1 ...
机器学习经典书籍&论文
原文地址:http://blog.sina.com.cn/s/blog_7e5f32ff0102vlgj.html 入门书单 1.<数学之美>PDF6 作者吴军大家都很熟悉.以极为通俗的语 ...
KDD2015,Accepted Papers
Accepted Papers by Session Research Session RT01: Social and Graphs 1Tuesday 10:20 am–12:00 pm | Lev ...

随机推荐

安装Python图型处理库Python Imaging Library(PIL)
方法1: 在Debian/Ubuntu Linux下直接通过apt安装: $sudo apt-get install python-imaging Mac和其他版本的Linux可以直接使用easy_i ...
推荐几款API文档集合工具
https://zealdocs.org/ 开源.免费,支持Linux.Windows http://velocity.silverlakesoftware.com/ https://kape ...
When cloning on with git bash on Windows, getting Fatal: UriFormatException encountered
I am using git bash $ git --version git version .windows. on Windows 7. When I clone a repo, I see: ...
一致性hash和solr千万级数据分布式搜索引擎中的应用
互联网创业中大部分人都是草根创业,这个时候没有强劲的服务器,也没有钱去买很昂贵的海量数据库.在这样严峻的条件下,一批又一批的创业者从创业中获得成功,这个和当前的开源技术.海量数据架构有着必不可分的关 ...
Oracle的FRA(Flash Recovery Area)的好处
如果FRA的空间耗尽,只会影响到这个Oracle实例自身.所以不会耗尽所有磁盘空间从而影响到其它的数据库实例或其它应用.
Android 图片的裁剪与相机调用
有时候我们需要的图片并不适合我们想要的大小, 那么我们就可以用到系统自带的图片裁剪功能, 把规定范围的图像给剪出来. 贴上部分代码: //调用图库 Intent intent = new Intent ...
利用python自动清除Android工程中的多余资源
我们直接在公司项目中使用,效果良好! 分享出脚本代码,希望对Android研发的同学有帮助. 提示,初学python,开发环境是Sublime Text 2,直接Ctrl+B的,其他环境下没调试过.应 ...
Andriod调用http请求
// 新建HttpPost对象 HttpPost httpPost = new HttpPost( "http://180.153.1.1:8080/mybankGateway/gatewa ...
define 与 inline
define 就是代码替换,在编译阶段进行简单的代码替换,大量用于宏定义开关,以及定义表达式和常量,如: 1.开关定义 #define CONFIG_OPENED 使用: #ifdef CONGFIG ...
TP收集一些可以用的资源
http://www.thinkphp.cn/code/2184.html UI http://www.thinkphp.cn/code/2158.html 报名 http://www.th ...