论文keywords和规则匹配的baseline
详细的思路可以参照小论文树立0317
关键词分为以下几类:
t/****一些通用的过滤词,这些通用的过滤词可以使用和节目一起出现的词语,结合tf-idf看出来么?*****/
public static String[] tvTerms={"观看","收看","节目","电视","表演","演出"};
public static String[] channelTerms={"央视","中央电视台","春晚","春节联欢晚会"};
public static String[] commentTerms={"赞","好看","精彩","失望","感动","吐槽","无聊"};
对于每一个节目:
节目演员、节目类别
以及基于节目演员和节目类别的拓展,这个具有天然的权重
过滤策略:
如果同时包含title和节目涉及的演员,label True
如果同时包含title和节目类别,label True
如果节目名称被双引号包围,label True
对于其他keywords,计算权重之和,如果权重之和大于阈值,label True
- 阈值的确定:(先不管keywords)<不过后面权重的木有做下去>
- 关于权重确定的java工程
package com.bobo.baseline; import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList; import com.bobo.features.ActorsFeature;
import com.bobo.features.CategoryFeature;
import com.bobo.features.ExpandFeature;
import com.bobo.features.GeneralRulesFeatures;
import com.bobo.features.TitleFeature;
import com.bobo.myinterface.MyFileFilter;
import com.bobo.util.Constants;
import com.bobo.util.FileUtil; public class KeywordAndRulesMatherBaseLine {
private ArrayList<File> dealedList=new ArrayList<File>();
private ArrayList<File> keywordsOutList=new ArrayList<File>();
public static void main(String[] args) {
KeywordAndRulesMatherBaseLine baseLine=new KeywordAndRulesMatherBaseLine();
baseLine.init();
baseLine.labelForAll();
System.out.println("整體執行完畢");
}
private void init()
{
// 得到所有标注过的数据
FileUtil.showAllFiles(new File(Constants.DataDir+"/"+"raw_data"), new MyFileFilter(".dealed"), dealedList);
for(int i=;i<dealedList.size();i++){
String dealedPath=dealedList.get(i).getAbsolutePath();
String outPath=dealedPath.substring(,dealedPath.lastIndexOf("."))+".keywordsMatch";
keywordsOutList.add(new File(outPath));
} } public void labelForAll(){
for(int i=;i<dealedList.size();i++){
if(dealedList.get(i).getAbsolutePath().contains("时间都去哪儿")){
labelForFile(dealedList.get(i),keywordsOutList.get(i),
Constants.ActorShijian,Constants.categoryGequ,"时间都去哪儿");
}else if(dealedList.get(i).getAbsolutePath().contains("团圆饭")){
labelForFile(dealedList.get(i),keywordsOutList.get(i),
Constants.ActorTuanyuan,Constants.categoryMoshu,"团圆饭");
}else if(dealedList.get(i).getAbsolutePath().contains("说你什么好")){
labelForFile(dealedList.get(i),keywordsOutList.get(i),
Constants.ActorShuoni,Constants.categoryXiangsheng,"说你什么好");
}else if(dealedList.get(i).getAbsolutePath().contains("我就这么个人")){
labelForFile(dealedList.get(i),keywordsOutList.get(i),
Constants.ActorWojiu,Constants.categoryXiaopin,"我就这么个人");
}else if(dealedList.get(i).getAbsolutePath().contains("我的要求不算高")){
labelForFile(dealedList.get(i),keywordsOutList.get(i),
Constants.ActorWode,Constants.categoryGequ,"我的要求不算高");
}else if(dealedList.get(i).getAbsolutePath().contains("扶不扶")){
labelForFile(dealedList.get(i),keywordsOutList.get(i),
Constants.ActorFubu,Constants.categoryXiaopin,"扶不扶");
}else if(dealedList.get(i).getAbsolutePath().contains("人到礼到")){
labelForFile(dealedList.get(i),keywordsOutList.get(i),
Constants.ActorRendao,Constants.categoryXiaopin,"人到礼到");
}
System.out.println(keywordsOutList.get(i)+"处理完毕!");
} } public void labelForFile(File dealedFile,File keywordsFile, String[] actors,
String[] categorys, String title){
FileReader fr=null;
BufferedReader br=null;
FileWriter fw=null;
BufferedWriter bw=null;
PrintWriter pw=null;
String line=null;
try{
fr=new FileReader(dealedFile);
br=new BufferedReader(fr);
fw=new FileWriter(keywordsFile);
bw=new BufferedWriter(fw);
pw=new PrintWriter(bw); while((line=br.readLine())!=null){
String[] lineArr=line.split("\t");
String weiboText=lineArr[lineArr.length-];
pw.println(lineArr[]+"\t"+labelForSingle(weiboText, actors,
categorys, title)+"\t"+weiboText); }
}catch(Exception e){
e.printStackTrace();
}finally{
try {
br.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
pw.flush();
pw.close();
}
} public Integer labelForSingle(String weiboText, String[] actors,
String[] categorys, String title) {
for (String actor : actors) {
if (weiboText.contains(actor)) {
return ;
}
} for (String cate : categorys) {
if (weiboText.contains(cate)) {
return ;
}
} for (String word : Constants.tvTerms) {
if (weiboText.contains(word)) {
return ;
}
} for (String word : Constants.commentTerms) {
if (weiboText.contains(word)) {
return ;
}
}
if(!weiboText.contains("《")||!weiboText.contains(title)){
return ;
}else{
int symbolIndex=weiboText.indexOf("《");
int titleIndex=weiboText.indexOf(title);
if(titleIndex==symbolIndex+){
return ;
} }
return ;
}
} package com.bobo.util; public class Constants {
public final static String RootDir="H:/paper_related/socialTvProgram";
public final static String DataDir="/media/新加卷/小论文实验/data/liweibo";
//时间都去哪儿
public final static String[] ActorShijian={"王铮亮"};
//我的要求不算高
public final static String[] ActorWode={"黄渤"};
//团员饭
public final static String[] ActorTuanyuan={"YIF","yif","Yif","王亦丰"};
//说你什么好
public final static String[] ActorShuoni={"曹云金","刘云天"};
//我就这么个人
public final static String[] ActorWojiu={"冯巩","曹随峰","蒋诗萌"};
//扶不扶
public final static String[] ActorFubu={"杜晓宇","马丽","沈腾"};
//人到礼到
public final static String[] ActorRendao={"郭子","郭冬临","邵峰","牛莉"}; /***节目类别*****/
public final static String[] categoryGequ={"歌","唱"} ;
public final static String[] categoryXiaopin={"小品"};
public final static String[] categoryMoshu={"魔术"};
public final static String[] categoryXiangsheng={"相声"}; /****一些通用的过滤词*****/
public static String[] tvTerms={"观看","收看","节目","电视","表演","演出"};
public static String[] channelTerms={"央视","中央电视台","春晚","春节联欢晚会"};
public static String[] commentTerms={"赞","好看","精彩","吐槽","无聊","不错","给力","接地气"}; }关键词匹配作为baseLine进行特征提取的java工
- 衡量指标的python工程
#!/usr/python
#!-*-coding=utf8-*-
import numpy as np import myUtil from sklearn import metrics root_dir="/media/新加卷/小论文实验/data/liweibo/raw_data" def loadAllFileWithSuffix(suffix):
file_list=list()
myUtil.traverseFile(root_dir,suffix,file_list)
return file_list #inFilePath对应的是节目目录下的keywordsMatch文件,其格式是 真实分类“\t”预测分类“\t”微博文本内容
def testForEachFile(inFilePath):
y_true=list()
y_pred=list()
print(inFilePath)
with open(inFilePath) as inFile:
for line in inFile:
y_true.append(int(line.split("\t")[]))
y_pred.append(int(line.split("\t")[]))
precision=metrics.accuracy_score(y_true,y_pred)
recall=metrics.recall_score(y_true,y_pred)
accuracy=metrics.accuracy_score(y_true,y_pred)
f=metrics.fbeta_score(y_true,y_pred,beta=)
print("precision:%0.2f,recall:%0.2f,f:%0.2f,accuracy:%0.2f"% (precision,recall,f,accuracy))
return (precision,recall,accuracy,f) #依次对每个文件调用testForEachFile,计算precison,recall,accuracy,f
def testForAll(inFileList):
mean_precision=0.0
mean_recall=0.0
mean_accuracy=0.0
mean_f=0.0
for inFilePath in inFileList:
(precison,recall,accuracy,f)=testForEachFile(inFilePath)
mean_precision+=precison
mean_recall+=recall
mean_accuracy+=accuracy
mean_f+=f
listLen=len(inFileList)
mean_precision/=listLen
mean_recall/=listLen
mean_accuracy/=listLen
mean_f/=listLen
print("所有节目各项目指标的平均值:")
print("mean_precision:%0.2f,mean_recall:%0.2f,mean_f:%0.2f,mean_accuracy:%0.2f"% (mean_precision,mean_recall,mean_f,mean_accuracy))
return(mean_precision,mean_recall,mean_accuracy,mean_f) def main():
fileList=loadAllFileWithSuffix(['keywordsMatch'])
testForAll(fileList) if __name__=='__main__':
main()keywordsMatch作为baseLine的工程
最终的结果为:
/media/新加卷/小论文实验/data/liweibo/raw_data/人到礼到/人到礼到.title.sample.annotate.keywordsMatch
precision:0.87,recall:0.84,f:0.89,accuracy:0.87
/media/新加卷/小论文实验/data/liweibo/raw_data/团圆饭/团圆饭.title.sample.annotate.keywordsMatch
precision:0.81,recall:0.98,f:0.79,accuracy:0.81
/media/新加卷/小论文实验/data/liweibo/raw_data/我就这么个人/我就这么个人.title.sample.annotate.keywordsMatch
precision:0.94,recall:0.97,f:0.96,accuracy:0.94
/media/新加卷/小论文实验/data/liweibo/raw_data/我的要求不算高/我的要求不算高.title.sample.annotate.keywordsMatch
precision:0.91,recall:0.94,f:0.93,accuracy:0.91
/media/新加卷/小论文实验/data/liweibo/raw_data/扶不扶/扶不扶.title.sample.annotate.keywordsMatch
precision:0.72,recall:0.69,f:0.81,accuracy:0.72
/media/新加卷/小论文实验/data/liweibo/raw_data/时间都去哪儿/时间都去哪儿.title.sample.annotate.keywordsMatch
precision:0.72,recall:0.62,f:0.73,accuracy:0.72
/media/新加卷/小论文实验/data/liweibo/raw_data/说你什么好/说你什么好.title.sample.annotate.keywordsMatch
precision:0.93,recall:0.98,f:0.92,accuracy:0.93
所有节目各项目指标的平均值:
mean_precision:0.84,mean_recall:0.86,mean_f:0.86,mean_accuracy:0.84关键词简单匹配的测路额
论文keywords和规则匹配的baseline的更多相关文章
- Latex: 添加IEEE论文keywords
参考: How to use \IEEEkeywords Latex: 添加IEEE论文keywords 方法: \begin{IEEEkeywords} keyword1, keyword2. \e ...
- 烂泥:haproxy学习之手机规则匹配
本文由ilanniweb提供友情赞助,首发于烂泥行天下 想要获得更多的文章,可以关注我的微信ilanniweb. 今天我们来介绍下有关haproxy匹配手机的一些规则配置. 一.业务需要 现在根据业务 ...
- Nginx接收的host值会影响alias的规则匹配
一般内网接收的HTTP请求都是内网唯一的网关传过来的,nginx的alias匹配会直接使用网关穿过的host值,而不是从URL解析出来的,从而导致的问题是,容器的alias相关Server_name规 ...
- nginx里面的location 规则匹配
nginx location语法 ~ # 区分大小写的正则匹配 location ~ \.(gif|jpg|png|js|css)$ { #规则D } ~* # 不区分大小写的正则匹配(和~的功能相同 ...
- Nginx location规则匹配
^~ 标识符匹配后面跟-一个字符串.匹配字符串后将停止对后续的正则表达式进行匹配,如location ^~ /images/ , 在匹配了/images/这个字符串后就停止对后续的正则匹配 = 精 ...
- haproxy 规则匹配到了就停止,不会继续匹配下一个
acl url_web_wwm path_beg -i /scan use_backend zjtest7_com if url_web_wwm acl url_static path_end .ht ...
- 论文阅读 A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SEN- TENCE EMBEDDINGS
这篇论文提出了SIF sentence embedding方法, 作者提供的代码在Github. 引入 作为一种无监督计算句子之间相似度的方法, sif sentence embedding使用预训练 ...
- nginx 针对特定地区的ip进行规则匹配
使用geoip模块,加载ip库 geoip_country GeoIP.dat; geoip_city GeoLiteCity.dat; 转自http://ju.outofmemory.cn/entr ...
- Windows Store App 全球化 资源匹配规则
上面几个小节通过示例介绍了如何引用资源以及设置应用语言来显示不同语言的信息,这些示例都只是添加了简体中文和英语两种语言来显示资源,而在一些复杂的应用程序中,字符串资源可能会被定义成多种语言,文件资源也 ...
随机推荐
- 前端 CSS 继承性和层叠性
CSS有两大特性:继承性和层叠性 前端 CSS的继承性 前端 CSS层叠性 CSS选择器优先级 前端 CSS 优先级 样式设置important
- Docker最详细入门教程
Docker原理.详细入门教程 https://blog.csdn.net/deng624796905/article/details/86493330 阮一峰Docker入门讲解 http://ww ...
- [Git] 017 加一条分支,享双倍快乐
0. 回顾 [Git] 009 逆转未来 中的 "2.2" 讲过 git checkout -- <file> 这回的 git checkout <branch_ ...
- Nginx 的方向代理及配置
最近有打算研读nginx源代码,看到网上介绍nginx可以作为一个反向代理服务器完成负载均衡.所以搜罗了一些关于反向代理服务器的内容,整理综合. 一 概述 反向代理(Reverse Proxy)方式 ...
- java基础笔记(4)
二进制运算: &的应用:清零.得到指定位的数: |的应用:将指定位置取1: ^的应用:取反.保留原值:交换两个bian变量:A= A^B,B =A ^ B,A = A^B;(原理就是本身异或本 ...
- PDFObject的使用(转)
1.pdfobject.js官网:https://pdfobject.com/ 2.在html文件中引入这个文件,以pdfobject.min.js为例 1 <script type=" ...
- http协议中常见的状态码以及请求方式,http协议的组成
请求状态码: 2xxx:表示请求成功,例如200. 3xxx:表示请求被重定向,表示完成请求,需要进一步操作,例如 302. 4xxx:表示请求错误,例如:404,资源没有找到. 5xxx:表示服务器 ...
- todo JVM笔记
之前给自己定了很多计划,要学Dubbo,Netty,SSHM源码,Tomcat源码...这些基本浅尝辄止,很难继续研读,过不了多久就忘了. 觉得还是基础不够,所以决定把<JVM>.< ...
- java中遍历实体类属性和类型,属性值
public static void testReflect(Object model) throws NoSuchMethodException, IllegalAccessException, I ...
- C 调试 gdb常用命令
gdb常用命令: [root@redhat home]#gdb 调试文件:启动gdb (gdb) l :(字母l)从第一行开始列出源码 (gdb) break n :在第n行处设置断点 (gdb) b ...