原文链接:http://www.one2know.cn/nlp17/

  • 数据集

    scikit-learn中20个新闻组,总邮件18846,训练集11314,测试集7532,类别20
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
x_train = newsgroups_train.data
x_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target
print('List of all 20 categories:')
print(newsgroups_train.target_names,'\n')
print('Sample Email:')
print(x_train[0])
print('Sample Target Category:')
print(y_train[0])
print(newsgroups_train.target_names[y_train[0]])

输出:

List of all 20 categories:
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'] Sample Email:
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail. Thanks,
- IL
---- brought to you by your neighborhood Lerxst ----
  • 实现步骤
  1. 预处理

    1)去标点符号

    2)分词

    3)单词都转化成小写

    4)去停用词

    5)保留长度至少为3的词

    6)提取词干

    7)词性标注

    8)词形还原
  2. TF-IDF向量转换
  3. 深度学习模型的训练和测试
  4. 模型评估和结果分析
  • 代码
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
x_train = newsgroups_train.data
x_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target
# print('List of all 20 categories:')
# print(newsgroups_train.target_names,'\n')
# print('Sample Email:')
# print(x_train[0])
# print('Sample Target Category:')
# print(y_train[0])
# print(newsgroups_train.target_names[y_train[0]]) import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
import pandas as pd
from nltk import pos_tag
from nltk.stem import PorterStemmer def preprocessing(text):
# 标点都换成空格,再以空格分割,在以空格为分割合并所以元素
text2 = ' '.join(''.join([' ' if ch in string.punctuation else ch for ch in text]).split())
# 分词
tokens = [word for sent in nltk.sent_tokenize(text2) for word in nltk.word_tokenize(sent)]
tokens = [word.lower() for word in tokens]
stopwds = stopwords.words('english')
# 过滤掉 停用词 和 长度<3 的token
tokens = [token for token in tokens if token not in stopwds and len(token) >= 3]
# 词干提取
stemmer = PorterStemmer()
tokens = [stemmer.stem(word) for word in tokens]
# 词性标注
tagged_corpus = pos_tag(tokens)
Noun_tags = ['NN','NNP','NNPS','NNS'] # 普通名词 专有名词 专有名词复数 普通名词复数
Verb_tags = ['VB','VBD','VBG','VBN','VBP','VBZ']
# 动词 动词过去式 动词现在分词 动词过去分词 动词现在时 动词现在时第三人称单数
lemmatizer = WordNetLemmatizer()
def prat_lemmatize(token,tag):
if tag in Noun_tags:
return lemmatizer.lemmatize(token,'n')
elif tag in Verb_tags:
return lemmatizer.lemmatize(token,'v')
else:
return lemmatizer.lemmatize(token,'n')
pre_proc_text = ' '.join([prat_lemmatize(token,tag) for token,tag in tagged_corpus])
return pre_proc_text # 处理数据集
x_train_preprocessed = []
for i in x_train:
x_train_preprocessed.append(preprocessing(i))
x_test_preprocessed = []
for i in x_test:
x_test_preprocessed.append(preprocessing(i)) # 得到每个文档的TF-IDF向量
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=2,ngram_range=(1,2),stop_words='english',
max_features=10000,strip_accents='unicode',norm='l2')
x_train_2 = vectorizer.fit_transform(x_train_preprocessed).todense() # 稀疏矩阵=>密集!?
x_test_2 = vectorizer.transform(x_test_preprocessed).todense() # 导入深度学习模块
import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation
from keras.optimizers import Adadelta,Adam,RMSprop
from keras.utils import np_utils np.random.seed(0)
nb_classes = 20
batch_size = 64 # 批尺寸
nb_epochs = 20 # 迭代次数 # 将20个类变成one-hot编码向量
Y_train = np_utils.to_categorical(y_train,nb_classes) # 建立keras模型 3个隐藏层 神经元个数分别为1000 500 50,每层dropout均为50%,优化算法为Adam
model = Sequential()
model.add(Dense(1000,input_shape=(10000,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(500))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(50))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam')
# loss=交叉熵损失函数 optimizer优化程序=adam
print(model.summary()) # 模型训练
model.fit(x_train_2,Y_train,batch_size=batch_size,epochs=nb_epochs,verbose=1) # 模型预测
y_train_predclass = model.predict_classes(x_train_2,batch_size=batch_size)
y_test_preclass = model.predict_classes(x_test_2,batch_size==batch_size)
from sklearn.metrics import accuracy_score,classification_report
print("\n\nDeep Neural Network - Train accuracy:",round(accuracy_score(y_train,y_train_predclass),3))
print("\nDeep Neural Network - Test accuracy:",round(accuracy_score(y_test,y_test_preclass),3))
print("\nDeep Neural Network - Train Classification Report")
print(classification_report(y_train,y_train_predclass))
print("\nDeep Neural Network - Test Classification Report")
print(classification_report(y_test,y_test_preclass))

输出:

Using TensorFlow backend.
WARNING:tensorflow:From
D:\Python37\Lib\site-packages\tensorflow\python\framework\op_def_library.py:263:
colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a
future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From
D:\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:3445: calling dropout
(from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a
future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   =================================================================
dense_1 (Dense)              (None, 1000)              10001000 
_________________________________________________________________ activation_1 (Activation)    (None, 1000)              0        
_________________________________________________________________ dropout_1 (Dropout)          (None, 1000)              0        
_________________________________________________________________ dense_2 (Dense)              (None, 500)               500500   
_________________________________________________________________ activation_2 (Activation)    (None, 500)               0        
_________________________________________________________________ dropout_2 (Dropout)          (None, 500)               0        
_________________________________________________________________ dense_3 (Dense)              (None, 50)                25050    
_________________________________________________________________ activation_3 (Activation)    (None, 50)                0        
_________________________________________________________________ dropout_3 (Dropout)          (None, 50)                0        
_________________________________________________________________ dense_4 (Dense)              (None, 20)                1020     
_________________________________________________________________
activation_4 (Activation)    (None, 20)                0  =================================================================
Total params: 10,527,570
Trainable params: 10,527,570
Non-trainable params:0
______________________________________________________________
None
WARNING:tensorflow:From
D:\Python37\Lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from
tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Epoch 1/20
2019-07-06 23:03:46.934966: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU
supports instructions that this TensorFlow binary was not compiled to use: AVX2    64/11314 [..............................] - ETA: 4:41 - loss: 2.9946
  128/11314 [..............................] - ETA: 2:43 - loss: 2.9948
  192/11314 [..............................] - ETA: 2:03 - loss: 2.9951
  256/11314 [..............................] - ETA: 1:43 - loss: 2.9947
  320/11314 [..............................] - ETA: 1:32 - loss: 2.9938
此处省略一堆epoch的一堆操作 Deep Neural Network - Train accuracy: 0.999
Deep Neural Network - Test accuracy: 0.811 Deep Neural Network - Train Classification Report
              precision    recall  f1-score   support            0       1.00      1.00      1.00       480
           1       1.00      0.99      1.00       584
           2       0.99      1.00      1.00       591
           3       1.00      1.00      1.00       590
           4       1.00      1.00      1.00       578
           5       1.00      1.00      1.00       593
           6       1.00      1.00      1.00       585
           7       1.00      1.00      1.00       594
           8       1.00      1.00      1.00       598
           9       1.00      1.00      1.00       597
          10       1.00      1.00      1.00       600
          11       1.00      1.00      1.00       595
          12       1.00      1.00      1.00       591
          13       1.00      1.00      1.00       594
          14       1.00      1.00      1.00       593
          15       1.00      1.00      1.00       599
          16       1.00      1.00      1.00       546
          17       1.00      1.00      1.00       564
          18       1.00      1.00      1.00       465
          19       1.00      1.00      1.00       377     accuracy                           1.00     11314
   macro avg       1.00      1.00      1.00     11314
weighted avg       1.00      1.00      1.00     11314 Deep Neural Network - Test Classification Report
              precision    recall  f1-score   support            0       0.78      0.78      0.78       319
           1       0.70      0.74      0.72       389
           2       0.68      0.69      0.68       394
           3       0.71      0.69      0.70       392
           4       0.82      0.76      0.79       385
           5       0.84      0.74      0.78       395
           6       0.73      0.87      0.80       390
           7       0.85      0.86      0.86       396
           8       0.93      0.91      0.92       398
           9       0.89      0.91      0.90       397
          10       0.96      0.97      0.96       399
          11       0.87      0.95      0.91       396
          12       0.69      0.72      0.70       393
          13       0.88      0.77      0.82       396
          14       0.83      0.92      0.87       394
          15       0.91      0.84      0.88       398
          16       0.78      0.83      0.80       364
          17       0.97      0.87      0.92       376
          18       0.74      0.66      0.70       310
          19       0.59      0.62      0.61       251     accuracy                           0.81      7532
   macro avg       0.81      0.81      0.81      7532
weighted avg       0.81      0.81      0.81      7532

NLP(十七) 利用DNN对Email分类的更多相关文章

  1. NLP学习(2)----文本分类模型

    实战:https://github.com/jiangxinyang227/NLP-Project 一.简介: 1.传统的文本分类方法:[人工特征工程+浅层分类模型] (1)文本预处理: ①(中文) ...

  2. 斯坦福深度学习与nlp第四讲词窗口分类和神经网络

    http://www.52nlp.cn/%E6%96%AF%E5%9D%A6%E7%A6%8F%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E4%B8%8Enlp%E7%A ...

  3. NLTK学习笔记(六):利用机器学习进行文本分类

    目录 一.监督式分类:建立在训练语料基础上的分类 特征提取器和朴素贝叶斯分类器 过拟合:当特征过多 错误分析 二.实例:文本分类和词性标注 文本分类 词性标注:"决策树"分类器 三 ...

  4. 百度开源其NLP主题模型工具包,文本分类等场景可直接使用L——LDA进行主题选择本质就是降维,然后用于推荐或者分类

    2017年7月4日,百度开源了一款主题模型项目,名曰:Familia. InfoQ记者第一时间联系到百度Familia项目负责人姜迪并对他进行采访,在本文中,他将为我们解析Familia项目的技术细节 ...

  5. php利用递归函数实现无限级分类

    相信很多学php的很多小伙伴都会尝试做一个网上商城作为提升自己技术的一种途径.各种对商品分类,商品名之类的操作应该是得心应手,那么就可以尝试下无限级分类列表的制作了. 什么是无限级分类? 无限级分类是 ...

  6. 利用CART算法建立分类回归树

    常见的一种决策树算法是ID3,ID3的做法是每次选择当前最佳的特征来分割数据,并按照该特征所有可能取值来切分,也就是说,如果一个特征有四种取值,那么数据将被切分成4份,一旦按某特征切分后,该特征在之后 ...

  7. php之利用递归写无限极分类

    <?php //无限极分类 //parent 的值,是该栏目的父栏目的id 反之是 /*0 安徽 合肥 北京 海淀 中关村 上地 河北 石家庄 */ $area = array( array(' ...

  8. 利用CNN进行多分类的文档分类

    # coding: utf-8 import tensorflow as tf class TCNNConfig(object): """CNN配置参数"&qu ...

  9. 利用sklearn对多分类的每个类别进行指标评价

      今天晚上,笔者接到客户的一个需要,那就是:对多分类结果的每个类别进行指标评价,也就是需要输出每个类型的精确率(precision),召回率(recall)以及F1值(F1-score).   对于 ...

随机推荐

  1. 脱壳系列_2_IAT加密壳_详细版_解法1_包含脚本

    1 查看壳程序信息 使用ExeInfoPe 分析: 发现这个壳的类型没有被识别出来,Vc 6.0倒是识别出来了,Vc 6.0的特征是 入口函数先调用GetVersion() 2 用OD找OEP 拖进O ...

  2. Linux基本操作及安装(部分)

    1.分别用cat \tac\nl三个命令查看文件/etc/ssh/sshd_config文件中的内容,   并用自己的话总计出这三个文档操作命令的不同之处? [root@localhost ~]# c ...

  3. 异步请求xhr、ajax、axios与fetch的区别比较

    目录 1. XMLHttpRequest对象 2. jQuery ajax 3. axios 4. fetch 参考 why: 为什么会出现不同的方法呢? what: 这些都是异步请求数据的方法.在不 ...

  4. 放出一批jsp图书管理系统图书借阅系统源码代码运行

    基于jsp+mysql的JSP图书销售管理系统 https://www.icodedock.com/article/105.html基于jsp+Spring+Spring MVC的Spring图书借阅 ...

  5. .net core 基于 IHostedService 实现定时任务

    .net core 基于 IHostedService 实现定时任务 Intro 从 .net core 2.0 开始,开始引入 IHostedService,可以通过 IHostedService ...

  6. Python基础总结之初步认识---clsaa类(上)。第十四天开始(新手可相互督促)

    最近的类看着很疼,坚持就是胜利~~~ python中的类,什么是类?类是由属性和方法组成的.类中可能有很多属性,以及方法. 我们这样定义一个类: 前面是class关键字 后面school是一个类的名字 ...

  7. 浅谈单例模式及其java实现

    单例模式是23种设计模式中比较简单的一种,在此聊一下单例模式. 1.什么是设计模式? 对于没有接触过设计模式的人来说,一听到设计模式这四个字就觉得这个东西很高深莫测,一下子就对这个东西产生了恐惧感,其 ...

  8. 富文本编辑器TinyMCE的使用(React Vue)

    富文本编辑器TinyMCE的使用(React Vue) 一,需求与介绍 1.1,需求 编辑新闻等富有个性化的文本 1.2,介绍 TinyMCE是一款易用.且功能强大的所见即所得的富文本编辑器. Tin ...

  9. 如何保证FPGA PCIe唤醒能满足PC的100ms 的时间要求(Autonomous Mode)?

    原创By DeeZeng [ Intel FPGA笔记 ]  PC 需要PCIe设备在 100ms 内启动,这样PC 才能扫描到PCIe 设备.对于 FPGA PCIe 板卡,同样也需要满足这个时间要 ...

  10. 浅谈 JavaScript 垃圾回收机制

    github 获取更多资源 https://github.com/ChenMingK/WebKnowledges-Notes 在线阅读:https://www.kancloud.cn/chenmk/w ...