1. 数据集说明

trec06c是一个公开的垃圾邮件语料库，由国际文本检索会议提供，分为英文数据集（trec06p）和中文数据集（trec06c），其中所含的邮件均来源于真实邮件保留了邮件的原有格式和内容,下载地址：https://plg.uwaterloo.ca/~gvcormac/treccorpus06/

由于数据集分散在各个文件中，为了方便我将正样本和负样本分别放在了ham_data和spam_data文件夹中(处女座的强迫症)

正样本数：21766

负样本数：42854

中文停用词：chinese_stop_vocab.txt

下面使用的所有数据集都已上传github

2. 实现思路

对单个邮件进行数据预处理
- 去除所有非中文字符，如标点符号、英文字符、数字、网站链接等特殊字符
- 对邮件内容进行分词处理
- 过滤停用词
创建特征矩阵和样本数据集
- feature_maxtrix:shape=(samples, feature_word_nums)
- leabel; shape = (samples, 1)
- 词向量的选择：索引或word2vect,注意二者的区别
拆分数据集：训练数据集、测试数据集和验证数据集
选择模型，这里选择svm
训练、测试、调参

3. 具体实现过程

3.1 所用到的库

import os

import jieba

import pandas as pd

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import RandomizedSearchCV,train_test_split

from sklearn.svm import LinearSVC

from sklearn.metrics import accuracy_score

from scipy.stats import uniform

3.2 将邮件转换为特征词矩阵类

class EmailToWordFeatures:

    '''

    功能:将邮件转换为特征词矩阵

    整个过程包括：

    - 对邮件内容进行分词处理

    - 去除所有非中文字符，如标点符号、英文字符、数字、网站链接等特殊字符

    - 过滤停用词

    - 创建特征矩阵

    '''

    def __init__(self,stop_word_file=None,features_vocabulary=None):

        self.features_vocabulary = features_vocabulary

        self.stop_vocab_dict = {}  # 初始化停用词

        if stop_word_file is not None:

            self.stop_vocab_dict = self._get_stop_words(stop_word_file)

    def text_to_feature_matrix(self,words,vocabulary=None,threshold =10):

        cv = CountVectorizer()

        if vocabulary is None:

            cv.fit(words)

        else:

            cv.fit(vocabulary)

        words_to_vect = cv.transform(words)

        words_to_matrix = pd.DataFrame(words_to_vect.toarray())  # 转换成索引矩阵

        print(words_to_matrix.shape)

        # 进行训练特征词选择，给定一个阈值，当单个词在所有邮件中出现的次数的在阈值范围内时及选为训练特征词、

        selected_features = []

        selected_features_index = []

        for key,value in cv.vocabulary_.items():

            if words_to_matrix[value].sum() >= threshold:  # 词在每封邮件中出现的次数与阈值进行比较

                selected_features.append(key)

                selected_features_index.append(value)

        words_to_matrix.rename(columns=dict(zip(selected_features_index,selected_features)),inplace=True)

        return words_to_matrix[selected_features]

    def get_email_words(self,email_path, max_email = 600):

        '''

        由于机器配置问题，作为测试给定阈值600，及正负样本数各位600

        '''

        self.emails = email_path

        if os.path.isdir(self.emails):

            emails = os.listdir(self.emails)

            is_dir = True

        else:

            emails = [self.emails,]

            is_dir = False

        count = 0

        all_email_words = []

        for email in emails:

            if count >= max_email:  # 给定读取email数量的阈值

                break

            if is_dir:

                email_path = os.path.join(self.emails,email)

            email_words = self._email_to_words(email_path)

            all_email_words.append(' '.join(email_words))

            count += 1

        return all_email_words

    def _email_to_words(self, email):

        '''

        将邮件进行分词处理，去除所有非中文和停用词

        retrun:words_list

        '''

        email_words = []

        with open(email, 'rb') as pf:

            for line in pf.readlines():

                line = line.strip().decode('gbk','ignore')

                if not self._check_contain_chinese(line):  # 判断是否是中文

                    continue

                word_list = jieba.cut(line, cut_all=False)  # 进行分词处理

                for word in word_list:

                    if word in self.stop_vocab_dict or not self._check_contain_chinese(word):

                        continue  # 判断是否为停用词

                    email_words.append(word)

            return email_words

    def _get_stop_words(self,file):

        '''

        获取停用词

        '''

        stop_vocab_dict = {}

        with open(file,'rb') as pf:

            for line in pf.readlines():

                line = line.decode('utf-8','ignore').strip()

                stop_vocab_dict[line] = 1

        return stop_vocab_dict

    def _check_contain_chinese(self,check_str):

        '''

        判断邮件中的字符是否有中文

        '''

        for ch in check_str:

            if u'\u4e00' <= ch <= u'\u9fff':

                return True

        return False

3.3 将正负邮件数据集转换为词特征列表，每项为一封邮件

index_file= '.\\datasets\\trec06c\\full\\index'

stop_word_file = '.\\datasets\\trec06c\\chinese_stop_vocab.txt'

ham_file = '.\\datasets\\trec06c\\ham_data'

spam_file = '.\\datasets\\trec06c\\spam_data'

email_to_features = EmailToWordFeatures(stop_word_file=stop_word_file)

ham_words = email_to_features.get_email_words(ham_file)

spam_words = email_to_features.get_email_words(spam_file)

print('ham email numbers:',len(ham_words))

print('spam email numbers:',len(spam_words))

ham email numbers: 600

spam email numbers: 600

3.4 将所有邮件转换为特征词矩阵，及模型输入数据

all_email = []

all_email.extend(ham_words)

all_email.extend(spam_words)

print('all test email numbers:',len(all_email))

words_to_matrix = email_to_features.text_to_feature_matrix(all_email)

print(words_to_matrix)

all test email numbers: 1200

(1200, 22556)

      故事  领导  回到  儿子  感情  有个  大概  民俗  出国  教育  ...  培训网  商友会  网管  埃森哲  驱鼠器  条例  \

0      1   2   1   1   1   1   1   1   1   1  ...    0    0   0    0    0   0

1      0   0   0   0   5   0   0   0   0   0  ...    0    0   0    0    0   0

2      0   0   0   0   0   0   0   0   0   0  ...    0    0   0    0    0   0

3      0   0   0   0   0   0   0   0   0   0  ...    0    0   0    0    0   0

4      0   0   0   0   0   0   0   0   0   0  ...    0    0   0    0    0   0

...   ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ...  ...  ...  ..  ...  ...  ..

1195   0   0   0   0   0   0   0   0   0   0  ...    0    0   0    0    0   0

1196   0   0   0   0   0   0   0   0   0   0  ...    0    0   0    0    0   0

1197   0   0   0   0   0   0   0   0   0   0  ...    0    0   0    0    0   0

1198   0   0   0   0   0   0   0   0   0   0  ...    0    0   0    0    0   0

1199   0   0   0   0   0   0   0   0   0   0  ...    0    0   0    0    0   0   

      智囊  教练  含双早  王府井

0      0   0    0    0

1      0   0    0    0

2      0   0    0    0

3      0   0    0    0

4      0   0    0    0

...   ..  ..  ...  ...

1195   0   0    0    0

1196   0   0    0    0

1197   0   0    0    0

1198   0   0    0    0

1199   0   0    0    0  

[1200 rows x 3099 columns]

3.5 获取标签矩阵

label_matrix = np.zeros((len(all_email),1))

label_matrix[0:len(ham_words),:] = 1

4. 使用svm模型进行训练

# 拆分数据集

x_train,x_test,y_train,y_test = train_test_split(words_to_matrix,label_matrix,test_size=0.2,random_state=42)

# 使用LinearSVC模型进行训练

svc = LinearSVC(loss='hinge',dual=True)

param_distributions = {'C':uniform(0,10)}

rscv_clf =RandomizedSearchCV(estimator=svc, param_distributions=param_distributions,cv=3,n_iter=200,verbose=2)

rscv_clf.fit(x_train,y_train)

print('best_params:',rscv_clf.best_params_)

Fitting 3 folds for each of 200 candidates, totalling 600 fits

[CV] C=6.119041659192192 .............................................

[CV] .............................. C=6.119041659192192, total=   0.0s

[CV] C=6.119041659192192 .............................................

[CV] .............................. C=6.119041659192192, total=   0.1s

[CV] C=6.119041659192192 .............................................

[CV] .............................. C=6.119041659192192, total=   0.1s

[CV] C=6.103402593686549 .............................................

...

...

...

[CV] .............................. C=4.395657632563425, total=   0.2s

best_params: {'C': 0.0279898379592336}

# 使用测试数据集进行测试

y_prab = rscv_clf.predict(x_test)

print('accuracy:',accuracy_score(y_prab,y_test))

accuracy: 0.9791666666666666

5. 分别选择一封正式邮件和垃圾邮件进行

正式邮件内容如下：

很久以前，我为了考人大，申请了他的ID，而现在却不对外开放了。

申请水木的ID，真的是不知道出于什么缘故。离开校园寻找一份校园的感觉，怀着对清华的向往，还是为了偶尔无聊工作的一些调剂……

我讨厌多愁善感，却时常沉浸于其中，生活中的挫折，不幸，让我知道自己得坚强……

可每天的灰色心情却挥之不去，我可以把自己的心事埋于深处，笑着对待我身边的每一个人，告诉我乐观。身边最亲的人，也许知道或不知道我的脆弱和恐惧。而唯一知道的人，告诉我“希望你坚不可摧”。

回想多年前，为“在靠近的地方住下，能掩耳不听烦世喧嚣，要一份干净的自由自在”而感动，那，是否是对今天的预见，无知是快乐的，而不知道责任也是快乐的。我可以逃避一时，却始终要面对。

垃圾邮件如下：

这是一封善意的邮件,如给你造成不便,请随手删除.SOHO建站代理网诚聘兼职网站代理　

一、职业要求:

１、有上网条件(在家中、办公室、网吧等地)；

２、每天能有１－２小时上网时间；

３、有网络应用的基础（会上论坛发贴子、发电子邮件,

与客户QQ沟通等）。

二、工作任务:

您报名加入我公司后，公司将分配给您一个属

于自己的冠名为SOHO致富联盟的网站，作为站长，您的任

务就是利用各种方法宣传这个网站，让客户从你的网站上

购买更多的商品，并接受你的建议，也同意加盟SOHO建站

代理网网站的兼职代理，这样你便拥有滚滚不断的财源。

三、工资待遇:3000元以上／月。业绩累积,收入直线上升.

def email_to_predict_matrix(words,features):

    cv = CountVectorizer()

    words_to_vect = cv.fit_transform(words)

    words_to_marix = pd.DataFrame(words_to_vect.toarray())

    vocabulary = cv.vocabulary_

    words_numbers_list = [] # 特征词出现的次数列表

    for feature in features:

        if feature in cv.vocabulary_.keys():

            words_numbers_list.append(words_to_marix[vocabulary[feature]][0])

        else:

            words_numbers_list.append(0)

    words_numbers_matrix = pd.DataFrame([words_numbers_list],columns = features)

    return words_numbers_matrix

valid_ham_email = '.\\datasets\\trec06c\\valid_ham_email'

valid_spam_email = '.\\datasets\\trec06c\\valid_spam_email'

email_to_features_valid = EmailToWordFeatures(stop_word_file=stop_word_file)

valid_ham_email_words = email_to_features_valid.get_email_words(valid_ham_email)

valid_spam_email_words = email_to_features_valid.get_email_words(valid_spam_email)

valid_ham_words_maxtrix = email_to_predict_matrix(valid_ham_email_words,words_to_matrix.columns)

valid_spam_words_maxtrix = email_to_predict_matrix(valid_spam_email_words,words_to_matrix.columns)

print('测试正式邮件----------')

print('预测结果：',rscv_clf.predict(valid_ham_words_maxtrix))

测试正式邮件----------

预测结果： [1.]

print('测试垃圾邮件----------')

print('预测结果：',rscv_clf.predict(valid_spam_words_maxtrix))

测试垃圾邮件----------

预测结果： [0.]

附

6.1 改进计划

将特征词矩阵改word2vect
使用mxnet神经网络模型进行训练

6.2 数据集整理部分的代码

# 将正样本和负样本分别放在了ham_data和spam_data文件夹中

index_file= '.\\datasets\\trec06c\\full\\index'

data_file = '.\\datasets\\trec06c\\data'

ham_file_list = []

spam_file_list = []

# 读index文件

with open(index_file,'r') as pf:

    for line in pf.readlines():

        content = line.strip().split('..')

        label,path = content

        path = path.replace('/', '\\')

        if label == 'spam ':

            spam_file_list.append(path)

        else:

            ham_file_list.append(path)

import os

import shutil

root = '.\\datasets\\trec06c\\'

new_ham_root = '.\\datasets\\trec06c\\ham_data'

new_spam_root = '.\\datasets\\trec06c\\spam_data'

def copy_file(filelist,new_file_path):

    for file in filelist:

        file_name = file.split('\\')

        path = root + file

        if not os.path.exists(new_file_path):

            os.makedirs(new_file_path)

        shutil.copyfile(path, new_file_path+'\\' + file_name[-2]+ '_' + file_name[-1])

垃圾邮件分类实战(SVM)的更多相关文章

基于SKLearn的SVM模型垃圾邮件分类——代码实现及优化
一. 前言由于最近有一个邮件分类的工作需要完成,研究了一下基于SVM的垃圾邮件分类模型.参照这位作者的思路(https://blog.csdn.net/qq_40186809/article/det ...
【深度学习系列】PaddlePaddle垃圾邮件处理实战（一）
PaddlePaddle垃圾邮件处理实战(一) 背景介绍在我们日常生活中,经常会受到各种垃圾邮件,譬如来自商家的广告.打折促销信息.澳门博彩邮件.理财推广信息等,一般来说邮件客户端都会设置一定的 ...
【深度学习系列】PaddlePaddle垃圾邮件处理实战（二）
PaddlePaddle垃圾邮件处理实战(二) 前文回顾在上篇文章中我们讲了如何用支持向量机对垃圾邮件进行分类,auc为73.3%,本篇讲继续讲如何用PaddlePaddle实现邮件分类,将深度 ...
利用朴素贝叶斯（Navie Bayes）进行垃圾邮件分类
贝叶斯公式描写叙述的是一组条件概率之间相互转化的关系. 在机器学习中.贝叶斯公式能够应用在分类问题上. 这篇文章是基于自己的学习所整理.并利用一个垃圾邮件分类的样例来加深对于理论的理解. 这里我们来解 ...
Bert模型实现垃圾邮件分类
近日,对近些年在NLP领域很火的BERT模型进行了学习,并进行实践.今天在这里做一下笔记. 本篇博客包含下列内容: BERT模型简介概览 BERT模型结构 BERT项目学习及代码走读项目基本特性介 ...
Atitit 贝叶斯算法的原理以及垃圾邮件分类的原理
Atitit 贝叶斯算法的原理以及垃圾邮件分类的原理 1.1. 最开始的垃圾邮件判断方法,使用contain包含判断,只能一个关键词,而且100%概率判断1 1.2. 元件部件串联定律1 1.3. 垃 ...
CNN实现垃圾邮件分类(行大小不一致要补全)
以下是利用卷积神经网络对某一个句子的处理结构图我们从上图可知,将一句话转化成一个矩阵.我们看到该句话有6个单词和一个标点符号,所以我们可以将该矩阵设置为7行,对于列的话每个单词可以用什么样的数值表示 ...
Python之机器学习-朴素贝叶斯(垃圾邮件分类)
目录朴素贝叶斯(垃圾邮件分类) 邮箱训练集下载地址模块导入文本预处理遍历邮件训练模型测试模型朴素贝叶斯(垃圾邮件分类) 邮箱训练集下载地址邮箱训练集可以加我微信:nickchen121 ...
scikit-learn机器学习(二)逻辑回归进行二分类(垃圾邮件分类),二分类性能指标，画ROC曲线，计算acc,recall,presicion,f1
数据来自UCI机器学习仓库中的垃圾信息数据集数据可从http://archive.ics.uci.edu/ml/datasets/sms+spam+collection下载转成csv载入数据 im ...

随机推荐

Dapper安装与使用
1.VS2015直接使用nuget包搜索Dapper,安装时报错:显示版本不兼容. 于是使用命令安装dapper低版本. 步骤: 打开项目,vs工具---Nuget包管理器--程序包管理器控制台 ...
php 逻辑题
越长大约发现,高中学的数学,都还给了数学老师,一点都没有留住. 最近遇到了一个逻辑题,然后想了半天,后来做出来了,我就发现了,我可能是一个假的理科生.很简单的样子. 废话不说,看看这道题吧. /** ...
framework7 Autocomplete (自动完成) 具体使用
官网地址:https://framework7.io/docs/autocomplete.html#autocomplete-parameters 效果图: <meta charset=&quo ...
WebApi自定义全局异常过滤器及返回数据格式化
WebApi在这里就不多说了,一种轻量级的服务,应用非常广泛.我这这里主要记录下有关 WebApi的相关知识,以便日后使用. 当WebApi应用程序出现异常时,我们都会使用到异常过滤器进行日志记录,并 ...
【8】学习C++之this指针
在学习类的时候,我们可以考虑到一种情况: class Array { public: Array(int len); ~Array(); void setLen(int len) { len=len; ...
VC++中双缓冲技术画图
用双缓冲,先在内存中绘制,然后拷贝到屏幕DC,这样就不会出现画出去的情况了,前段时间我也是为这个问题费了不少劲.我把我的一段代码给你看一下: CDC *pDC = m_drawbox.GetDC(); ...
JS工程师的成长路径
JS 说起来必须是一个神器,这个当年10天内被开发出来的神器,以一种谁也想象不到的速度快速发展,它击败了Java Applet,逼死Flash,当Android和IOS看似一统全球的时候,JS慢条斯理 ...
如何使用Python的Django框架创建自己的网站
如何使用Python的Django框架创建自己的网站 Django建站主要分四步:1.创建Django项目,2.将网页模板移植到Django项目中,3.数据交互,4.数据库 1创建Django项目本 ...
OCR5：预处理
Tesseract4.X已经有了初步成效(见下面的对比), 但目前结果对于训练之外的数据, 仍会有很大的偏差.想要更好的 OCR 结果, README 中重点强调的一点是: 在交给 Tesseract ...
C语言深入学习
计算机存储篇 1.计算机对数据类型的辨别: 编译器在编译C程序时将其转变为汇编指令,其中指明了数据类型.此外,每种数据类型都有固定的存储长度,计算机运行程序时,会根据具体类型读出相应长度的数据进行计 ...

垃圾邮件分类实战(SVM)