机器学习-06

支持向量机(SVM)

支持向量机原理

  1. 寻求最优分类边界

    正确:对大部分样本可以正确地划分类别。

    泛化:最大化持向量间距。

    公平:与支持向量等距。

    简单:线性,直线或平面,分割超平面。

  2. 基于核函数的升维变换

    通过名为核函数的特征变换,增加新的特征,使得低维度空间中的线性不可分问题变为高维度空间中的线性可分问题。

    线性核函数:linear,不通过核函数进行维度提升,仅在原始维度空间中寻求线性分类边界。

    基于线性核函数的SVM分类相关API:

    model = svm.SVC(kernel='linear')
    model.fit(train_x, train_y)

    案例:对simple2.txt中的数据进行分类。

    import numpy as np
    import sklearn.model_selection as ms
    import sklearn.svm as svm
    import sklearn.metrics as sm
    import matplotlib.pyplot as mp
    x, y = [], []
    data = np.loadtxt('../data/multiple2.txt', delimiter=',', dtype='f8')
    x = data[:, :-1]
    y = data[:, -1]
    train_x, test_x, train_y, test_y = \
    ms.train_test_split(x, y, test_size=0.25, random_state=5)
    # 基于线性核函数的支持向量机分类器
    model = svm.SVC(kernel='linear')
    model.fit(train_x, train_y)
    n = 500
    l, r = x[:, 0].min() - 1, x[:, 0].max() + 1
    b, t = x[:, 1].min() - 1, x[:, 1].max() + 1
    grid_x = np.meshgrid(np.linspace(l, r, n),
    np.linspace(b, t, n))
    flat_x = np.column_stack((grid_x[0].ravel(), grid_x[1].ravel()))
    flat_y = model.predict(flat_x)
    grid_y = flat_y.reshape(grid_x[0].shape)
    pred_test_y = model.predict(test_x)
    cr = sm.classification_report(test_y, pred_test_y)
    print(cr)
    mp.figure('SVM Linear Classification', facecolor='lightgray')
    mp.title('SVM Linear Classification', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray')
    mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y, cmap='brg', s=80)
    mp.show()

    多项式核函数:poly,通过多项式函数增加原始样本特征的高次方幂

    y

    =

    x

    1

    +

    x

    2

    y

    =

    x

    1

    2

    +

    2

    x

    1

    x

    2

    +

    x

    2

    2

    y

    =

    x

    1

    3

    +

    3

    x

    1

    2

    x

    2

    +

    3

    x

    1

    x

    2

    2

    +

    x

    2

    3

    y = x_1+x_2 \\ y = x_1^2 + 2x_1x_2 + x_2^2 \\ y = x_1^3 + 3x_1^2x_2 + 3x_1x_2^2 + x_2^3

    y=x1​+x2​y=x12​+2x1​x2​+x22​y=x13​+3x12​x2​+3x1​x22​+x23​
    案例,基于多项式核函数训练sample2.txt中的样本数据。

    # 基于线性核函数的支持向量机分类器
    model = svm.SVC(kernel='poly', degree=3)
    model.fit(train_x, train_y)

    径向基核函数:rbf,通过高斯分布函数增加原始样本特征的分布概率

    案例,基于径向基核函数训练sample2.txt中的样本数据。

    # 基于径向基核函数的支持向量机分类器
    # C:正则强度
    # gamma:如果gamma越大,标准差会很小,标准差很小的高斯分布长得又高又瘦, 会造成只会作用于支持向量样本附近,容易过拟合。
    model = svm.SVC(kernel='rbf', C=600, gamma=0.01)
    model.fit(train_x, train_y)

网格搜索

获取一个最优超参数的方式可以绘制验证曲线,但是验证曲线只能每次获取一个最优超参数。如果多个超参数有很多排列组合的话,就可以使用网格搜索寻求最优超参数组合。

针对超参数组合列表中的每一个超参数组合,实例化给定的模型,做cv次交叉验证,将其中平均f1得分最高的超参数组合作为最佳选择,实例化模型对象。

网格搜索相关API:

import sklearn.model_selection as ms
params = [{'kernel':['linear'], 'C':[1, 10, 100, 1000]},
{'kernel':['poly'], 'C':[1], 'degree':[2, 3]},
{'kernel':['rbf'], 'C':[1,10,100], 'gamma':[1, 0.1, 0.01]}]
model = ms.GridSearchCV(模型, params, cv=交叉验证次数)
model.fit(输入集,输出集)
# 获取网格搜索每个参数组合
model.cv_results_['params']
# 获取网格搜索每个参数组合所对应的平均测试分值
model.cv_results_['mean_test_score']
# 获取最好的参数
model.best_params_
model.best_score_
model.best_estimator_

案例:修改置信概率案例,基于网格搜索得到最优超参数。

# 基于径向基核函数的支持向量机分类器
params = [{'kernel':['linear'], 'C':[1, 10, 100, 1000]},
{'kernel':['poly'], 'C':[1], 'degree':[2, 3]},
{'kernel':['rbf'], 'C':[1,10,100,1000], 'gamma':[1, 0.1, 0.01, 0.001]}]
model = ms.GridSearchCV(svm.SVC(probability=True), params, cv=5)
model.fit(train_x, train_y)
for p, s in zip(model.cv_results_['params'],
model.cv_results_['mean_test_score']):
print(p, s)
# 获取得分最优的的超参数信息
print(model.best_params_)
# 获取最优得分
print(model.best_score_)
# 获取最优模型的信息
print(model.best_estimator_)

逻辑回归、决策树、Adaboost、RF、GBDT、SVM …

情感分析

文本情感分析又称意见挖掘、倾向性分析等。简单而言,是对带有情感色彩的主观性文本进行分析、处理、归纳和推理的过程。互联网产生了大量的诸如人物、事件、产品等有价值的评论信息。这些评论信息表达了人们的各种情感色彩和情感倾向性,如喜、怒、哀、乐和批评、赞扬等。基于此,潜在的用户就可以通过浏览这些主观色彩的评论来了解大众舆论对于某一事件或产品的看法。

1. 房间真棒,离大马路很近,非常方便。不错。				好评
2. 房间有点脏,厕所还漏水,空调不制冷,下次再也不来了。 差评
3. 地板不太干净,电视没信号,但是空调还可以,总之还行。 好评

先针对训练文本进行分词处理,统计词频,通过词频-逆文档频率算法获得该词对样本语义的贡献,根据每个词的贡献力度,构建有监督分类学习模型。把测试样本交给模型处理,得到测试样本的情感类别。

pip install nltk -i https://pypi.tuna.tsinghua.edu.cn/simple/
pip install jieba -i https://pypi.tuna.tsinghua.edu.cn/simple/

文本分词

分词处理相关API:nltk将会寻找punkt资源:

import nltk.tokenize as tk
# 把样本按句子进行拆分 sent_list:句子列表
sent_list = tk.sent_tokenize(text)
# 把样本按单词进行拆分 word_list:单词列表
word_list = tk.word_tokenize(text)
# 把样本按单词进行拆分 punctTokenizer:分词器对象
punctTokenizer = tk.WordPunctTokenizer()
word_list = punctTokenizer.tokenize(text)

案例:

import nltk.tokenize as tk
doc = "Are you curious about tokenization? " \
"Let's see how it works! " \
"We need to analyze a couple of sentences " \
"with punctuations to see it in action."
print(doc)
tokens = tk.sent_tokenize(doc)
for i, token in enumerate(tokens):
print("%2d" % (i + 1), token)
print('-' * 15)
tokens = tk.word_tokenize(doc)
for i, token in enumerate(tokens):
print("%2d" % (i + 1), token)
print('-' * 15)
tokenizer = tk.WordPunctTokenizer()
tokens = tokenizer.tokenize(doc)
for i, token in enumerate(tokens):
print("%2d" % (i + 1), token)

词袋模型

一句话的语义很大程度取决于某个单词出现的次数,所以可以把句子中所有可能出现的单词作为特征名,每一个句子为一个样本,单词在句子中出现的次数为特征值构建数学模型,称为词袋模型。

This hotel is very bad. The toilet in this hotel smells bad. The environment of this hotel is very good.

1 This hotel is very bad.
2 The toilet in this hotel smells bad.
3 The environment of this hotel is very good.

This hotel is very bad The toilet in smells environment of good
1 1 1 1 1 0 0 0 0 0 0 0
1 1 0 0 1 1 1 1 1 0 0 0
1 1 1 1 0 1 0 0 0 1 1 1

词袋模型化相关API:

import sklearn.feature_extraction.text as ft

# 构建词袋模型对象
cv = ft.CountVectorizer()
# 训练模型,把句子中所有可能出现的单词作为特征名,每一个句子为一个样本,单词在句子中出现的次数为特征值。
bow = cv.fit_transform(sentences).toarray()
print(bow)
# 获取所有特征名
words = cv.get_feature_names()

案例:

import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'This hotel is very bad. The toilet in this hotel smells bad. The environment of this hotel is very good.'
print(doc)
sentences = tk.sent_tokenize(doc)
print(sentences)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
words = cv.get_feature_names()
print(words)

词频(TF)

单词在句子中出现的次数除以句子的总词数称为词频。即一个单词在一个句子中出现的频率。词频相比单词的出现次数可以更加客观的评估单词对一句话的语义的贡献度。词频越高,对语义的贡献度越大。对词袋矩阵归一化即可得到词频。

案例:对词袋矩阵进行归一化

import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
import sklearn.preprocessing as sp
doc = 'This hotel is very bad. The toilet in this hotel smells bad. The environment of this hotel is very good.'
print(doc)
sentences = tk.sent_tokenize(doc)
print(sentences)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
words = cv.get_feature_names()
print(words)
tf = sp.normalize(bow, norm='l1')
print(tf)

文档频率(DF)

D

F

=

DF = \frac{含有某个单词的文档样本数}{总文档样本数}

DF=总文档样本数含有某个单词的文档样本数​

逆文档频率(IDF)

I

D

F

=

l

o

g

(

1

+

)

IDF =log( \frac{总样本数}{1+含有某个单词的样本数})

IDF=log(1+含有某个单词的样本数总样本数​)

词频-逆文档频率(TF-IDF)

词频矩阵中的每一个元素乘以相应单词的逆文档频率,其值越大说明该词对样本语义的贡献越大,根据每个词的贡献力度,构建学习模型。

获取词频逆文档频率(TF-IDF)矩阵相关API:

# 获取词袋模型
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
# 获取TF-IDF模型训练器
tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow).toarray()

案例:获取TF_IDF矩阵:

import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft doc = 'This hotel is very bad. The toilet in this hotel smells bad. The environment of this hotel is very good.'
print(doc)
sentences = tk.sent_tokenize(doc)
print(sentences)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
words = cv.get_feature_names()
print(words)
tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow).toarray()
print(tfidf)

文本分类(主题识别)

使用给定的文本数据集进行主题识别训练,自定义测试集测试模型准确性。

案例:

import sklearn.datasets as sd
import sklearn.feature_extraction.text as ft
import sklearn.naive_bayes as nb train = sd.load_files('../data/20news', encoding='latin1',
shuffle=True, random_state=7)
# 20news 下的文件夹名即是相应子文件的主题类别名
# train.data 返回每个文件的字符串内容
# train.target 返回每个文件的父目录名(主题类别名)
train_data = train.data
train_y = train.target
categories = train.target_names
cv = ft.CountVectorizer()
train_bow = cv.fit_transform(train_data)
tt = ft.TfidfTransformer()
train_x = tt.fit_transform(train_bow)
model = nb.MultinomialNB()
model.fit(train_x, train_y)
test_data = [
'The curveballs of right handed pitchers tend to curve to the left',
'Caesar cipher is an ancient form of encryption',
'This two-wheeler is really good on slippery roads']
test_bow = cv.transform(test_data)
test_x = tt.transform(test_bow)
pred_test_y = model.predict(test_x)
for sentence, index in zip(test_data, pred_test_y):
print(sentence, '->', categories[index])

朴素贝叶斯分类

朴素贝叶斯分类是一种依据统计概率理论而实现的一种分类方式。观察这组数据:

天气情况 穿衣风格 约女朋友 ==> 心情
0(晴天) 0(休闲) 0(约了) ==> 0(高兴)
0 1(风骚) 1(没约) ==> 0
1(多云) 1 0 ==> 0
0 2(破旧) 1 ==> 1(郁闷)
2(下雨) 2 0 ==> 0
==>
0 1 0 ==>

通过上述训练样本如何预测:晴天、穿着休闲、没有约女朋友时的心情?可以整理相同特征值的样本,计算属于某类别的概率即可。但是如果在样本空间没有完全匹配的数据该如何预测?

1)晴天、穿着休闲、没有约女朋友的条件下高兴的概率

2)晴天、穿着休闲、没有约女朋友的条件下郁闷的概率

贝叶斯定理:P(A|B)=P(B|A)P(A)/P(B) <== P(A, B) = P(A) P(B|A) = P(B) P(A|B)

例如:

假设一个学校里有60%男生和4 0%女生.女生穿裤子的人数和穿裙子的人数相等,所有男生穿裤子.一个人在远处随机看到了一个穿裤子的学生.那么这个学生是女生的概率是多少?

P(女) = 0.4
P(裤子|女) = 0.5
P(裤子) = 0.6 + 0.2 = 0.8
P(女|裤子) = P(裤子|女) * P(女) / P(裤子) = 0.5 * 0.4 / 0.8 = 0.25

根据贝叶斯定理,如何预测:晴天、穿着休闲、没有约女朋友时的心情?

P(高兴|晴天,休闲,没约)
= P(晴天,休闲,没约|高兴) P(高兴) / P(晴天,休闲,没约)
P(郁闷|晴天,休闲,没约)
= P(晴天,休闲,没约|郁闷) P(郁闷) / P(晴天,休闲,没约) 即比较:
P(晴天,休闲,没约|高兴) P(高兴)
P(晴天,休闲,没约|郁闷) P(郁闷) 朴素:根据条件独立假设,特征值之间没有因果关系,则:
P(晴天|高兴) P(休闲|高兴) P(没约|高兴)P(高兴)
P(晴天|郁闷) P(休闲|郁闷) P(没约|郁闷)P(郁闷)

由此可得,统计总样本空间中晴天、穿着休闲、没有约女朋友时高兴的概率,与晴天、穿着休闲、没有约女朋友时不高兴的概率,择其大者为最终结果。

高斯贝叶斯分类器相关API:

import sklearn.naive_bayes as nb
# 创建高斯分布朴素贝叶斯分类器
model = nb.GaussianNB()
model = nb.MultinomialNB()
model.fit(x, y)
result = model.predict(samples)

案例:

import numpy as np
import sklearn.naive_bayes as nb
import matplotlib.pyplot as mp data = np.loadtxt('../data/multiple1.txt', unpack=False, dtype='U20', delimiter=',')
print(data.shape)
x = np.array(data[:, :-1], dtype=float)
y = np.array(data[:, -1], dtype=float) # 创建高斯分布朴素贝叶斯分类器
model = nb.GaussianNB()
model.fit(x, y)
l, r = x[:, 0].min() - 1, x[:, 0].max() + 1
b, t = x[:, 1].min() - 1, x[:, 1].max() + 1
n = 500
grid_x, grid_y = np.meshgrid(np.linspace(l, r, n), np.linspace(b, t, n))
samples = np.column_stack((grid_x.ravel(), grid_y.ravel()))
grid_z = model.predict(samples)
grid_z = grid_z.reshape(grid_x.shape) mp.figure('Naive Bayes Classification', facecolor='lightgray')
mp.title('Naive Bayes Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x, grid_y, grid_z, cmap='gray')
mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80)
mp.show()

代码总结

支持向量机SVM

线性核函数SVM

import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import matplotlib.pyplot as mp x, y = [], []
data = np.loadtxt('../data/multiple2.txt', delimiter=',', dtype='f8')
x = data[:, :-1]
y = data[:, -1]
train_x, test_x, train_y, test_y = \
ms.train_test_split(x, y, test_size=0.25, random_state=5)
# 基于线性核函数的支持向量机分类器
model = svm.SVC(kernel='linear')
model.fit(train_x, train_y)
n = 500
l, r = x[:, 0].min() - 1, x[:, 0].max() + 1
b, t = x[:, 1].min() - 1, x[:, 1].max() + 1
grid_x = np.meshgrid(np.linspace(l, r, n),
np.linspace(b, t, n))
flat_x = np.column_stack((grid_x[0].ravel(), grid_x[1].ravel()))
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)
pred_test_y = model.predict(test_x)
cr = sm.classification_report(test_y, pred_test_y)
print(cr)
mp.figure('SVM Linear Classification', facecolor='lightgray')
mp.title('SVM Linear Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray')
mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y, cmap='brg', s=80)
mp.show()
             precision    recall  f1-score   support

        0.0       0.64      0.96      0.77        45
1.0 0.75 0.20 0.32 30 avg / total 0.69 0.65 0.59 75

多项式核函数

import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import matplotlib.pyplot as mp x, y = [], []
data = np.loadtxt('../data/multiple2.txt', delimiter=',', dtype='f8')
x = data[:, :-1]
y = data[:, -1]
train_x, test_x, train_y, test_y = \
ms.train_test_split(x, y, test_size=0.25, random_state=5)
# 基于线性核函数的支持向量机分类器
model = svm.SVC(kernel='poly', degree=2)
model.fit(train_x, train_y)
n = 500
l, r = x[:, 0].min() - 1, x[:, 0].max() + 1
b, t = x[:, 1].min() - 1, x[:, 1].max() + 1
grid_x = np.meshgrid(np.linspace(l, r, n),
np.linspace(b, t, n))
flat_x = np.column_stack((grid_x[0].ravel(), grid_x[1].ravel()))
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)
pred_test_y = model.predict(test_x)
cr = sm.classification_report(test_y, pred_test_y)
print(cr)
mp.figure('SVM Linear Classification', facecolor='lightgray')
mp.title('SVM Linear Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray')
mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y, cmap='brg', s=80)
mp.show()
             precision    recall  f1-score   support

        0.0       0.91      0.89      0.90        45
1.0 0.84 0.87 0.85 30 avg / total 0.88 0.88 0.88 75

径向基核函数

import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import matplotlib.pyplot as mp x, y = [], []
data = np.loadtxt('../data/multiple2.txt', delimiter=',', dtype='f8')
x = data[:, :-1]
y = data[:, -1]
train_x, test_x, train_y, test_y = \
ms.train_test_split(x, y, test_size=0.25, random_state=5)
# 基于线性核函数的支持向量机分类器
model = svm.SVC(kernel='rbf', C=100, gamma=0.01)
model.fit(train_x, train_y)
n = 500
l, r = x[:, 0].min() - 1, x[:, 0].max() + 1
b, t = x[:, 1].min() - 1, x[:, 1].max() + 1
grid_x = np.meshgrid(np.linspace(l, r, n),
np.linspace(b, t, n))
flat_x = np.column_stack((grid_x[0].ravel(), grid_x[1].ravel()))
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)
pred_test_y = model.predict(test_x)
cr = sm.classification_report(test_y, pred_test_y)
print(cr)
mp.figure('SVM Linear Classification', facecolor='lightgray')
mp.title('SVM Linear Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray')
mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y, cmap='brg', s=80)
mp.show()
             precision    recall  f1-score   support

        0.0       0.91      0.93      0.92        45
1.0 0.90 0.87 0.88 30 avg / total 0.91 0.91 0.91 75

网格搜索

import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import matplotlib.pyplot as mp x, y = [], []
data = np.loadtxt('../data/multiple2.txt', delimiter=',', dtype='f8')
x = data[:, :-1]
y = data[:, -1]
train_x, test_x, train_y, test_y = \
ms.train_test_split(x, y, test_size=0.25, random_state=5)
# 基于线性核函数的支持向量机分类器
model = svm.SVC() # 网格搜索获取最优模型
params = \
[{'kernel':['linear'], 'C':[1, 10, 100, 1000]},
{'kernel':['poly'], 'C':[1], 'degree':[2, 3]},
{'kernel':['rbf'], 'C':[1,10,100], 'gamma':[1, 0.1, 0.01]}]
model = ms.GridSearchCV(model, params, cv=5)
# 训练模型
model.fit(train_x, train_y)
# 获取最好的参数
print(model.best_params_)
print(model.best_score_)
print(model.best_estimator_) n = 500
l, r = x[:, 0].min() - 1, x[:, 0].max() + 1
b, t = x[:, 1].min() - 1, x[:, 1].max() + 1
grid_x = np.meshgrid(np.linspace(l, r, n),
np.linspace(b, t, n))
flat_x = np.column_stack((grid_x[0].ravel(), grid_x[1].ravel()))
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)
pred_test_y = model.predict(test_x)
cr = sm.classification_report(test_y, pred_test_y)
print(cr)
mp.figure('SVM Linear Classification', facecolor='lightgray')
mp.title('SVM Linear Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray')
mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y, cmap='brg', s=80)
mp.show()
{'C': 1, 'gamma': 1, 'kernel': 'rbf'}
0.96
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
precision recall f1-score support 0.0 0.95 0.93 0.94 45
1.0 0.90 0.93 0.92 30 avg / total 0.93 0.93 0.93 75

NLP情感分析基础

nltk 英文分词

import nltk.tokenize as tk

doc = "Are you curious about tokenization? " \
"Let's see how it works! " \
"We need to analyze a couple of sentences " \
"with punctuations to see it in action."
print(doc)
tokens = tk.sent_tokenize(doc)
for i, token in enumerate(tokens):
print("%2d" % (i + 1), token)
print('-' * 15)
tokens = tk.word_tokenize(doc)
for i, token in enumerate(tokens):
print("%2d" % (i + 1), token)
print('-' * 15)
tokenizer = tk.WordPunctTokenizer()
tokens = tokenizer.tokenize(doc)
for i, token in enumerate(tokens):
print("%2d" % (i + 1), token)
Are you curious about tokenization? Let's see how it works! We need to analyze a couple of sentences with punctuations to see it in action.
1 Are you curious about tokenization?
2 Let's see how it works!
3 We need to analyze a couple of sentences with punctuations to see it in action.
---------------
1 Are
2 you
3 curious
4 about
5 tokenization
6 ?
7 Let
8 's
9 see
10 how
11 it
12 works
13 !
14 We
15 need
16 to
17 analyze
18 a
19 couple
20 of
21 sentences
22 with
23 punctuations
24 to
25 see
26 it
27 in
28 action
29 .
---------------
1 Are
2 you
3 curious
4 about
5 tokenization
6 ?
7 Let
8 '
9 s
10 see
11 how
12 it
13 works
14 !
15 We
16 need
17 to
18 analyze
19 a
20 couple
21 of
22 sentences
23 with
24 punctuations
25 to
26 see
27 it
28 in
29 action
30 .

词袋模型

import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft doc = 'This hotel is very bad. The toilet in this hotel smells bad. The environment of this hotel is very good.'
print(doc)
sents = tk.sent_tokenize(doc)
sents # 把sents中每句话转成词袋模型
cv = ft.CountVectorizer()
bow = cv.fit_transform(sents).toarray()
print(bow)
print(cv.get_feature_names())
This hotel is very bad. The toilet in this hotel smells bad. The environment of this hotel is very good.
[[1 0 0 1 0 1 0 0 0 1 0 1]
[1 0 0 1 1 0 0 1 1 1 1 0]
[0 1 1 1 0 1 1 0 1 1 0 1]]
['bad', 'environment', 'good', 'hotel', 'in', 'is', 'of', 'smells', 'the', 'this', 'toilet', 'very']

TF-IDF 词频逆文档频率矩阵

import numpy as np

tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow).toarray()
print(np.round(tfidf, 3))
[[0.488 0.    0.    0.379 0.    0.488 0.    0.    0.    0.379 0.    0.488]
[0.345 0. 0. 0.268 0.454 0. 0. 0.454 0.345 0.268 0.454 0. ]
[0. 0.429 0.429 0.253 0. 0.326 0.429 0. 0.326 0.253 0. 0.326]]

20news 文章分类

import numpy as np
import sklearn.datasets as sd
import sklearn.model_selection as ms
import sklearn.linear_model as lm
import sklearn.metrics as sm
import sklearn.feature_extraction.text as ft
# 加载数据集
data = sd.load_files('../../../DataSinece/data/20news', encoding='latin1')
data.keys()
dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])
# 把所有的样本输入转成TFIDF矩阵
cv = ft.CountVectorizer()
bow = cv.fit_transform(data.data).toarray()
tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow).toarray()
# 整理输入集输出集,拆分训练集测试集
x, y = tfidf, data.target
train_x, test_x, train_y, test_y = \
ms.train_test_split(x, y, test_size=0.2, random_state=7, stratify=y)
# 找一个分类模型,训练模型
model = lm.LogisticRegression() # 使用朴素贝叶斯
import sklearn.naive_bayes as nb
model = nb.MultinomialNB()
model.fit(train_x, train_y)
# 测试评估模型
pred_test_y = model.predict(test_x)
print(sm.classification_report(test_y, pred_test_y))
              precision    recall  f1-score   support

           0       0.97      0.79      0.87       117
1 0.94 0.97 0.96 120
2 0.97 0.98 0.98 119
3 0.90 1.00 0.95 119
4 0.95 0.97 0.96 119 accuracy 0.95 594
macro avg 0.95 0.95 0.94 594
weighted avg 0.95 0.95 0.94 594

test_data = [
'The curveballs of right handed pitchers tend to curve to the left',
'Caesar cipher is an ancient form of encryption',
'This two-wheeler is really good on slippery roads',
'Recently, we studied an encryption algorithm: asymmetric encryption',
"I heard Tesla's rocket blew up"]
bow = cv.transform(test_data)
tfidf = tt.transform(bow)
pred_y = model.predict(tfidf)
print(pred_y)
print(data.target_names)
[2 3 1 3 4]
['misc.forsale', 'rec.motorcycles', 'rec.sport.baseball', 'sci.crypt', 'sci.space']

机器学习06-(支持向量机SVM、网格搜索、文本分词、词袋模型、词频、文本分类-主题识别)的更多相关文章

  1. 机器学习算法中的网格搜索GridSearch实现(以k-近邻算法参数寻最优为例)

    机器学习算法参数的网格搜索实现: //2019.08.031.scikitlearn库中调用网格搜索的方法为:Grid search,它的搜索方式比较统一简单,其对于算法批判的标准比较复杂,是一种复合 ...

  2. 机器学习算法 - 支持向量机SVM

    在上两节中,我们讲解了机器学习的决策树和k-近邻算法,本节我们讲解另外一种分类算法:支持向量机SVM. SVM是迄今为止最好使用的分类器之一,它可以不加修改即可直接使用,从而得到低错误率的结果. [案 ...

  3. 机器学习之支持向量机—SVM原理代码实现

    支持向量机—SVM原理代码实现 本文系作者原创,转载请注明出处:https://www.cnblogs.com/further-further-further/p/9596898.html 1. 解决 ...

  4. python机器学习之支持向量机SVM

    支持向量机SVM(Support Vector Machine) 关注公众号"轻松学编程"了解更多. [关键词]支持向量,最大几何间隔,拉格朗日乘子法 一.支持向量机的原理 Sup ...

  5. 【机器学习】支持向量机SVM

    关于支持向量机SVM,这里也只是简单地作个要点梳理,尤其是要注意的是SVM的SMO优化算法.核函数的选择以及参数调整.在此不作过多阐述,单从应用层面来讲,重点在于如何使用libsvm,但对其原理算法要 ...

  6. 机器学习入门-文本数据-构造Tf-idf词袋模型(词频和逆文档频率) 1.TfidfVectorizer(构造tf-idf词袋模型)

    TF-idf模型:TF表示的是词频:即这个词在一篇文档中出现的频率 idf表示的是逆文档频率, 即log(文档的个数/1+出现该词的文档个数)  可以看出出现该词的文档个数越小,表示这个词越稀有,在这 ...

  7. 文本特征提取---词袋模型,TF-IDF模型,N-gram模型(Text Feature Extraction Bag of Words TF-IDF N-gram )

    假设有一段文本:"I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good frie ...

  8. 机器学习(十一) 支持向量机 SVM(上)

    一.什么是支撑向量机SVM (Support Vector Machine) SVM(Support Vector Machine)指的是支持向量机,是常见的一种判别方法.在机器学习领域,是一个有监督 ...

  9. 机器学习-5 支持向量机SVM

    一.概念和背景 SVM:Support Vector Machine 支持向量机. 最早是由Vladimir N. Vapnik和Alexey Ya. Chervonenkis在1963年提出的. 目 ...

  10. 机器学习:支持向量机(SVM)

    SVM,称为支持向量机,曾经一度是应用最广泛的模型,它有很好的数学基础和理论基础,但是它的数学基础却比以前讲过的那些学习模型复杂很多,我一直认为它是最难推导,比神经网络的BP算法还要难懂,要想完全懂这 ...

随机推荐

  1. JSON数据转对象遍历

    String json = "[{\"n\":\"北京\",\"i\":11,\"p\":0,\"y ...

  2. Mac下python2升级3

    1.下载python最新版本 链接:https://www.python.org/downloads/mac-osx/ 安装默认位置为: /Library/Frameworks/Python.fram ...

  3. 部门mysql操作

      use test_db; -- 删除表 drop table if exists t1_profit; drop table if exists t1_salgrade; drop table i ...

  4. 插入Mybatis教学

    ------------恢复内容开始------------ 1.Mybatis的CRUD 首先第一点要注意: namespace中的包名称,一定要和mapper接口的包名称要一一对应. 有上面的图可 ...

  5. git 更换远程连接

    原来的git仓库不可用,更换远程仓库 查看远程仓库地址 git remote -v 删除远程仓库 git remote rm origin 添加远程仓库地址 git remote add origin ...

  6. Android Studio报错--Build failed with an exception.

    错误描述 在代码写好之后,点击运行,会爆出这样的错误,查看日志,发现是Manifest.xml文件爆出来的错误 具体解决 我的错误没有别的版本那么麻烦,就是我建立了Empty Activity之后,我 ...

  7. 换个脑袋,做个小练习------四则运算系统的随机出题的jsp实现

    四则运算出题系统网页界面的实现(别期待,只有俩操作数) index.jsp <%@ page contentType="text/html;charset=UTF-8" la ...

  8. 给生活加点惊喜,做创意生活的原型设计师丨编程挑战赛 x 选手分享

    前言 做产品的大都跳过一个坑:我有了一个很好的产品创意,只差一个程序员帮我实现编程了. 事实上从产品创意到落地上线,中间需要经过非常复杂的过程,细节的逻辑流程才是难点,创意不能落地,并不值钱. 本文作 ...

  9. C_C++常用函数汇总

    1 string.h.cstring(C) (1)字符串连接函数 strcat.strncat strcat(char[ ], const char[ ]) strncat(char[ ], cons ...

  10. Linux RedHat 利用 ISO镜像文件制作本地 yum源

    RedHat 利用ISO镜像文件制作本地yum源 [1]创建iso存放目录和挂载目录 1 [root@desktop ~]# cd /mnt/ 2 [root@desktop mnt]# mkdir ...