源码请到：自然语言处理练习: 学习自然语言处理时候写的一些代码 (gitee.com)

数据来源：麦卡里价格建议挑战Mercari Price Suggestion Challenge | Kaggle

如果不会使用魔法可以使用百度云

链接：https://pan.baidu.com/s/1EM2MwjX4bLlypLSIJYZqeg?pwd=xqs0
提取码：xqs0

一、加载数据集

拿到数据集后首先对数据集的数据进行一些显示展示，来了解数据集

# 设置pandas显示配置

pd.set_option('display.max_columns', 1000)

pd.set_option('display.width', 1000)

pd.set_option('display.max_colwidth', 1000)

train = pd.read_csv('data/train.tsv', sep='\t')

test = pd.read_csv('data/test.tsv', sep='\t')

# 查看数据集大小

print(train.shape)

print(test.shape)

# 查看数据集列名

print(train.dtypes)

# 查看数据集前几行

print(train.head())

# 查看商品价格信息

print(train.price.describe())

二、分析影响价格的因素

经过第一步已经大致了解了数据集的内容以及价格的范围大小及均值，下面我们进一步对影响价格的因素进行分析

2.1 对价格进行对数变换，比较变换后价格的分布情况

由于价格的分布十分的散乱，所以对价格进行对数变换处理

# 对价格进行对数变换，比较转换前和转换后的分布情况

plt.subplot(1, 2, 1)

(train['price']).plot.hist(bins=50, figsize=(20, 10), edgecolor='white', range=[0, 250])

plt.xlabel('price+', fontsize=17)

plt.ylabel('frequency', fontsize=17)

plt.tick_params(labelsize=15)

plt.title('Price Distribution - Training Set', fontsize=17)

plt.subplot(1, 2, 2)

np.log(train['price'] + 1).plot.hist(bins=50, figsize=(20, 10), edgecolor='white')

plt.xlabel('log(price+1)', fontsize=17)

plt.ylabel('frequency', fontsize=17)

plt.tick_params(labelsize=15)

plt.title('Log(Price) Distribution - Training Set', fontsize=17)

plt.show()

可以看出，原数据十分不均衡，高价区区间大数量少，经过对数处理的数据更加符合正态分布

2.2 包邮对价格的影响

接下来分析商户包邮是否对价格产生影响

# 运费承担：大概有55%卖家承担运费

print(train.shipping.value_counts() / len(train))

可以看出，大概有百分之五十五的商户包邮

# 看下运费不同情况下的价格变化(包邮的价格贵一些）

prc_shipBySeller = train.loc[train.shipping == 0, 'price']

prc_shipByBuyer = train.loc[train.shipping == 1, 'price']

fig, ax = plt.subplots(figsize=(20, 10))

ax.hist(np.log(prc_shipBySeller + 1), color='#8CB4E1', alpha=1.0, bins=50, label='Price when Seller pays Shipping')

ax.hist(np.log(prc_shipByBuyer + 1), color='#007D00', alpha=0.7, bins=50, label='Price when Buyer pays Shipping')

ax.set(title='Histogram Comparison', ylabel='% of Dataset in Bin')

plt.legend()

plt.xlabel('log(price+1)', fontsize=17)

plt.ylabel('frequency', fontsize=17)

plt.title('Price Distribution by Shipping Type', fontsize=17)

plt.tick_params(labelsize=15)

plt.show()

可以看出，包邮的整体价格比不包邮的价格贵一些

2.3 商品类别对价格的影响

首先我们统计下商品的类别

# 商品类别划分

print('There are %d unique values in the category column' % train['category_name'].nunique())

print(train['category_name'].value_counts()[:5])

print('There are %d items that do not have a label' % train['category_name'].isnull().sum())

可以看出总共有1287种商品，并且展示了数量前五的商品，还有6327件商品没有类别标签，我们处理的时候就忽略这些商品

商品类别太多了，可以看到商品类别结构为主类/子类1/子类2的格式，我们进行拆分，将类别合并一些

# 商品类别太多了，合并一下

def split_cat(text):

    try:

        return text.split("/")

    except:

        return "No Label", "No Label", "No Label"

train['general_cat'], train['subcat_1'], train['subcat_2'] = zip(*train['category_name'].apply(lambda x: split_cat(x)))

print(train.head())

test['general_cat'], test['subcat_1'], test['subcat_2'] = zip(*test['category_name'].apply(lambda x: split_cat(x)))

print('There are %d unique general_cat' % train['general_cat'].nunique())

print('There are %d unique first sub-categories' % train['subcat_1'].nunique())

print('There are %d unique second sub-categories' % train['subcat_2'].nunique())

可以看出总共有11个主类，114个子类1，871个子类2

接下来分析下主类的分布情况

# 主类别分布情况

x = train['general_cat'].value_counts().index.values.astype('str')

y = train['general_cat'].value_counts().values

pct = [('%.2f' % (v * 100)) + '%' for v in (y / len(train))]

tracel = go.Bar(x=x, y=y, text=pct)

layout = dict(title="Number of Items by Main Category", yaxis=dict(title='Count'), xaxis=dict(title='Category'))

fig = dict(data=[tracel], layout=layout)

py.iplot(fig)

可以看出，大量商品是关于女这种类别的，占据了百分之四十五，第二多的是化妆品，第三多的是孩子。

子类的数量很多，我们展示前15个子类的分布

# 前15个子类别分布情况

x = train['subcat_1'].value_counts().index.values.astype('str')[:15]

y = train['subcat_1'].value_counts().values[:15]

pct = [('%.2f' % (v * 100)) + '%' for v in (y / len(train))][:15]

tracel = go.Bar(x=x, y=y, text=pct, marker=dict(color=y, colorscale='Portland', showscale=True, reversescale=False))

layout = dict(title="Number of Items by Sub Category(Top 15)", yaxis=dict(title='Count'),

              xaxis=dict(title='SubCategory'))

fig = dict(data=[tracel], layout=layout)

py.iplot(fig)

接下来看看不同主类商品的价格区间

# 不同类型商品价格浮动区间

general_cats = train['general_cat'].unique()

x = [train.loc[train['general_cat'] == cat, 'price'] for cat in general_cats]

data = [go.Box(x=np.log(x[i] + 1), name=general_cats[i]) for i in range(len(general_cats))]

layout = dict(title='Price Distribution by General Category', yaxis=dict(title='Frequency'),

              xaxis=dict(title='Category'))

fig = dict(data=data, layout=layout)

py.iplot(fig)

2.4 品牌分布情况

分析一下品牌的分布情况

# 前10品牌名称的数据分布

x = train['brand_name'].value_counts().index.values.astype('str')[:10]

y = train['brand_name'].value_counts().values[:10]

tracel = go.Bar(x=x, y=y, marker=dict(color=y, colorscale='Portland', showscale=True, reversescale=False))

layout = dict(title="Top 10 Brand by Number of Items", yaxis=dict(title='Count'), xaxis=dict(title='Brand Name'))

fig = dict(data=[tracel], layout=layout)

py.iplot(fig)

2.5 商品描述长度对商品的影响

统计商品描述的长度，然后研究其对商品的影响

# 商品描述对价格的影响

def wordCount(text):

    try:

        text = text.lower()

        regex = re.compile('[' + re.escape(string.punctuation) + '0-9\\r\\t\\n]')

        txt = regex.sub(' ', text)

        words = [w for w in txt.split(" ") if w not in stop_words.STOP_WORDS and len(w) > 3]

        return len(words)

    except:

        return 0

train['desc_len'] = train['item_description'].apply(lambda x: wordCount(x))

test['desc_len'] = test['item_description'].apply(lambda x: wordCount(x))

print(train.head())

df = train.groupby('desc_len')['price'].mean().reset_index()

tracel = go.Scatter(x=df['desc_len'], y=np.log(df['price'] + 1), mode='lines+markers', name='lines+markers')

layout = dict(title='Average Log(Price) by Description Length', yaxis=dict(title='Average Log(Price)'),

              xaxis=dict(title='Description Length'))

fig = dict(data=[tracel], layout=layout)

py.iplot(fig)

可以看到商品描述适中价格越高，描述短的可能因为功能简单所以价格低，描述长的可能因为小众所以价格低

三、商品描述关键字

3.1 统计常用关键字

统计一下商品描述中常用的关键字，注意，有的商品没有商品描述，需要去掉

print(train.item_description.isnull().sum())

# 去掉缺失值

train = train[pd.notnull(train['item_description'])]

# 提取每种品牌的描述关键词

tokenize = nltk.data.load('tokenizers/punkt/english.pickle')

cat_desc = dict()

for cat in general_cats:

    text = ' '.join(train.loc[train['general_cat'] == cat, 'item_description'].values)

    cat_desc[cat] = tokenize.tokenize(text)

# 统计常用关键词

flat_lst = [item for sublist in list(cat_desc.values()) for item in sublist]

allWordsCount = Counter(flat_lst)

all_top10 = allWordsCount.most_common(20)

x = [w[0] for w in all_top10]

y = [w[1] for w in all_top10]

tracel = go.Bar(x=x, y=y)

layout = dict(title='Word Frequency', yaxis=dict(title='Count'), xaxis=dict(title='Word'))

fig = dict(data=[tracel], layout=layout)

py.iplot(fig)

3.2 分别展示不同商品的关键字

首先将商品描述进行分词，去掉停用词

# 展示不同商品的关键词

stop = set(stopwords.words('english'))

def tokenize(text):

    try:

        regex = re.compile('[' + re.escape(string.punctuation) + '0-9\\r\\t\\n]')

        text = regex.sub(' ', text)

        tokens_ = [word_tokenize(s) for s in sent_tokenize(text)]

        tokens = []

        for token_by_sent in tokens_:

            tokens += token_by_sent

        tokens = list(filter(lambda t: t.lower() not in stop, tokens))

        filtered_tokens = [w for w in tokens if re.search('[a-zA-Z]', w)]

        filtered_tokens = [w.lower() for w in filtered_tokens if len(w) >= 3]

        return filtered_tokens

    except TypeError as e:

        print(text, e)

train['tokens'] = train['item_description'].map(tokenize)

test['tokens'] = test['item_description'].map(tokenize)

train.reset_index(drop=True, inplace=True)

test.reset_index(drop=True, inplace=True)

for description, tokens in zip(train['item_description'].head(), train['tokens'].head()):

    print('description:', description)

    print('tokens:', tokens)

    print()

cat_desc = dict()

for cat in general_cats:

    text = ' '.join(train.loc[train['general_cat'] == cat, 'item_description'].values)

    cat_desc[cat] = tokenize(text)

cat_desc100 = dict()

for key, value in cat_desc.items():

    cat_desc100[key] = Counter(value).most_common()

def generate_wordcloud(tup):

    wordcloud = WordCloud(background_color='white', max_words=50, max_font_size=40, random_state=42).generate(str(tup))

    return wordcloud

fig, axes = plt.subplots(len(cat_desc100) // 2 + 1, 2, figsize=(30, 15))

for i, (key, cat) in enumerate(cat_desc100.items()):

    ax = axes[i // 2, i % 2]

    ax.imshow(generate_wordcloud(cat), interpolation='bilinear')

    ax.axis('off')

    ax.set_title("%s Top 100" % key, fontsize=12)

plt.show()

对每个类别提取数量最多的前100个关键字统计词频生成词云

四、tfidf算法

可以看出不同类别出现的关键字有很多是相似的，不能代表这种类别的商品，所以我们使用tf-idf算法进行关键字的挖掘，tf-idf基本思想是词在本文章中出现的次数越多在其他文章中出现的次数越少越可能是关键字。首先将描述数据扩展到180000维度，在进行tf-idf打分

# tf-idf

vectorizer = TfidfVectorizer(min_df=10, max_features=180000, tokenizer=tokenize, ngram_range=(1, 2))

all_desc = np.append(train['item_description'].values, test['item_description'].values)

vz = vectorizer.fit_transform(list(all_desc))

print(vz.shape)

tfidf = dict(zip(vectorizer.get_feature_names_out(), vectorizer.idf_))

tfidf = pd.DataFrame(columns=['tfidf']).from_dict(dict(tfidf), orient='index')

tfidf.columns = ['tfidf']

print(tfidf.sort_values(by=['tfidf'], ascending=True).head(10))

print(tfidf.sort_values(by=['tfidf'], ascending=False).head(10))

可以看出停用词的得分值基本上都比较低，因为他们虽然频率高但是不具备什么代表性的价值，而另一批的词得分就很高，可以作为关键词来分析

接下来使用SVD降维将特征的维度降到50，然后使用t-SNE将维度降维到2进行展示

trn = train.copy()

tst = test.copy()

trn['is_train'] = 1

tst['is_train'] = 0

sample_sz = 15000

# 采样

combined_df = pd.concat([trn, tst])

combined_sample = combined_df.sample(n=sample_sz)

vz_sample = vectorizer.fit_transform(list(combined_sample['item_description']))

# SVD 降维

n_comp = 30

svd = TruncatedSVD(n_components=n_comp, random_state=42)

svd_tfidf = svd.fit_transform(vz_sample)

# t-SNE降维

tsne_model = TSNE(n_components=2, verbose=1, random_state=42, n_iter=500)

tsne_tfidf = tsne_model.fit_transform(svd_tfidf)

plot_tfidf = bp.figure(width=700, height=600, title='tf-idf clustring of the item description',

                       tools='pan, wheel_zoom, box_zoom, reset, hover', x_axis_type=None, y_axis_type=None,

                       min_border=1)

combined_sample.reset_index(inplace=True, drop=True)

tfidf_df = pd.DataFrame(tsne_tfidf, columns=['x', 'y'])

tfidf_df['description'] = combined_sample['item_description']

tfidf_df['tokens'] = combined_sample['tokens']

tfidf_df['category'] = combined_sample['general_cat']

plot_tfidf.scatter(x='x', y='y', source=tfidf_df, alpha=0.7)

hover = plot_tfidf.select(dict(type=HoverTool))

hover.tooltips = {'description': '@description', 'tokens': '@tokens', 'category': '@category'}

show(plot_tfidf)

关键词比较接近的就会被绘制在一个点位置

五、分类

5.1 使用聚类算法对上面数据的点可以进行分类

# 聚类分堆

num_clusters = 10

kmeans_model = MiniBatchKMeans(n_clusters=num_clusters, init='k-means++', n_init=1, init_size=10000, batch_size=1000,

                               verbose=0, max_iter=1000)

kmeans_model.fit(vz_sample)

kmeans_clusters = kmeans_model.predict(vz_sample)

kmeans_distances = kmeans_model.transform(vz_sample)

tsne_kmeans = tsne_model.fit_transform(kmeans_distances)

kmeans_df = pd.DataFrame(tsne_kmeans, columns=['x', 'y'])

kmeans_df['cluster'] = kmeans_clusters

kmeans_df['description'] = combined_sample['item_description']

kmeans_df['category'] = combined_sample['general_cat']

plot_kmeans = bp.figure(width=700, height=600, title='KMeans clustering of the description',

                        tools='pan, wheel_zoom, box_zoom, reset, hover', x_axis_type=None, y_axis_type=None,

                        min_border=1)

print(kmeans_clusters)

colormap = {'0': 'red', '1': 'green', '2': 'blue', '3': 'black', '4': 'yellow', '5': 'pink', '6': 'purple', '7': 'grey',

            '8': 'brown', '9': 'orange'}

def get_color(num):

    if num == 0:

        return 'red'

    elif num == 1:

        return 'green'

    elif num == 2:

        return 'blue'

    elif num == 3:

        return 'black'

    elif num == 4:

        return 'yellow'

    elif num == 5:

        return 'pink'

    elif num == 6:

        return 'purple'

    elif num == 7:

        return 'grey'

    elif num == 8:

        return 'brown'

    elif num == 9:

        return 'orange'

color = pd.Series(kmeans_clusters).apply(get_color)

source = ColumnDataSource(

    data=dict(x=kmeans_df['x'], y=kmeans_df['y'], color=color, description=kmeans_df['description'],

              category=kmeans_df['category'], cluster=kmeans_df['cluster']))

plot_kmeans.scatter(x='x', y='y', color='color', source=source)

hover = plot_kmeans.select(dict(type=HoverTool))

hover.tooltips = {'description': '@description', 'category': '@category', 'cluster': '@cluster'}

show(plot_kmeans)

5.2 LDA主题模型分类

除了聚类算法外，也可以使用LDA主题模型进行分类

# LDA分堆

cvectorizer = CountVectorizer(min_df=4, max_features=180000, tokenizer=tokenize, ngram_range=(1, 2))

cvz = cvectorizer.fit_transform(combined_sample['item_description'])

lda_model = LatentDirichletAllocation(n_components=10, learning_method='online', max_iter=20, random_state=42)

X_topics = lda_model.fit_transform(cvz)

# 获取主题

n_top_words = 10

topic_summaries = []

topic_word = lda_model.components_

vocab = cvectorizer.get_feature_names_out()

for i, topic_dist in enumerate(topic_word):

    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words + 1):-1]

    topic_summaries.append(' '.join(topic_words))

    print('Topic {}:{}'.format(i, '|'.join(topic_words)))

tsne_lda = tsne_model.fit_transform(X_topics)

unnormalized = np.matrix(X_topics)

doc_topic = unnormalized / unnormalized.sum(axis=1)

lda_keys = []

for i, tweet in enumerate(combined_sample['item_description']):

    lda_keys += [doc_topic[i].argmax()]

lda_df = pd.DataFrame(tsne_lda, columns=['x', 'y'])

lda_df['description'] = combined_sample['item_description']

lda_df['category'] = combined_sample['general_cat']

lda_df['topic'] = lda_keys

lda_df['topic'] = lda_df['topic'].map(int)

plot_lda = bp.figure(width=700, height=600, title='LDA topic visualization',

                     tools='pan, wheel_zoom, box_zoom, reset, hover', x_axis_type=None, y_axis_type=None, min_border=1)

source = ColumnDataSource(

    data=dict(x=lda_df['x'], y=lda_df['y'], color=color, description=lda_df['description'],

              topic=lda_df['topic'], category=lda_df['category']))

plot_lda.scatter(x='x', y='y', color='color', source=source)

hover = plot_lda.select(dict(type=HoverTool))

hover.tooltips = {'description': '@description', 'topic': '@topic', 'category': '@category'}

show(plot_lda)

把不同关键字分为了十个主题

nlp入门（二）：商品信息可视化与文本分析实战的更多相关文章

NLP（七）信息抽取和文本分类
命名实体专有名词:人名地名产品名例句命名实体 Hampi is on the South Bank of Tungabhabra river Hampi,Tungabhabra River ...
NLP知识图谱项目合集（信息抽取、文本分类、图神经网络、性能优化等)
NLP知识图谱项目合集(信息抽取.文本分类.图神经网络.性能优化等) 这段时间完成了很多大大小小的小项目,现在做一个整体归纳方便学习和收藏,有利于持续学习. 1. 信息抽取项目合集 1.PaddleN ...
NLP（二十二）利用ALBERT实现文本二分类
在文章NLP(二十)利用BERT实现文本二分类中,笔者介绍了如何使用BERT来实现文本二分类功能,以判别是否属于出访类事件为例子.但是呢,利用BERT在做模型预测的时候存在预测时间较长的问题.因此 ...
NLP（十二）依存句法分析的可视化及图分析
依存句法分析的效果虽然没有像分词.NER的效果来的好,但也有其使用价值,在日常的工作中,我们免不了要和其打交道.笔者这几天一直在想如何分析依存句法分析的结果,一个重要的方面便是其可视化和它的图分析 ...
重磅︱R+NLP：text2vec包——New 文本分析生态系统 No.1（一,简介）
每每以为攀得众山小,可.每每又切实来到起点,大牛们,缓缓脚步来俺笔记葩分享一下吧,please~ --------------------------- 词向量的表示主流的有两种方式,一种当然是耳熟能 ...
NLP（二十一）人物关系抽取的一次实战
去年,笔者写过一篇文章利用关系抽取构建知识图谱的一次尝试,试图用现在的深度学习办法去做开放领域的关系抽取,但是遗憾的是,目前在开放领域的关系抽取,还没有成熟的解决方案和模型.当时的文章仅作为笔者的 ...
【原创】NIO框架入门(二)：服务端基于MINA2的UDP双向通信Demo演示
前言 NIO框架的流行,使得开发大并发.高性能的互联网服务端成为可能.这其中最流行的无非就是MINA和Netty了,MINA目前的主要版本是MINA2.而Netty的主要版本是Netty3和Netty ...
汽车之家店铺商品详情数据抓取 DotnetSpider实战[二]
一.迟到的下期预告自从上一篇文章发布到现在,大约差不多有3个月的样子,其实一直想把这个实战入门系列的教程写完,一个是为了支持DotnetSpider,二个是为了.Net 社区发展献出一份绵薄之力,这 ...
NLP入门（八）使用CRF++实现命名实体识别(NER)
CRF与NER简介 CRF,英文全称为conditional random field, 中文名为条件随机场,是给定一组输入随机变量条件下另一组输出随机变量的条件概率分布模型,其特点是假设输出随机 ...
Photoshop入门教程（一）：文本新建与概念解析
写在开头 <Photoshop实用入门>系列教程可能对于一点都没有接触过Photoshop的人来说不太容易接受,因为本教程并没有细致到教你如何使用画笔工具等一系列很基础的东西,有些地方的讲 ...

随机推荐

2020-10-31：java中LinkedTransferQueue和SynchronousQueue有什么区别？
福哥答案2020-11-01:SynchronousQueue:线程A使用put将数据添加到队列,如果没有其他线程使用take去获取数据,那么线程A阻塞,直到数据被其他线程获取,同理如果线程B从队列 ...
2022-05-08：给你一个下标从 0 开始的字符串数组 words 。每个字符串都只包含小写英文字母。words 中任意一个子串中，每个字母都至多只出现一次。如果通过以下操作之一，我们可以
2022-05-08:给你一个下标从 0 开始的字符串数组 words .每个字符串都只包含小写英文字母 .words 中任意一个子串中,每个字母都至多只出现一次. 如果通过以下操作之一,我们可以从 ...
AHB2APB bridge IP简介
背景介绍 AMBA总线规范是由ARM公司提出的一种开放性的片上总线标准,它独立于处理器和工艺技术,具有高速度.低功耗等特点.AMBA规范中包括了AHB系统总线和APB外设总线. AHB主要用于高性能模 ...
@GrpcServise 注解的作用和使用
转载请注明出处: 1. @GrpcServise 的作用和优势在没有使用 @GrpcServise 注解编写服务端时,我们通常需要自定义 Server 以及端口,包括 start,stop ,注册s ...
docker 下MySQL主从读写分离配置
主从同步机制: 同步基于耳机子机制,主服务器使用二进制来记录数据库的变动状况,从服务器通过读取和执行日志文件来保存主服务的数据一致首先要保障主从的版本一致或相近 1 登陆docker,拉取镜像 do ...
pyinstaller打包exe
1.执行环境说明 python版本3.7直接使用pip进行安装pywin32.pyinstallerpip install pywin32pip install pyinstaller 2.使用了第三 ...
selenium4-获取页面元素相关信息
本小节我们简单说下如何使用selenium4-获取页面元素相关信息,以及获取页面元素的相关信息后可以做什么. 获取页面元素的主要目的:(1)执行完步骤后进行断言:(2)获取前一步骤的响应结果作为后续步 ...
函数接口（Functional Interfaces）
定义首先,我们先看看函数接口在<Java语言规范>中是怎么定义的: 函数接口是一种只有一个抽象方法(除Object中的方法之外)的接口,因此代表一种单一函数契约.函数接口的抽象方法可以是 ...
一种实现Spring动态数据源切换的方法
1 目标不在现有查询代码逻辑上做任何改动,实现dao维度的数据源切换(即表维度) 2 使用场景节约bdp的集群资源.接入新的宽表时,通常uat验证后就会停止集群释放资源,在对应的查询服务器uat环 ...
C++ 核心指南之资源管理（中）
C++ 核心指南(C++ Core Guidelines)是由 Bjarne Stroustrup.Herb Sutter 等顶尖 C++ 专家创建的一份 C++ 指南.规则及最佳实践.旨在帮助大家正 ...

nlp入门（二） ：商品信息可视化与文本分析实战