【337】Text Mining Using Twitter Streaming API and Python
Reference: An Introduction to Text Mining using Twitter Streaming API and Python
Reference: How to Register a Twitter App in 8 Easy Steps
- Getting Data from Twitter Streaming API
- Reading and Understanding the data
- Mining the tweets
Key Methods:
- Map()
- Lambda()
- Set()
- Pandas.DataFrame()
- matplotlib
1. Getting Data from Twitter Streaming API
twitter_streaming.py, this file is used to extract information from Twitter.
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream #Variables that contains the user credentials to access Twitter API
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET" #This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener): def on_data(self, data):
print(data)
return True def on_error(self, status):
print(status) if __name__ == '__main__': #This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l) #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(track=['python', 'javascript', 'ruby'])
You can use the following command to store information in the specific file. (By CMD)
python twitter_streaming.py > twitter_data.txt
Then we will get the information from the above text file and store them in JSON format.
import json
tweets_data_path = r"..\twitter_data.txt"
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue
Data are stored in tweets_data, and we can get the specific information by the following scripts.
Reference: python JSON only get keys in first level
# get the text content, language from the specific tweets
num = 0
for tweet in tweets_data:
num += 1
if num == 10:
break
else:
tweet_text = tweet["text"]
tweet_lang = tweet["lang"]
print(str(num))
print(tweet_lang)
print(tweet_text)
print() # get all the keys from json
tweets_data[0].keys()
2. Reading and Understanding the data
Now we can also get the specific key by list(), map() and lambda() with the following scripts.
Reference: Python中map与lambda的结合使用
>>> a = list(map(lambda tweet: tweet['text'], tweets_data))
>>> len(a)
1633
>>> a[0]
'RT @neet_se: 案件数って点だけならJavaがダントツ、つまり仕事に繋がりやすい。https://t.co/rqxp…'
Or we can also use set() method to get the unique values of the list.
Reference: Python set() 函数
Reference: Python统计列表中的重复项出现的次数的方法
>>> langs = list(map(lambda tweet: tweet['lang'], tweets_data))
>>> len(langs)
1633
>>> set(langs)
{'zh', 'de', 'es', 'et', 'th', 'cy', 'ru', 'in', 'lt', 'pt', 'tl', 'en', 'it', 'ja', 'ro', 'fa', 'pl', 'fr', 'ht', 'ar', 'tr', 'ca', 'cs', 'und', 'da'}
Next, we will structure the tweets data into a pandas DataFrame to simplify the data manipulation.
>>> import pandas as pd
>>> tweets = pd.DataFrame()
>>> tweets['text'] = list(map(lambda tweet: tweet['text'], tweets_data))
>>> tweets['lang'] = list(map(lambda tweet: tweet['lang'], tweets_data))
>>> tweets['country'] = list(map(lambda tweet: tweet['place']['country'] if tweet['place'] != None else None, tweets_data))
>>> tweets['lang'].value_counts()
en 1119
ja 278
es 113
pt 36
und 26
...
Next, we will use matplotlib to create a chart describing the Top 5 languages in which the tweets were written.
>>> tweets_by_lang = tweets['lang'].value_counts() >>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots()
>>> ax.tick_params(axis='x', labelsize=15)
>>> ax.tick_params(axis='y', labelsize=10)
>>> ax.set_xlabel('Languages', fontsize=15)
Text(0.5, 0, 'Languages')
>>> ax.set_ylabel('Number of tweets' , fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Top 5 languages', fontsize=15, fontweight='bold')
Text(0.5, 1.0, 'Top 5 languages')
>>> tweets_by_lang[:5].plot(ax=ax, kind='bar', color='red')
<matplotlib.axes._subplots.AxesSubplot object at 0x00000189B635D630>
>>> plt.show()
Next, we will create a chart describing the Top 5 countries from which the tweets were sent.
>>> tweets_by_country = tweets['country'].value_counts() >>> fig, ax = plt.subplots()
>>> ax.tick_params(axis='x', labelsize=15)
>>> ax.tick_params(axis='y', labelsize=10)
>>> ax.set_xlabel('Countries', fontsize=15)
Text(0.5, 0, 'Countries')
>>> ax.set_ylabel('Number of tweets' , fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Top 5 countries', fontsize=15, fontweight='bold')
Text(0.5, 1.0, 'Top 5 countries')
>>> tweets_by_country[:5].plot(ax=ax, kind='bar', color='blue')
<matplotlib.axes._subplots.AxesSubplot object at 0x00000189BA6038D0>
>>> plt.show()
3. Mining the tweets
Out main goals in these text mining tasks are: compare the popularity of Python, Ruby and Javascript programming languages and to retrieve programming tutorial links. We will do this in 3 steps:
- We will add tags to our tweets DataFrame in order to be able to manipulate the data easily.
- Target tweets that have "programming" or tutorial" keywords.
- Extract links from the relevant tweets.
Adding Python, Ruby, and Javascript tags
First, we will create a function that checks if a specific keyword is present in a text. We will do this by using regular expression (正则表达式).
Python provides a library for regular expression called re. We will start by importing this library.
Next, we will create a function called word_in_text(word, text). This function return True if a word is found in text, otherwise it returns False.
>>> import re
>>> def word_in_text(word, text):
word = word.lower()
text = text.lower()
match = re.search(word, text)
if match:
return True
return False
Next, we will add 3 columns to our tweets DataFrame by pandas.DataFrame.apply().
>>> tweets['python'] = tweets['text'].apply(lambda tweet: word_in_text('python', tweet))
>>> tweets['ruby'] = tweets['text'].apply(lambda tweet: word_in_text('ruby', tweet))
>>> tweets['javascript'] = tweets['text'].apply(lambda tweet: word_in_text('javascript', tweet))
We can calculate the number of tweets for each programming language by pandas.Series.value_counts as follows:
>>> print(tweets['python'].value_counts()[True])
447
>>> print(tweets['ruby'].value_counts()[True])
529
>>> print(tweets['javascript'].value_counts()[True])
275
We can make a simple comparison chart by executing the following:
>>> prg_langs = ['python', 'ruby', 'javascript']
>>> tweets_by_prg_lang = [tweets['python'].value_counts()[True], tweets['ruby'].value_counts()[True], tweets['javascript'].value_counts()[True]]
>>> x_pos = list(range(len(prg_langs)))
>>> width = 0.8
>>> fig, ax = plt.subplots()
>>> plt.bar(x_pos, tweets_by_prg_lang, width, alpha=1, color='g')
<BarContainer object of 3 artists>
>>> # Setting axis labels and ticks
>>> ax.set_ylabel('Number of tweets', fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Ranking: python vs. javascript vs. ruby (Raw data)', fontsize=10, fontweight='bold')
Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Raw data)')
>>> ax.set_xticks([p + 0.4 * width for p in x_pos])
[<matplotlib.axis.XTick object at 0x00000189BA5D1F28>, <matplotlib.axis.XTick object at 0x00000189BA603D30>, <matplotlib.axis.XTick object at 0x00000189BA5D15F8>]
>>> ax.set_xticklabels(prg_langs)
[Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')]
>>> plt.grid()
>>> plt.show()
This shows, that the keyword ruby is the most popular, followed by python then javascript. However, the tweets DataFrame contains information about all tweets that contains one of the 3 keywords and doesn't restrict the information to the programming languages. For example, there are a lot of tweets that contains the keyword ruby and that are related to a political scandal Rubygate. In the next section, we will filter the tweets and re-run the analysis to make a more accurate comparison.
Targeting relevant tweets
We are interested in targeting tweets that are related to programming languages. Such tweets often have one of the 2 keywords: "programming" or "tutorial". We will create 2 additional columns to our tweets DataFrame where we will add this information.
>>> tweets['programming'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet))
>>> tweets['tutorial'] = tweets['text'].apply(lambda tweet: word_in_text('tutorial', tweet))
We will add an additional column called relevant that take value True if the tweet has either "programming" or "tutorial" keyword, otherwise it takes value False.
>>> tweets['relevant'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet) or word_in_text('tutorial', tweet))
We can print the counts of relevant tweet by executing the commands below.
>>> print(tweets['programming'].value_counts()[True])
55
>>> print(tweets['tutorial'].value_counts()[True])
22
>>> print(tweets['relevant'].value_counts()[True])
74
We can compare now the popularity of the programming languages by executing the commands below.
tweets[tweets['relevant'] == True]['python'] # 将 relevant 为 True 的索引对应 Python 组成一个新的列
>>> print(tweets[tweets['relevant'] == True]['python'].value_counts()[True])
31
>>> print(tweets[tweets['relevant'] == True]['ruby'].value_counts()[True])
8
>>> print(tweets[tweets['relevant'] == True]['javascript'].value_counts()[True])
11
Python is the most popular with a count of 31, followed by javascript by a count of 11, and ruby by a count of 185. We can make a comparison
>>> tweets_by_prg_lang = [tweets[tweets['relevant'] == True]['python'].value_counts()[True],
tweets[tweets['relevant'] == True]['ruby'].value_counts()[True],
tweets[tweets['relevant'] == True]['javascript'].value_counts()[True]]
>>> x_pos = list(range(len(prg_langs)))
>>> width = 0.8
>>> fig, ax = plt.subplots()
>>> plt.bar(x_pos, tweets_by_prg_lang, width,alpha=1,color='g')
<BarContainer object of 3 artists>
>>> ax.set_ylabel('Number of tweets', fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Ranking: python vs. javascript vs. ruby (Relevant data)', fontsize=10, fontweight='bold')
Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Relevant data)')
>>> ax.set_xticks([p + 0.4 * width for p in x_pos])
[<matplotlib.axis.XTick object at 0x00000189B6E9E128>, <matplotlib.axis.XTick object at 0x00000189B430F9E8>, <matplotlib.axis.XTick object at 0x00000189B430F5C0>]
>>> ax.set_xticklabels(prg_langs)
[Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')]
>>> plt.grid()
>>> plt.show()
Extracting links from the relevants tweets
Now that we extracted the relevant tweets, we want to retrieve links to programming tutorials. We will start by creating a function that uses regular expressions for retrieving link that start with "http://" or "https:" from a text. This function will return the url if found, otherwise it returns an empty string.
>>> def extract_link(text):
regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
match = re.search(regex, text)
if match:
return match.group()
return ''
Next, we will add a column called link to our tweets DataFrame. This column will contain the urls information.
>>> tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))
Next, we will create a new DataFrame called tweets_relevant_with_link. This DataFrame is a subset of tweets DataFrame and contains all relevant tweets that have a link.
将原有 DataFrame 进行截取。
>>> tweets_relevant = tweets[tweets['relevant'] == True]
>>> tweets_relevant_with_link = tweets_relevant[tweets_relevant['link'] != '']
We can now print out all links for python, ruby, and javascript by executing the commands below:
>>> print(tweets_relevant_with_link[tweets_relevant_with_link['python'] == True]['link'])
40 https://t.co/zoAgyQuMAZ
105 https://t.co/ogaPbuIbEW
274 https://t.co/y4sUmovFOn
329 https://t.co/A030fqWeWA
339 https://t.co/LaaVc5T2rQ
391 https://t.co/8bYvlziCZb
413 https://t.co/8bYvlziCZb
436 https://t.co/EByqxT1qyN
444 https://t.co/8bYvlziCZb
445 https://t.co/5Jujg6h31B
462 https://t.co/UrFHlOaJYf
476 https://t.co/5Jujg6h31B
477 https://t.co/EByqxT1qyN
589 https://t.co/UrFHlOaJYf
603 https://t.co/5Jujg6h31B
822 https://t.co/Oc21FrzQc5
1060 https://t.co/qOAIuKfyD0
1097 https://t.co/qOAIuKfyD0
1248 https://t.co/V3ZNKuYsK7
1278 https://t.co/qOAIuKfyD0
1411 https://t.co/szHRHavQKy
1594 https://t.co/X6KWMlzlv6
Name: link, dtype: object
>>> print(tweets_relevant_with_link[tweets_relevant_with_link['ruby'] == True]['link'])
782 https://t.co/JgY40r2NSo
833 https://t.co/JgY40r2NSo
1177 https://t.co/xycOG3ndi9
1254 https://t.co/xycOG3ndi9
1293 https://t.co/LMHW050TGs
1328 https://t.co/SS4DzEnSBZ
1393 https://t.co/NZlUce5Ne8
1619 https://t.co/e4nwrn3N2j
Name: link, dtype: object
>>> print(tweets_relevant_with_link[tweets_relevant_with_link['javascript'] == True]['link'])
130 https://t.co/AbJFaSI0B8
286 https://t.co/7dNBIsQ5Gq
467 https://t.co/3YIK588j8t
471 https://t.co/vjBJWWzvfv
830 https://t.co/T4mUjwUcgL
1093 https://t.co/wvLZLjuVKF
1180 https://t.co/luxL2qbxte
1526 https://t.co/G3ZTFL0RKv
Name: link, dtype: object
【337】Text Mining Using Twitter Streaming API and Python的更多相关文章
- An Introduction to Text Mining using Twitter Streaming
Text mining is the application of natural language processing techniques and analytical methods to t ...
- 【LeetCode】299. Bulls and Cows 解题报告(Python)
[LeetCode]299. Bulls and Cows 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题 ...
- 【LeetCode】743. Network Delay Time 解题报告(Python)
[LeetCode]743. Network Delay Time 解题报告(Python) 标签(空格分隔): LeetCode 作者: 负雪明烛 id: fuxuemingzhu 个人博客: ht ...
- 【LeetCode】518. Coin Change 2 解题报告(Python)
[LeetCode]518. Coin Change 2 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题目 ...
- 【LeetCode】474. Ones and Zeroes 解题报告(Python)
[LeetCode]474. Ones and Zeroes 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ ...
- 【LeetCode】731. My Calendar II 解题报告(Python)
[LeetCode]731. My Calendar II 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题 ...
- 【LeetCode】785. Is Graph Bipartite? 解题报告(Python)
[LeetCode]785. Is Graph Bipartite? 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu. ...
- 【LeetCode】895. Maximum Frequency Stack 解题报告(Python)
[LeetCode]895. Maximum Frequency Stack 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxueming ...
- 【LeetCode】764. Largest Plus Sign 解题报告(Python)
[LeetCode]764. Largest Plus Sign 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn ...
随机推荐
- spring答题
ioc 依赖注入:通过注入的方式实例化对象,不再直接new对象了,交给spring容器进行管理和维护 控制反转:实例化对象的控制权交给了spring容器,而不再是某个单独的类,控制权发生了变更 作用: ...
- Cenots7对lvm逻辑卷分区大小的调整
Cenots7对lvm逻辑卷分区大小的调整 (针对xfs和ext4不同文件系统) 1.支持的文件系统类型 特别注意的是: resize2fs命令 针对的是ext2.ext3.ex ...
- jquery编辑插件tinyMCE的使用方法
jquery编辑插件tinyMCE是一个非常容易集成到您系统的一个html编辑插件,它不像FckEditor那样需要针对专门的后台语言集成,tinyMCE既能做到轻松集成asp.net,php,jav ...
- golang显示本机IP代码
package main import ( "fmt" "net" ) func main() { addrs, err := net.InterfaceAdd ...
- BASIC-16_蓝桥杯_分解质因数
代码示例: #include <stdio.h> int i = 0 ;int Primes(int a){ for (i = 2 ; i <= a/2 ; i ++) { if ( ...
- bzoj4385 Wilcze doły
Description 给定一个长度为n的序列,你有一次机会选中一段连续的长度不超过d的区间,将里面所有数字全部修改为0.请找到最长的一段连续区间,使得该区间内所有数字之和不超过p. Input 第一 ...
- web中显示中文名称的图片,可以这样配置filter
com.cy.filter.UrlFilter: package com.cy.filter; import java.io.IOException; import java.net.URLDecod ...
- Number常用方法函数
Number类型应该是ECMAScript中最令人关注的数据类型了,这种类型使用IEEE754来表示整数和浮点数,并针对Number相关特点定义了一系列相关的方法函数. isFinite() 在Jav ...
- 【BZOJ】1923 [Sdoi2010]外星千足虫(高斯消元)
题目 传送门:QWQ 分析 高斯消元解异或方程组,和解普通方程组差不多. 范围有点大,要套一个bitset. 代码 #include <bits/stdc++.h> using names ...
- unicode 转码 ansi
#include "stdafx.h"#include <Windows.h>#include <stdio.h> HRESULT SomeCOMFunct ...