Reference: An Introduction to Text Mining using Twitter Streaming API and Python

Reference: How to Register a Twitter App in 8 Easy Steps

  • Getting Data from Twitter Streaming API
  • Reading and Understanding the data
  • Mining the tweets

Key Methods:

  • Map()
  • Lambda()
  • Set()
  • Pandas.DataFrame()
  • matplotlib

1. Getting Data from Twitter Streaming API

twitter_streaming.py, this file is used to extract information from Twitter.

#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream #Variables that contains the user credentials to access Twitter API
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET" #This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener): def on_data(self, data):
print(data)
return True def on_error(self, status):
print(status) if __name__ == '__main__': #This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l) #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(track=['python', 'javascript', 'ruby'])

You can use the following command to store information in the specific file. (By CMD)

python twitter_streaming.py > twitter_data.txt

Then we will get the information from the above text file and store them in JSON format.

import json
tweets_data_path = r"..\twitter_data.txt"
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue

Data are stored in tweets_data, and we can get the specific information by the following scripts.

Reference: python JSON only get keys in first level

# get the text content, language from the specific tweets
num = 0
for tweet in tweets_data:
num += 1
if num == 10:
break
else:
tweet_text = tweet["text"]
tweet_lang = tweet["lang"]
print(str(num))
print(tweet_lang)
print(tweet_text)
print() # get all the keys from json
tweets_data[0].keys()

2. Reading and Understanding the data

Now we can also get the specific key by list(), map() and lambda() with the following scripts.

Reference: Python中map与lambda的结合使用

>>> a = list(map(lambda tweet: tweet['text'], tweets_data))
>>> len(a)
1633
>>> a[0]
'RT @neet_se: 案件数って点だけならJavaがダントツ、つまり仕事に繋がりやすい。https://t.co/rqxp…'

Or we can also use set() method to get the unique values of the list.

Reference: Python set() 函数

Reference: Python统计列表中的重复项出现的次数的方法

>>> langs = list(map(lambda tweet: tweet['lang'], tweets_data))
>>> len(langs)
1633
>>> set(langs)
{'zh', 'de', 'es', 'et', 'th', 'cy', 'ru', 'in', 'lt', 'pt', 'tl', 'en', 'it', 'ja', 'ro', 'fa', 'pl', 'fr', 'ht', 'ar', 'tr', 'ca', 'cs', 'und', 'da'}

Next, we will structure the tweets data into a pandas DataFrame to simplify the data manipulation.

>>> import pandas as pd
>>> tweets = pd.DataFrame()
>>> tweets['text'] = list(map(lambda tweet: tweet['text'], tweets_data))
>>> tweets['lang'] = list(map(lambda tweet: tweet['lang'], tweets_data))
>>> tweets['country'] = list(map(lambda tweet: tweet['place']['country'] if tweet['place'] != None else None, tweets_data))
>>> tweets['lang'].value_counts()
en 1119
ja 278
es 113
pt 36
und 26
...

Next, we will use matplotlib to create a chart describing the Top 5 languages in which the tweets were written.

>>> tweets_by_lang = tweets['lang'].value_counts()

>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots()
>>> ax.tick_params(axis='x', labelsize=15)
>>> ax.tick_params(axis='y', labelsize=10)
>>> ax.set_xlabel('Languages', fontsize=15)
Text(0.5, 0, 'Languages')
>>> ax.set_ylabel('Number of tweets' , fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Top 5 languages', fontsize=15, fontweight='bold')
Text(0.5, 1.0, 'Top 5 languages')
>>> tweets_by_lang[:5].plot(ax=ax, kind='bar', color='red')
<matplotlib.axes._subplots.AxesSubplot object at 0x00000189B635D630>
>>> plt.show()

Next, we will create a chart describing the Top 5 countries from which the tweets were sent.

>>> tweets_by_country = tweets['country'].value_counts()

>>> fig, ax = plt.subplots()
>>> ax.tick_params(axis='x', labelsize=15)
>>> ax.tick_params(axis='y', labelsize=10)
>>> ax.set_xlabel('Countries', fontsize=15)
Text(0.5, 0, 'Countries')
>>> ax.set_ylabel('Number of tweets' , fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Top 5 countries', fontsize=15, fontweight='bold')
Text(0.5, 1.0, 'Top 5 countries')
>>> tweets_by_country[:5].plot(ax=ax, kind='bar', color='blue')
<matplotlib.axes._subplots.AxesSubplot object at 0x00000189BA6038D0>
>>> plt.show()

3. Mining the tweets

Out main goals in these text mining tasks are: compare the popularity of Python, Ruby and Javascript programming languages and to retrieve programming tutorial links. We will do this in 3 steps:

  • We will add tags to our tweets DataFrame in order to be able to manipulate the data easily.
  • Target tweets that have "programming" or tutorial" keywords.
  • Extract links from the relevant tweets.

Adding Python, Ruby, and Javascript tags

First, we will create a function that checks if a specific keyword is present in a text. We will do this by using regular expression (正则表达式).

Python provides a library for regular expression called re. We will start by importing this library.

Next, we will create a function called word_in_text(word, text). This function return True if a word is found in text, otherwise it returns False.

>>> import re
>>> def word_in_text(word, text):
word = word.lower()
text = text.lower()
match = re.search(word, text)
if match:
return True
return False

Next, we will add 3 columns to our tweets DataFrame by pandas.DataFrame.apply().

>>> tweets['python'] = tweets['text'].apply(lambda tweet: word_in_text('python', tweet))
>>> tweets['ruby'] = tweets['text'].apply(lambda tweet: word_in_text('ruby', tweet))
>>> tweets['javascript'] = tweets['text'].apply(lambda tweet: word_in_text('javascript', tweet))

We can calculate the number of tweets for each programming language by pandas.Series.value_counts as follows:

>>> print(tweets['python'].value_counts()[True])
447
>>> print(tweets['ruby'].value_counts()[True])
529
>>> print(tweets['javascript'].value_counts()[True])
275

We can make a simple comparison chart by executing the following:

>>> prg_langs = ['python', 'ruby', 'javascript']
>>> tweets_by_prg_lang = [tweets['python'].value_counts()[True], tweets['ruby'].value_counts()[True], tweets['javascript'].value_counts()[True]]
>>> x_pos = list(range(len(prg_langs)))
>>> width = 0.8
>>> fig, ax = plt.subplots()
>>> plt.bar(x_pos, tweets_by_prg_lang, width, alpha=1, color='g')
<BarContainer object of 3 artists>
>>> # Setting axis labels and ticks
>>> ax.set_ylabel('Number of tweets', fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Ranking: python vs. javascript vs. ruby (Raw data)', fontsize=10, fontweight='bold')
Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Raw data)')
>>> ax.set_xticks([p + 0.4 * width for p in x_pos])
[<matplotlib.axis.XTick object at 0x00000189BA5D1F28>, <matplotlib.axis.XTick object at 0x00000189BA603D30>, <matplotlib.axis.XTick object at 0x00000189BA5D15F8>]
>>> ax.set_xticklabels(prg_langs)
[Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')]
>>> plt.grid()
>>> plt.show()

This shows, that the keyword ruby is the most popular, followed by python then javascript. However, the tweets DataFrame contains information about all tweets that contains one of the 3 keywords and doesn't restrict the information to the programming languages. For example, there are a lot of tweets that contains the keyword ruby and that are related to a political scandal Rubygate. In the next section, we will filter the tweets and re-run the analysis to make a more accurate comparison.

Targeting relevant tweets

We are interested in targeting tweets that are related to programming languages. Such tweets often have one of the 2 keywords: "programming" or "tutorial". We will create 2 additional columns to our tweets DataFrame where we will add this information.

>>> tweets['programming'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet))
>>> tweets['tutorial'] = tweets['text'].apply(lambda tweet: word_in_text('tutorial', tweet))

We will add an additional column called relevant that take value True if the tweet has either "programming" or "tutorial" keyword, otherwise it takes value False.

>>> tweets['relevant'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet) or word_in_text('tutorial', tweet))

We can print the counts of relevant tweet by executing the commands below.

>>> print(tweets['programming'].value_counts()[True])
55
>>> print(tweets['tutorial'].value_counts()[True])
22
>>> print(tweets['relevant'].value_counts()[True])
74

We can compare now the popularity of the programming languages by executing the commands below.

tweets[tweets['relevant'] == True]['python'] # 将 relevant 为 True 的索引对应 Python 组成一个新的列
>>> print(tweets[tweets['relevant'] == True]['python'].value_counts()[True])
31
>>> print(tweets[tweets['relevant'] == True]['ruby'].value_counts()[True])
8
>>> print(tweets[tweets['relevant'] == True]['javascript'].value_counts()[True])
11

Python is the most popular with a count of 31, followed by javascript by a count of 11, and ruby by a count of 185. We can make a comparison

>>> tweets_by_prg_lang = [tweets[tweets['relevant'] == True]['python'].value_counts()[True],
tweets[tweets['relevant'] == True]['ruby'].value_counts()[True],
tweets[tweets['relevant'] == True]['javascript'].value_counts()[True]]
>>> x_pos = list(range(len(prg_langs)))
>>> width = 0.8
>>> fig, ax = plt.subplots()
>>> plt.bar(x_pos, tweets_by_prg_lang, width,alpha=1,color='g')
<BarContainer object of 3 artists>
>>> ax.set_ylabel('Number of tweets', fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Ranking: python vs. javascript vs. ruby (Relevant data)', fontsize=10, fontweight='bold')
Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Relevant data)')
>>> ax.set_xticks([p + 0.4 * width for p in x_pos])
[<matplotlib.axis.XTick object at 0x00000189B6E9E128>, <matplotlib.axis.XTick object at 0x00000189B430F9E8>, <matplotlib.axis.XTick object at 0x00000189B430F5C0>]
>>> ax.set_xticklabels(prg_langs)
[Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')]
>>> plt.grid()
>>> plt.show()

Extracting links from the relevants tweets

Now that we extracted the relevant tweets, we want to retrieve links to programming tutorials. We will start by creating a function that uses regular expressions for retrieving link that start with "http://" or "https:" from a text. This function will return the url if found, otherwise it returns an empty string.

>>> def extract_link(text):
regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
match = re.search(regex, text)
if match:
return match.group()
return ''

Next, we will add a column called link to our tweets DataFrame. This column will contain the urls information.

>>> tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))

Next, we will create a new DataFrame called tweets_relevant_with_link. This DataFrame is a subset of tweets DataFrame and contains all relevant tweets that have a link.

将原有 DataFrame 进行截取。

>>> tweets_relevant = tweets[tweets['relevant'] == True]
>>> tweets_relevant_with_link = tweets_relevant[tweets_relevant['link'] != '']

We can now print out all links for python, ruby, and javascript by executing the commands below:

>>> print(tweets_relevant_with_link[tweets_relevant_with_link['python'] == True]['link'])
40 https://t.co/zoAgyQuMAZ
105 https://t.co/ogaPbuIbEW
274 https://t.co/y4sUmovFOn
329 https://t.co/A030fqWeWA
339 https://t.co/LaaVc5T2rQ
391 https://t.co/8bYvlziCZb
413 https://t.co/8bYvlziCZb
436 https://t.co/EByqxT1qyN
444 https://t.co/8bYvlziCZb
445 https://t.co/5Jujg6h31B
462 https://t.co/UrFHlOaJYf
476 https://t.co/5Jujg6h31B
477 https://t.co/EByqxT1qyN
589 https://t.co/UrFHlOaJYf
603 https://t.co/5Jujg6h31B
822 https://t.co/Oc21FrzQc5
1060 https://t.co/qOAIuKfyD0
1097 https://t.co/qOAIuKfyD0
1248 https://t.co/V3ZNKuYsK7
1278 https://t.co/qOAIuKfyD0
1411 https://t.co/szHRHavQKy
1594 https://t.co/X6KWMlzlv6
Name: link, dtype: object
>>> print(tweets_relevant_with_link[tweets_relevant_with_link['ruby'] == True]['link'])
782 https://t.co/JgY40r2NSo
833 https://t.co/JgY40r2NSo
1177 https://t.co/xycOG3ndi9
1254 https://t.co/xycOG3ndi9
1293 https://t.co/LMHW050TGs
1328 https://t.co/SS4DzEnSBZ
1393 https://t.co/NZlUce5Ne8
1619 https://t.co/e4nwrn3N2j
Name: link, dtype: object
>>> print(tweets_relevant_with_link[tweets_relevant_with_link['javascript'] == True]['link'])
130 https://t.co/AbJFaSI0B8
286 https://t.co/7dNBIsQ5Gq
467 https://t.co/3YIK588j8t
471 https://t.co/vjBJWWzvfv
830 https://t.co/T4mUjwUcgL
1093 https://t.co/wvLZLjuVKF
1180 https://t.co/luxL2qbxte
1526 https://t.co/G3ZTFL0RKv
Name: link, dtype: object

【337】Text Mining Using Twitter Streaming API and Python的更多相关文章

  1. An Introduction to Text Mining using Twitter Streaming

    Text mining is the application of natural language processing techniques and analytical methods to t ...

  2. 【LeetCode】299. Bulls and Cows 解题报告(Python)

    [LeetCode]299. Bulls and Cows 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题 ...

  3. 【LeetCode】743. Network Delay Time 解题报告(Python)

    [LeetCode]743. Network Delay Time 解题报告(Python) 标签(空格分隔): LeetCode 作者: 负雪明烛 id: fuxuemingzhu 个人博客: ht ...

  4. 【LeetCode】518. Coin Change 2 解题报告(Python)

    [LeetCode]518. Coin Change 2 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题目 ...

  5. 【LeetCode】474. Ones and Zeroes 解题报告(Python)

    [LeetCode]474. Ones and Zeroes 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ ...

  6. 【LeetCode】731. My Calendar II 解题报告(Python)

    [LeetCode]731. My Calendar II 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题 ...

  7. 【LeetCode】785. Is Graph Bipartite? 解题报告(Python)

    [LeetCode]785. Is Graph Bipartite? 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu. ...

  8. 【LeetCode】895. Maximum Frequency Stack 解题报告(Python)

    [LeetCode]895. Maximum Frequency Stack 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxueming ...

  9. 【LeetCode】764. Largest Plus Sign 解题报告(Python)

    [LeetCode]764. Largest Plus Sign 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn ...

随机推荐

  1. bzoj 4539 [Hnoi2016]树——主席树+倍增

    题目:https://www.lydsy.com/JudgeOnline/problem.php?id=4539 明明就是把每次复制的一个子树当作一个点,这样能连出一个树的结构,自己竟然都没想到.思维 ...

  2. bzoj3143 游走

    Description 一个无向连通图,顶点从1编号到N,边从1编号到M. 小Z在该图上进行随机游走,初始时小Z在1号顶点,每一步小Z以相等的概率随机选 择当前顶点的某条边,沿着这条边走到下一个顶点, ...

  3. 利用x-requested-with判断请求是否是Ajax请求

    在服务器端判断request来自Ajax请求(异步)还是传统请求(同步):         两种请求在请求的Header不同,Ajax 异步请求比传统的同步请求多了一个头参数 1.传统同步请求参数 a ...

  4. [转]C#鼠标拖动任意控件

    C#鼠标拖动任意控件(winform) 分类: c#2011-08-15 22:51 178人阅读 评论(0) 收藏 举报 winformc#userwindowsobjectapi using Sy ...

  5. 【转载】深入浅出REST

    英文原文:A Brief Introduction to REST 作者:Stefan Tilkov ,译者:苑永凯,发布于 2007-12-25 不知你是否意识到,围绕着什么才是实现异构的应用到应用 ...

  6. 1109 Group Photo (25 分)

    1109 Group Photo (25 分) Formation is very important when taking a group photo. Given the rules of fo ...

  7. 1030 Travel Plan (30 分)

    1030 Travel Plan (30 分) A traveler's map gives the distances between cities along the highways, toge ...

  8. Python 标准库 ConfigParser 模块 的使用

    Python 标准库 ConfigParser 模块 的使用 demo #!/usr/bin/env python # coding=utf-8 import ConfigParser import ...

  9. Callable接口和Future

    本篇说明的是Callable和Future,它俩很有意思的,一个产生结果,一个拿到结果.        Callable接口类似于Runnable,从名字就可以看出来了,但是Runnable不会返回结 ...

  10. (转)Win7 64位系统下 Retional rose 2003 安装及破解

    网上关于Retional rose 2003安装和破解的文章比较多,这里,我结合自己的亲身体验,和大家分享一下win7 旗舰版 64位系统下Retional rose 2003(下面简称rose200 ...