Reference: An Introduction to Text Mining using Twitter Streaming API and Python

Reference: How to Register a Twitter App in 8 Easy Steps

  • Getting Data from Twitter Streaming API
  • Reading and Understanding the data
  • Mining the tweets

Key Methods:

  • Map()
  • Lambda()
  • Set()
  • Pandas.DataFrame()
  • matplotlib

1. Getting Data from Twitter Streaming API

twitter_streaming.py, this file is used to extract information from Twitter.

#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream #Variables that contains the user credentials to access Twitter API
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET" #This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener): def on_data(self, data):
print(data)
return True def on_error(self, status):
print(status) if __name__ == '__main__': #This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l) #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(track=['python', 'javascript', 'ruby'])

You can use the following command to store information in the specific file. (By CMD)

python twitter_streaming.py > twitter_data.txt

Then we will get the information from the above text file and store them in JSON format.

import json
tweets_data_path = r"..\twitter_data.txt"
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue

Data are stored in tweets_data, and we can get the specific information by the following scripts.

Reference: python JSON only get keys in first level

# get the text content, language from the specific tweets
num = 0
for tweet in tweets_data:
num += 1
if num == 10:
break
else:
tweet_text = tweet["text"]
tweet_lang = tweet["lang"]
print(str(num))
print(tweet_lang)
print(tweet_text)
print() # get all the keys from json
tweets_data[0].keys()

2. Reading and Understanding the data

Now we can also get the specific key by list(), map() and lambda() with the following scripts.

Reference: Python中map与lambda的结合使用

>>> a = list(map(lambda tweet: tweet['text'], tweets_data))
>>> len(a)
1633
>>> a[0]
'RT @neet_se: 案件数って点だけならJavaがダントツ、つまり仕事に繋がりやすい。https://t.co/rqxp…'

Or we can also use set() method to get the unique values of the list.

Reference: Python set() 函数

Reference: Python统计列表中的重复项出现的次数的方法

>>> langs = list(map(lambda tweet: tweet['lang'], tweets_data))
>>> len(langs)
1633
>>> set(langs)
{'zh', 'de', 'es', 'et', 'th', 'cy', 'ru', 'in', 'lt', 'pt', 'tl', 'en', 'it', 'ja', 'ro', 'fa', 'pl', 'fr', 'ht', 'ar', 'tr', 'ca', 'cs', 'und', 'da'}

Next, we will structure the tweets data into a pandas DataFrame to simplify the data manipulation.

>>> import pandas as pd
>>> tweets = pd.DataFrame()
>>> tweets['text'] = list(map(lambda tweet: tweet['text'], tweets_data))
>>> tweets['lang'] = list(map(lambda tweet: tweet['lang'], tweets_data))
>>> tweets['country'] = list(map(lambda tweet: tweet['place']['country'] if tweet['place'] != None else None, tweets_data))
>>> tweets['lang'].value_counts()
en 1119
ja 278
es 113
pt 36
und 26
...

Next, we will use matplotlib to create a chart describing the Top 5 languages in which the tweets were written.

>>> tweets_by_lang = tweets['lang'].value_counts()

>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots()
>>> ax.tick_params(axis='x', labelsize=15)
>>> ax.tick_params(axis='y', labelsize=10)
>>> ax.set_xlabel('Languages', fontsize=15)
Text(0.5, 0, 'Languages')
>>> ax.set_ylabel('Number of tweets' , fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Top 5 languages', fontsize=15, fontweight='bold')
Text(0.5, 1.0, 'Top 5 languages')
>>> tweets_by_lang[:5].plot(ax=ax, kind='bar', color='red')
<matplotlib.axes._subplots.AxesSubplot object at 0x00000189B635D630>
>>> plt.show()

Next, we will create a chart describing the Top 5 countries from which the tweets were sent.

>>> tweets_by_country = tweets['country'].value_counts()

>>> fig, ax = plt.subplots()
>>> ax.tick_params(axis='x', labelsize=15)
>>> ax.tick_params(axis='y', labelsize=10)
>>> ax.set_xlabel('Countries', fontsize=15)
Text(0.5, 0, 'Countries')
>>> ax.set_ylabel('Number of tweets' , fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Top 5 countries', fontsize=15, fontweight='bold')
Text(0.5, 1.0, 'Top 5 countries')
>>> tweets_by_country[:5].plot(ax=ax, kind='bar', color='blue')
<matplotlib.axes._subplots.AxesSubplot object at 0x00000189BA6038D0>
>>> plt.show()

3. Mining the tweets

Out main goals in these text mining tasks are: compare the popularity of Python, Ruby and Javascript programming languages and to retrieve programming tutorial links. We will do this in 3 steps:

  • We will add tags to our tweets DataFrame in order to be able to manipulate the data easily.
  • Target tweets that have "programming" or tutorial" keywords.
  • Extract links from the relevant tweets.

Adding Python, Ruby, and Javascript tags

First, we will create a function that checks if a specific keyword is present in a text. We will do this by using regular expression (正则表达式).

Python provides a library for regular expression called re. We will start by importing this library.

Next, we will create a function called word_in_text(word, text). This function return True if a word is found in text, otherwise it returns False.

>>> import re
>>> def word_in_text(word, text):
word = word.lower()
text = text.lower()
match = re.search(word, text)
if match:
return True
return False

Next, we will add 3 columns to our tweets DataFrame by pandas.DataFrame.apply().

>>> tweets['python'] = tweets['text'].apply(lambda tweet: word_in_text('python', tweet))
>>> tweets['ruby'] = tweets['text'].apply(lambda tweet: word_in_text('ruby', tweet))
>>> tweets['javascript'] = tweets['text'].apply(lambda tweet: word_in_text('javascript', tweet))

We can calculate the number of tweets for each programming language by pandas.Series.value_counts as follows:

>>> print(tweets['python'].value_counts()[True])
447
>>> print(tweets['ruby'].value_counts()[True])
529
>>> print(tweets['javascript'].value_counts()[True])
275

We can make a simple comparison chart by executing the following:

>>> prg_langs = ['python', 'ruby', 'javascript']
>>> tweets_by_prg_lang = [tweets['python'].value_counts()[True], tweets['ruby'].value_counts()[True], tweets['javascript'].value_counts()[True]]
>>> x_pos = list(range(len(prg_langs)))
>>> width = 0.8
>>> fig, ax = plt.subplots()
>>> plt.bar(x_pos, tweets_by_prg_lang, width, alpha=1, color='g')
<BarContainer object of 3 artists>
>>> # Setting axis labels and ticks
>>> ax.set_ylabel('Number of tweets', fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Ranking: python vs. javascript vs. ruby (Raw data)', fontsize=10, fontweight='bold')
Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Raw data)')
>>> ax.set_xticks([p + 0.4 * width for p in x_pos])
[<matplotlib.axis.XTick object at 0x00000189BA5D1F28>, <matplotlib.axis.XTick object at 0x00000189BA603D30>, <matplotlib.axis.XTick object at 0x00000189BA5D15F8>]
>>> ax.set_xticklabels(prg_langs)
[Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')]
>>> plt.grid()
>>> plt.show()

This shows, that the keyword ruby is the most popular, followed by python then javascript. However, the tweets DataFrame contains information about all tweets that contains one of the 3 keywords and doesn't restrict the information to the programming languages. For example, there are a lot of tweets that contains the keyword ruby and that are related to a political scandal Rubygate. In the next section, we will filter the tweets and re-run the analysis to make a more accurate comparison.

Targeting relevant tweets

We are interested in targeting tweets that are related to programming languages. Such tweets often have one of the 2 keywords: "programming" or "tutorial". We will create 2 additional columns to our tweets DataFrame where we will add this information.

>>> tweets['programming'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet))
>>> tweets['tutorial'] = tweets['text'].apply(lambda tweet: word_in_text('tutorial', tweet))

We will add an additional column called relevant that take value True if the tweet has either "programming" or "tutorial" keyword, otherwise it takes value False.

>>> tweets['relevant'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet) or word_in_text('tutorial', tweet))

We can print the counts of relevant tweet by executing the commands below.

>>> print(tweets['programming'].value_counts()[True])
55
>>> print(tweets['tutorial'].value_counts()[True])
22
>>> print(tweets['relevant'].value_counts()[True])
74

We can compare now the popularity of the programming languages by executing the commands below.

tweets[tweets['relevant'] == True]['python'] # 将 relevant 为 True 的索引对应 Python 组成一个新的列
>>> print(tweets[tweets['relevant'] == True]['python'].value_counts()[True])
31
>>> print(tweets[tweets['relevant'] == True]['ruby'].value_counts()[True])
8
>>> print(tweets[tweets['relevant'] == True]['javascript'].value_counts()[True])
11

Python is the most popular with a count of 31, followed by javascript by a count of 11, and ruby by a count of 185. We can make a comparison

>>> tweets_by_prg_lang = [tweets[tweets['relevant'] == True]['python'].value_counts()[True],
tweets[tweets['relevant'] == True]['ruby'].value_counts()[True],
tweets[tweets['relevant'] == True]['javascript'].value_counts()[True]]
>>> x_pos = list(range(len(prg_langs)))
>>> width = 0.8
>>> fig, ax = plt.subplots()
>>> plt.bar(x_pos, tweets_by_prg_lang, width,alpha=1,color='g')
<BarContainer object of 3 artists>
>>> ax.set_ylabel('Number of tweets', fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Ranking: python vs. javascript vs. ruby (Relevant data)', fontsize=10, fontweight='bold')
Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Relevant data)')
>>> ax.set_xticks([p + 0.4 * width for p in x_pos])
[<matplotlib.axis.XTick object at 0x00000189B6E9E128>, <matplotlib.axis.XTick object at 0x00000189B430F9E8>, <matplotlib.axis.XTick object at 0x00000189B430F5C0>]
>>> ax.set_xticklabels(prg_langs)
[Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')]
>>> plt.grid()
>>> plt.show()

Extracting links from the relevants tweets

Now that we extracted the relevant tweets, we want to retrieve links to programming tutorials. We will start by creating a function that uses regular expressions for retrieving link that start with "http://" or "https:" from a text. This function will return the url if found, otherwise it returns an empty string.

>>> def extract_link(text):
regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
match = re.search(regex, text)
if match:
return match.group()
return ''

Next, we will add a column called link to our tweets DataFrame. This column will contain the urls information.

>>> tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))

Next, we will create a new DataFrame called tweets_relevant_with_link. This DataFrame is a subset of tweets DataFrame and contains all relevant tweets that have a link.

将原有 DataFrame 进行截取。

>>> tweets_relevant = tweets[tweets['relevant'] == True]
>>> tweets_relevant_with_link = tweets_relevant[tweets_relevant['link'] != '']

We can now print out all links for python, ruby, and javascript by executing the commands below:

>>> print(tweets_relevant_with_link[tweets_relevant_with_link['python'] == True]['link'])
40 https://t.co/zoAgyQuMAZ
105 https://t.co/ogaPbuIbEW
274 https://t.co/y4sUmovFOn
329 https://t.co/A030fqWeWA
339 https://t.co/LaaVc5T2rQ
391 https://t.co/8bYvlziCZb
413 https://t.co/8bYvlziCZb
436 https://t.co/EByqxT1qyN
444 https://t.co/8bYvlziCZb
445 https://t.co/5Jujg6h31B
462 https://t.co/UrFHlOaJYf
476 https://t.co/5Jujg6h31B
477 https://t.co/EByqxT1qyN
589 https://t.co/UrFHlOaJYf
603 https://t.co/5Jujg6h31B
822 https://t.co/Oc21FrzQc5
1060 https://t.co/qOAIuKfyD0
1097 https://t.co/qOAIuKfyD0
1248 https://t.co/V3ZNKuYsK7
1278 https://t.co/qOAIuKfyD0
1411 https://t.co/szHRHavQKy
1594 https://t.co/X6KWMlzlv6
Name: link, dtype: object
>>> print(tweets_relevant_with_link[tweets_relevant_with_link['ruby'] == True]['link'])
782 https://t.co/JgY40r2NSo
833 https://t.co/JgY40r2NSo
1177 https://t.co/xycOG3ndi9
1254 https://t.co/xycOG3ndi9
1293 https://t.co/LMHW050TGs
1328 https://t.co/SS4DzEnSBZ
1393 https://t.co/NZlUce5Ne8
1619 https://t.co/e4nwrn3N2j
Name: link, dtype: object
>>> print(tweets_relevant_with_link[tweets_relevant_with_link['javascript'] == True]['link'])
130 https://t.co/AbJFaSI0B8
286 https://t.co/7dNBIsQ5Gq
467 https://t.co/3YIK588j8t
471 https://t.co/vjBJWWzvfv
830 https://t.co/T4mUjwUcgL
1093 https://t.co/wvLZLjuVKF
1180 https://t.co/luxL2qbxte
1526 https://t.co/G3ZTFL0RKv
Name: link, dtype: object

【337】Text Mining Using Twitter Streaming API and Python的更多相关文章

  1. An Introduction to Text Mining using Twitter Streaming

    Text mining is the application of natural language processing techniques and analytical methods to t ...

  2. 【LeetCode】299. Bulls and Cows 解题报告(Python)

    [LeetCode]299. Bulls and Cows 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题 ...

  3. 【LeetCode】743. Network Delay Time 解题报告(Python)

    [LeetCode]743. Network Delay Time 解题报告(Python) 标签(空格分隔): LeetCode 作者: 负雪明烛 id: fuxuemingzhu 个人博客: ht ...

  4. 【LeetCode】518. Coin Change 2 解题报告(Python)

    [LeetCode]518. Coin Change 2 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题目 ...

  5. 【LeetCode】474. Ones and Zeroes 解题报告(Python)

    [LeetCode]474. Ones and Zeroes 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ ...

  6. 【LeetCode】731. My Calendar II 解题报告(Python)

    [LeetCode]731. My Calendar II 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题 ...

  7. 【LeetCode】785. Is Graph Bipartite? 解题报告(Python)

    [LeetCode]785. Is Graph Bipartite? 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu. ...

  8. 【LeetCode】895. Maximum Frequency Stack 解题报告(Python)

    [LeetCode]895. Maximum Frequency Stack 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxueming ...

  9. 【LeetCode】764. Largest Plus Sign 解题报告(Python)

    [LeetCode]764. Largest Plus Sign 解题报告(Python) 作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn ...

随机推荐

  1. spring答题

    ioc 依赖注入:通过注入的方式实例化对象,不再直接new对象了,交给spring容器进行管理和维护 控制反转:实例化对象的控制权交给了spring容器,而不再是某个单独的类,控制权发生了变更 作用: ...

  2. Cenots7对lvm逻辑卷分区大小的调整

    Cenots7对lvm逻辑卷分区大小的调整 (针对xfs和ext4不同文件系统) 1.支持的文件系统类型 特别注意的是: resize2fs命令            针对的是ext2.ext3.ex ...

  3. jquery编辑插件tinyMCE的使用方法

    jquery编辑插件tinyMCE是一个非常容易集成到您系统的一个html编辑插件,它不像FckEditor那样需要针对专门的后台语言集成,tinyMCE既能做到轻松集成asp.net,php,jav ...

  4. golang显示本机IP代码

    package main import ( "fmt" "net" ) func main() { addrs, err := net.InterfaceAdd ...

  5. BASIC-16_蓝桥杯_分解质因数

    代码示例: #include <stdio.h> int i = 0 ;int Primes(int a){ for (i = 2 ; i <= a/2 ; i ++) { if ( ...

  6. bzoj4385 Wilcze doły

    Description 给定一个长度为n的序列,你有一次机会选中一段连续的长度不超过d的区间,将里面所有数字全部修改为0.请找到最长的一段连续区间,使得该区间内所有数字之和不超过p. Input 第一 ...

  7. web中显示中文名称的图片,可以这样配置filter

    com.cy.filter.UrlFilter: package com.cy.filter; import java.io.IOException; import java.net.URLDecod ...

  8. Number常用方法函数

    Number类型应该是ECMAScript中最令人关注的数据类型了,这种类型使用IEEE754来表示整数和浮点数,并针对Number相关特点定义了一系列相关的方法函数. isFinite() 在Jav ...

  9. 【BZOJ】1923 [Sdoi2010]外星千足虫(高斯消元)

    题目 传送门:QWQ 分析 高斯消元解异或方程组,和解普通方程组差不多. 范围有点大,要套一个bitset. 代码 #include <bits/stdc++.h> using names ...

  10. unicode 转码 ansi

    #include "stdafx.h"#include <Windows.h>#include <stdio.h> HRESULT SomeCOMFunct ...