【337】Text Mining Using Twitter Streaming API and Python

Reference: An Introduction to Text Mining using Twitter Streaming API and Python

Reference: How to Register a Twitter App in 8 Easy Steps

Getting Data from Twitter Streaming API
Reading and Understanding the data
Mining the tweets

Key Methods:

Map()
Lambda()
Set()
Pandas.DataFrame()
matplotlib

1. Getting Data from Twitter Streaming API

twitter_streaming.py, this file is used to extract information from Twitter.

#Import the necessary methods from tweepy library

from tweepy.streaming import StreamListener

from tweepy import OAuthHandler

from tweepy import Stream

#Variables that contains the user credentials to access Twitter API

access_token = "ENTER YOUR ACCESS TOKEN"

access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"

consumer_key = "ENTER YOUR API KEY"

consumer_secret = "ENTER YOUR API SECRET"

#This is a basic listener that just prints received tweets to stdout.

class StdOutListener(StreamListener):

    def on_data(self, data):

        print(data)

        return True

    def on_error(self, status):

        print(status)

if __name__ == '__main__':

    #This handles Twitter authetification and the connection to Twitter Streaming API

    l = StdOutListener()

    auth = OAuthHandler(consumer_key, consumer_secret)

    auth.set_access_token(access_token, access_token_secret)

    stream = Stream(auth, l)

    #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'

    stream.filter(track=['python', 'javascript', 'ruby'])

You can use the following command to store information in the specific file. (By CMD)

python twitter_streaming.py > twitter_data.txt

Then we will get the information from the above text file and store them in JSON format.

import json

tweets_data_path = r"..\twitter_data.txt"

tweets_data = []

tweets_file = open(tweets_data_path, "r")

for line in tweets_file:

	try:

		tweet = json.loads(line)

		tweets_data.append(tweet)

	except:

		continue

Data are stored in tweets_data, and we can get the specific information by the following scripts.

Reference: python JSON only get keys in first level

# get the text content, language from the specific tweets

num = 0

for tweet in tweets_data:

	num += 1

	if num == 10:

		break

	else:

		tweet_text = tweet["text"]

		tweet_lang = tweet["lang"]

		print(str(num))

		print(tweet_lang)

		print(tweet_text)

		print()

# get all the keys from json

tweets_data[0].keys()

2. Reading and Understanding the data

Now we can also get the specific key by list(), map() and lambda() with the following scripts.

Reference: Python中map与lambda的结合使用

>>> a = list(map(lambda tweet: tweet['text'], tweets_data))

>>> len(a)

1633

>>> a[0]

'RT @neet_se: 案件数って点だけならJavaがダントツ、つまり仕事に繋がりやすい。https://t.co/rqxp…'

Or we can also use set() method to get the unique values of the list.

Reference: Python set() 函数

Reference: Python统计列表中的重复项出现的次数的方法

>>> langs = list(map(lambda tweet: tweet['lang'], tweets_data))

>>> len(langs)

1633

>>> set(langs)

{'zh', 'de', 'es', 'et', 'th', 'cy', 'ru', 'in', 'lt', 'pt', 'tl', 'en', 'it', 'ja', 'ro', 'fa', 'pl', 'fr', 'ht', 'ar', 'tr', 'ca', 'cs', 'und', 'da'}

Next, we will structure the tweets data into a pandas DataFrame to simplify the data manipulation.

>>> import pandas as pd

>>> tweets = pd.DataFrame()

>>> tweets['text'] = list(map(lambda tweet: tweet['text'], tweets_data))

>>> tweets['lang'] = list(map(lambda tweet: tweet['lang'], tweets_data))

>>> tweets['country'] = list(map(lambda tweet: tweet['place']['country'] if tweet['place'] != None else None, tweets_data))

>>> tweets['lang'].value_counts()

en     1119

ja      278

es      113

pt       36

und      26

...

Next, we will use matplotlib to create a chart describing the Top 5 languages in which the tweets were written.

>>> tweets_by_lang = tweets['lang'].value_counts()

>>> import matplotlib.pyplot as plt

>>> fig, ax = plt.subplots()

>>> ax.tick_params(axis='x', labelsize=15)

>>> ax.tick_params(axis='y', labelsize=10)

>>> ax.set_xlabel('Languages', fontsize=15)

Text(0.5, 0, 'Languages')

>>> ax.set_ylabel('Number of tweets' , fontsize=15)

Text(0, 0.5, 'Number of tweets')

>>> ax.set_title('Top 5 languages', fontsize=15, fontweight='bold')

Text(0.5, 1.0, 'Top 5 languages')

>>> tweets_by_lang[:5].plot(ax=ax, kind='bar', color='red')

<matplotlib.axes._subplots.AxesSubplot object at 0x00000189B635D630>

>>> plt.show()

Next, we will create a chart describing the Top 5 countries from which the tweets were sent.

>>> tweets_by_country = tweets['country'].value_counts()

>>> fig, ax = plt.subplots()

>>> ax.tick_params(axis='x', labelsize=15)

>>> ax.tick_params(axis='y', labelsize=10)

>>> ax.set_xlabel('Countries', fontsize=15)

Text(0.5, 0, 'Countries')

>>> ax.set_ylabel('Number of tweets' , fontsize=15)

Text(0, 0.5, 'Number of tweets')

>>> ax.set_title('Top 5 countries', fontsize=15, fontweight='bold')

Text(0.5, 1.0, 'Top 5 countries')

>>> tweets_by_country[:5].plot(ax=ax, kind='bar', color='blue')

<matplotlib.axes._subplots.AxesSubplot object at 0x00000189BA6038D0>

>>> plt.show()

3. Mining the tweets

Out main goals in these text mining tasks are: compare the popularity of Python, Ruby and Javascript programming languages and to retrieve programming tutorial links. We will do this in 3 steps:

We will add tags to our tweets DataFrame in order to be able to manipulate the data easily.
Target tweets that have "programming" or tutorial" keywords.
Extract links from the relevant tweets.

Adding Python, Ruby, and Javascript tags

First, we will create a function that checks if a specific keyword is present in a text. We will do this by using regular expression (正则表达式).

Python provides a library for regular expression called re. We will start by importing this library.

Next, we will create a function called word_in_text(word, text). This function return True if a word is found in text, otherwise it returns False.

>>> import re

>>> def word_in_text(word, text):

	word = word.lower()

	text = text.lower()

	match = re.search(word, text)

	if match:

		return True

	return False

Next, we will add 3 columns to our tweets DataFrame by pandas.DataFrame.apply().

>>> tweets['python'] = tweets['text'].apply(lambda tweet: word_in_text('python', tweet))

>>> tweets['ruby'] = tweets['text'].apply(lambda tweet: word_in_text('ruby', tweet))

>>> tweets['javascript'] = tweets['text'].apply(lambda tweet: word_in_text('javascript', tweet))

We can calculate the number of tweets for each programming language by pandas.Series.value_counts as follows:

>>> print(tweets['python'].value_counts()[True])

447

>>> print(tweets['ruby'].value_counts()[True])

529

>>> print(tweets['javascript'].value_counts()[True])

275

We can make a simple comparison chart by executing the following:

>>> prg_langs = ['python', 'ruby', 'javascript']

>>> tweets_by_prg_lang = [tweets['python'].value_counts()[True], tweets['ruby'].value_counts()[True], tweets['javascript'].value_counts()[True]]

>>> x_pos = list(range(len(prg_langs)))

>>> width = 0.8

>>> fig, ax = plt.subplots()

>>> plt.bar(x_pos, tweets_by_prg_lang, width, alpha=1, color='g')

<BarContainer object of 3 artists>

>>> # Setting axis labels and ticks

>>> ax.set_ylabel('Number of tweets', fontsize=15)

Text(0, 0.5, 'Number of tweets')

>>> ax.set_title('Ranking: python vs. javascript vs. ruby (Raw data)', fontsize=10, fontweight='bold')

Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Raw data)')

>>> ax.set_xticks([p + 0.4 * width for p in x_pos])

[<matplotlib.axis.XTick object at 0x00000189BA5D1F28>, <matplotlib.axis.XTick object at 0x00000189BA603D30>, <matplotlib.axis.XTick object at 0x00000189BA5D15F8>]

>>> ax.set_xticklabels(prg_langs)

[Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')]

>>> plt.grid()

>>> plt.show()

This shows, that the keyword ruby is the most popular, followed by python then javascript. However, the tweets DataFrame contains information about all tweets that contains one of the 3 keywords and doesn't restrict the information to the programming languages. For example, there are a lot of tweets that contains the keyword ruby and that are related to a political scandal Rubygate. In the next section, we will filter the tweets and re-run the analysis to make a more accurate comparison.

Targeting relevant tweets

We are interested in targeting tweets that are related to programming languages. Such tweets often have one of the 2 keywords: "programming" or "tutorial". We will create 2 additional columns to our tweets DataFrame where we will add this information.

>>> tweets['programming'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet))

>>> tweets['tutorial'] = tweets['text'].apply(lambda tweet: word_in_text('tutorial', tweet))

We will add an additional column called relevant that take value True if the tweet has either "programming" or "tutorial" keyword, otherwise it takes value False.

>>> tweets['relevant'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet) or word_in_text('tutorial', tweet))

We can print the counts of relevant tweet by executing the commands below.

>>> print(tweets['programming'].value_counts()[True])

55

>>> print(tweets['tutorial'].value_counts()[True])

22

>>> print(tweets['relevant'].value_counts()[True])

74

We can compare now the popularity of the programming languages by executing the commands below.

tweets[tweets['relevant'] == True]['python'] # 将 relevant 为 True 的索引对应 Python 组成一个新的列

>>> print(tweets[tweets['relevant'] == True]['python'].value_counts()[True])

31

>>> print(tweets[tweets['relevant'] == True]['ruby'].value_counts()[True])

8

>>> print(tweets[tweets['relevant'] == True]['javascript'].value_counts()[True])

11

Python is the most popular with a count of 31, followed by javascript by a count of 11, and ruby by a count of 185. We can make a comparison

>>> tweets_by_prg_lang = [tweets[tweets['relevant'] == True]['python'].value_counts()[True],

			  tweets[tweets['relevant'] == True]['ruby'].value_counts()[True],

			  tweets[tweets['relevant'] == True]['javascript'].value_counts()[True]]

>>> x_pos = list(range(len(prg_langs)))

>>> width = 0.8

>>> fig, ax = plt.subplots()

>>> plt.bar(x_pos, tweets_by_prg_lang, width,alpha=1,color='g')

<BarContainer object of 3 artists>

>>> ax.set_ylabel('Number of tweets', fontsize=15)

Text(0, 0.5, 'Number of tweets')

>>> ax.set_title('Ranking: python vs. javascript vs. ruby (Relevant data)', fontsize=10, fontweight='bold')

Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Relevant data)')

>>> ax.set_xticks([p + 0.4 * width for p in x_pos])

[<matplotlib.axis.XTick object at 0x00000189B6E9E128>, <matplotlib.axis.XTick object at 0x00000189B430F9E8>, <matplotlib.axis.XTick object at 0x00000189B430F5C0>]

>>> ax.set_xticklabels(prg_langs)

[Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')]

>>> plt.grid()

>>> plt.show()

Extracting links from the relevants tweets

Now that we extracted the relevant tweets, we want to retrieve links to programming tutorials. We will start by creating a function that uses regular expressions for retrieving link that start with "http://" or "https:" from a text. This function will return the url if found, otherwise it returns an empty string.

>>> def extract_link(text):

	regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+'

	match = re.search(regex, text)

	if match:

		return match.group()

	return ''

Next, we will add a column called link to our tweets DataFrame. This column will contain the urls information.

>>> tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))

Next, we will create a new DataFrame called tweets_relevant_with_link. This DataFrame is a subset of tweets DataFrame and contains all relevant tweets that have a link.

将原有 DataFrame 进行截取。

>>> tweets_relevant = tweets[tweets['relevant'] == True]

>>> tweets_relevant_with_link = tweets_relevant[tweets_relevant['link'] != '']

We can now print out all links for python, ruby, and javascript by executing the commands below:

>>> print(tweets_relevant_with_link[tweets_relevant_with_link['python'] == True]['link'])

40      https://t.co/zoAgyQuMAZ

105     https://t.co/ogaPbuIbEW

274     https://t.co/y4sUmovFOn

329     https://t.co/A030fqWeWA

339     https://t.co/LaaVc5T2rQ

391     https://t.co/8bYvlziCZb

413     https://t.co/8bYvlziCZb

436     https://t.co/EByqxT1qyN

444     https://t.co/8bYvlziCZb

445     https://t.co/5Jujg6h31B

462     https://t.co/UrFHlOaJYf

476     https://t.co/5Jujg6h31B

477     https://t.co/EByqxT1qyN

589     https://t.co/UrFHlOaJYf

603     https://t.co/5Jujg6h31B

822     https://t.co/Oc21FrzQc5

1060    https://t.co/qOAIuKfyD0

1097    https://t.co/qOAIuKfyD0

1248    https://t.co/V3ZNKuYsK7

1278    https://t.co/qOAIuKfyD0

1411    https://t.co/szHRHavQKy

1594    https://t.co/X6KWMlzlv6

Name: link, dtype: object

>>> print(tweets_relevant_with_link[tweets_relevant_with_link['ruby'] == True]['link'])

782     https://t.co/JgY40r2NSo

833     https://t.co/JgY40r2NSo

1177    https://t.co/xycOG3ndi9

1254    https://t.co/xycOG3ndi9

1293    https://t.co/LMHW050TGs

1328    https://t.co/SS4DzEnSBZ

1393    https://t.co/NZlUce5Ne8

1619    https://t.co/e4nwrn3N2j

Name: link, dtype: object

>>> print(tweets_relevant_with_link[tweets_relevant_with_link['javascript'] == True]['link'])

130     https://t.co/AbJFaSI0B8

286     https://t.co/7dNBIsQ5Gq

467     https://t.co/3YIK588j8t

471     https://t.co/vjBJWWzvfv

830     https://t.co/T4mUjwUcgL

1093    https://t.co/wvLZLjuVKF

1180    https://t.co/luxL2qbxte

1526    https://t.co/G3ZTFL0RKv

Name: link, dtype: object