Twitter is a popular social network where users can share short SMS-like messages called tweets. Users share thoughts, links and pictures on Twitter, journalists comment on live events, companies promote products and engage with customers. The list of different ways to use Twitter could be really long, and with 500 millions of tweets per day, there’s a lot of data to analyse and to play with.

This is the first in a series of articles dedicated to mining data on Twitter using Python. In this first part, we’ll see different options to collect data from Twitter. Once we have built a data set, in the next episodes we’ll discuss some interesting data applications.

1.Collecting data

1.1 Register Your App

In order to have access to Twitter data programmatically, we need to create an app that interacts with the Twitter API.

The first step is the registration of your app. In particular, you need to point your browser to http://apps.twitter.com, log-in to Twitter (if you’re not already logged in) and register a new application. You can now choose a name and a description for your app (for example “Mining Demo” or similar). You will receive a consumer key and a consumer secret: these are application settings that should always be kept private. From the configuration page of your app, you can also require an access token and an access token secret. Similarly to the consumer keys, these strings must also be kept private: they provide the application access to Twitter on behalf of your account. The default permissions are read-only, which is all we need in our case, but if you decide to change your permission to provide writing features in your app, you must negotiate a new access token.

Important Note: there are rate limits in the use of the Twitter API, as well as limitations in case you want to provide a downloadable data-set, see:

https://dev.twitter.com/overview/terms/agreement-and-policy

https://dev.twitter.com/rest/public/rate-limiting

1.2 Accessing the Data

Twitter provides REST APIs you can use to interact with their service. There is also a bunch of Python-based clients out there that we can use without re-inventing the wheel. In particular, Tweepy in one of the most interesting and straightforward to use, so let’s install it:

  1. pip install tweepy==3.5.0

In order to authorise our app to access Twitter on our behalf, we need to use the OAuth interface:

  1. import tweepy
  2. from tweepy import OAuthHandler
  3. consumer_key = 'YOUR-CONSUMER-KEY'
  4. consumer_secret = 'YOUR-CONSUMER-SECRET'
  5. access_token = 'YOUR-ACCESS-TOKEN'
  6. access_secret = 'YOUR-ACCESS-SECRET'
  7. auth = OAuthHandler(consumer_key, consumer_secret)
  8. auth.set_access_token(access_token, access_secret)
  9. api = tweepy.API(auth)

The api variable is now our entry point for most of the operations we can perform with Twitter.

For example, we can read our own timeline (i.e. our Twitter homepage) with:

  1. for status in tweepy.Cursor(api.home_timeline).items(10):
  2. # Process a single status
  3. print(status.text)

Tweepy provides the convenient Cursor interface to iterate through different types of objects. In the example above we’re using 10 to limit the number of tweets we’re reading, but we can of course access more. The status variable is an instance of the Status() class, a nice wrapper to access the data. The JSON response from the Twitter API is available in the attribute _json (with a leading underscore), which is not the raw JSON string, but a dictionary.

  • So the code above can be re-written to process/store the JSON:
  1. for status in tweepy.Cursor(api.home_timeline).items(10):
  2. # Process a single status
  3. process_or_store(status._json)
  • What if we want to have a list of all our followers? There you go:
  1. for friend in tweepy.Cursor(api.friends).items():
  2. process_or_store(friend._json)
  • And how about a list of all our tweets? Simple:
  1. for tweet in tweepy.Cursor(api.user_timeline).items():
  2. process_or_store(tweet._json)

In this way we can easily collect tweets (and more) and store them in the original JSON format, fairly easy to convert into different data models depending on our storage (many NoSQL technologies provide some bulk import feature).

The function process_or_store() is a place-holder for your custom implementation. In the simplest form, you could just print out the JSON, one tweet per line:

  1. def process_or_store(tweet):
  2. print(json.dumps(tweet))

1.3 Streaming

In case we want to “keep the connection open”, and gather all the upcoming tweets about a particular event, the streaming API is what we need. We need to extend the StreamListener() to customise the way we process the incoming data. A working example that gathers all the new tweets with the #python hashtag:

  1. from tweepy import Stream
  2. from tweepy.streaming import StreamListener
  3. class MyListener(StreamListener):
  4. def on_data(self, data):
  5. try:
  6. with open('python.json', 'a') as f:
  7. f.write(data)
  8. return True
  9. except BaseException as e:
  10. print("Error on_data: %s" % str(e))
  11. return True
  12. def on_error(self, status):
  13. print(status)
  14. return True
  15. twitter_stream = Stream(auth, MyListener())
  16. twitter_stream.filter(track=['#python'])

Depending on the search term, we can gather tons of tweets within a few minutes. This is especially true for live events with a world-wide coverage (World Cups, Super Bowls, Academy Awards, you name it), so keep an eye on the JSON file to understand how fast it grows and consider how many tweets you might need for your tests. The above script will save each tweet on a new line, so you can use the command wc -l python.json from a Unix shell to know how many tweets you’ve gathered.

You can see a minimal working example of the Twitter Stream API in the following Gist:

  1. ##config.py
  2. consumer_key = 'your-consumer-key'
  3. consumer_secret = 'your-consumer-secret'
  4. access_token = 'your-access-token'
  5. access_secret = 'your-access-secret'
  1. ##twitter_stream_download.py
  2. # To run this code, first edit config.py with your configuration, then:
  3. #
  4. # mkdir data
  5. # python twitter_stream_download.py -q apple -d data
  6. #
  7. # It will produce the list of tweets for the query "apple"
  8. # in the file data/stream_apple.json
  9. import tweepy
  10. from tweepy import Stream
  11. from tweepy import OAuthHandler
  12. from tweepy.streaming import StreamListener
  13. import time
  14. import argparse
  15. import string
  16. import config
  17. import json
  18. def get_parser():
  19. """Get parser for command line arguments."""
  20. parser = argparse.ArgumentParser(description="Twitter Downloader")
  21. parser.add_argument("-q",
  22. "--query",
  23. dest="query",
  24. help="Query/Filter",
  25. default='-')
  26. parser.add_argument("-d",
  27. "--data-dir",
  28. dest="data_dir",
  29. help="Output/Data Directory")
  30. return parser
  31. class MyListener(StreamListener):
  32. """Custom StreamListener for streaming data."""
  33. def __init__(self, data_dir, query):
  34. query_fname = format_filename(query)
  35. self.outfile = "%s/stream_%s.json" % (data_dir, query_fname)
  36. def on_data(self, data):
  37. try:
  38. with open(self.outfile, 'a') as f:
  39. f.write(data)
  40. print(data)
  41. return True
  42. except BaseException as e:
  43. print("Error on_data: %s" % str(e))
  44. time.sleep(5)
  45. return True
  46. def on_error(self, status):
  47. print(status)
  48. return True
  49. def format_filename(fname):
  50. """Convert file name into a safe string.
  51. Arguments:
  52. fname -- the file name to convert
  53. Return:
  54. String -- converted file name
  55. """
  56. return ''.join(convert_valid(one_char) for one_char in fname)
  57. def convert_valid(one_char):
  58. """Convert a character into '_' if invalid.
  59. Arguments:
  60. one_char -- the char to convert
  61. Return:
  62. Character -- converted char
  63. """
  64. valid_chars = "-_.%s%s" % (string.ascii_letters, string.digits)
  65. if one_char in valid_chars:
  66. return one_char
  67. else:
  68. return '_'
  69. @classmethod
  70. def parse(cls, api, raw):
  71. status = cls.first_parse(api, raw)
  72. setattr(status, 'json', json.dumps(raw))
  73. return status
  74. if __name__ == '__main__':
  75. parser = get_parser()
  76. args = parser.parse_args()
  77. auth = OAuthHandler(config.consumer_key, config.consumer_secret)
  78. auth.set_access_token(config.access_token, config.access_secret)
  79. api = tweepy.API(auth)
  80. twitter_stream = Stream(auth, MyListener(args.data_dir, args.query))
  81. twitter_stream.filter(track=[args.query])

2.Text Pre-processing

2.1 The Anatomy of a Tweet

Assuming that you have collected a number of tweets and stored them in JSON as suggested above, let’s have a look at the structure of a tweet:

  1. import json
  2. with open('mytweets.json', 'r') as f:
  3. line = f.readline() # read only the first tweet/line
  4. tweet = json.loads(line) # load it as Python dict
  5. print(json.dumps(tweet, indent=4)) # pretty-print

The key attributes are the following:

  • text: the text of the tweet itself
  • created_at: the date of creation
  • favorite_count, retweet_count: the number of favourites and retweets
  • favorited, retweeted: boolean stating whether the authenticated user (you) have favourited or retweeted this tweet
  • lang: acronym for the language (e.g. “en” for english)
  • id: the tweet identifier
  • place, coordinates, geo: geo-location information if available
  • user: the author’s full profile
  • entities: list of entities like URLs, @-mentions, hashtags and symbols
  • in_reply_to_user_id: user identifier if the tweet is a reply to a specific user
  • in_reply_to_status_id: status identifier id the tweet is a reply to a specific status

As you can see there’s a lot of information we can play with. All the *_id fields also have a *_id_str counterpart, where the same information is stored as a string rather than a big int (to avoid overflow problems). We can imagine how these data already allow for some interesting analysis: we can check who is most favourited/retweeted, who’s discussing with who, what are the most popular hashtags and so on. Most of the goodness we’re looking for, i.e. the content of a tweet, is anyway embedded in the text, and that’s where we’re starting our analysis.

We start our analysis by breaking the text down into words. Tokenisation is one of the most basic, yet most important, steps in text analysis. The purpose of tokenisation is to split a stream of text into smaller units called tokens, usually words or phrases. While this is a well understood problem with several out-of-the-box solutions from popular libraries, Twitter data pose some challenges because of the nature of the language.

2.2 How to Tokenise a Tweet Text

Let’s see an example, using the popular NLTK library to tokenise a fictitious tweet:

  1. from nltk.tokenize import word_tokenize
  2. tweet = 'RT @marcobonzanini: just an example! :D http://example.com #NLP'
  3. print(word_tokenize(tweet))
  4. # ['RT', '@', 'marcobonzanini', ':', 'just', 'an', 'example', '!', ':', 'D', 'http', ':', '//example.com', '#', 'NLP']

You will notice some peculiarities that are not captured by a general-purpose English tokeniser like the one from NLTK: @-mentions, emoticons, URLs and #hash-tags are not recognised as single tokens. The following code will propose a pre-processing chain that will consider these aspects of the language.

  1. import re
  2. emoticons_str = r"""
  3. (?:
  4. [:=;] # Eyes
  5. [oO\-]? # Nose (optional)
  6. [D\)\]\(\]/\\OpP] # Mouth
  7. )"""
  8. regex_str = [
  9. emoticons_str,
  10. r'<[^>]+>', # HTML tags
  11. r'(?:@[\w_]+)', # @-mentions
  12. r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
  13. r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
  14. r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
  15. r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
  16. r'(?:[\w_]+)', # other words
  17. r'(?:\S)' # anything else
  18. ]
  19. tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
  20. emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
  21. def tokenize(s):
  22. return tokens_re.findall(s)
  23. def preprocess(s, lowercase=False):
  24. tokens = tokenize(s)
  25. if lowercase:
  26. tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
  27. return tokens
  28. tweet = "RT @marcobonzanini: just an example! :D http://example.com #NLP"
  29. print(preprocess(tweet))
  30. # ['RT', '@marcobonzanini', ':', 'just', 'an', 'example', '!', ':D', 'http://example.com', '#NLP']

As you can see, @-mentions, emoticons, URLs and #hash-tags are now preserved as individual tokens.

If we want to process all our tweets, previously saved on file:

  1. with open('mytweets.json, 'r') as f:
  2. for line in f:
  3. tweet = json.loads(line)
  4. tokens = preprocess(tweet['text'])
  5. do_something_else(tokens)

The tokeniser is probably far from perfect, but it gives you the general idea. The tokenisation is based on regular expressions (regexp), which is a common choice for this type of problem. Some particular types of tokens (e.g. phone numbers or chemical names) will not be captured, and will be probably broken into several tokens. To overcome this problem, as well as to improve the richness of your pre-processing pipeline, you can improve the regular expressions, or even employ more sophisticated techniques like Named Entity Recognition.

The core component of the tokeniser is the regex_str variable, which is a list of possible patterns. In particular, we try to capture some emoticons, HTML tags, Twitter @usernames (@-mentions), Twitter #hashtags, URLs, numbers, words with and without dashes and apostrophes, and finally “anything else”. Please take a moment to observe the regexp for capturing numbers: why don’t we just use \d+? The problem here is that numbers can appear in several different ways, e.g. 1000 can also be written as 1,000 or 1,000.00 — and we can get into more complications in a multi-lingual environment where commas and dots are inverted: “one thousand” can be written as 1.000 or 1.000,00 in many non-anglophone countries. The task of identifying numeric tokens correctly just gives you a glimpse of how difficult tokenisation can be.

The regular expressions are compiled with the flags re.VERBOSE, to allow spaces in the regexp to be ignored (see the multi-line emoticons regexp), and re.IGNORECASE to catch both upper and lowercases. The tokenize() function simply catches all the tokens in a string and returns them as a list. This function is used within preprocess(), which is used as a pre-processing chain: in this case we simply add a lowercasing feature for all the tokens that are not emoticons (e.g.

Mining Twitter Data with Python的更多相关文章

  1. 使用Python对Twitter进行数据挖掘(Mining Twitter Data with Python)

    目录 1.Collecting data 1.1 Register Your App 1.2 Accessing the Data 1.3 Streaming 2.Text Pre-processin ...

  2. 【402】Twitter Data Collection

    参考:Python判断文件是否存在的三种方法 参考:在python文件中执行另一个python文件 参考:How can I make a time delay in Python? 参考:Twili ...

  3. What’s the difference between data mining and data warehousing?

    Data mining is the process of finding patterns in a given data set. These patterns can often provide ...

  4. Datasets for Data Mining and Data Science

    https://github.com/mattbane/RecommenderSystem http://grouplens.org/datasets/movielens/ KDDCUP-2012官网 ...

  5. python data analysis | python数据预处理(基于scikit-learn模块)

    原文:http://www.jianshu.com/p/94516a58314d Dataset transformations| 数据转换 Combining estimators|组合学习器 Fe ...

  6. Packing data with Python

    Defining how a sequence of bytes sits in a memory buffer or on disk can be challenging from time to ...

  7. Working with Binary Data in Python

    http://www.devdungeon.com/content/working-binary-data-python

  8. An Introduction to Text Mining using Twitter Streaming

    Text mining is the application of natural language processing techniques and analytical methods to t ...

  9. Seven Python Tools All Data Scientists Should Know How to Use

    Seven Python Tools All Data Scientists Should Know How to Use If you’re an aspiring data scientist, ...

随机推荐

  1. spec文件写作规范

    spec文件写作规范 2008-09-28 11:52:17 分类: LINUX 1.The RPM system assumes five RPM directories BUILD:rpmbuil ...

  2. Linux中处理字符串

    获取字符串长度: ${#字符串变量名} 截取子串: 1. expr substr 字符串 起始位置 截取长度 2. 命令输出 | cut -c 起始位置-结束位置 命令输出 | cut -c &quo ...

  3. Spring 注解(一)Spring 注解编程模型

    Spring 注解(一)Spring 注解编程模型 Spring 系列目录(https://www.cnblogs.com/binarylei/p/10198698.html) Spring 注解系列 ...

  4. maven web+spring mvc项目没有出现src/main/java路径

    直接在main 文件夹下创建java可以解决 https://www.cnblogs.com/zhujiabin/p/6343462.html

  5. Clover相关知识

    -f 重建驱动缓存 darkwake=4 有深度睡眠有关的设置,不懂 kext-dev-mode=1 启用第三方驱动,比较重要. dart=0 修复因开启 VT-d 导致系统启动时SMC五国错误,系统 ...

  6. tomcat优化(转)

    tomcat优化 Activiti  分享牛  2017-02-08  1132℃ 本文重点讲解tomcat的优化. 基本优化思路: 1.         尽量缩短单个请求的处理时间. 2.      ...

  7. [C#.Net]对WinForm应用程序的App.config的使用及加密

    我们在写C#应用程序时,在工程文件中放置一个app.config,程序打包时,系统会将该配置文件自动编译为与程序集同名的.exe.config 文件.作用就是应用程序安装后,只需在安装目录中找到该文件 ...

  8. js 分页

    html代码:  <div id="paging_wrap" class="paging-wrap"></div> css代码: div ...

  9. 再读c++primer plus 005

    对象和类: 1.类和结构的唯一区别是结构的默认访问类型是public,而类为private: 2.其定义位于类声明中的函数都将自动成为内联函数,也可以在类声明外定义成员函数,并使其成为内联函数,为此只 ...

  10. Tkinter添加图片

    Tkinter添加图片的方式,与Java相似都是利用label标签来完成的: 一.默认的是gif的格式,注意将png后缀名修改为gif还是无法使用,文件格式依然错误. photo = PhotoIma ...