Twitter数据抓取的方法(二)
Scraping Tweets Directly from Twitters Search Page – Part 2
Published January 11, 2015
In the previous post we covered effectively the theory of how we can search and extract tweets from Twitter without having to use their API.
First, let’s have a quick recap of what we learned in the previous post. We have a URL that we can use to search Twitter with:
https://twitter.com/i/search/timeline
This includes the following parameters:
Key | Value |
---|---|
q | URL encoded query string |
f | Type of query (omit for top results or realtime for all) |
scroll_cursor | Allows to paginate through results. If omitted it returns first page |
We also know that Twitter returns the following JSON response:
{ |
Finally, we know that we can extract the following information for each tweet:
Selector | Value |
---|---|
div.original-tweet[data-tweet-id] | The authors twitter handle |
div.original-tweet[data-name] | The name of the author |
div.original-tweet[data-user-id] | The user ID of the author |
span._timestamp[data-time] | Timestamp of the post |
span._timestamp[data-time-ms] | Timestamp of the post in ms |
p.tweet-text | Text of Tweet |
span.ProfileTweet-action–retweet > span.ProfileTweet-actionCount[data-tweet-stat-count] | Number of Retweets |
span.ProfileTweet-action–favorite > span.ProfileTweet-actionCount[data-tweet-stat-count] | Number of Favourites |
Ok, recap done, let’s consider some pseudo code to get us started. As the example is going to be in Java, the pseudo code will take on a Java syntax.
searchTwitter(String query, long rateDelay) { |
Firstly, we define a function called searchTwitter, where we pass a query value as a string, and a specified time to pause the thread between calls. Given this string, we then pass to a function that creates our search URL based on our query. Then, in a while loop, we execute the search to return a TwitterResponse object that represents the JSON Twitter returns. Checking that the response is not null, it has more items, and we are not repeating the scroll cursor, we proceed to extract tweets from the items html, save them, and create our next search URL. We finally sleep the thread for however long we choose to with rateDelay, so we are not bombarding Twitter with a stupid amount of requests that could be viewed as a very crap DDOS.
Now that we’ve got an idea of what algorithm we’re going to use, let’s start coding.
I’m going to use Gradle as a the build system, as we are going to use some additional dependencies to make things easier. You can either download it and set it up on your machine if you want, but I’ve also added a Gradle wrapper (gradlew) to the repository so you can run without downloading Gradle. All you’ll need is to make sure that you’re JAVA_HOME Path variable is set up and pointing to wherever Java is located.
Lets take a look at the Gradle file.
apply plugin: 'java' |
As this is Java project, we’ve applied the java plugin. This will generate our standard directory structure that we get with Gradle and Maven projects: src/main/java src/test/java.
In addition, there are several dependencies I’ve included to help make the task a little easier. HTTPClient provides libraries that make it easier to construct URI’s, GSON is a useful JSON processing library that will allow us to convert the response query from Twitter into a Java object, and finally JSoup is an HTML parsing library that we can use to extract what we need from the inner_html value that Twitter returns to us. Finally, I’ve included JUnit, however I won’t go into unit testing with this example.
Lets start writing our code. Again, if you’re not familiar with gradle, the root for your packages should be in src/main/java. If the folders are not already there, you can auto generate, although feel free to look at the example code if you’re still unclear.
package uk.co.tomkdickinson.twitter.search; |
package uk.co.tomkdickinson.twitter.search; |
You’ll notice the additional method getTweets() in TwitterResponse. For now, just return an empty ArrayList, but we will revisit this later.
In addition to these bean classes, we also want to consider an edge case
where people might use this to search for an empty, null string, or the
query contains characters not allowed in a URL. Therefore to handle
this, we will also create a small Exception class called
InvalidQueryException.
package uk.co.tomkdickinson.twitter.search; |
Next, we need to create a TwitterSearch class and it’s basic structure. An important thing to consider here is we are interested in making the code reusable, so in the example I have made this abstract with an abstract method called saveTweets. The nice thing about this is it decouples the saving logic from the extraction logic. In other words, this will allow you to implement your own save solution without having to rewrite any of the TwitterSearch code. Additionally, you might also note that I’ve specified that the saveTweets method returns a boolean. This will allow anyone extending this to provide their own exit condition, for example once a certain number of tweets have been extracted. By returning false, we can indicate in our code to stop extracting tweets from Twitter.
package uk.co.tomkdickinson.twitter.search; |
Finally, lets also create a TwitterSearchImpl. This will contain a small implementation of TwitterSearch so we can test our code as we go along.
package uk.co.tomkdickinson.twitter.search; |
All this implementation does is print out our tweets date and text, collecting up to a maximum of 500 where the program should terminate.
Now we have the skeleton of our project set up, lets start implementing some of the functionality. Considering our pseudo code from earlier Let’s start with TwitterSearch.class:
public void search(final String query, final long rateDelay) { |
As you can probably tell, that is pretty much most of our main pseudo code implemented. Running it will have no effect, as we haven’t implemented any of the actual steps yet, but it is a good start.
Lets implement some of our other methods starting with constructURL.
public final static String TYPE_PARAM = "f"; |
First, we make a check to see if the query is valid. If not, we’re going to throw that InvalidQuery exception from earlier. Additionally, we may throw a MalformedURLException or URISyntaxexception, both caused by an invalid query string, so when caught we shall throw a new InvalidQuery exception. Next, using a URIBuilder, we build our URL using some constants we specify as variables, and the query and scroll_cursor value we pass. With our initial queries, we will have a null scroll cursor, so we also check for that. Finally, we build the URI and return as a URL, so we can use it to open up an InputStream later on.
Lets implement our executeSearch function. This is where we actually call Twitter and parse its response.
public static TwitterResponse executeSearch(final URL url) { |
This is a fairly simple method. All we’re doing is opening up a URLConnection for our Twitter query, then parsing that response using Gson as a TwitterResponse object, serializing the JSON into a Java object that we can use. As we’ve already implemented the logic earlier for using the scroll cursor, if we were to run this now, rather than the program terminating after a few seconds, it will keep running till there is no longer a valid response from Twitter. However, we haven’t quite finished yet as we have yet to extract any information from the tweets.
The TwitterResponse object is currently holding all the twitter data in it’s items_html variable, so what we now need to do is go back to TwitterResponse and add in some code that lets us extract that data. If you remember from earlier, we added a getTweets() method to the TwitterResponse object, however it’s returning an empty list. We’re going to fully implement that method so that when called, it builds up a list of tweets from the response inner_html.
To do this, we are going to be using JSoup, and we can even refer to some of those CSS queries that we noted earlier.
public List getTweets() { |
Let’s discuss what we’re doing here. First, we’re create a JSoup document from the items_html variable. This allows us to select elements within the document using css selectors. Next, we are going through each of the li elements that represent each tweet, and then extracting all the information that we are interested in. As you can see, there’s a number of catch statements in here as we want to check against edge cases where particular data items might not be there (i.e. user’s real name), while at the same time not using an all encompassing catch statement that will skip tweets if it is just missing a singular piece of information. The only value that we require to save the tweet here is the tweetId, as this allows us to fully extract information about the tweet later on if we so want. Obviously, you can modify this section to your hearts content to meet your own rules.
Finally, lets re run our program again. This is the final time, and you should now see tweets being extracted and printed out. That’s it. Job done, finished!
Obviously, there are many ways this code can be improved. For example, a more generic error checking methodology could be implemented to check against missing attributes (or you could just use groovy and ?). You could implement runnable in the TwitterSearch class to allow multiple calls to Twitter with a ThreadPool (although, I stress respect rate limits). You could even change TwitterResponse so it serializes the tweets as a list on creation, rather than extracting them from items_html each time you access them.
Twitter数据抓取的方法(二)的更多相关文章
- Twitter数据抓取的方法(一)
Scraping Tweets Directly from Twitters Search Page – Part 1 Published January 8, 2015 EDIT – Since I ...
- Twitter数据抓取的方法(三)
Scraping Tweets Directly from Twitters Search – Update Published August 1, 2015 Sorry for my delayed ...
- Twitter数据抓取
说明:这里分三个系列介绍Twitter数据的非API抓取方法.有兴趣的QQ群交流: BitCrawler网络爬虫QQ群 322937592 1.Twitter数据抓取(一) 2.Twitter数据抓取 ...
- 汽车之家店铺商品详情数据抓取 DotnetSpider实战[二]
一.迟到的下期预告 自从上一篇文章发布到现在,大约差不多有3个月的样子,其实一直想把这个实战入门系列的教程写完,一个是为了支持DotnetSpider,二个是为了.Net 社区发展献出一份绵薄之力,这 ...
- Hawk 数据抓取工具 使用说明(二)
1. 调试模式和执行模式 1.1.调试模式 系统能够通过拖拽构造工作流.在编辑流的过程中,处于调试模式,为了保证快速地计算和显示当前结果(只显示前20个数据,可在调试的采样量中修改),此时,所有执行器 ...
- Twitter数据非API采集方法
说明:这里分三个系列介绍Twitter数据的非API抓取方法. 在一个老外的博看上看到的,想详细了解的可以自己去看原文. 这种方法可以采集基于关键字在twitter上搜索的结果推文,已经实现自动翻页功 ...
- python爬虫数据抓取方法汇总
概要:利用python进行web数据抓取方法和实现. 1.python进行网页数据抓取有两种方式:一种是直接依据url链接来拼接使用get方法得到内容,一种是构建post请求改变对应参数来获得web返 ...
- [原创.数据可视化系列之十二]使用 nodejs通过async await建立同步数据抓取
做数据分析和可视化工作,最重要的一点就是数据抓取工作,之前使用Java和python都做过简单的数据抓取,感觉用的很不顺手. 后来用nodejs发现非常不错,通过js就可以进行数据抓取工作,类似jqu ...
- C# 微信 生活助手 空气质量 天气预报等 效果展示 数据抓取 (二)
此文主要是 中国天气网和中国环境监测总站的数据抓取 打算开放全部数据抓取源代码 已在服务器上 稳定运行半个月 webapi http://api.xuzhiheng.cn/ 常量 /// <su ...
随机推荐
- 页面添加数据的PHP
(接前面写的) 第一个页面tianjia.php <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" ...
- Spring+SpringMVC+MyBatis+easyUI整合基础篇(七)JDBC url的连接参数
在java程序与数据库连接的编程中,mysql jdbc url格式如下: jdbc:mysql://[host:port],[host:port].../[database][?参数名1][=参数值 ...
- 类比Spring框架来实现OC中的依赖注入
如果你之前使用过JavaEE开发中的Spring框架的话,那么你一定对依赖注入并不陌生.依赖注入(DI: Dependency Injection)是控制反转(IoC: Inversion of Co ...
- Selenium Web 自动化 - 项目持续集成
Selenium Web 自动化 - 项目持续集成 2017-02-13 目录 1环境准备 1.1 安装git 1.2 安装jenkins 1.3 安装jenkins插件 1.4 jekins ...
- Gradle之恋-任务1
任务作为Gradle的核心功能模块,而且Gradle的任务还可以具有自己的属性和方法,大大扩展了Ant任务的功能.由于任务相关内容比较多,分为两篇来探讨,本篇主要涉及到:任务的定义.任务的属性.任务的 ...
- golang中的reflect包用法
最近在写一个自动生成api文档的功能,用到了reflect包来给结构体赋值,给空数组新增一个元素,这样只要定义一个input结构体和一个output的结构体,并填写一些相关tag信息,就能使用程序来生 ...
- BOM基础(四)
最近写的文章感觉内容不像之前那么充实,内容可能也有点杂.对于DOM,和BOM来说,要理解是不难的,难的是做的时候.要自己想的到,而且,对于目前阶段来说,BOM还存在着很大的兼容性问题,最主要就是要兼容 ...
- Summary Ranges leetcode
Given a sorted integer array without duplicates, return the summary of its ranges. For example, give ...
- ANdrod Studio查看Sha1的方法
在用Studio做开发中,有时候根据业务需求,需要集成一些SDk,举个例子,百度的鹰眼定位,当然还有很多,在创建项目的时候需要输入sha1值,这个sha1值的获取有多种方式,我记得百度有个检测,就可以 ...
- PHP环境搭建(20161014)
听课笔记: 上午: 现在PHP的版本一般是5.5,相对于以前的版本移除了一些函数,所以在以后使用到不同版本的PHP时要特别注意 localhost(本机[默认的网址]) my project(我的项目 ...