【转】jsoup的使用

Jsoup的使用

jsoup 是一款 Java 的HTML 解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于JQuery的操作方法来取出和操作数据。请参考：http://jsoup.org/

jsoup的主要功能如下：

从一个URL，文件或字符串中解析HTML；

使用DOM或CSS选择器来查找、取出数据；

可操作HTML元素、属性、文本；

jsoup是基于MIT协议发布的，可放心使用于商业项目。

下载和安装：

maven安装方法：

把下面放入pom.xml下

<groupId>org.jsoup</groupId>

<artifactId>jsoup</artifactId>

</dependency>

用jsoup解析html的方法如下：

解析url html方法

Document doc =Jsoup.connect("http://example.com")  .data("query","Java")

  .userAgent("Mozilla")

  .cookie("auth","token")

  .timeout(3000)

  .post();

从文件中解析的方法：

File input =newFile("/tmp/input.html");Document doc =Jsoup.parse(input,"UTF-8","http://example.com/");

类试js jsoup提供下面方法：

getElementById(String id) 用id获得元素
getElementsByTag(String tag) 用标签获得元素
getElementsByClass(String className) 用class获得元素
getElementsByAttribute(String key) 用属性获得元素

同时还提供下面的方法提供获取兄弟节点：

siblingElements(), firstElementSibling(), lastElementSibling();nextElementSibling(), previousElementSibling()

用下面方法获得元素的数据：

attr(String key) 获得元素的数据
attr(String key, String value) t设置元素数据
attributes() 获得所以属性
id(), className() classNames() 获得id class得值
text()获得文本值
text(String value) 设置文本值
html() 获取html
html(String value)设置html
outerHtml() 获得内部html
data()获得数据内容
tag() 获得tag 和 tagName() 获得tagname

操作html提供了下面方法：

通过类似jquery的方法操作html

File input =newFile("/tmp/input.html");Document doc =Jsoup.parse(input,"UTF-8","http://example.com/");Elements links = doc.select("a[href]");// a with hrefElements pngs = doc.select("img[src$=.png]");

  // img with src ending .pngElement masthead = doc.select("div.masthead").first();

  // div with class="masthead"Elements resultLinks = doc.select("h3.r > a");// direct a after h3

支持的操作有下面这些：

tagname 操作tag
ns|tag ns或tag
#id 用id获得元素
.class 用class获得元素
[attribute] 属性获得元素
[^attr]: 以attr开头的属性
[attr=value] 属性值为value
[attr^=value], [attr$=value], [attr*=value]
[attr~=regex]正则
*:所以的标签

选择组合

el#id el和id定位
el.class e1和class定位
el[attr] e1和属性定位
ancestor child ancestor下面的child

等等

抓取网站标题和内容及里面图片的事例：

public void parse(String urlStr) {
// 返回结果初始化。
Document doc = null;
try {
doc = Jsoup
.connect(urlStr)
.userAgent(
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.15)") // 设置User-Agent
.timeout(5000) // 设置连接超时时间
.get();
} catch (MalformedURLException e) {
log.error( e);
return ;
} catch (IOException e) {
if (e instanceof SocketTimeoutException) {
log.error( e);
return ;
}
if(e instanceof UnknownHostException){
log.error(e);
return ;
}
log.error( e);
return ;
}
system.out.println(doc.title());
Element head = doc.head();
Elements metas = head.select("meta");
for (Element meta : metas) {
String content = meta.attr("content");
if ("content-type".equalsIgnoreCase(meta.attr("http-equiv"))
&& !StringUtils.startsWith(content, "text/html")) {
log.debug( urlStr);
return ;
}
if ("description".equalsIgnoreCase(meta.attr("name"))) {
system.out.println(meta.attr("content"));
}
}
Element body = doc.body();
for (Element img : body.getElementsByTag("img")) {
String imageUrl = img.attr("abs:src");//获得绝对路径
for (String suffix : IMAGE_TYPE_ARRAY) {
if(imageUrl.indexOf("?")>0){
imageUrl=imageUrl.substring(0,imageUrl.indexOf("?"));
}
if (StringUtils.endsWithIgnoreCase(imageUrl, suffix)) {
imgSrcs.add(imageUrl);
break;
}
}
}
}

这里重点要提的是怎么获得图片或链接的决定地址：

如上获得绝对地址的方法String imageUrl = img.attr("abs:src");//获得绝对路径，前面添加abs：jsoup就会获得决定地址；

想知道原因，咱们查看下源码，如下：

//该方面是先从map中找看是否有该属性key，如果有直接返回，如果没有检查是否
//以abs：开头
public String attr(String attributeKey) {
Validate.notNull(attributeKey);
if (hasAttr(attributeKey))
return attributes.get(attributeKey);
else if (attributeKey.toLowerCase().startsWith("abs:"))
return absUrl(attributeKey.substring("abs:".length()));
else return "";
}

接着查看absUrl方法：

/**
* Get an absolute URL from a URL attribute that may be relative (i.e. an <code><a href></code> or
* <code><img src></code>).
* <p/>
* E.g.: <code>String absUrl = linkEl.absUrl("href");</code>
* <p/>
* If the attribute value is already absolute (i.e. it starts with a protocol, like
* <code>http://</code> or <code>https://</code> etc), and it successfully parses as a URL, the attribute is
* returned directly. Otherwise, it is treated as a URL relative to the element's {@link #baseUri}, and made
* absolute using that.
* <p/>
* As an alternate, you can use the {@link #attr} method with the <code>abs:</code> prefix, e.g.:
* <code>String absUrl = linkEl.attr("abs:href");</code>
*
* @param attributeKey The attribute key
* @return An absolute URL if one could be made, or an empty string (not null) if the attribute was missing or
* could not be made successfully into a URL.
* @see #attr
* @see java.net.URL#URL(java.net.URL, String)
*/
//看到这里大家应该明白绝对地址是怎么取的了
public String absUrl(String attributeKey) {
Validate.notEmpty(attributeKey);
String relUrl = attr(attributeKey);
if (!hasAttr(attributeKey)) {
return ""; // nothing to make absolute with
} else {
URL base;
try {
try {
base = new URL(baseUri);
} catch (MalformedURLException e) {
// the base is unsuitable, but the attribute may be abs on its own, so try that
URL abs = new URL(relUrl);
return abs.toExternalForm();
}
// workaround: java resolves '//path/file + ?foo' to '//path/?foo', not '//path/file?foo' as desired
if (relUrl.startsWith("?"))
relUrl = base.getPath() + relUrl;
URL abs = new URL(base, relUrl);
return abs.toExternalForm();
} catch (MalformedURLException e) {
return "";
}
}
}

【转】jsoup的使用的更多相关文章

Jsoup问题---获取http协议请求失败 org.jsoup.UnsupportedMimeTypeException: Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml.
Jsoup问题---获取http协议请求失败 1.问题:用Jsoup在获取一些网站的数据时,起初获取很顺利,但是在访问某浪的数据是Jsoup报错,应该是请求头里面的请求类型(ContextType)不 ...
Jsoup系列学习(2)-解析html文件
解析html文件 1.当我们通过发送http请求时,有时候返回结果是一个html格式字符串,你需要从一个网站获取和解析一个HTML文档,并查找其中的相关数据.你可以使用下面解决方法: 使用 Jsoup ...
Jsoup系列学习(1)-发送get或post请求
简介 jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址.HTML文本内容.它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据. 官 ...
使用 jsoup 对 HTML 文档进行解析和操作
jsoup 简介 Java 程序在解析 HTML 文档时,相信大家都接触过 htmlparser 这个开源项目,我曾经在 IBM DW 上发表过两篇关于 htmlparser 的文章,分别是:从 HT ...
jsoup获取图片示例
import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.io.Inp ...
jsoup获取文档类示例
import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsou ...
Jsoup解析html终于成功了！！！
package com.eric.pickupjoke.activity; import java.io.IOException; import java.io.InputStream; import ...
Jsoup做接口测试
最早用Jsoup是有一个小的爬虫应用要写,发现Jsoup较HttpClient轻便多了,API也方便易懂,上手很快,对于response的Document解析的选择器用的是cssSelector(Jq ...
jsoup开发网页客户端3
这个系列好久没更新,最近好忙,老大说未来是Html5的,所以最近一直学习前端以及Html5的一些东西.Android5.0的诞生,让我们眼前一亮,独特的Material风格更是吸引了无数人. 话说不学 ...
Jsoup开发网站客户端第二篇，图片轮播，ScrollView兼容ListView
最近一段日子忙的焦头烂额,代码重构,新项目编码,导致jsoup开发网站客户端也没时间继续下去,只能利用晚上时间去研究了.今天实现美食网首页图片轮播效果,网站效果图跟Android客户端实现如图: 从浏 ...

随机推荐

tensorboard的安装及遇到的问题
1 安装tensorboard 打开anaconda prompt,键入下边的命令: activate tensorflow pip install tensorboard 当执行“activate ...
Oracle中date转为timstam可以函数to_timestamp的方式来转化
data 转为timstam可以函数to_timestamp的方式来转化 Select to_timestamp('2018-02-27 09:48:28','yyyy-mm-dd hh24:mi:s ...
Iterator 遍历器
1.遍历器(Iterator)是一种接口,为各种不同的数据结构提供统一的访问机制.任何数据结构只要部署Iterator接口,就可以完成遍历操作(即依次处理该数据结构的所有成员). 2.Iterator ...
CI框架定义判断POST GET AJAX
CI框架当中并没有提供,类似tp框架中IS_POST,IS_AJAX,IS_GET的方法. 所有就得我们自己造轮子了.下面就介绍一下,如何定义这些判断请求的方法.其实很简单的. 首先打开constan ...
Lyft Level 5 Challenge 2018 - Final Round (Open Div. 2) B 1075B （思维）
B. Taxi drivers and Lyft time limit per test 1 second memory limit per test 256 megabytes input stan ...
FPGA基础学习(1) -- FFT IP核（Quartus）
为了突出重点,仅对I/O数据流为steaming的情况作简要说明,以便快速上手,有关FFT ip核模型及每种设置详细介绍请参考官方手册FFT MegaCore Function User Guide. ...
商品录入功能v1.0【持续优化中...】
# 录入商品 def goods_record(): print("欢迎使用铜锣辉的购物商城[商品管理][录入商品]".center(30, "*")) whi ...
老男孩python作业3-购物车程序优化
购物车优化要求:用户入口: 1.商品信息存在文件里 2.已购商品,余额记录.第一次启动程序时需要记录工资,第二次启动程序时谈出上次余额 3.允许用户根据商品编号购买商品 4.用户选择商品后,检测是否够 ...
POJ 3734 Blocks（矩阵快速幂+矩阵递推式）
题意:个n个方块涂色, 只能涂红黄蓝绿四种颜色,求最终红色和绿色都为偶数的方案数. 该题我们可以想到一个递推式 . 设a[i]表示到第i个方块为止红绿是偶数的方案数, b[i]为红绿恰有一个是偶数 ...
JS匿名函数以及arguments.callee的调用
var res = (function (n) { if( n>1 ) { return n + arguments.callee( n-1 ); } else { ...

【转】jsoup的使用

选择组合

【转】jsoup的使用的更多相关文章

随机推荐

热门专题