java爬虫爬取网页内容前，对网页内容的编码格式进行判断的方式

近日在做爬虫功能，爬取网页内容，然后对内容进行语义分析，最后对网页打标签，从而判断访问该网页的用户的属性。

在爬取内容时，遇到乱码问题。故需对网页内容编码格式做判断，方式大体分为三种：一、从header标签中获取Content-Type=#Charset；二、从meta标签中获取Content-Type=#Charset；三、根据页面内容分析编码格式。

其中一/二方式并不能准确指示该页面的具体编码方式，周全考虑，加入第三种方式。

第三种方式引入开源jar包info.monitorenter.cpdetector，可以从github上面下载(https://github.com/onilton/cpdetector-maven-repo/tree/master/info/monitorenter/cpdetector/1.0.10)下载。

package com.mobivans.encoding;

import info.monitorenter.cpdetector.io.ASCIIDetector;

import info.monitorenter.cpdetector.io.ByteOrderMarkDetector;

import info.monitorenter.cpdetector.io.CodepageDetectorProxy;

import info.monitorenter.cpdetector.io.JChardetFacade;

import info.monitorenter.cpdetector.io.ParsingDetector;

import info.monitorenter.cpdetector.io.UnicodeDetector;

import java.io.ByteArrayInputStream;

import java.io.IOException;

import java.io.InputStream;

import java.net.MalformedURLException;

import java.net.URL;

import java.net.URLConnection;

import java.nio.charset.Charset;

import java.util.List;

import java.util.Map;

import org.apache.commons.io.IOUtils;

public class PageEncoding {

    /**    测试用例

     * @param args

     */

    public static void main(String[] args) {

//        String charset = getEncodingByHeader("http://blog.csdn.net/liuzhenwen/article/details/4060922");

//        String charset = getEncodingByMeta("http://blog.csdn.net/liuzhenwen/article/details/4060922");

        String charset = getEncodingByContentStream("http://blog.csdn.net/liuzhenwen/article/details/5930910");

        System.out.println(charset);

    }

    /**

     * 从header中获取页面编码

     * @param strUrl

     * @return

     */

    public static String getEncodingByHeader(String strUrl){

        String charset = null;

        try {

            URLConnection urlConn = new URL(strUrl).openConnection();

            // 获取链接的header

            Map<String, List<String>> headerFields = urlConn.getHeaderFields();

            // 判断headers中是否存在Content-Type

            if(headerFields.containsKey("Content-Type")){

                //拿到header 中的 Content-Type ：[text/html; charset=utf-8]

                List<String> attrs = headerFields.get("Content-Type");

                String[] as = attrs.get(0).split(";");

                for (String att : as) {

                    if(att.contains("charset")){

//                        System.out.println(att.split("=")[1]);

                        charset = att.split("=")[1];

                    }

                }

            }

            return charset;

        } catch (MalformedURLException e) {

            e.printStackTrace();

            return charset;

        } catch (IOException e) {

            e.printStackTrace();

            return charset;

        }

    }

    /**

     * 从meta中获取页面编码

     * @param strUrl

     * @return

     */

    public static String getEncodingByMeta(String strUrl){

        String charset = null;

        try {

            URLConnection urlConn = new URL(strUrl).openConnection();

            //避免被拒绝

            urlConn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36");

            // 将html读取成行,放入list

            List<String> lines = IOUtils.readLines(urlConn.getInputStream());

            for (String line : lines) {

                if(line.contains("http-equiv") && line.contains("charset")){

//                    System.out.println(line);

                    String tmp = line.split(";")[1];

                    charset = tmp.substring(tmp.indexOf("=")+1, tmp.indexOf("\""));

                }else{

                    continue;

                }

            }

            return charset;

        } catch (MalformedURLException e) {

            e.printStackTrace();

            return charset;

        } catch (IOException e) {

            e.printStackTrace();

            return charset;

        }

    }

    /**

     * 根据网页内容获取页面编码

     *     case : 适用于可以直接读取网页的情况(例外情况:一些博客网站禁止不带User-Agent信息的访问请求)

     * @param url

     * @return

     */

    public static String getEncodingByContentUrl(String url) {

        CodepageDetectorProxy cdp = CodepageDetectorProxy.getInstance();

        cdp.add(JChardetFacade.getInstance());// 依赖jar包 ：antlr.jar & chardet.jar

        cdp.add(ASCIIDetector.getInstance());

        cdp.add(UnicodeDetector.getInstance());

        cdp.add(new ParsingDetector(false));

        cdp.add(new ByteOrderMarkDetector());

        Charset charset = null;

        try {

            charset = cdp.detectCodepage(new URL(url));

        } catch (MalformedURLException e) {

            e.printStackTrace();

        } catch (IOException e) {

            e.printStackTrace();

        }

        System.out.println(charset);

        return charset == null ? null : charset.name().toLowerCase();

    }

    /**

     * 根据网页内容获取页面编码

     *     case : 适用于不可以直接读取网页的情况,通过将该网页转换为支持mark的输入流,然后解析编码

     * @param strUrl

     * @return

     */

    public static String getEncodingByContentStream(String strUrl) {

        Charset charset = null;

        try {

            URLConnection urlConn = new URL(strUrl).openConnection();

            //打开链接,加上User-Agent,避免被拒绝

            urlConn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36");

            //解析页面内容

            CodepageDetectorProxy cdp = CodepageDetectorProxy.getInstance();

            cdp.add(JChardetFacade.getInstance());// 依赖jar包 ：antlr.jar & chardet.jar

            cdp.add(ASCIIDetector.getInstance());

            cdp.add(UnicodeDetector.getInstance());

            cdp.add(new ParsingDetector(false));

            cdp.add(new ByteOrderMarkDetector());

            InputStream in = urlConn.getInputStream();

            ByteArrayInputStream bais = new ByteArrayInputStream(IOUtils.toByteArray(in));

            // detectCodepage(InputStream in, int length) 只支持可以mark的InputStream

            charset = cdp.detectCodepage(bais, 2147483647);

        } catch (MalformedURLException e) {

            e.printStackTrace();

        } catch (IOException e) {

            e.printStackTrace();

        }

        return charset == null ? null : charset.name().toLowerCase();

    }

}

注意的点：

1.info.monitorenter.cpdetector未在mvn-repository中开源，因而无法从mvn-repository中下载，需要将该jar下到本地，然后手动导入到本地repository，mvn命令如下：

mvn install:install-file -Dfile=jar包的位置 -DgroupId=该jar的groupId -DartifactId=该jar的artifactId -Dversion=该jar的version -Dpackaging=jar

然后在pom.xml中添加该jar的依赖

<!-- charset detector -->

<dependency>

    <groupId>info.monitorenter.cpdetector</groupId>

    <artifactId>cpdetector</artifactId>

    <version>1.0.10</version>

</dependency>

2.JChardetFacade.getInstance()在引入antlr.jar和chardet.jar之前会报异常，在pom.xml中添加这两个jar的dependency:

<!-- antlr -->

<dependency>

    <groupId>antlr</groupId>

    <artifactId>antlr</artifactId>

    <version>2.7.7</version>

</dependency>

<!-- ChardetFacade -->

<dependency>

    <groupId>net.sourceforge.jchardet</groupId>

    <artifactId>jchardet</artifactId>

    <version>1.0</version>

</dependency>

如果是普通项目则无需关心pom.xml，直接把这三个jar包下载下来然后添加到该项目的环境中即可

java爬虫爬取网页内容前，对网页内容的编码格式进行判断的方式的更多相关文章

Java爬虫爬取网站电影下载链接
之前有看过一段时间爬虫,了解了爬虫的原理,以及一些实现的方法,本项目完成于半年前,一直放在那里,现在和大家分享出来. 网络爬虫简单的原理就是把程序想象成为一个小虫子,一旦进去了一个大门,这个小虫子就像 ...
java爬虫爬取资源，小白必须会的入门代码块
java作为目前最火的语言之一,他的实用性也在被无数的java语言爱好者逐渐的开发,目前比较流行的爬取资源,用java来做也更简单一些,下面是爬取网页上所有手机型号,参数等极为简便的数据 packag ...
一个简单java爬虫爬取网页中邮箱并保存
此代码为一十分简单网络爬虫,仅供娱乐之用. java代码如下: package tool; import java.io.BufferedReader; import java.io.File; im ...
java爬虫爬取的html内容中空格（ ）变为问号“?”的解决方法
用java编写的爬虫,使用xpath爬取内容后,发现网页源码中的全部显示为?(问号),但是使用字符串的replace("?", ""),并不能替换,网上找了一 ...
用Java爬虫爬取凤凰财经提供的沪深A股所有股票代号名称
要爬取的凤凰财经网址:http://app.finance.ifeng.com/list/stock.php?t=hs 本作主要采用的技术是jsoup,相关介绍网页:https://www.jians ...
java爬虫爬取https协议的网站时，SSL报错， java.lang.IllegalArgumentException TSLv1.2 报错
目前在广州一家小公司实习,这里的学习环境还是挺好的,今天公司从业十几年的大佬让我检查一下几年前的爬虫程序是否还能使用…… 我从myeclipse上check out了大佬的程序,放到workspace ...
Java爬虫爬取京东商品信息
以下内容转载于<https://www.cnblogs.com/zhuangbiing/p/9194994.html>,在此仅供学习借鉴只用. Maven地址 <dependency ...
Python爬虫 - 爬取百度html代码前200行
Python爬虫 - 爬取百度html代码前200行 - 改进版, 增加了对字符串的.strip()处理源代码如下: # 改进版, 增加了 .strip()方法的使用 # coding=utf-8 ...
Python爬虫爬取数据的步骤
爬虫: 网络爬虫是捜索引擎抓取系统(Baidu.Google等)的重要组成部分.主要目的是将互联网上的网页下载到本地,形成一个互联网内容的镜像备份. 步骤: 第一步:获取网页链接 1.观察需要爬取的多 ...

随机推荐

2.MySQL 数据类型
MySQL 数据类型 MySQL中定义数据字段的类型对你数据库的优化是非常重要的. MySQL支持多种类型,大致可以分为三类:数值.日期/时间和字符串(字符)类型. 数值类型 MySQL支持所有标准S ...
SQL 语句及关键字的用法
一.SELECT select [ALL|DISTINCT] select_list [into new table] FROM table_source [where serch_conditaio ...
ORM------多表操作
上面介绍了单表操作下面就好比我们的sql语句这只能满足于我们的一些简单的操作不能适应我们更多的需要所以我们需要用到更多的需求来进行我们的关系的建立以及查找其实ORM语句就对应着我们的sql语句 ...
三、WPF 全选，反选，以及获取选中行
页面代码 <TextBlock> <CheckBox Name="cbAllCreate" Click="CbAllCreate_Click" ...
使用UITableView实现图片视差效果
使用UITableView实现图片视差效果视差效果如下: 原理: 根据偏移量计算不同的移动速度,so easy! // // RootTableViewController.h // TableVi ...
Mysql常用语句与函数（待续）
-- 查询语句select class from stu_info where sid=1000000102;select * from stu_info t where t.age=88; -- t ...
沉淀再出发：ELK使用初探
沉淀再出发:ELK使用初探一.前言 ELK是Elasticsearch.Logstash.Kibana的简称,这三者是核心套件,但并非全部. 最近ElasticSearch可以说是非常火的一款开源软 ...
16.1 eclipse设置
内容:删除注释自动生成:添加自己使用的模板syso:设置字体:设置黑色主题 // 删除注释自动生成,强迫症表示很受不了那个什么自动生成方法注释,所以我把它关了我之前的截图,删除那个todo的注释行 ...
IntelliJ IDEA常用设置（转）
IntelliJ IDEA是一款非常优秀的JAVA编辑器,初学都可会对其中的一些做法感到很别扭,刚开始用的时候我也感到很不习惯,在参考了网上一些文章后在这里把我的一些经验写出来,希望初学者能快速适应它 ...
Xpath提取一个标签里的所有文本
content = etree.HTML(text) h = content.xpath('//h1') h1 = h[0].xpath('string(.)').strip()

java爬虫爬取网页内容前，对网页内容的编码格式进行判断的方式

java爬虫爬取网页内容前，对网页内容的编码格式进行判断的方式的更多相关文章

随机推荐

热门专题