java简单web爬虫(网页图片)

java简单web爬虫(网页图片)
效果，执行main（）方法后图片就下载道C盘的res文件夹中。没有的话创建一个文件夹
代码里的常量根据自己的需求修改，代码附到下面。

package com.sinitek.sirm.common.utils;
 
import java.io.*;
import java.net.URL;
import java.net.URLConnection;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
/**
 * java简单web爬虫(网页图片)
 */
public class Main {
 
    // 地址
    private static final String URL = "http://www.xxx";
    // 获取img标签正则
    private static final String IMGURL_REG = "<img.*src=(.*?)[^>]*?>";
    // 获取src路径的正则
    private static final String IMGSRC_REG = "src\\s*=\\s*\"?(.*?)(\"|>|\\s+)";
    //图片原始路径(如果src里的路径正确则不用)
    private static final String IMG_LUJING = "http://xxx/";
    //下载路径
    private static final String LUJING = "C:/res/";
 
    public static void main(String[] args) {
        try {
            Main cm=new Main();
            //获得html文本内容
            String HTML = cm.getHtml(URL);
            //获取图片标签
            List<String> imgUrl = cm.getImageUrl(HTML);
            //获取图片src地址
            List<String> imgSrc = cm.getImageSrc(imgUrl);
            //下载图片
            cm.Download(imgSrc);
 
        }catch (Exception e){
            System.out.println("发生错误");
        }
 
    }
 
   //获取HTML内容
    private String getHtml(String url)throws Exception{
        URL url1=new URL(url);
        URLConnection connection=url1.openConnection();
        InputStream in=connection.getInputStream();
        InputStreamReader isr=new InputStreamReader(in);
        BufferedReader br=new BufferedReader(isr);
 
        String line;
        StringBuffer sb=new StringBuffer();
        while((line=br.readLine())!=null){
            sb.append(line,0,line.length());
            sb.append('\n');
        }
        br.close();
        isr.close();
        in.close();
        return sb.toString();
    }
 
    //获取ImageUrl地址
    private List<String> getImageUrl(String html){
        Matcher matcher=Pattern.compile(IMGURL_REG).matcher(html);
        List<String>listimgurl=new ArrayList<String>();
        while (matcher.find()){
            listimgurl.add(matcher.group());
        }
        return listimgurl;
    }
 
    //获取ImageSrc地址
    private List<String> getImageSrc(List<String> listimageurl){
        List<String> listImageSrc=new ArrayList<String>();
        for (String image:listimageurl){
            // 匹配<img>中的src数据
            Matcher m = Pattern.compile(IMGSRC_REG).matcher(image);
            while (m.find()) {
                String a = m.group(1);//获取图片路径
                a = IMG_LUJING+a;//数据拼接
                listImageSrc.add(a);
            }
        }
        return listImageSrc;
    }
 
    //下载图片
    private void Download(List<String> listImgSrc) {
        try {
            //开始时间
            Date begindate = new Date();
            for (String url : listImgSrc) {
                //开始时间
                Date begindate2 = new Date();
                String imageName = url.substring(url.lastIndexOf("/") + 1, url.length());
                URL uri = new URL(url);
                InputStream in = uri.openStream();
                FileOutputStream fo = new FileOutputStream(new File(LUJING+imageName));//路径
                byte[] buf = new byte[1024];
                int length = 0;
                System.out.println("开始下载:" + url);
                while ((length = in.read(buf, 0, buf.length)) != -1) {
                    fo.write(buf, 0, length);
                }
                in.close();
                fo.close();
                System.out.println(imageName + "下载完成");
                //结束时间
                Date overdate2 = new Date();
                double time = overdate2.getTime() - begindate2.getTime();
                System.out.println("耗时：" + time / 1000 + "s");
            }
            Date overdate = new Date();
            double time = overdate.getTime() - begindate.getTime();
            System.out.println("总耗时：" + time / 1000 + "s");
        } catch (Exception e) {
            System.out.println("下载失败");
        }
    }
}

java简单web爬虫(网页图片)的更多相关文章

java爬虫-简单爬取网页图片
刚刚接触到“爬虫”这个词的时候是在大一,那时候什么都不明白,但知道了百度.谷歌他们的搜索引擎就是个爬虫. 现在大二.再次燃起对爬虫的热爱,查阅资料,知道常用java.python语言编程,这次我选择了 ...
Python爬虫网页图片
一概述参考http://www.cnblogs.com/abelsu/p/4540711.html 弄了个Python捉取单一网页的图片,但是Python已经升到3+版本了.参考的已经失效,基本用 ...
从urllib和urllib2基础到一个简单抓取网页图片的小爬虫
urllib最常用的两大功能(个人理解urllib用于辅助urllib2) 1.urllib.urlopen() 2. urllib.urlencode() #适当的编码,可用于后面的post提交 ...
node爬虫 -- 网页图片
相信大家都听说过爬虫,我们也听说过Python是可以很方便地爬取网络上的图片,但是奈何本人不会Python,就只有通过 Node 来实践一下了. 接下来看我如何板砖 ! !!
Python3简单爬虫抓取网页图片
现在网上有很多python2写的爬虫抓取网页图片的实例,但不适用新手(新手都使用python3环境,不兼容python2), 所以我用Python3的语法写了一个简单抓取网页图片的实例,希望能够帮助到 ...
Python并发编程-一个简单的爬虫
一个简单的爬虫 #网页状态码 #200 正常 #404 网页找不到 #502 504 import requests from multiprocessing import Pool def get( ...
Java简单爬虫(一)
简单的说,爬虫的意思就是根据url访问请求,然后对返回的数据进行提取,获取对自己有用的信息.然后我们可以将这些有用的信息保存到数据库或者保存到文件中.如果我们手工一个一个访问提取非常慢,所以我们需要编 ...
JAVA多线程超时加载当网页图片
先上图: 这一次没有采取正则匹配,而采取了最简单的java分割和替代方法进行筛选图片它能够筛选如下的图片并保存到指定的文件夹如: “http://xxxx/xxxx/xxx.jpg” 'http: ...
Python爬虫之网页图片抓取
一.引入这段时间一直在学习Python的东西,以前就听说Python爬虫多厉害,正好现在学到这里,跟着小甲鱼的Python视频写了一个爬虫程序,能实现简单的网页图片下载. 二.代码 __author ...

随机推荐

The twentyth day
10th Dec 2018 Cause It's hard for me to lose in my life I've found 因为失去你是一种煎熬 Only time will tell a ...
创建简单的node服务器
昨天咱们说了封装ajax,今天咱们说一下自己创建一个建议的node服务器: 话不多说直接上代码: var http = require('http') //对URL 解析为对象//1.导入模块 UR ...
菜鸟学习Spring——SpringIoC容器基于三种配置的对比
一.概述对于实现Bean信息定义的目标,它提供了基于XML.基于注解及基于java类这三种选项.下面总结一下三种配置方式的差异. 二.Bean不同配置方式比较. 三.Bean不同配置方式的适用场合. ...
linux 中环境变量配置文件说明
1. 修改/etc/profile文件特点:所有用户的shell都有权使用你配置好的环境变量说明:如果你的电脑仅用作开发,建议使用此配置,因为所有用户的shell都有权使用你配置好的环境变量,所以 ...
android的MVP模式
MVP简介相信大家对MVC都是比较熟悉了:M-Model-模型.V-View-视图.C-Controller-控制器,MVP作为MVC的演化版本,那么类似的MVP所对应的意义:M-Model-模型. ...
js：JavaScript中的ActiveXObject对象
JavaScript中的ActiveXObject对象作用: https://blog.csdn.net/pl1612127/article/details/77862174
把IDENTITY_INSERT 设置为 ON ，还不能插入数据问题
IDENTITY_INSERT 为 ON 时 , 必须把需要插入的列名列出来不然报错正确例子: SET IDENTITY_INSERT table(表名) ONinsert into table ...
zan-framework mysql连接
①根据文档内容要配置sqlmap连接池的读写白名单 http://doc.zanphp.io/zh/libs/connection_pool.html 示例代码 // demo.demo.demo_s ...
TP5.0：新建控制器
例如,我们在admin模块下创建一个名为OneMenu.php的控制器 1.在该控制器文件中内容为: 2.访问的URL为:http://localhost/tp5/public/index.php/a ...
March 2 2017 Week 9 Thursday
The first duty of love is to listen. 爱的首要责任是倾听. Yesterday, I read an article that says a successful ...

java简单web爬虫(网页图片)

java简单web爬虫(网页图片)的更多相关文章

随机推荐

热门专题