java爬虫中jsoup的使用

jsoup可以用来解析HTML的内容，其功能非常强大,它可以向javascript那样直接从网页中提取有用的信息

例如1：

从html字符串中解析数据

//直接从字符串中获取

    public static void getParByString()

    {

        String html = "<html><head><title> 这里是字符串内容</title></head"+ ">"+"<body><p class='p1'> 这里是 jsoup 作用的相关演示</p></body></html>";

       Document doc = Jsoup.parse(html);

       Elements links = doc.select("p[class]");

       for(Element link:links){

        String linkclass = link.className();

            String linkText = link.text();

            System.out.println(linkText);

            System.out.println(linkclass);

        }

    }

从本地文件中解析数据

//从本地文件中获取

    public static void getHrefByLocal()

    {

        File input = new File("C:\\Users\\Idea\\Desktop\\html\\Home.html");

        Document doc = null;

        try {

            doc = Jsoup.parse(input,"UTF-8","http://www.oschina.net/"); //这里后面加了网址是为了解决后面绝对路径和相对路径的问题

        } catch (IOException e) {

            // TODO Auto-generated catch block

            e.printStackTrace();

        }

        Elements links = doc.select("a[href]");

        for(Element link:links){

            String linkHref = link.attr("href");

            String linkText = link.text();

            System.out.println(linkText+":"+linkHref);

        }

    }

直接从网络上解析数据

public static HashMap getHrefByNet(String url)

    {

      HashMap hm = new HashMap();

      String href = null;

         try {

            //这是get方式得到的

            Document doc = Jsoup.connect(url).get();

            String title = doc.title();

            Elements links = doc.select("a[href]");

            for(Element link:links){

                String linkHref = link.attr("abs:href");

                String linkText = link.text();

                //System.out.println(linkText+":"+linkHref);

                hm.put(linkText, linkHref);

                href=linkText;

            }

            //System.out.println("***************");

            //另外一种是post方式

            /*@SuppressWarnings("unused")

            Document doc_Post = Jsoup.connect(url)

                    .data("query","Java")

                    .userAgent("I am jsoup")

                    .cookie("auth","token")

                    .timeout(10000)

                    .post();

            Elements links_Post = doc.select("a[href]");

             for(Element link:links_Post){

                    String linkHref = link.attr("abs:href");

                    String linkText = link.text();

                    //System.out.println(linkText+":"+linkHref);

                    //map.put(linkText, linkHref);

                }*/

        } catch (IOException e) {

            // TODO Auto-generated catch block

            e.printStackTrace();

            hm.put("加载失败", "error");

        }

        return hm ;

    }

注意：需要引用的jar为以下：

import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;

最后附上jar包下载地址：

http://jsoup.org/packages/jsoup-1.8.1.jar
 具体实际项目请看java爬虫实战项目

循环遍历Hashtable中的键和值

/*创建一个测试的键值对*/

Hashtable h = new Hashtable();

/*往键值对中添加数据*/

h.put(key, value);

/*然后依次循环取出hashtable中的键和值*/

Iterator it = h.entrySet().iterator();

        while(it.hasNext())

        {

            Map.Entry m = (Map.Entry)it.next();

            System.out.println(m.getValue());

            System.out.println(m.getKey());

        }

java文件夹的创建（先判断是否存在，如果不存在就创建）

//创建文件夹（如果不存在就创建，存在就不变）

     public void makedir(){

         //定义文件夹路径

         String filePath = "D://home//Lucy";

         File file = new File(filePath);

         if(!file.exists()&&!file.isDirectory())

         {

             System.out.println("不存在");

             file.mkdirs();  //创建文件夹  注意mkdirs()和mkdir()的区别

             //判断是否创建成功

             if(file.exists()&&file.isDirectory())  //文件夹存在并且是文件夹

             {

                 System.out.println("文件夹创建成功!");

             }

             else{

                 System.out.println("文件创建不成功!");

             }

         }

         else{

             System.out.println("文件已经存在!");

         }

     }

java文件的创建（先判断是否存在，如果不存在就创建）

//创建文件，如果不存在就创建文件

     public void makeFile()

     {

         String fileName = "D://file2.txt";

         File file = new File(fileName);

         if(!file.exists()&&!file.isFile())

         {

            try {

                if(file.createNewFile())  //创建文件，返回布尔值，如果成功为true，否则为false

                {

                    System.out.println("文件创建成功！");

                }

            } catch (IOException e) {

                // TODO Auto-generated catch block

                e.printStackTrace();

            }

         }

         else{

          System.out.println("文件已经存在！");

          }

     }

在文件中写入内容

 //往文件中写入文本

     public void writeText(String s)

     {

         String fileName = "D://file2.txt";

        File file = new File(fileName);

        if(file.exists()&&file.isFile()) //如果文件存在，可以写入内容

        {

            FileOutputStream fos = null;

            try {

                fos = new FileOutputStream(fileName);

            } catch (FileNotFoundException e2) {

                // TODO Auto-generated catch block

                e2.printStackTrace();

            }

            try {

                fos.write(s.getBytes());

            } catch (IOException e1) {

                // TODO Auto-generated catch block

                e1.printStackTrace();

            }

            try {

                fos.close();

            } catch (IOException e) {

                // TODO Auto-generated catch block

                e.printStackTrace();

            }

        }

        else{

            System.out.println("文件不存在，不能写入内容");

        }

     }

java获取系统时间：

public static void getTime()

    {

        SimpleDateFormat f = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

        Date date = new Date();

        System.out.println(f.format(date));

        System.out.println(new SimpleDateFormat("yyyy年MM月dd日   HH时mm分ss秒").format(date));

        System.out.println(date);

    }

java连接mysql数据库

首先添加jar包：下载jar包

public class connectDoctorMySql {

    /*

        public static final String url = "jdbc:mysql://192.168.0.16/hive";

        public static final String name = "com.mysql.jdbc.Driver";

        public static final String user = "hive";

        public static final String password = "hive";

        public Connection conn = null;

        public PreparedStatement pst = null;

        public Statement stmt = null;

        ResultSet rs = null;*/

        public static final String url = "jdbc:mysql://127.0.0.1/orcl?useUnicode=true&characterEncoding=utf-8&useSSL=false";

        public static final String name = "com.mysql.jdbc.Driver";

        public static final String user = "root";

        public static final String password = "China123";

        public Connection conn = null;

        public PreparedStatement pst = null;

        public Statement stmt = null;

        ResultSet rs = null;

   //初始化数据库

     public void init(){

                 try {

                        Class.forName(name);//指定连接类型

                         conn = DriverManager.getConnection(url, user, password);//获取连接

                         stmt = conn.createStatement();

                    } catch (Exception e) {

                        System.out.println("数据库连接失败. . .");

                        e.printStackTrace();

                    }

          }

   //执行sql语句

    public void excute(String sql){

            init();

            try {

                int result =stmt.executeUpdate(sql);

            } catch (SQLException e) {

                System.out.println("数据执行失败："+sql);//打印sql语句

                e.printStackTrace();

                }finally{

                     try {

                         if (rs!=null){

                            rs.close();

                          }

                         if(pst!=null){

                           pst.close();

                            }

                         if(conn!=null) {

                          conn.close();

                          }

                  }catch (SQLException e) {

                   e.printStackTrace();

                   }

          }

        }

//查询语句

    public ArrayList select(String sql,int x,int y){

            init();

             ArrayList result= new ArrayList();

            try {

                ResultSet rs = stmt.executeQuery(sql);

                while(rs.next())

                {   String[] str = new String[2];

                    str[0]=rs.getString(x);

                    str[1]=rs.getString(y);

                    result.add(str);

                }

            } catch (SQLException e) {

                e.printStackTrace();

                }finally{

                       try {

                         if (rs!=null){

                            rs.close();

                          }

                         if(pst!=null){

                           pst.close();

                            }

                         if(conn!=null) {

                          conn.close();

                          }

                  }catch (SQLException e) {

                   e.printStackTrace();

                   }

          }

                return result;

        }

java连接oracle数据库

public class connectDoctor {

      //连接oracl数据库

        public static final String url = "jdbc:oracle:thin:@127.0.0.1:1521:orcl";

        //@127.0.0.1

        public static final String name = "oracle.jdbc.driver.OracleDriver";

        public static final String user = "c238891";

        public static final String password = "Rapid111";

        public Connection conn = null;

        public PreparedStatement pst = null;

        public Statement stmt = null;

        ResultSet rs = null;

        //初始化数据库

        public void init(){

                 try {

                        Class.forName(name);//指定连接类型

                         conn = DriverManager.getConnection(url, user, password);//获取连接

                         stmt = conn.createStatement();

                    } catch (Exception e) {

                        System.out.println("插入数据失败：");

                        e.printStackTrace();

                    }

          }

        //测试连接数据库

        public void start()

        {

            init();

            String sql = "select * from emp";

            try {

                pst = conn.prepareStatement(sql);

                 rs = pst.executeQuery();

                                 while (rs.next()) {

                                    System.out.println("编号：" + rs.getString("empno")

                                                    + "；姓名：" + rs.getString("ename")

                                                    + "; 工作:" + rs.getString("job")

                                                    + "; 领导:" + rs.getString("mgr")

                                                    + "; 雇佣日期:" + rs.getString("hiredate")

                                                    + "; 工资:" + rs.getString("sal")

                                                     + "; 奖金:" + rs.getString("comm")

                                                     + "; 部门:" + rs.getString("deptno"));

                                 }

            } catch (SQLException e) {

                e.printStackTrace();

            }finally{

                 try {

                     if (rs!=null){

                     rs.close();

                     if(pst!=null)

                     {

                         pst.close();

                     }

                     if(conn!=null)

                     {

                         conn.close();

                     }

                    }

                } catch (SQLException e) {

                    e.printStackTrace();

                }  

            }

        }

  //执行sql语句

        public void excute(String sql){

            init();

            try {

                int result =stmt.executeUpdate(sql);

            } catch (SQLException e) {

                System.out.println(sql);

                //System.out.println("错误");

                e.printStackTrace();

                }finally{

                     try {

                         if (rs!=null){

                            rs.close();

                          }

                         if(pst!=null){

                           pst.close();

                            }

                         if(conn!=null) {

                          conn.close();

                          }

                  }catch (SQLException e) {

                   e.printStackTrace();

                   }

          }

        }

  //查询语句

        public ArrayList select(String sql,int x,int y){

            init();

             ArrayList result= new ArrayList();

            try {

                ResultSet rs = stmt.executeQuery(sql);

                while(rs.next())

                {   String[] str = new String[2];

                    str[0]=rs.getString(x);

                    str[1]=rs.getString(y);

                    result.add(str);

                }

            } catch (SQLException e) {

                e.printStackTrace();

                }finally{

                       try {

                         if (rs!=null){

                            rs.close();

                          }

                         if(pst!=null){

                           pst.close();

                            }

                         if(conn!=null) {

                          conn.close();

                          }

                  }catch (SQLException e) {

                   e.printStackTrace();

                   }

          }

                return result;

        }

java爬虫中jsoup的使用的更多相关文章

初识Java爬虫之Jsoup，提供参考代码
本文主要分享的是关于Java爬虫技术其中一个方式 ==> Jsoup 1.Jsoup简介推开技术大门,爬虫技术琳琅满目,而今天要分享的Jsoup是一款Java的HTML解析神器,,可直接 ...
java爬虫框架jsoup
1.java爬虫框架的api jsoup:https://www.open-open.com/jsoup/
Java爬虫框架Jsoup学习记录
Jsoup的作用当你想获得某网页的内容,可以使用此框架做个爬虫程序,爬某图片网站的图片(先获得图片地址,之后再借助其他工具下载图片)或者是小说网站的小说内容我使用Jsoup写出的一款小说下载器,小 ...
java爬虫系列第三讲-获取页面中绝对路径的各种方法
在使用webmgiac的过程中,很多时候我们需要抓取连接的绝对路径,总结了几种方法,示例代码放在最后. 以和讯网的一个页面为例: xpath方式获取 log.info("{}", ...
Java爬虫系列三：使用Jsoup解析HTML
在上一篇随笔<Java爬虫系列二:使用HttpClient抓取页面HTML>中介绍了怎么使用HttpClient进行爬虫的第一步--抓取页面html,今天接着来看下爬虫的第二步--解析抓取 ...
Java爬虫利器HTML解析工具-Jsoup
Jsoup简介 Java爬虫解析HTML文档的工具有:htmlparser, Jsoup.本文将会详细介绍Jsoup的使用方法,10分钟搞定Java爬虫HTML解析. Jsoup可以直接解析某个URL ...
webmagic的设计机制及原理-如何开发一个Java爬虫
之前就有网友在博客里留言,觉得webmagic的实现比较有意思,想要借此研究一下爬虫.最近终于集中精力,花了三天时间,终于写完了这篇文章.之前垂直爬虫写了一年多,webmagic框架写了一个多月,这方 ...
JAVA爬虫挖取CSDN博客文章
开门见山,看看这个教程的主要任务,就去csdn博客,挖取技术文章,我以<第一行代码–安卓>的作者为例,将他在csdn发表的额博客信息都挖取出来.因为郭神是我在大学期间比较崇拜的对象之一.他 ...
JAVA爬虫 WebCollector
JAVA爬虫 WebCollector 爬虫简介: WebCollector是一个无须配置.便于二次开发的JAVA爬虫框架(内核),它提供精简的的API,只需少量代码即可实现一个功能强大的爬虫. 爬虫 ...

随机推荐

placement new
placement new就是把原本new做的两步工作分开来.第一步你自己分配内存,第二步你调用类的构造函数在自己分配的内存上构建新的对象. class Foo { float f; public: ...
WebService超时
1.web.config配置,<system.web></system.web>里面增加:<httpRuntime maxRequestLength="1024 ...
RocketMQ：Cannot allocate memory
使用Storm本地模式消费RocketMQ数据的时候, 消费一点数据之后,就会出现如下错误: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os:: ...
视频转换工具ffmpeg
安装ffmpeg ffmpeg官网下载地址点击此处. 如果使用mac也可用homebrew下载安装:brew install ffmpeg 使用ffmpeg 命令如下:ffmpeg -i input. ...
前端福利之jQuery文字轮播特效（转）
闲谈:离开学校那座象牙塔已经也有大半年的事件了,生活中不再充满了茫然只有忙碌.连续加班加点大半个月,做的活动项目终于算是告一段落了,而今天也将是考验其真正价值的时候,现在将这次开发中遇到的问题做一下总 ...
selenium设置代理，基于chrome浏览器
工作中遇到需要对项目中使用的selenium设置代理,跟大家分享一下. 1.下载chromeDriver:http://chromedriver.storage.googleapis.com/inde ...
阿里杨传辉的访问节选(oceanbase)
皮皮(Q4): OceanBase第一个应用是收藏夹.最近,听说支付宝交易也用到了OceanBase.能否结合阿里的应用谈谈OceanBase的优势. 杨传辉(A4):相比传统的关系数据库,谈及Oce ...
Centos 安装 erlang 环境
系统 Centos 6.5 64位 Erlang 18.3.4 安装依赖组件 yum install -y gcc gcc-g++ unixODBC unixODBC-devel wxBase wxG ...
记那些年在asp.net mvc上挖过的坑
表现: IDE是vs2017.是在 A 控制器方法断点后,却怎么也运行不到那个位置,但是又正常返回页面.该方法位于web项目引用的控制器类库上的一个控制器,试过它隔壁的控制器,一切正常. 但每次访问该 ...
UDP实现一个简易的聊天室 (Unity&&C#完成)
效果展示(尚未完善) UDP User Data Protocol 用户数据报协议概述 UDP是不连接的数据报模式.即传输数据之前源端和终端不建立连接.使用尽最大努力交付原则,即不保证可靠交付. 数 ...

java爬虫中jsoup的使用

java连接mysql数据库

java连接oracle数据库

java爬虫中jsoup的使用的更多相关文章

随机推荐

热门专题