基于.net的爬虫应用-DotnetSpider

最近应朋友的邀请，帮忙做了个简单的爬虫程序，要求不高，主要是方便对不同网站的爬取进行扩展，获取到想要的数据信息即可。当然，基于数据的后期分析功能是后话，以后的随笔我会逐步的介绍。

开源的爬虫框架比较多，之前我研究过java的nutch,同时它还兼备基于Lucene全文检索的功能,还有Python爬虫等等。为什么我会选择用DotnetSpider呢，我之前有使用.net开发过一套分布式框架，框架的实现机制和DotnetSpider有相似之处，所以上手之后，甚是喜欢。

先看下解决方案的整体分层情况：

InternetSpider：控制台程序，后续可以服务的方式部署在windows环境中

ISee.Shaun.Spiders.Business：爬虫程序的中心调度层，负责爬虫的配置，启动，执行等

ISee.Shaun.Spiders.Common：通用类，包括反射代码、大众点评的数据字典、回调委托的定义等

ISee.Shaun.Spiders.Pipeline：BasePipeline的实现层，主要实现了数据保存

ISee.Shaun.Spiders.Processor：BasePageProcessor的实现层，主要实现了通过xpath的数据提取

ISee.Shaun.Spiders.SpiderModel：数据模型层，负责实体定义和EF数据操作

以爬取大众点评湘菜数据为例，程序的执行过程如下：

InternetSpider读取配置文件，获取需要爬取的URL地址，大众点评数据分页仅支持50页，所以，需要获取更多数据我们需要将检索条件进行细化，观察后发现通过重点地区进行爬取，效果尚可，地址为http://www.dianping.com/search/keyword/2/10_湖南菜/{0}p{1}。

图一：湘菜检索地址

图二：分类检索地址，共11页

那么行政区地址从哪里来的呢？我们直接使用谷歌浏览器，代码全在里面了

字典直接附上：

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

namespace ISee.Shaun.Spiders.Common

{

    public static class DazhongdianpingArea

    {

        private static Dictionary<string, string> areaDic = null;

        public static Dictionary<string, string> GetAreaDic()

        {

            if (areaDic == null)

            {

                areaDic = new Dictionary<string, string>();

                areaDic.Add("r16", "西城区");

                areaDic.Add("r15", "东城区");

                areaDic.Add("r17", "海淀区");

                areaDic.Add("r328", "石景山区");

                areaDic.Add("r14", "朝阳区");

                areaDic.Add("r20", "丰台区");

                areaDic.Add("r9158", "顺义区");

                areaDic.Add("r5950", "昌平区");

                areaDic.Add("r5952", "大兴区");

                areaDic.Add("r9157", "房山区");

                areaDic.Add("r5951", "通州区");

                areaDic.Add("c4453", "怀柔区");

                areaDic.Add("c435", "延庆区");

                areaDic.Add("c434", "密云区");

                areaDic.Add("c4454", "门头沟区");

                areaDic.Add("c4455", "平谷区");

            }

            return areaDic;

        }

    }

}

OK，在看一下配置文件，对应好需要的地址

<?xml version="1.0" encoding="utf-8"?>

<configuration>

  <configSections>

    <!-- For more information on Entity Framework configuration, visit http://go.microsoft.com/fwlink/?LinkID=237468 -->

    <section name="entityFramework" type="System.Data.Entity.Internal.ConfigFile.EntityFrameworkSection, EntityFramework, Version=6.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" requirePermission="false" />

  </configSections>

  <appSettings>

    <!-- 大分类抓取地址，共五十页 -->

    <add key="WebUrls" value="http://www.dianping.com/search/keyword/2/10_湖南菜/p{0}" />

    <!-- 细化后地址，加入了地区 -->

    <add key="WebAreaUrls" value="http://www.dianping.com/search/keyword/2/10_湖南菜/{0}p{1}" />

  </appSettings>

  <startup>

    <supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.6.1" />

  </startup>

  <connectionStrings>

    <!-- 数据库连接字符串 -->

    <add name="ConnectionStr" connectionString="data source=.;initial catalog=Membership_Spider;integrated security=True;user id=sa;password=123asd!@#;multipleactiveresultsets=True;" providerName="System.Data.SqlClient" />

  </connectionStrings>

  <entityFramework>

    <defaultConnectionFactory type="System.Data.Entity.Infrastructure.LocalDbConnectionFactory, EntityFramework">

      <parameters>

        <parameter value="mssqllocaldb" />

      </parameters>

    </defaultConnectionFactory>

    <providers>

      <provider invariantName="System.Data.SqlClient" type="System.Data.Entity.SqlServer.SqlProviderServices, EntityFramework.SqlServer" />

    </providers>

  </entityFramework>

</configuration>

获取到页面地址后，我们需要初始化爬虫服务，我定义了一个RunSpider，初始化时，传递Processor和Pipeline实现类字符串，编码格式等。直接调用run方法，开始执行。

 using ISee.Shaun.Spiders.Business;

 using ISee.Shaun.Spiders.Common;

 using System;

 using System.Collections.Generic;

 using System.Configuration;

 using System.Linq;

 using System.Text;

 using System.Threading.Tasks;

 namespace InternetSpider

 {

     class Program

     {

         private static string urlInfo = ConfigurationManager.AppSettings["WebUrls"];

         private static string urlAreaInfo = ConfigurationManager.AppSettings["WebAreaUrls"];

         static void Main(string[] args)

         {

             Run();

         }

         /// <summary>

         /// Begin spider

         /// </summary>

         private static void Run()

         {

             //Add other areaInfo

             Dictionary<string, string> areaDic = DazhongdianpingArea.GetAreaDic();

             List<string> urls = new List<string>();

             foreach (var key in areaDic.Keys)

             {

                 for (int i = ; i <= ; i++)

                 {

                     urls.Add(string.Format(urlAreaInfo, key, i));

                 }

             }

             RunSpider runSpiders = new RunSpider("DazhongdianpingProcessor", "DazhongdianpingPipeline", "UTF-8", true);

             runSpiders.Run(urls);

             //RunSpider runSpider = new RunSpider("DazhongdianpingProcessor", "DazhongdianpingPipeline", "UTF-8", true);

             //runSpider.Run(urlInfo, 50);

         }

     }

 }

关于RunSpider,我不在重复说明，请看代码注释（RunSpider类的主要功能就是方便新任务的开启，或者不通域名下站点的调用，或者说我这里的委托中开启的子页面调用等；反射的使用，便于在后续扩展时，创建批量任务配置文件，自动执行任务才加入的）：

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

using DotnetSpider.Core;

using DotnetSpider.Core.Downloader;

using DotnetSpider.Core.Pipeline;

using DotnetSpider.Core.Processor;

using DotnetSpider.Core.Scheduler;

using ISee.Shaun.Spiders.Common;

using ISee.Shaun.Spiders.Pipeline;

using ISee.Shaun.Spiders.Processor;

namespace ISee.Shaun.Spiders.Business

{

    public class RunSpider

    {

        private const string ASSEMBLY_PROCESSOR_NAME = "ISee.Shaun.Spiders.Processor";

        private const string ASSEMBLY_PIPELINE_NAME = "ISee.Shaun.Spiders.Pipeline";

        private BaseProcessor processor = null;

        private BasePipeline pipeline = null;

        private Site site = null;

        private string encoding = string.Empty;

        private bool removeOutBound = false;

        private int spiderThreadNums = ;

        public int SpiderThreadNums { get => spiderThreadNums; set => spiderThreadNums = value; }

        /// <summary>

        /// Constructor

        /// </summary>

        /// <param name="processorName"></param>

        /// <param name="pipeLineName"></param>

        public RunSpider(string processorName, string pipeLineName, string encoding, bool removeOutBound)

        {

            //通过反射，获取当前处理类

            processor = ReflectionInvoke.GetInstance(ASSEMBLY_PROCESSOR_NAME, processorName, null) as BaseProcessor;

            //如果需要回写信息，使用当前委托，如这里，继续子页面的抓取调用

            processor.InvokeFoodUrls = this.InvokeNext;

            pipeline = ReflectionInvoke.GetInstance(ASSEMBLY_PIPELINE_NAME, pipeLineName, null) as BasePipeline;

            this.encoding = encoding;

            this.removeOutBound = removeOutBound;

        }

        /// <summary>

        /// 执行，按照页号

        /// </summary>

        /// <param name="urlInfo"></param>

        /// <param name="times"></param>

        public void Run(string urlInfo, int times)

        {

            SetSite(encoding, removeOutBound, urlInfo, times);

            Run();

        }

        /// <summary>

        /// 执行，按照地址集合

        /// </summary>

        /// <param name="urlList"></param>

        public void Run(List<string> urlList)

        {

            SetSite(encoding, removeOutBound, urlList);

            Run();

        }

        /// <summary>

        /// Begin spider

        /// </summary>

        private void Run()

        {

            Spider spider = Spider.Create(site, new QueueDuplicateRemovedScheduler(), processor);

            spider.AddPipeline(pipeline);

            spider.Downloader = new HttpClientDownloader();

            spider.ThreadNum = this.spiderThreadNums;

            spider.EmptySleepTime = ;

            spider.Deep = ;

            spider.Run();

        }

        private void InvokeNext(string processorName, string pipeLineName, List<string> foodUrls)

        {

            RunSpider runSpider = new RunSpider(processorName, pipeLineName, this.encoding, true);

            runSpider.Run(foodUrls);

        }

        /// <summary>

        /// 通过可变页号，设定站点URL

        /// </summary>

        /// <param name="encoding"></param>

        /// <param name="removeOutBound"></param>

        /// <param name="urlInfo"></param>

        /// <param name="times"></param>

        private void SetSite(string encoding, bool removeOutBound, string urlInfo, int times)

        {

            this.site = new Site { EncodingName = encoding, RemoveOutboundLinks = false };

            if (times == )

            {

                this.site.AddStartUrl(urlInfo);

            }

            else

            {

                List<string> urls = new List<string>();

                for (int i = ; i <= times; ++i)

                {

                    urls.Add(string.Format(urlInfo, i));

                }

                this.site.AddStartUrls(urls);

            }

        }

        /// <summary>

        /// 通过URL集合设置站点URL

        /// </summary>

        /// <param name="encoding"></param>

        /// <param name="removeOutBound"></param>

        /// <param name="urlList"></param>

        private void SetSite(string encoding, bool removeOutBound, List<string> urlList)

        {

            this.site = new Site { EncodingName = encoding, RemoveOutboundLinks = false };

            this.site.AddStartUrls(urlList);

        }

    }

}

关于Processor，我后续会扩展出不通的网站实现类，那么涉及到通用属性等需要进行抽象处理，代码如下：

using DotnetSpider.Core;

using DotnetSpider.Core.Processor;

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

using static ISee.Shaun.Spiders.Common.DelegeteDefine;

namespace ISee.Shaun.Spiders.Processor

{

    public class BaseProcessor : BasePageProcessor

    {

        protected List<string> foodUrls = null;

        public CallbackEventHandler InvokeFoodUrls { get; set; }

        protected string SourceWebsite { get; set; }

        public BaseProcessor() { foodUrls = new List<string>(); }

        protected override void Handle(Page page)

        {

            throw new NotImplementedException();

        }

        protected virtual void InvokeCallback(string processorName, string pipeLineName)

        {

            if (InvokeFoodUrls != null && this.foodUrls.Count > )

            {

                InvokeFoodUrls(processorName, pipeLineName, this.foodUrls);

            }

        }

    }

}

接下来看具体的实现类（关于xpath不在多加说明，网上资料很多，如果结构不清楚，可以使用谷歌的开发者工具，或者在调试中，拿到html结构，自行分析，本文不再增加次类演示截图）：

using DotnetSpider.Core;

using DotnetSpider.Core.Processor;

using DotnetSpider.Core.Selector;

using ISee.Shaun.Spiders.Common;

using ISee.Shaun.Spiders.SpiderModel.Model;

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

using static ISee.Shaun.Spiders.Common.DelegeteDefine;

namespace ISee.Shaun.Spiders.Processor

{

    public class DazhongdianpingProcessor : BaseProcessor

    {

        public DazhongdianpingProcessor() : base()

        {

            //标记当前数据来源

            SourceWebsite = "大众点评";

        }

        /// <summary>

        /// 重新父类方法，开始执行数据获取操作

        /// </summary>

        /// <param name="page"></param>

        protected override void Handle(Page page)

        {

            // 利用 Selectable 查询并构造自己想要的数据对象

            var totalVideoElements = page.Selectable.SelectList(Selectors.XPath(".//div[@class='shop-list J_shop-list shop-all-list']/ul/li")).Nodes();

            if (totalVideoElements == null)

            {

                return;

            }

            //定义需处理数据集合

            List<Restaurant> restaurantList = new List<Restaurant>();

            foreach (var restElement in totalVideoElements)

            {

                var restaurant = new Restaurant() { SourceWebsite = SourceWebsite };

                //下面通过xpath开始获取餐厅信息

                restaurant.Name = restElement.Select(Selectors.XPath(".//h4")).GetValue();

                var price= restElement.Select(Selectors.XPath(".//div[@class='txt']/div/a[@class='mean-price']/b")).GetValue();

                restaurant.AveragePrice = string.IsNullOrEmpty(price) ? "" : price.Replace("￥","");

                restaurant.Type = restElement.Select(Selectors.XPath(".//div[@class='txt']/div[@class='tag-addr']/a/span[@class='tag']")).GetValue();

                restaurant.Star = restElement.Select(Selectors.XPath(".//div[@class='txt']/div[@class='comment']/span/@title")).GetValue();

                restaurant.ImageUrl = restElement.Select(Selectors.XPath(".//div[@class='pic']/a/img/@src")).GetValue();

                var areaCode = page.Url.Substring(page.Url.LastIndexOf('/')+);

                if (!string.IsNullOrEmpty(areaCode) && (areaCode.Contains("r")|| areaCode.Contains("c")))

                {

                    Dictionary<string, string> areaDic = DazhongdianpingArea.GetAreaDic();

                    string result= areaCode.Substring(, areaCode.IndexOf('p'));

                    if (areaDic.ContainsKey(result))

                    {

                        restaurant.Area = areaDic[result];

                    }

                }

                List<ISelectable> infoList = restElement.SelectList(Selectors.XPath("./div[@class='txt']/span[@class='comment-list']/span/b")).Nodes() as List<ISelectable>;

                if (infoList != null && infoList.Count > )

                {

                    var result = infoList[].GetValue();

                    restaurant.Taste = string.IsNullOrEmpty(result) ? string.Empty : result;

                    result = infoList[].GetValue();

                    restaurant.Environment = string.IsNullOrEmpty(result) ? string.Empty : result;

                    result = infoList[].GetValue();

                    restaurant.ServiceScore = string.IsNullOrEmpty(result) ? string.Empty : result;

                }

                var recommetList = restElement.SelectList(Selectors.XPath(".//div[@class='txt']/div[@class='recommend']/a")).Nodes();

                restaurant.Recommendation = string.Join(",", recommetList.Select(o => o.GetValue()));

                restaurant.Address = restElement.Select(Selectors.XPath(".//div[@class='txt']/div[@class='tag-addr']/span")).GetValue();

                restaurant.Position= restElement.Select(Selectors.XPath(".//div[@class='txt']/div[@class='tag-addr']/a[@data-click-name='shop_tag_region_click']/span[@class='tag']")).GetValue();

                var shopUrl = restElement.Select(Selectors.XPath(".//div[@class='txt']/div/a/@href")).GetValue();

                restaurant.Code = shopUrl.Substring(shopUrl.LastIndexOf('/') + );

                restaurantList.Add(restaurant);

                //add next links

                if (!string.IsNullOrEmpty(shopUrl))

                {

                    this.foodUrls.Add(shopUrl);

                }

            }

            // 如果进行二级爬虫，取消注释，并且实现对应的两个类

            //InvokeCallback("DazhongdianpingFoodProcessor", "DazhongdianpingFoodPipeline");

            // Save data object by key. 以自定义KEY存入page对象中供Pipeline调用

            page.AddResultItem("RestaurantList", restaurantList);

        }

    }

}

数据实体的定义：

using System;

using System.Collections.Generic;

using System.ComponentModel.DataAnnotations;

using System.ComponentModel.DataAnnotations.Schema;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

namespace ISee.Shaun.Spiders.SpiderModel.Model

{

    public class FoodInfo

    {

        [Key]

        public int Id { get; set; }

        public int RestaurantId { get; set; }

        public string Code { get; set; }

        public string RestaurantCode { get; set; }

        public string Name { get; set; }

        public string Price { get; set; }

        public string FoodImageUrl { get; set; }

        [ForeignKey("RestaurantId")]

        public Restaurant restaurant { get; set; }

    }

}

数据获取下来之后，爬虫会自动将任务分配给pipeline来处理收集到的数据信息，直接上代码：

using DotnetSpider.Core;

using DotnetSpider.Core.Pipeline;

using ISee.Shaun.Spiders.SpiderModel.Model;

using ISee.Shaun.Spiders.SpiderModel;

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

namespace ISee.Shaun.Spiders.Pipeline

{

    public class DazhongdianpingPipeline : BasePipeline

    {

        /// <summary>

        /// 处理餐厅信息

        /// </summary>

        /// <param name="resultItems"></param>

        /// <param name="spider"></param>

        public override void Process(IEnumerable<ResultItems> resultItems, ISpider spider)

        {

            //便利结果集

            foreach (ResultItems entry in resultItems)

            {

                //定义EF实体

                using (var rEntity = new FoodInfoEntity())

                {

                    List<Restaurant> resList = new List<Restaurant>();

                    foreach (Restaurant result in entry.Results["RestaurantList"])

                    {

                        //通过餐厅名称和地址作为筛重条件

                        var resultList = rEntity.RestaurantInfo.Where(o => o.Name == result.Name && o.Address == result.Address).ToList();

                        if (resultList.Count == )

                        {

                            resList.Add(result);

                        }

                    }

                    if (resList.Count > )

                    {

                        rEntity.RestaurantInfo.AddRange(resList);

                        rEntity.SaveChanges();

                    }

                }

            }

        }

    }

}

好了，整体下来，就是这样简单，当然我还要强调一下几个问题：

1.如果需要对大量的页面进行数据爬取，可增加额外的xml配置文件，来定义抓取的规则或者任务。（不再细说，如有疑问可留言交流）

2.如果要完成比如美团网等网站的扩展，在Processor和Pipeline分别实现对应的类即可

3.关于数据实体，我采用了EF的Code first方式，大家可以随意扩展自己想要的方式，或者更换数据库等，请参阅网上大量的关于EF的文章。

今天就到这里了，基本都在上代码，如何理解各自体会吧，另外，下周开始，停发两年多的1024伐木累还会继续更新，只想好好的把这件事做完，愿一切安好！

补充，Github地址：https://github.com/sall84993356/Spiders.git

基于.net的爬虫应用-DotnetSpider的更多相关文章

爬虫框架: DotnetSpider
[开源 .NET 跨平台数据采集爬虫框架: DotnetSpider] [一] 初衷与架构设计一 ,为什么要造轮子有兴趣的同学可以去各大招聘网站看一下爬虫工程师的要求,大多是JAVA,PYTH ...
基于golang分布式爬虫系统的架构体系v1.0
基于golang分布式爬虫系统的架构体系v1.0 一.什么是分布式系统分布式系统是一个硬件或软件组件分布在不同的网络计算机上,彼此之间仅仅通过消息传递进行通信和协调的系统.简单来说就是一群独立计算机 ...
[开源 .NET 跨平台数据采集爬虫框架: DotnetSpider] [二] 基本使用
[DotnetSpider 系列目录] 一.初衷与架构设计二.基本使用三.配置式爬虫四.JSON数据解析与配置系统使用环境 Visual Studio 2015 or later .NET 4 ...
[开源 .NET 跨平台数据采集爬虫框架: DotnetSpider] [一] 初衷与架构设计
[DotnetSpider 系列目录] 一.初衷与架构设计二.基本使用三.配置式爬虫四.JSON数据解析与配置系统为什么要造轮子同学们可以去各大招聘网站查看一下爬虫工程师的要求,大多是招JA ...
[开源 .NET 跨平台数据采集爬虫框架: DotnetSpider] [三] 配置式爬虫
[DotnetSpider 系列目录] 一.初衷与架构设计二.基本使用三.配置式爬虫四.JSON数据解析与配置系统上一篇介绍的基本的使用方式,虽然自由度很高,但是编写的代码相对还是挺多.于是框 ...
[开源 .NET 跨平台数据采集爬虫框架: DotnetSpider] [四] JSON数据解析
[DotnetSpider 系列目录] 一.初衷与架构设计二.基本使用三.配置式爬虫四.JSON数据解析与配置系统场景模拟假设由于漏存JD SKU对应的店铺信息.这时我们需要重新完全采集所有 ...
爬虫学习之基于Scrapy的爬虫自动登录
###概述在前面两篇(爬虫学习之基于Scrapy的网络爬虫和爬虫学习之简单的网络爬虫)文章中我们通过两个实际的案例,采用不同的方式进行了内容提取.我们对网络爬虫有了一个比较初级的认识,只要发起请求获 ...
基于 Electron 的爬虫框架 Nightmare
作者:William 本文为原创文章,转载请注明作者及出处 Electron 可以让你使用纯 JavaScript 调用 Chrome 丰富的原生的接口来创造桌面应用.你可以把它看作一个专注于桌面应用 ...
基于scrapy-redis分布式爬虫的部署
redis分布式部署 1.scrapy框架是否可以自己实现分布式? - 不可以.原因有二. 其一:因为多台机器上部署的scrapy会各自拥有各自的调度器,这样就使得多台机器无法分配start_urls ...

随机推荐

Python学习笔记 - 函数参数
>>> def power(x): ... return x * x ... >>> power(5) 25 >>> def power(x, n ...
基于web的jfreechart的使用
这个模块的主要步骤就是: 前台通过struts调用后台,通过JFreeChart产生图片格式的图表,存储在某个位置,然后前台jsp再去调用图片. 来开工. JFreeChart的简介大家请百度. 首先 ...
不要使用jQuery触发原生事件
原文链接: DO NOT TRIGGER REAL EVENT NAMES WITH JQUERY! 原文日期: 2014年02月26日翻译日期: 2014年03月2日翻译人员: 铁锚 JavaS ...
umask函数的用法 - 如何进行权限位的设置
下面程序创建了两个文件,创建foo文件时,umask值为0,创建第二个时,umask值禁止所有组和其他用户的访问权限. 测试结果: 测试结果可以看出更改进程的文件模式掩码并不影响其父进程(常常是she ...
C/C++中如何产生伪随机数
1. C语言中的伪随机数产生函数本节主要参考自一博文及cppreferrence. 我们知道rand()函数可以用来产生随机数,但是这不是真正意义上的随机数,是一个伪随机数,是根据一个数(我们可以 ...
Android Camera开发系列（上）——Camera的基本调用与实现拍照功能以及获取拍照图片加载大图片
Android Camera开发系列(上)--Camera的基本调用与实现拍照功能以及获取拍照图片加载大图片最近也是在搞个破相机,兼容性那叫一个不忍直视啊,于是自己翻阅了一些基本的资料,自己实现了一 ...
Mybatis 源码之Plugin类解析
public class Plugin implements InvocationHandler { private Object target; //目标对象 private Interceptor ...
LeetCode之旅（19）-Power of Two
题目 Given an integer, write a function to determine if it is a power of two. Credits: Special thanks ...
《HelloGitHub》第 25 期
<HelloGitHub>第 25 期兴趣是最好的老师,HelloGitHub 就是帮你找到兴趣! 简介分享 GitHub 上有趣.入门级的开源项目. 这是一个面向编程新手.热爱编程. ...
Android 加载gif图片强大框架(支持预加载、缓存，还支持显示静态图片，一行代码全搞定)
之前项目中没有涉及到显示gif图片的功能,也没有着重研究过,最近项目中要用到显示gif图片,于是就在网上一顿搜,用过之后发现如下几个缺点. 1.加载大的gif图片会出现oom. 2.没有预加载和缓存功 ...

基于.net的爬虫应用-DotnetSpider

基于.net的爬虫应用-DotnetSpider的更多相关文章

随机推荐

热门专题