Golang高并发抓取HTML图片
Golang高并发抓取HTML图片
使用准备
1.安装Golang
2.下载爬虫包
go get -v github.com/hunterhug/marmot/util
go get -v github.com/hunterhug/marmot/tool
程序
该程序只能抓取HTML中src="http"中的图片, 必须带有协议头http(s), 其他如data-src和混淆在JS中的无法抓取
See: https://github.com/hunterhug/marmot/blob/master/example/practice/pictures/main.go
package main
import (
"fmt"
"github.com/hunterhug/marmot/util"
"github.com/hunterhug/marmot/tool"
)
// Num of miner, We can run it at the same time to crawl data fast
var MinerNum = 5
// You can update this decide whether to proxy
var ProxyAddress interface{}
func main() {
// You can Proxy!
// ProxyAddress = "socks5://127.0.0.1:1080"
fmt.Println(`Welcome: Input "url" and picture keep "dir"`)
fmt.Println("---------------------------------------------")
url := util.Input(`URL(Like: "http://publicdomainarchive.com")`, "http://publicdomainarchive.com")
dir := util.Input(`DIR(Default: "./picture")`, "./picture")
fmt.Printf("You will keep %s picture in dir %s\n", url, dir)
fmt.Println("---------------------------------------------")
// Start Catch
err := tool.DownloadHTMLPictures(url, dir, MinerNum, ProxyAddress)
if err != nil {
fmt.Println("Error:" + err.Error())
}
}
解释均写, 运行后:
Welcome: Input "url" and picture keep "dir"
---------------------------------------------
URL(Like: "http://publicdomainarchive.com")
DIR(Default: "./picture")
You will keep http://publicdomainarchive.com picture in dir ./picture
---------------------------------------------
SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_02_03_modern.jpg
SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_google_dark.png
SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-003-667x1000-192684_667x675.jpg
SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_powered-by-wp-engine.png
SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_divi.png
SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-002-1000x667.jpg
SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_public-domain-mark.png
SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_01_03_public-domain-images-free-stock-photos008-1000x625.jpg
SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_09_03_Weekly.jpg
SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-054-1000x667.jpg
SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_10_03_instagram_dark.png
SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_02_03_vintage.jpg
SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-001-1000x667.jpg
SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-070-1000x667.jpg
SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_twitter02_dark.png
SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_01_03_public-domain-images-free-stock-photos001-1000x750-167066_1000x675.jpg
SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-035-1000x667.jpg
SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_facebook_dark.png
SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-060-1000x667.jpg
SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-013-1000x667.jpg
---------------------------------------------
URL(Like: "http://publicdomainarchive.com")
SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_powered-by-wp-engine.png
SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_divi.png
SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-002-1000x667.jpg
SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_public-domain-mark.png
SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_01_03_public-domain-images-free-stock-photos008-1000x625.jpg
SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_09_03_Weekly.jpg
SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-054-1000x667.jpg
SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_10_03_instagram_dark.png
SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_02_03_vintage.jpg
SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-001-1000x667.jpg
SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-070-1000x667.jpg
SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_twitter02_dark.png
SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_01_03_public-domain-images-free-stock-photos001-1000x750-167066_1000x675.jpg
SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-035-1000x667.jpg
SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_facebook_dark.png
SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-060-1000x667.jpg
SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-013-1000x667.jpg
---------------------------------------------
URL(Like: "http://publicdomainarchive.com")
Golang高并发抓取HTML图片的更多相关文章
- android高仿抖音、点餐界面、天气项目、自定义view指示、爬取美女图片等源码
Android精选源码 一个爬取美女图片的app Android高仿抖音 android一个可以上拉下滑的Ui效果 android用shape方式实现样式源码 一款Android上的新浪微博第三方轻量 ...
- Scrapy爬取美女图片第三集 代理ip(上) (原创)
首先说一声,让大家久等了.本来打算那天进行更新的,可是一细想,也只有我这样的单身狗还在做科研,大家可能没心思看更新的文章,所以就拖到了今天.不过忙了521,522这一天半,我把数据库也添加进来了,修复 ...
- Scrapy爬取美女图片续集 (原创)
上一篇咱们讲解了Scrapy的工作机制和如何使用Scrapy爬取美女图片,而今天接着讲解Scrapy爬取美女图片,不过采取了不同的方式和代码实现,对Scrapy的功能进行更深入的运用.(我的新书< ...
- Python爬虫学习(6): 爬取MM图片
为了有趣我们今天就主要去爬取以下MM的图片,并将其按名保存在本地.要爬取的网站为: 大秀台模特网 1. 分析网站 进入官网后我们发现有很多分类: 而我们要爬取的模特中的女模内容,点进入之后其网址为:h ...
- Scrapy爬取美女图片 (原创)
有半个月没有更新了,最近确实有点忙.先是华为的比赛,接着实验室又有项目,然后又学习了一些新的知识,所以没有更新文章.为了表达我的歉意,我给大家来一波福利... 今天咱们说的是爬虫框架.之前我使用pyt ...
- 百度图片爬虫-python版-如何爬取百度图片?
上一篇我写了如何爬取百度网盘的爬虫,在这里还是重温一下,把链接附上: http://www.cnblogs.com/huangxie/p/5473273.html 这一篇我想写写如何爬取百度图片的爬虫 ...
- php远程抓取网站图片并保存
以前看到网上别人说写程序抓取网页图片的,感觉挺神奇,心想什么时候我自己也写一个抓取图片的方法! 刚好这两天没什么事,就参考了网上一个php抓取图片代码,重点借鉴了 匹配img标签和其src属性正则的写 ...
- 百度UEditor编辑器关闭抓取远程图片功能(默认开启)
这个坑娘的功能,开始时居然不知道如何触发,以为有个按钮,点击一下触发,翻阅了文档,没有发现,然后再网络上看到原来是复制粘贴非白名单内的图片到编辑框时触发,坑娘啊............... 问题又来 ...
- Scrapy-多层爬取天堂图片网
1.根据图片分类对爬取的图片进行分类 开发者选项 --> 找到分类地址 爬取每个分类的地址通过回调函数传入下一层 name = 'sky'start_urls = ['http: ...
随机推荐
- 枚举(Enum)
enum是一个全新的“类”. 枚举(Enum): 我们所定义的每个枚举类型都继承自java.lang.Enum类.枚举中的每个成员都是public static final的. 当您使用“enum”定 ...
- TP5.1框架最后登录时间不会更新
最后登录时间:2019-5-1 14:44 发现系统管理员时间总是停留在这个时间,后来才发现原来是时间没有自动更新. 手册地址:https://www.kancloud.cn/manual/thi ...
- TensorFlow DeepLab教程初稿-tensorflow gpu安装教程
TensorFlow DeepLab教程初稿-tensorflow gpu安装教程 商务合作,科技咨询,版权转让:向日葵,135-4855__4328,xiexiaokui#qq.com Summar ...
- Linux ps -ef vs. ps aux(ps -aux)
ps aux.ps -aux.ps -ef之间的区别 - wynter_的博客 - CSDN博客 https://blog.csdn.net/wynter_/article/details/73825 ...
- Tomcat connection & session timeout settings
# connection timeout for globle web application cat /home/soft/apache-tomcat-7.0.92/conf/server.xml ...
- markdown2的key
分享一个MarkDown2的授权key 邮箱地址: Soar360@live.com 授权秘钥: GBPduHjWfJU1mZqcPM3BikjYKF6xKhlKIys3i1MU2eJHqWGIm ...
- 海思 Hi3516A Hi3518E V200 芯片介绍
海康是生产监控摄像头和硬盘录像机的,海思是提供机器里芯片的,海思属于华为的. http://www.hisilicon.com/en/Products/ProductList/Surveillance ...
- Visio 的键盘快捷方式
https://support.office.com/zh-cn/article/Visio-的键盘快捷方式-ee952f31-7e3e-4564-8116-f3ecbb733cc1 https:// ...
- Java回调机制在RPC框架中的应用示例
完整源码: https://gitee.com/shiyanjun/x-callback-demo 应用场景描述: 服务提供者在项目启动时,创建并启动一个TCP服务器,然后将自己提供的所有服务注册到注 ...
- web端自动化——Selenium Grid原理
利用Selenium Grid可以在不同的主机上建立主节点(hub)和分支节点(node),可以使主节点上的测试用例在不同的分支节点上运行. 对不同的节点来说,可以搭建不同的测试环境(操作系统.浏 ...