[爬虫]采用Go语言爬取天猫商品页面
最近工作中有一个需求,需要爬取天猫商品的信息,整个需求的过程如下:
修改后端广告交易平台的代码,从阿里上传的素材中解析url,该url格式如下:
https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content=%7B%22items%22%3A%5B%7B%22images%22%3A%5B%22https%3A%2F%2Fasearch.alicdn.com%2Fbao%2Fuploaded%2F%2Fi4%2F22356367%2FTB2PMQinN6I8KJjy0FgXXXXzVXa_%21%210-saturn_solar.jpg%22%5D%2C%22itemid%22%3A%227664169349%22%2C%22shorttitle%22%3A%22%E4%B9%92%E4%B9%93%E7%90%83%E6%8B%8D%20%E6%97%A0%E7%BA%BF%E4%B8%93%E5%B1%9E%22%7D%5D%7D
明显进行编码了,首先我们需要进行解码,解码的在线网站如下:
http://tool.chinaz.com/Tools/urlencode.aspx
经过decode以后,我们得到:
https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content={"items":[{"images":["https://asearch.alicdn.com/bao/uploaded//i4/22356367/TB2PMQinN6I8KJjy0FgXXXXzVXa_!!0-saturn_solar.jpg"],"itemid":"7664169349","shorttitle":"乒乓球拍 无线专属"}]}
我们需要的就是其中的"itemid":"7664169349"。
然后我们通过访问https://detail.tmall.com/item.htm?id=7664169349,打开如下页面:
这就是我们需要抓取的页面信息。广告交易平台将解析的ItemId放入到nsq中,爬虫系统从nsq中读取ItemId通过拼接URL抓取页面的关键信息,然后将关键信息发送到Kafka中,Hive和ES再从Kafka中获取相应的信息,进行查询操作。
第一步
第一步就是解析出ItemId,在广告交易平台我们可以获取需要解析的URL,接下来我们用代码对URL进行decode并且解析出相应的ItemId数值。由于项目采用的是Golang,所以这里以Golang为例,Python写其实更简单,原理一样。
URL解析的方法,可以参考:
https://gobyexample.com/url-parsing
JSON序列化和反序列化,可以参考:
https://www.cnblogs.com/liang1101/p/6741262.html
这里给出我的代码:
package main import (
"encoding/json"
"fmt"
"net/url"
"strconv"
)
//结构体的首字母大写
type item struct {
Images []string
ItemId string
ShortTitle string
} func main() {
var urlstring string = "https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content=%7B%22items%22%3A%5B%7B%22images%22%3A%5B%22https%3A%2F%2Fasearch.alicdn.com%2Fbao%2Fuploaded%2F%2Fi4%2F22356367%2FTB2PMQinN6I8KJjy0FgXXXXzVXa_%21%210-saturn_solar.jpg%22%5D%2C%22itemid%22%3A%227664169349%22%2C%22shorttitle%22%3A%22%E4%B9%92%E4%B9%93%E7%90%83%E6%8B%8D%20%E6%97%A0%E7%BA%BF%E4%B8%93%E5%B1%9E%22%7D%5D%7D"
unescape, err := url.QueryUnescape(urlstring)
if err != nil {
fmt.Println("err is", err)
}
fmt.Println(unescape)
parse, err := url.Parse(unescape)
fmt.Println(parse.RawQuery)
query, err := url.ParseQuery(parse.RawQuery)
fmt.Println(query)
fmt.Printf("%T, %v\n", query["content"][0], query["content"][0])
m := make(map[string][]item)
json.Unmarshal([]byte(query["content"][0]), &m)
fmt.Println("m:", m)
itemValue := m["items"][0]
fmt.Println(itemValue.ItemId)
//转成int64
i, err := strconv.ParseInt(itemValue.ItemId, 10, 64)
fmt.Printf("%T, %v", i, i)
}
运行结果:
便可以得到我们需要的ItemId数值。
第二步
第二步就是拼接我们的URL进行页面内容的爬取。
如何通过GoLang拉取网页呢?附上一个简单demo。
package main
import (
"net/http"
"io/ioutil"
"fmt"
)
func main(){
var website string = "http://www.future.org.cn"
if resp,err := http.Get(website); err == nil{
defer resp.Body.Close()
if body, err := ioutil.ReadAll(resp.Body); err == nil {
fmt.Println("HTML content:", string(body));
}else{
fmt.Println("Cannot read from connected http server:", err);
}
}else{
fmt.Println("Cannot connect the server:", err);
}
}
但是爬取页面以后,会发现个问题,就是中文显示乱码。
中文乱码问题解决,参考:
安装 iconv-go
go get github.com/djimenez/iconv-go
可以获取以后再转码,比如:
func convFromGbk(s string) string {
gbkConvert, _ := iconv.NewConverter("gbk", "utf-8")
res, _ := gbkConvert.ConvertString(s)
return res
}
也可以用如下方式转换Reader:
req, err := http.NewRequest("GET", url, nil)
if err != nil {
return nil, err
}
req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))])
rsp, err := j.client.Do(req)
if err != nil {
return nil, err
}
//转码
utfBody, err := iconv.NewReader(rsp.Body, "gb2312", "utf-8")
//if body, err := ioutil.ReadAll(utfBody); err == nil {
// fmt.Println("HTML content:", string(body))
//}
爬取以后的页面我们需要进行解析,这里采用的XPath。
关于使用XPath的方式,参考:
http://www.w3school.com.cn/xpath/xpath_axes.asp
非常简单,看完就明白了。
因为爬取之后是html,你只需要获取自己想要的内容即可,说白了就是解析html。
接下来还有一个难点,就是我们抓取的静态页面,很多信息都包含,但是价格信息不包含,因为它是动态加载的。
我们不妨分析一下,
我们将其点开,复制URL在浏览器打开,发现无法访问,403,不要着急,只需要在请求的Header中加上如下的参数即可。
在代码中如下:
referer := fmt.Sprintf("https://detail.tmall.com/item.htm?id=%d", itemID)
req.Header.Set("Referer", referer)
我们查看响应发现是一个JSON,
格式化一下:格式化网址:http://tool.oschina.net/codeformat/json
{
"defaultModel": {
"bannerDO": {
"success": true
},
"deliveryDO": {
"areaId": 110100,
"deliveryAddress": "浙江金华",
"deliverySkuMap": {
"6310159781": [
{
"arrivalNextDay": false,
"arrivalThisDay": false,
"forceMocked": false,
"postage": "快递: 0.00 ",
"postageFree": false,
"skuDeliveryAddress": "浙江金华",
"type": 0
}
],
"default": [
{
"arrivalNextDay": false,
"arrivalThisDay": false,
"forceMocked": false,
"postage": "快递: 0.00 ",
"postageFree": false,
"skuDeliveryAddress": "浙江金华",
"type": 0
}
],
"6310159797": [
{
"arrivalNextDay": false,
"arrivalThisDay": false,
"forceMocked": false,
"postage": "快递: 0.00 ",
"postageFree": false,
"skuDeliveryAddress": "浙江金华",
"type": 0
}
],
"3280089025135": [
{
"arrivalNextDay": false,
"arrivalThisDay": false,
"forceMocked": false,
"postage": "快递: 0.00 ",
"postageFree": false,
"skuDeliveryAddress": "浙江金华",
"type": 0
}
],
"3280089025136": [
{
"arrivalNextDay": false,
"arrivalThisDay": false,
"forceMocked": false,
"postage": "快递: 0.00 ",
"postageFree": false,
"skuDeliveryAddress": "浙江金华",
"type": 0
}
]
},
"destination": "北京市",
"success": true
},
"detailPageTipsDO": {
"crowdType": 0,
"hasCoupon": true,
"hideIcons": false,
"jhs99": false,
"minicartSurprise": 0,
"onlyShowOnePrice": false,
"priceDisplayType": 4,
"primaryPicIcons": [ ],
"prime": false,
"showCuntaoIcon": false,
"showDou11Style": false,
"showDou11SugPromPrice": false,
"showDou12CornerIcon": false,
"showDuo11Stage": 0,
"showJuIcon": false,
"showMaskedDou11SugPrice": false,
"success": true,
"trueDuo11Prom": false
},
"doubleEleven2014": {
"doubleElevenItem": false,
"halfOffItem": false,
"showAtmosphere": false,
"showRightRecommendedArea": false,
"step": 0,
"success": true
},
"extendedData": { },
"extras": { },
"gatewayDO": {
"changeLocationGateway": {
"queryDelivery": true,
"queryProm": false
},
"success": true,
"trade": {
"addToBuyNow": { },
"addToCart": { }
}
},
"inventoryDO": {
"hidden": false,
"icTotalQuantity": 225,
"skuQuantity": {
"3280089025136": {
"quantity": 71,
"totalQuantity": 71,
"type": 1
},
"6310159781": {
"quantity": 33,
"totalQuantity": 33,
"type": 1
},
"6310159797": {
"quantity": 44,
"totalQuantity": 44,
"type": 1
},
"3280089025135": {
"quantity": 77,
"totalQuantity": 77,
"type": 1
}
},
"success": true,
"totalQuantity": 225,
"type": 1
},
"itemPriceResultDO": {
"areaId": 110100,
"duo11Item": false,
"duo11Stage": 0,
"extraPromShowRealPrice": false,
"halfOffItem": false,
"hasDPromotion": false,
"hasMobileProm": false,
"hasTmallappProm": false,
"hiddenNonBuyPrice": false,
"hideMeal": false,
"priceInfo": {
"6310159781": {
"areaSold": true,
"onlyShowOnePrice": false,
"price": "178.00",
"promotionList": [
{
"amountPromLimit": 0,
"amountRestriction": "",
"basePriceType": "IcPrice",
"canBuyCouponNum": 0,
"endTime": 1561651200000,
"extraPromTextType": 0,
"extraPromType": 0,
"limitProm": false,
"postageFree": false,
"price": "75.00",
"promType": "normal",
"start": false,
"startTime": 1546267717000,
"status": 2,
"tfCartSupport": false,
"tmallCartSupport": false,
"type": "火爆促销",
"unLogBrandMember": false,
"unLogShopVip": false,
"unLogTbvip": false
}
],
"sortOrder": 0
},
"6310159797": {
"areaSold": true,
"onlyShowOnePrice": false,
"price": "178.00",
"promotionList": [
{
"amountPromLimit": 0,
"amountRestriction": "",
"basePriceType": "IcPrice",
"canBuyCouponNum": 0,
"endTime": 1561651200000,
"extraPromTextType": 0,
"extraPromType": 0,
"limitProm": false,
"postageFree": false,
"price": "75.00",
"promType": "normal",
"start": false,
"startTime": 1546267717000,
"status": 2,
"tfCartSupport": false,
"tmallCartSupport": false,
"type": "火爆促销",
"unLogBrandMember": false,
"unLogShopVip": false,
"unLogTbvip": false
}
],
"sortOrder": 0
},
"3280089025135": {
"areaSold": true,
"onlyShowOnePrice": false,
"price": "168.00",
"promotionList": [
{
"amountPromLimit": 0,
"amountRestriction": "",
"basePriceType": "IcPrice",
"canBuyCouponNum": 0,
"endTime": 1561651200000,
"extraPromTextType": 0,
"extraPromType": 0,
"limitProm": false,
"postageFree": false,
"price": "68.00",
"promType": "normal",
"start": false,
"startTime": 1546267717000,
"status": 2,
"tfCartSupport": false,
"tmallCartSupport": false,
"type": "火爆促销",
"unLogBrandMember": false,
"unLogShopVip": false,
"unLogTbvip": false
}
],
"sortOrder": 0
},
"3280089025136": {
"areaSold": true,
"onlyShowOnePrice": false,
"price": "168.00",
"promotionList": [
{
"amountPromLimit": 0,
"amountRestriction": "",
"basePriceType": "IcPrice",
"canBuyCouponNum": 0,
"endTime": 1561651200000,
"extraPromTextType": 0,
"extraPromType": 0,
"limitProm": false,
"postageFree": false,
"price": "68.00",
"promType": "normal",
"start": false,
"startTime": 1546267717000,
"status": 2,
"tfCartSupport": false,
"tmallCartSupport": false,
"type": "火爆促销",
"unLogBrandMember": false,
"unLogShopVip": false,
"unLogTbvip": false
}
],
"sortOrder": 0
}
},
"queryProm": false,
"success": true,
"successCall": true,
"tmallShopProm": [ ]
},
"memberRightDO": {
"activityType": 0,
"level": 0,
"postageFree": false,
"shopMember": false,
"success": true,
"time": 1,
"value": 0.5
},
"miscDO": {
"bucketId": 15,
"city": "北京",
"cityId": 110100,
"debug": { },
"hasCoupon": false,
"region": "东城区",
"regionId": 110101,
"rn": "fa015e69c6a4ca4bb559805d670557e7",
"smartBannerFlag": "top",
"success": true,
"supportCartRecommend": false,
"systemTime": "1555232632711",
"town": "东华门街道",
"townId": 110101001
},
"regionalizedData": {
"success": true
},
"sellCountDO": {
"sellCount": "5",
"success": true
},
"servicePromise": {
"has3CPromise": false,
"servicePromiseList": [
{
"description": "商品支持正品保障服务",
"displayText": "正品保证",
"icon": "无",
"link": "//www.tmall.com/wow/portal/act/bzj",
"rank": -1
},
{
"description": "极速退款是为诚信会员提供的退款退货流程的专享特权,额度是根据每个用户当前的信誉评级情况而定",
"displayText": "极速退款",
"icon": "//img.alicdn.com/bao/album/sys/icon/discount.gif",
"link": "//vip.tmall.com/vip/privilege.htm?spm=3.1000588.0.141.2a0ae8&priv=speed",
"rank": -1
},
{
"description": "卖家为您购买的商品投保退货运费险(保单生效以下单显示为准)",
"displayText": "赠运费险",
"icon": "//img.alicdn.com/bao/album/sys/icon/discount.gif",
"link": "//service.tmall.com/support/tmall/knowledge-1121473.htm?spm=0.0.0.0.asbDA1",
"rank": -1
},
{
"description": "七天无理由退换",
"displayText": "七天无理由退换",
"icon": "//img.alicdn.com/tps/i3/T1Vyl6FCBlXXaSQP_X-16-16.png",
"link": "//pages.tmall.com/wow/seller/act/seven-day",
"rank": -1
}
],
"show": true,
"success": true,
"titleInformation": [ ]
},
"soldAreaDataDO": {
"currentAreaEnable": true,
"success": true,
"useNewRegionalSales": true
},
"tradeResult": {
"cartEnable": true,
"cartType": 2,
"miniTmallCartEnable": true,
"startTime": 1554812946000,
"success": true,
"tradeEnable": true
},
"userInfoDO": {
"activeStatus": 0,
"companyPurchaseUser": false,
"loginMember": false,
"loginUserType": "buyer",
"success": true,
"userId": 0
}
},
"isSuccess": true
}
我们发现JSON的内容非常多,我们要是每个都解析,岂不是很累?这里我们只需要获取price的信息,也就是priceInfo,所以我们想寻求一种方法,类似XPath的方式解析,这里我们采用JSONPath。
参考:https://github.com/DarrenChanChenChi/jsonpath
用法和XPath大同小异。
解析出我们想要的代码即可。
整体代码
common.go:
package main import (
"github.com/djimenez/iconv-go"
"time"
"net"
"net/http"
"gopkg.in/xmlpath.v2"
"strings"
"fmt"
"math/rand"
) type Msg struct{
AdID int64 `json:"ad_id"`
SourceID int64 `json:"source_id"`
Source string `json:"source"`
ItemID int64 `json:"item_id"`
URL string `json:"url"`
UID int64 `json:"uid"`
DID int64 `json:"did"`
} func convFromGbk(s string) string {
gbkConvert, _ := iconv.NewConverter("gbk", "utf-8")
res, _ := gbkConvert.ConvertString(s)
return res
} func newHTTPClient() *http.Client {
client := &http.Client{
Transport: &http.Transport{
Dial: func(netw, addr string) (net.Conn, error) {
return net.DialTimeout(netw, addr, time.Duration(1500*time.Millisecond))
},
MaxIdleConnsPerHost: 200,
},
Timeout: time.Duration(1500 * time.Millisecond),
}
return client
} //只获取首元素
func parseNode(node *xmlpath.Node, xpath string) string {
path, err := xmlpath.Compile(xpath)
if err != nil {
fmt.Errorf("%s",err)
return ""
} it := path.Iter(node)
for it.Next() {
s := strings.TrimSpace(it.Node().String())
if len(s) != 0 {
//return convFromGbk(s)
return s
}
}
return ""
} //获取所有元素
func parseNodeForAll(node *xmlpath.Node, xpath string) []string {
path, err := xmlpath.Compile(xpath)
if err != nil {
fmt.Errorf("%s",err)
return nil
} it := path.Iter(node)
elements := []string{}
for it.Next() {
s := strings.TrimSpace(it.Node().String())
if len(s) != 0 {
//return convFromGbk(s)
elements = append(elements, s)
}
}
return elements
} // percent returns the possibility of pct
func percent(pct int) bool {
if pct < 0 || pct > 100 {
return false
}
return pct > rand.Intn(100)
}
ali_spider.go:
package main import (
"code.byted.org/gopkg/logs"
"encoding/json"
"fmt"
"github.com/djimenez/iconv-go"
"github.com/ngaut/logging"
"github.com/oliveagle/jsonpath"
"gopkg.in/xmlpath.v2"
"io/ioutil"
"math/rand"
"net/http"
"strconv"
"strings"
) const itemURLPatternAli = "https://detail.tmall.com/item.htm?id=%d"
const priceURLPatternAli = "https://mdskip.taobao.com/core/initItemDetail.htm?isUseInventoryCenter=false&cartEnable=true&service3C=false&isApparel=true&isSecKill=false&tmallBuySupport=true&isAreaSell=false&tryBeforeBuy=false&offlineShop=false&itemId=%d&showShopProm=false&isPurchaseMallPage=false&itemGmtModified=1555201252000&isRegionLevel=false&household=false&sellerPreview=false&queryMemberRight=true&addressLevel=2&isForbidBuyItem=false&callback=setMdskip×tamp=1555210888509&isg=bBQF1SmIvk4dQ8UGBOCNIZNDTp7T7IRAguWjmN99i_5Qy1Y_p8_OlZkxNev6Vj5RsG8p46-P7M29-etfw&isg2=BPPzr6M1qyiTZGdgYB4puOBagvEXdGgbstRSkqWQUpJJpBNGLPrUOlF1XpTvBN_i" var ualist = []string{
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36",
} type AliSpider struct {
client *http.Client
} func NewAliSpider() *AliSpider {
return &AliSpider{
client: newHTTPClient(),
}
} func (j *AliSpider) loadPage(url string) (*xmlpath.Node, error) {
req, err := http.NewRequest("GET", url, nil)
if err != nil {
return nil, err
}
req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))])
rsp, err := j.client.Do(req)
if err != nil {
return nil, err
}
//转码
utfBody, err := iconv.NewReader(rsp.Body, "gb2312", "utf-8")
//if body, err := ioutil.ReadAll(utfBody); err == nil {
// fmt.Println("HTML content:", string(body))
//}
node, err := xmlpath.ParseHTML(utfBody)
rsp.Body.Close()
return node, err
} func (j *AliSpider) parsePrice(itemID int64) (map[string]map[string]float64, error) {
priceURL := fmt.Sprintf(priceURLPatternAli, itemID)
req, err := http.NewRequest("GET", priceURL, nil)
if err != nil {
return nil, err
}
req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))])
referer := fmt.Sprintf("https://detail.tmall.com/item.htm?id=%d", itemID)
req.Header.Set("Referer", referer)
rsp, err := j.client.Do(req)
if err != nil {
return nil, err
}
priceInfoRaw, err := ioutil.ReadAll(rsp.Body)
if err != nil {
return nil, err
}
priceInfo := string(priceInfoRaw)
jsonStr := convFromGbk(priceInfo) leftIndex := strings.Index(jsonStr, "(") + 1
rightIndex := strings.Index(jsonStr, ")")
var json_data interface{}
json.Unmarshal([]byte(jsonStr[leftIndex:rightIndex]), &json_data) skuQuantity, err := jsonpath.JsonPathLookup(json_data, "$.defaultModel.inventoryDO.skuQuantity")
if err != nil {
logs.Info("json path is err, err is %v", err)
}
skuQuantityMap := skuQuantity.(map[string]interface{})
itemPriceResultMap := map[string]map[string]float64{}
itemPriceResultDetailMap := map[string]float64{}
for skuQuantityId, _ := range skuQuantityMap {
//fmt.Println(key, value)
jpathPrice := fmt.Sprintf("$.defaultModel.itemPriceResultDO.priceInfo.%s.price", skuQuantityId)
jpathPromotionPrice := fmt.Sprintf("$.defaultModel.itemPriceResultDO.priceInfo.%s.promotionList[0].price", skuQuantityId)
price, err := jsonpath.JsonPathLookup(json_data, jpathPrice)
if err != nil {
logs.Info("jpathPrice is err, err is %v", err)
}
promotionPrice, err := jsonpath.JsonPathLookup(json_data, jpathPromotionPrice)
if err != nil {
logs.Info("jpathPromotionPrice is err, err is %v", err)
}
priceStr := price.(string)
promotionPriceStr := promotionPrice.(string)
itemPriceResultDetailMap["price"], _ = strconv.ParseFloat(priceStr, 64)
itemPriceResultDetailMap["promotion_price"], _ = strconv.ParseFloat(promotionPriceStr, 64)
itemPriceResultMap[skuQuantityId] = itemPriceResultDetailMap
}
return itemPriceResultMap, err
} func (j *AliSpider) Parse(msg *Msg) (map[string]interface{}, error) {
defer func() {
if r := recover(); r != nil {
logging.Errorf("parse msg %v, error %v", *msg, r)
return
}
}()
itemURL := fmt.Sprintf(itemURLPatternAli, msg.ItemID)
node, err := j.loadPage(itemURL)
if err != nil {
fmt.Errorf("%s",err)
return nil, err
}
//metricsClient.EmitCounter("jd_spider", 1, "", map[string]string{"step": "parse"}) name := parseNode(node, "//h1[@data-spm]")
//详情描述
/**
产品名称:纽曼
品牌: 纽曼
型号: EX16
功能: 睡眠监测 计步 防水
*/
details := parseNodeForAll(node, "//ul[@id=\"J_AttrUL\"]/li")
detailsMap := make(map[string]string, len(details))
for _, detail := range details {
split := strings.Split(detail, ":")
if(len(split) > 1){
detailsMap[split[0]] = strings.TrimSpace(split[1])
}
} shopname := parseNode(node, "//a[@class=\"slogo-shopname\"]") //描述 服务 物流
shopinfos := parseNodeForAll(node, "//span[@class=\"shopdsr-score-con\"]")
describe, _ := strconv.ParseFloat(shopinfos[0], 64)
service, _ := strconv.ParseFloat(shopinfos[1], 64)
logistics, _ := strconv.ParseFloat(shopinfos[2], 64) //价格(多个型号,price是标准价格,promotion_price是促销价格)
//map[4023134073248:map[price:3299.00 promotion_price:3299.00] 4023134073249:map[price:3299.00 promotion_price:3299.00] 4200326178501:map[promotion_price:3299.00 price:3299.00] 4023134073246:map[price:3299.00 promotion_price:3299.00] 4023134073247:map[price:3299.00 promotion_price:3299.00] 4023134073245:map[price:3299.00 promotion_price:3299.00] 4023134073250:map[price:3299.00 promotion_price:3299.00]]
itemPriceResultMap, err := j.parsePrice(msg.ItemID) res := map[string]interface{}{}
res["source"] = "Ali"
res["source_id"] = msg.SourceID
res["id"] = msg.ItemID
res["ad_id"] = msg.AdID
res["url"] = itemURL
res["name"] = name
res["details"] = detailsMap
res["shopname"] = shopname
res["describe"] = describe
res["service"] = service
res["logistics"] = logistics
res["uid"] = msg.UID
res["did"] = msg.DID
res["item_price"] = itemPriceResultMap
// 选几个必须包含的类别校验
if res["name"] == "" && res["shopname"] == "" {
return nil, fmt.Errorf("invalid html page %s", itemURL)
}
return res, nil
}
ali_spider_test.go:
package main import (
"encoding/json"
"fmt"
"strconv"
"strings"
"testing"
) func TestName(t *testing.T) {
//conf, err := ssconf.LoadSsConfFile(confFile)
//if err != nil {
// panic(err)
//}
aliSpider := NewAliSpider()
//554867117919 585758506034
var itemId int64 = 7664169349
itemURL := fmt.Sprintf(itemURLPatternAli, itemId)
node, err := aliSpider.loadPage(itemURL)
if err != nil {
fmt.Errorf("%s",err)
}
//fmt.Println(node)
name := parseNode(node, "//h1[@data-spm]")
//详情描述
/**
产品名称:纽曼
品牌: 纽曼
型号: EX16
功能: 睡眠监测 计步 防水
*/
details := parseNodeForAll(node, "//ul[@id=\"J_AttrUL\"]/li")
detailsMap := make(map[string]string, len(details))
for _, detail := range details {
split := strings.Split(detail, ":")
if(len(split) > 1){
detailsMap[split[0]] = strings.TrimSpace(split[1])
}
} shopname := parseNode(node, "//a[@class=\"slogo-shopname\"]") //描述 服务 物流
shopinfos := parseNodeForAll(node, "//span[@class=\"shopdsr-score-con\"]")
describe, _ := strconv.ParseFloat(shopinfos[0], 64)
service, _ := strconv.ParseFloat(shopinfos[1], 64)
logistics, _ := strconv.ParseFloat(shopinfos[2], 64)
//价格(多个型号,price是标准价格,promotion_price是促销价格)
//map[4023134073248:map[price:3299.00 promotion_price:3299.00] 4023134073249:map[price:3299.00 promotion_price:3299.00] 4200326178501:map[promotion_price:3299.00 price:3299.00] 4023134073246:map[price:3299.00 promotion_price:3299.00] 4023134073247:map[price:3299.00 promotion_price:3299.00] 4023134073245:map[price:3299.00 promotion_price:3299.00] 4023134073250:map[price:3299.00 promotion_price:3299.00]]
itemPriceResultMap, err := aliSpider.parsePrice(itemId) res := map[string]interface{}{}
res["source"] = "Ali"
res["url"] = itemURL
res["name"] = name
res["details"] = detailsMap
res["shopname"] = shopname
res["describe"] = describe
res["service"] = service
res["logistics"] = logistics
res["item_price"] = itemPriceResultMap bytes, err := json.Marshal(res)
if err != nil {
fmt.Println("error is ", err)
}
fmt.Println(string(bytes))
}
运行结果:
{"describe":4.9,"details":{"上市时间":"2014年冬季","乒乓底板材质":"其他","品牌":"Palio/拍里奥","型号":"TNT-1","层数":"9层","拍柄重量":"头沉柄轻","是否商场同款":"是","系列":"拍里奥TNT-1","货号":"TNT-1","颜色分类":"TNT-1直拍(短柄)1只+赠送:1海绵护边【7木+2碳】 TNT-1横拍(长柄)1只+赠送:1海绵护边【7木+2碳】 新TNT直拍(短柄)1只+赠送:1海绵护边【5木+2碳】 新TNT横拍(长柄)1只+赠送:1海绵护边【5木+2碳】"},"item_price":{"3280089025135":{"price":168,"promotion_price":68},"3280089025136":{"price":168,"promotion_price":68},"6310159781":{"price":168,"promotion_price":68},"6310159797":{"price":168,"promotion_price":68}},"logistics":4.8,"name":"正品 拍里奥乒乓球底板新TNT-1碳素快攻弧圈乒乓球拍底板球拍球板","service":4.8,"shopname":"玺源运动专营店","source":"Ali","url":"https://detail.tmall.com/item.htm?id=7664169349"}
[爬虫]采用Go语言爬取天猫商品页面的更多相关文章
- selenium跳过webdriver检测并爬取天猫商品数据
目录 简介 编写思路 使用教程 演示图片 源代码 @(文章目录) 简介 现在爬取淘宝,天猫商品数据都是需要首先进行登录的.上一节我们已经完成了模拟登录淘宝的步骤,所以在此不详细讲如何模拟登录淘宝.把关 ...
- Python爬虫之selenium爬虫,模拟浏览器爬取天猫信息
由于工作需要,需要提取到天猫400个指定商品页面中指定的信息,于是有了这个爬虫.这是一个使用 selenium 爬取天猫商品信息的爬虫,虽然功能单一,但是也算是 selenium 爬虫的基本用法了. ...
- 零基础掌握百度地图兴趣点获取POI爬虫(python语言爬取)(代码篇)
好,现在进入高阶代码篇. 目的: 爬取昆明市中学的兴趣点POI. 关键词:中学 已有ak:9s5GSYZsWbMaFU8Ps2V2VWvDlDlqGaaO 昆明市坐标范围: 左下角:24.390894 ...
- scrapy 爬取天猫商品信息
spider # -*- coding: utf-8 -*- from urllib.parse import urlencode import requests import scrapy impo ...
- Python开发简单爬虫(二)---爬取百度百科页面数据
一.开发爬虫的步骤 1.确定目标抓取策略: 打开目标页面,通过右键审查元素确定网页的url格式.数据格式.和网页编码形式. ①先看url的格式, F12观察一下链接的形式;② 再看目标文本信息的标签格 ...
- 爬虫框架Scrapy入门——爬取acg12某页面
1.安装1.1自行安装python3环境1.2ide使用pycharm1.3安装scrapy框架2.入门案例2.1新建项目工程2.2配置settings文件2.3新建爬虫app新建app将start_ ...
- python制作爬虫爬取京东商品评论教程
作者:蓝鲸 类型:转载 本文是继前2篇Python爬虫系列文章的后续篇,给大家介绍的是如何使用Python爬取京东商品评论信息的方法,并根据数据绘制成各种统计图表,非常的细致,有需要的小伙伴可以参考下 ...
- 一起学爬虫——使用selenium和pyquery爬取京东商品列表
layout: article title: 一起学爬虫--使用selenium和pyquery爬取京东商品列表 mathjax: true --- 今天一起学起使用selenium和pyquery爬 ...
- 爬虫系列(十三) 用selenium爬取京东商品
这篇文章,我们将通过 selenium 模拟用户使用浏览器的行为,爬取京东商品信息,还是先放上最终的效果图: 1.网页分析 (1)初步分析 原本博主打算写一个能够爬取所有商品信息的爬虫,可是在分析过程 ...
随机推荐
- 2018.12.1 Test
目录 2018.12.1 Test A 串string(思路) B 变量variable(最小割ISAP) C 取石子stone(思路 博弈) 考试代码 B C 2018.12.1 Test 题目为2 ...
- 免花生壳 TCP测试 DTU测试 GPRS测试TCP服务器
通常在学习GPRS或者DTU的时候,往往没有自己的服务器,很多时候我们只能用这个模块打个电话发个短信,但是随着移动互联的兴起,各行各业大家都开始弄移动接入.为了这个需求,这里提供TCP移动接入. 工作 ...
- mongodb副本集搭建
1.创建目录 mkdir -p /data/r1 /data/r2 /data/r3 2.启动: bin/mongod --config ../mongod.conf --replSet r1 b ...
- [P1441]砝码称重 (搜索+DP)
对于我这种蒟蒻,是很不错的一题了. dfs搜索当前状态 满足时DP 比较坑的地方就是起始的地方 我一开始从1开始,搜索写的是从0开始. 后来就统一用0开始的了. #include<bits/st ...
- 2017.07.09【NOIP提高组】模拟赛B组
Summary 今天放假,比赛于是就没有打了,但是看了一下题,发现都挺简单了,不想码~╮(╯▽╰)╭懒虫一条.最后一题居然做过原题.这次比赛让我对并查集“刮目相看”,对贪心感到“前途无量”,觉得树形D ...
- tensorflow 手写数字识别
https://www.kaggle.com/kakauandme/tensorflow-deep-nn 本人只是负责将这个kernels的代码整理了一遍,具体还是请看原链接 import numpy ...
- modelform的操作以及验证
1,model的两个功能 1,数据库操作 2,验证只有一个clean方法作为钩子来操作,方法比较少 2,form(专门用来做验证的) 根据form里面写的类,类里面的字段,这些字段里有内置的的正则表达 ...
- 基于ubuntu的docker安装
系统版本:Ubuntu16.04 docker版本:18.02.0 Ubuntu 系统的内核版本>3.10(执行 uname -r 可查看内核版本) 在安装前先简单介绍一下docker,按照 ...
- Drools(BRMS) 速成教程(上)
大家在日常开发中,肯定遇到过一些业务规则变来变去的需求,比如:会员积分系统(今天要新注册会员送10积分,明天要改成注册送优惠券,后天搞活动要改成注册自动变成高级会员...),此类需求,一般都是通过写i ...
- fiddler抓取手机上https数据失败,全部显示“Tunnel to......443”解决办法
与后端数据通信是前端日常开发的重要一环,在与后端接口联调的时候往往需要通过查看后端返回的数据进行调试.如果在PC端,Chrome自带的DevTools就已经足够用了,Network面板可以记录所有网络 ...