[爬虫]采用Go语言爬取天猫商品页面
最近工作中有一个需求,需要爬取天猫商品的信息,整个需求的过程如下:
修改后端广告交易平台的代码,从阿里上传的素材中解析url,该url格式如下:
https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content=%7B%22items%22%3A%5B%7B%22images%22%3A%5B%22https%3A%2F%2Fasearch.alicdn.com%2Fbao%2Fuploaded%2F%2Fi4%2F22356367%2FTB2PMQinN6I8KJjy0FgXXXXzVXa_%21%210-saturn_solar.jpg%22%5D%2C%22itemid%22%3A%227664169349%22%2C%22shorttitle%22%3A%22%E4%B9%92%E4%B9%93%E7%90%83%E6%8B%8D%20%E6%97%A0%E7%BA%BF%E4%B8%93%E5%B1%9E%22%7D%5D%7D
明显进行编码了,首先我们需要进行解码,解码的在线网站如下:
http://tool.chinaz.com/Tools/urlencode.aspx
经过decode以后,我们得到:
https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content={"items":[{"images":["https://asearch.alicdn.com/bao/uploaded//i4/22356367/TB2PMQinN6I8KJjy0FgXXXXzVXa_!!0-saturn_solar.jpg"],"itemid":"7664169349","shorttitle":"乒乓球拍 无线专属"}]}
我们需要的就是其中的"itemid":"7664169349"。
然后我们通过访问https://detail.tmall.com/item.htm?id=7664169349,打开如下页面:
这就是我们需要抓取的页面信息。广告交易平台将解析的ItemId放入到nsq中,爬虫系统从nsq中读取ItemId通过拼接URL抓取页面的关键信息,然后将关键信息发送到Kafka中,Hive和ES再从Kafka中获取相应的信息,进行查询操作。
第一步
第一步就是解析出ItemId,在广告交易平台我们可以获取需要解析的URL,接下来我们用代码对URL进行decode并且解析出相应的ItemId数值。由于项目采用的是Golang,所以这里以Golang为例,Python写其实更简单,原理一样。
URL解析的方法,可以参考:
https://gobyexample.com/url-parsing
JSON序列化和反序列化,可以参考:
https://www.cnblogs.com/liang1101/p/6741262.html
这里给出我的代码:
package main import (
"encoding/json"
"fmt"
"net/url"
"strconv"
)
//结构体的首字母大写
type item struct {
Images []string
ItemId string
ShortTitle string
} func main() {
var urlstring string = "https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content=%7B%22items%22%3A%5B%7B%22images%22%3A%5B%22https%3A%2F%2Fasearch.alicdn.com%2Fbao%2Fuploaded%2F%2Fi4%2F22356367%2FTB2PMQinN6I8KJjy0FgXXXXzVXa_%21%210-saturn_solar.jpg%22%5D%2C%22itemid%22%3A%227664169349%22%2C%22shorttitle%22%3A%22%E4%B9%92%E4%B9%93%E7%90%83%E6%8B%8D%20%E6%97%A0%E7%BA%BF%E4%B8%93%E5%B1%9E%22%7D%5D%7D"
unescape, err := url.QueryUnescape(urlstring)
if err != nil {
fmt.Println("err is", err)
}
fmt.Println(unescape)
parse, err := url.Parse(unescape)
fmt.Println(parse.RawQuery)
query, err := url.ParseQuery(parse.RawQuery)
fmt.Println(query)
fmt.Printf("%T, %v\n", query["content"][0], query["content"][0])
m := make(map[string][]item)
json.Unmarshal([]byte(query["content"][0]), &m)
fmt.Println("m:", m)
itemValue := m["items"][0]
fmt.Println(itemValue.ItemId)
//转成int64
i, err := strconv.ParseInt(itemValue.ItemId, 10, 64)
fmt.Printf("%T, %v", i, i)
}
运行结果:
便可以得到我们需要的ItemId数值。
第二步
第二步就是拼接我们的URL进行页面内容的爬取。
如何通过GoLang拉取网页呢?附上一个简单demo。
package main
import (
"net/http"
"io/ioutil"
"fmt"
)
func main(){
var website string = "http://www.future.org.cn"
if resp,err := http.Get(website); err == nil{
defer resp.Body.Close()
if body, err := ioutil.ReadAll(resp.Body); err == nil {
fmt.Println("HTML content:", string(body));
}else{
fmt.Println("Cannot read from connected http server:", err);
}
}else{
fmt.Println("Cannot connect the server:", err);
}
}
但是爬取页面以后,会发现个问题,就是中文显示乱码。
中文乱码问题解决,参考:
安装 iconv-go
go get github.com/djimenez/iconv-go
可以获取以后再转码,比如:
func convFromGbk(s string) string {
gbkConvert, _ := iconv.NewConverter("gbk", "utf-8")
res, _ := gbkConvert.ConvertString(s)
return res
}
也可以用如下方式转换Reader:
req, err := http.NewRequest("GET", url, nil)
if err != nil {
return nil, err
}
req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))])
rsp, err := j.client.Do(req)
if err != nil {
return nil, err
}
//转码
utfBody, err := iconv.NewReader(rsp.Body, "gb2312", "utf-8")
//if body, err := ioutil.ReadAll(utfBody); err == nil {
// fmt.Println("HTML content:", string(body))
//}
爬取以后的页面我们需要进行解析,这里采用的XPath。
关于使用XPath的方式,参考:
http://www.w3school.com.cn/xpath/xpath_axes.asp
非常简单,看完就明白了。
因为爬取之后是html,你只需要获取自己想要的内容即可,说白了就是解析html。
接下来还有一个难点,就是我们抓取的静态页面,很多信息都包含,但是价格信息不包含,因为它是动态加载的。
我们不妨分析一下,
我们将其点开,复制URL在浏览器打开,发现无法访问,403,不要着急,只需要在请求的Header中加上如下的参数即可。
在代码中如下:
referer := fmt.Sprintf("https://detail.tmall.com/item.htm?id=%d", itemID)
req.Header.Set("Referer", referer)
我们查看响应发现是一个JSON,
格式化一下:格式化网址:http://tool.oschina.net/codeformat/json
{
"defaultModel": {
"bannerDO": {
"success": true
},
"deliveryDO": {
"areaId": 110100,
"deliveryAddress": "浙江金华",
"deliverySkuMap": {
"6310159781": [
{
"arrivalNextDay": false,
"arrivalThisDay": false,
"forceMocked": false,
"postage": "快递: 0.00 ",
"postageFree": false,
"skuDeliveryAddress": "浙江金华",
"type": 0
}
],
"default": [
{
"arrivalNextDay": false,
"arrivalThisDay": false,
"forceMocked": false,
"postage": "快递: 0.00 ",
"postageFree": false,
"skuDeliveryAddress": "浙江金华",
"type": 0
}
],
"6310159797": [
{
"arrivalNextDay": false,
"arrivalThisDay": false,
"forceMocked": false,
"postage": "快递: 0.00 ",
"postageFree": false,
"skuDeliveryAddress": "浙江金华",
"type": 0
}
],
"3280089025135": [
{
"arrivalNextDay": false,
"arrivalThisDay": false,
"forceMocked": false,
"postage": "快递: 0.00 ",
"postageFree": false,
"skuDeliveryAddress": "浙江金华",
"type": 0
}
],
"3280089025136": [
{
"arrivalNextDay": false,
"arrivalThisDay": false,
"forceMocked": false,
"postage": "快递: 0.00 ",
"postageFree": false,
"skuDeliveryAddress": "浙江金华",
"type": 0
}
]
},
"destination": "北京市",
"success": true
},
"detailPageTipsDO": {
"crowdType": 0,
"hasCoupon": true,
"hideIcons": false,
"jhs99": false,
"minicartSurprise": 0,
"onlyShowOnePrice": false,
"priceDisplayType": 4,
"primaryPicIcons": [ ],
"prime": false,
"showCuntaoIcon": false,
"showDou11Style": false,
"showDou11SugPromPrice": false,
"showDou12CornerIcon": false,
"showDuo11Stage": 0,
"showJuIcon": false,
"showMaskedDou11SugPrice": false,
"success": true,
"trueDuo11Prom": false
},
"doubleEleven2014": {
"doubleElevenItem": false,
"halfOffItem": false,
"showAtmosphere": false,
"showRightRecommendedArea": false,
"step": 0,
"success": true
},
"extendedData": { },
"extras": { },
"gatewayDO": {
"changeLocationGateway": {
"queryDelivery": true,
"queryProm": false
},
"success": true,
"trade": {
"addToBuyNow": { },
"addToCart": { }
}
},
"inventoryDO": {
"hidden": false,
"icTotalQuantity": 225,
"skuQuantity": {
"3280089025136": {
"quantity": 71,
"totalQuantity": 71,
"type": 1
},
"6310159781": {
"quantity": 33,
"totalQuantity": 33,
"type": 1
},
"6310159797": {
"quantity": 44,
"totalQuantity": 44,
"type": 1
},
"3280089025135": {
"quantity": 77,
"totalQuantity": 77,
"type": 1
}
},
"success": true,
"totalQuantity": 225,
"type": 1
},
"itemPriceResultDO": {
"areaId": 110100,
"duo11Item": false,
"duo11Stage": 0,
"extraPromShowRealPrice": false,
"halfOffItem": false,
"hasDPromotion": false,
"hasMobileProm": false,
"hasTmallappProm": false,
"hiddenNonBuyPrice": false,
"hideMeal": false,
"priceInfo": {
"6310159781": {
"areaSold": true,
"onlyShowOnePrice": false,
"price": "178.00",
"promotionList": [
{
"amountPromLimit": 0,
"amountRestriction": "",
"basePriceType": "IcPrice",
"canBuyCouponNum": 0,
"endTime": 1561651200000,
"extraPromTextType": 0,
"extraPromType": 0,
"limitProm": false,
"postageFree": false,
"price": "75.00",
"promType": "normal",
"start": false,
"startTime": 1546267717000,
"status": 2,
"tfCartSupport": false,
"tmallCartSupport": false,
"type": "火爆促销",
"unLogBrandMember": false,
"unLogShopVip": false,
"unLogTbvip": false
}
],
"sortOrder": 0
},
"6310159797": {
"areaSold": true,
"onlyShowOnePrice": false,
"price": "178.00",
"promotionList": [
{
"amountPromLimit": 0,
"amountRestriction": "",
"basePriceType": "IcPrice",
"canBuyCouponNum": 0,
"endTime": 1561651200000,
"extraPromTextType": 0,
"extraPromType": 0,
"limitProm": false,
"postageFree": false,
"price": "75.00",
"promType": "normal",
"start": false,
"startTime": 1546267717000,
"status": 2,
"tfCartSupport": false,
"tmallCartSupport": false,
"type": "火爆促销",
"unLogBrandMember": false,
"unLogShopVip": false,
"unLogTbvip": false
}
],
"sortOrder": 0
},
"3280089025135": {
"areaSold": true,
"onlyShowOnePrice": false,
"price": "168.00",
"promotionList": [
{
"amountPromLimit": 0,
"amountRestriction": "",
"basePriceType": "IcPrice",
"canBuyCouponNum": 0,
"endTime": 1561651200000,
"extraPromTextType": 0,
"extraPromType": 0,
"limitProm": false,
"postageFree": false,
"price": "68.00",
"promType": "normal",
"start": false,
"startTime": 1546267717000,
"status": 2,
"tfCartSupport": false,
"tmallCartSupport": false,
"type": "火爆促销",
"unLogBrandMember": false,
"unLogShopVip": false,
"unLogTbvip": false
}
],
"sortOrder": 0
},
"3280089025136": {
"areaSold": true,
"onlyShowOnePrice": false,
"price": "168.00",
"promotionList": [
{
"amountPromLimit": 0,
"amountRestriction": "",
"basePriceType": "IcPrice",
"canBuyCouponNum": 0,
"endTime": 1561651200000,
"extraPromTextType": 0,
"extraPromType": 0,
"limitProm": false,
"postageFree": false,
"price": "68.00",
"promType": "normal",
"start": false,
"startTime": 1546267717000,
"status": 2,
"tfCartSupport": false,
"tmallCartSupport": false,
"type": "火爆促销",
"unLogBrandMember": false,
"unLogShopVip": false,
"unLogTbvip": false
}
],
"sortOrder": 0
}
},
"queryProm": false,
"success": true,
"successCall": true,
"tmallShopProm": [ ]
},
"memberRightDO": {
"activityType": 0,
"level": 0,
"postageFree": false,
"shopMember": false,
"success": true,
"time": 1,
"value": 0.5
},
"miscDO": {
"bucketId": 15,
"city": "北京",
"cityId": 110100,
"debug": { },
"hasCoupon": false,
"region": "东城区",
"regionId": 110101,
"rn": "fa015e69c6a4ca4bb559805d670557e7",
"smartBannerFlag": "top",
"success": true,
"supportCartRecommend": false,
"systemTime": "1555232632711",
"town": "东华门街道",
"townId": 110101001
},
"regionalizedData": {
"success": true
},
"sellCountDO": {
"sellCount": "5",
"success": true
},
"servicePromise": {
"has3CPromise": false,
"servicePromiseList": [
{
"description": "商品支持正品保障服务",
"displayText": "正品保证",
"icon": "无",
"link": "//www.tmall.com/wow/portal/act/bzj",
"rank": -1
},
{
"description": "极速退款是为诚信会员提供的退款退货流程的专享特权,额度是根据每个用户当前的信誉评级情况而定",
"displayText": "极速退款",
"icon": "//img.alicdn.com/bao/album/sys/icon/discount.gif",
"link": "//vip.tmall.com/vip/privilege.htm?spm=3.1000588.0.141.2a0ae8&priv=speed",
"rank": -1
},
{
"description": "卖家为您购买的商品投保退货运费险(保单生效以下单显示为准)",
"displayText": "赠运费险",
"icon": "//img.alicdn.com/bao/album/sys/icon/discount.gif",
"link": "//service.tmall.com/support/tmall/knowledge-1121473.htm?spm=0.0.0.0.asbDA1",
"rank": -1
},
{
"description": "七天无理由退换",
"displayText": "七天无理由退换",
"icon": "//img.alicdn.com/tps/i3/T1Vyl6FCBlXXaSQP_X-16-16.png",
"link": "//pages.tmall.com/wow/seller/act/seven-day",
"rank": -1
}
],
"show": true,
"success": true,
"titleInformation": [ ]
},
"soldAreaDataDO": {
"currentAreaEnable": true,
"success": true,
"useNewRegionalSales": true
},
"tradeResult": {
"cartEnable": true,
"cartType": 2,
"miniTmallCartEnable": true,
"startTime": 1554812946000,
"success": true,
"tradeEnable": true
},
"userInfoDO": {
"activeStatus": 0,
"companyPurchaseUser": false,
"loginMember": false,
"loginUserType": "buyer",
"success": true,
"userId": 0
}
},
"isSuccess": true
}
我们发现JSON的内容非常多,我们要是每个都解析,岂不是很累?这里我们只需要获取price的信息,也就是priceInfo,所以我们想寻求一种方法,类似XPath的方式解析,这里我们采用JSONPath。
参考:https://github.com/DarrenChanChenChi/jsonpath
用法和XPath大同小异。
解析出我们想要的代码即可。
整体代码
common.go:
package main import (
"github.com/djimenez/iconv-go"
"time"
"net"
"net/http"
"gopkg.in/xmlpath.v2"
"strings"
"fmt"
"math/rand"
) type Msg struct{
AdID int64 `json:"ad_id"`
SourceID int64 `json:"source_id"`
Source string `json:"source"`
ItemID int64 `json:"item_id"`
URL string `json:"url"`
UID int64 `json:"uid"`
DID int64 `json:"did"`
} func convFromGbk(s string) string {
gbkConvert, _ := iconv.NewConverter("gbk", "utf-8")
res, _ := gbkConvert.ConvertString(s)
return res
} func newHTTPClient() *http.Client {
client := &http.Client{
Transport: &http.Transport{
Dial: func(netw, addr string) (net.Conn, error) {
return net.DialTimeout(netw, addr, time.Duration(1500*time.Millisecond))
},
MaxIdleConnsPerHost: 200,
},
Timeout: time.Duration(1500 * time.Millisecond),
}
return client
} //只获取首元素
func parseNode(node *xmlpath.Node, xpath string) string {
path, err := xmlpath.Compile(xpath)
if err != nil {
fmt.Errorf("%s",err)
return ""
} it := path.Iter(node)
for it.Next() {
s := strings.TrimSpace(it.Node().String())
if len(s) != 0 {
//return convFromGbk(s)
return s
}
}
return ""
} //获取所有元素
func parseNodeForAll(node *xmlpath.Node, xpath string) []string {
path, err := xmlpath.Compile(xpath)
if err != nil {
fmt.Errorf("%s",err)
return nil
} it := path.Iter(node)
elements := []string{}
for it.Next() {
s := strings.TrimSpace(it.Node().String())
if len(s) != 0 {
//return convFromGbk(s)
elements = append(elements, s)
}
}
return elements
} // percent returns the possibility of pct
func percent(pct int) bool {
if pct < 0 || pct > 100 {
return false
}
return pct > rand.Intn(100)
}
ali_spider.go:
package main import (
"code.byted.org/gopkg/logs"
"encoding/json"
"fmt"
"github.com/djimenez/iconv-go"
"github.com/ngaut/logging"
"github.com/oliveagle/jsonpath"
"gopkg.in/xmlpath.v2"
"io/ioutil"
"math/rand"
"net/http"
"strconv"
"strings"
) const itemURLPatternAli = "https://detail.tmall.com/item.htm?id=%d"
const priceURLPatternAli = "https://mdskip.taobao.com/core/initItemDetail.htm?isUseInventoryCenter=false&cartEnable=true&service3C=false&isApparel=true&isSecKill=false&tmallBuySupport=true&isAreaSell=false&tryBeforeBuy=false&offlineShop=false&itemId=%d&showShopProm=false&isPurchaseMallPage=false&itemGmtModified=1555201252000&isRegionLevel=false&household=false&sellerPreview=false&queryMemberRight=true&addressLevel=2&isForbidBuyItem=false&callback=setMdskip×tamp=1555210888509&isg=bBQF1SmIvk4dQ8UGBOCNIZNDTp7T7IRAguWjmN99i_5Qy1Y_p8_OlZkxNev6Vj5RsG8p46-P7M29-etfw&isg2=BPPzr6M1qyiTZGdgYB4puOBagvEXdGgbstRSkqWQUpJJpBNGLPrUOlF1XpTvBN_i" var ualist = []string{
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36",
} type AliSpider struct {
client *http.Client
} func NewAliSpider() *AliSpider {
return &AliSpider{
client: newHTTPClient(),
}
} func (j *AliSpider) loadPage(url string) (*xmlpath.Node, error) {
req, err := http.NewRequest("GET", url, nil)
if err != nil {
return nil, err
}
req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))])
rsp, err := j.client.Do(req)
if err != nil {
return nil, err
}
//转码
utfBody, err := iconv.NewReader(rsp.Body, "gb2312", "utf-8")
//if body, err := ioutil.ReadAll(utfBody); err == nil {
// fmt.Println("HTML content:", string(body))
//}
node, err := xmlpath.ParseHTML(utfBody)
rsp.Body.Close()
return node, err
} func (j *AliSpider) parsePrice(itemID int64) (map[string]map[string]float64, error) {
priceURL := fmt.Sprintf(priceURLPatternAli, itemID)
req, err := http.NewRequest("GET", priceURL, nil)
if err != nil {
return nil, err
}
req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))])
referer := fmt.Sprintf("https://detail.tmall.com/item.htm?id=%d", itemID)
req.Header.Set("Referer", referer)
rsp, err := j.client.Do(req)
if err != nil {
return nil, err
}
priceInfoRaw, err := ioutil.ReadAll(rsp.Body)
if err != nil {
return nil, err
}
priceInfo := string(priceInfoRaw)
jsonStr := convFromGbk(priceInfo) leftIndex := strings.Index(jsonStr, "(") + 1
rightIndex := strings.Index(jsonStr, ")")
var json_data interface{}
json.Unmarshal([]byte(jsonStr[leftIndex:rightIndex]), &json_data) skuQuantity, err := jsonpath.JsonPathLookup(json_data, "$.defaultModel.inventoryDO.skuQuantity")
if err != nil {
logs.Info("json path is err, err is %v", err)
}
skuQuantityMap := skuQuantity.(map[string]interface{})
itemPriceResultMap := map[string]map[string]float64{}
itemPriceResultDetailMap := map[string]float64{}
for skuQuantityId, _ := range skuQuantityMap {
//fmt.Println(key, value)
jpathPrice := fmt.Sprintf("$.defaultModel.itemPriceResultDO.priceInfo.%s.price", skuQuantityId)
jpathPromotionPrice := fmt.Sprintf("$.defaultModel.itemPriceResultDO.priceInfo.%s.promotionList[0].price", skuQuantityId)
price, err := jsonpath.JsonPathLookup(json_data, jpathPrice)
if err != nil {
logs.Info("jpathPrice is err, err is %v", err)
}
promotionPrice, err := jsonpath.JsonPathLookup(json_data, jpathPromotionPrice)
if err != nil {
logs.Info("jpathPromotionPrice is err, err is %v", err)
}
priceStr := price.(string)
promotionPriceStr := promotionPrice.(string)
itemPriceResultDetailMap["price"], _ = strconv.ParseFloat(priceStr, 64)
itemPriceResultDetailMap["promotion_price"], _ = strconv.ParseFloat(promotionPriceStr, 64)
itemPriceResultMap[skuQuantityId] = itemPriceResultDetailMap
}
return itemPriceResultMap, err
} func (j *AliSpider) Parse(msg *Msg) (map[string]interface{}, error) {
defer func() {
if r := recover(); r != nil {
logging.Errorf("parse msg %v, error %v", *msg, r)
return
}
}()
itemURL := fmt.Sprintf(itemURLPatternAli, msg.ItemID)
node, err := j.loadPage(itemURL)
if err != nil {
fmt.Errorf("%s",err)
return nil, err
}
//metricsClient.EmitCounter("jd_spider", 1, "", map[string]string{"step": "parse"}) name := parseNode(node, "//h1[@data-spm]")
//详情描述
/**
产品名称:纽曼
品牌: 纽曼
型号: EX16
功能: 睡眠监测 计步 防水
*/
details := parseNodeForAll(node, "//ul[@id=\"J_AttrUL\"]/li")
detailsMap := make(map[string]string, len(details))
for _, detail := range details {
split := strings.Split(detail, ":")
if(len(split) > 1){
detailsMap[split[0]] = strings.TrimSpace(split[1])
}
} shopname := parseNode(node, "//a[@class=\"slogo-shopname\"]") //描述 服务 物流
shopinfos := parseNodeForAll(node, "//span[@class=\"shopdsr-score-con\"]")
describe, _ := strconv.ParseFloat(shopinfos[0], 64)
service, _ := strconv.ParseFloat(shopinfos[1], 64)
logistics, _ := strconv.ParseFloat(shopinfos[2], 64) //价格(多个型号,price是标准价格,promotion_price是促销价格)
//map[4023134073248:map[price:3299.00 promotion_price:3299.00] 4023134073249:map[price:3299.00 promotion_price:3299.00] 4200326178501:map[promotion_price:3299.00 price:3299.00] 4023134073246:map[price:3299.00 promotion_price:3299.00] 4023134073247:map[price:3299.00 promotion_price:3299.00] 4023134073245:map[price:3299.00 promotion_price:3299.00] 4023134073250:map[price:3299.00 promotion_price:3299.00]]
itemPriceResultMap, err := j.parsePrice(msg.ItemID) res := map[string]interface{}{}
res["source"] = "Ali"
res["source_id"] = msg.SourceID
res["id"] = msg.ItemID
res["ad_id"] = msg.AdID
res["url"] = itemURL
res["name"] = name
res["details"] = detailsMap
res["shopname"] = shopname
res["describe"] = describe
res["service"] = service
res["logistics"] = logistics
res["uid"] = msg.UID
res["did"] = msg.DID
res["item_price"] = itemPriceResultMap
// 选几个必须包含的类别校验
if res["name"] == "" && res["shopname"] == "" {
return nil, fmt.Errorf("invalid html page %s", itemURL)
}
return res, nil
}
ali_spider_test.go:
package main import (
"encoding/json"
"fmt"
"strconv"
"strings"
"testing"
) func TestName(t *testing.T) {
//conf, err := ssconf.LoadSsConfFile(confFile)
//if err != nil {
// panic(err)
//}
aliSpider := NewAliSpider()
//554867117919 585758506034
var itemId int64 = 7664169349
itemURL := fmt.Sprintf(itemURLPatternAli, itemId)
node, err := aliSpider.loadPage(itemURL)
if err != nil {
fmt.Errorf("%s",err)
}
//fmt.Println(node)
name := parseNode(node, "//h1[@data-spm]")
//详情描述
/**
产品名称:纽曼
品牌: 纽曼
型号: EX16
功能: 睡眠监测 计步 防水
*/
details := parseNodeForAll(node, "//ul[@id=\"J_AttrUL\"]/li")
detailsMap := make(map[string]string, len(details))
for _, detail := range details {
split := strings.Split(detail, ":")
if(len(split) > 1){
detailsMap[split[0]] = strings.TrimSpace(split[1])
}
} shopname := parseNode(node, "//a[@class=\"slogo-shopname\"]") //描述 服务 物流
shopinfos := parseNodeForAll(node, "//span[@class=\"shopdsr-score-con\"]")
describe, _ := strconv.ParseFloat(shopinfos[0], 64)
service, _ := strconv.ParseFloat(shopinfos[1], 64)
logistics, _ := strconv.ParseFloat(shopinfos[2], 64)
//价格(多个型号,price是标准价格,promotion_price是促销价格)
//map[4023134073248:map[price:3299.00 promotion_price:3299.00] 4023134073249:map[price:3299.00 promotion_price:3299.00] 4200326178501:map[promotion_price:3299.00 price:3299.00] 4023134073246:map[price:3299.00 promotion_price:3299.00] 4023134073247:map[price:3299.00 promotion_price:3299.00] 4023134073245:map[price:3299.00 promotion_price:3299.00] 4023134073250:map[price:3299.00 promotion_price:3299.00]]
itemPriceResultMap, err := aliSpider.parsePrice(itemId) res := map[string]interface{}{}
res["source"] = "Ali"
res["url"] = itemURL
res["name"] = name
res["details"] = detailsMap
res["shopname"] = shopname
res["describe"] = describe
res["service"] = service
res["logistics"] = logistics
res["item_price"] = itemPriceResultMap bytes, err := json.Marshal(res)
if err != nil {
fmt.Println("error is ", err)
}
fmt.Println(string(bytes))
}
运行结果:
{"describe":4.9,"details":{"上市时间":"2014年冬季","乒乓底板材质":"其他","品牌":"Palio/拍里奥","型号":"TNT-1","层数":"9层","拍柄重量":"头沉柄轻","是否商场同款":"是","系列":"拍里奥TNT-1","货号":"TNT-1","颜色分类":"TNT-1直拍(短柄)1只+赠送:1海绵护边【7木+2碳】 TNT-1横拍(长柄)1只+赠送:1海绵护边【7木+2碳】 新TNT直拍(短柄)1只+赠送:1海绵护边【5木+2碳】 新TNT横拍(长柄)1只+赠送:1海绵护边【5木+2碳】"},"item_price":{"3280089025135":{"price":168,"promotion_price":68},"3280089025136":{"price":168,"promotion_price":68},"6310159781":{"price":168,"promotion_price":68},"6310159797":{"price":168,"promotion_price":68}},"logistics":4.8,"name":"正品 拍里奥乒乓球底板新TNT-1碳素快攻弧圈乒乓球拍底板球拍球板","service":4.8,"shopname":"玺源运动专营店","source":"Ali","url":"https://detail.tmall.com/item.htm?id=7664169349"}
[爬虫]采用Go语言爬取天猫商品页面的更多相关文章
- selenium跳过webdriver检测并爬取天猫商品数据
目录 简介 编写思路 使用教程 演示图片 源代码 @(文章目录) 简介 现在爬取淘宝,天猫商品数据都是需要首先进行登录的.上一节我们已经完成了模拟登录淘宝的步骤,所以在此不详细讲如何模拟登录淘宝.把关 ...
- Python爬虫之selenium爬虫,模拟浏览器爬取天猫信息
由于工作需要,需要提取到天猫400个指定商品页面中指定的信息,于是有了这个爬虫.这是一个使用 selenium 爬取天猫商品信息的爬虫,虽然功能单一,但是也算是 selenium 爬虫的基本用法了. ...
- 零基础掌握百度地图兴趣点获取POI爬虫(python语言爬取)(代码篇)
好,现在进入高阶代码篇. 目的: 爬取昆明市中学的兴趣点POI. 关键词:中学 已有ak:9s5GSYZsWbMaFU8Ps2V2VWvDlDlqGaaO 昆明市坐标范围: 左下角:24.390894 ...
- scrapy 爬取天猫商品信息
spider # -*- coding: utf-8 -*- from urllib.parse import urlencode import requests import scrapy impo ...
- Python开发简单爬虫(二)---爬取百度百科页面数据
一.开发爬虫的步骤 1.确定目标抓取策略: 打开目标页面,通过右键审查元素确定网页的url格式.数据格式.和网页编码形式. ①先看url的格式, F12观察一下链接的形式;② 再看目标文本信息的标签格 ...
- 爬虫框架Scrapy入门——爬取acg12某页面
1.安装1.1自行安装python3环境1.2ide使用pycharm1.3安装scrapy框架2.入门案例2.1新建项目工程2.2配置settings文件2.3新建爬虫app新建app将start_ ...
- python制作爬虫爬取京东商品评论教程
作者:蓝鲸 类型:转载 本文是继前2篇Python爬虫系列文章的后续篇,给大家介绍的是如何使用Python爬取京东商品评论信息的方法,并根据数据绘制成各种统计图表,非常的细致,有需要的小伙伴可以参考下 ...
- 一起学爬虫——使用selenium和pyquery爬取京东商品列表
layout: article title: 一起学爬虫--使用selenium和pyquery爬取京东商品列表 mathjax: true --- 今天一起学起使用selenium和pyquery爬 ...
- 爬虫系列(十三) 用selenium爬取京东商品
这篇文章,我们将通过 selenium 模拟用户使用浏览器的行为,爬取京东商品信息,还是先放上最终的效果图: 1.网页分析 (1)初步分析 原本博主打算写一个能够爬取所有商品信息的爬虫,可是在分析过程 ...
随机推荐
- Tensorflow显示图片
Tensorflow在处理数据时,经常加载图像数据,有的时候是直接读取文件,有的则是读取二进制文件,为了更好的理解Tensorflow数据处理模式,先简单讲解显示图片机制,就能更好掌握是否读取正确了. ...
- 谈一谈java里面的反射机制
首先来看看百度百科中是如何定义的: JAVA反射机制是在运行状态中,对于任意一个类,都能够知道这个类的所有属性和方法:对于任意一个对象,都能够调用它的任意方法和属性:这种动态获取信息以及动态调用对象方 ...
- BZOJ4077 : [Wf2014]Messenger
二分答案,让$A$推迟出发$mid$的时间. 对于每个相邻的时间区间,两个点都是做匀速直线运动. 以$A$为参照物,那么$A$不动,$B$作匀速直线运动. 若线段$B$到$A$的距离不超过$mid$, ...
- Redis设计与实现:读书笔记之二
1.数据库 Redis服务器一般包含多个db,默认16个. 切换数据库 每个redis客户端都有自己的目标数据库,默认为0,可以通过select 1,切换数据库. 设置键的生存周期和过期时间 PTTL ...
- python gui 之 tkinter库
http://blog.csdn.net/jcodeer?viewmode=contents http://tieba.baidu.com/p/3082739560 http://blog.sina. ...
- insert /*+append*/为什么会提高性能
在上一篇的blog中 做了下使用,在归档和非归档下,做数据插入http://blog.csdn.net/guogang83/article/details/9219479.结论是在非归档模式下表设置为 ...
- linux i2c 的通信函数i2c_transfer在什么情况下出现错误
问题: linux i2c 的通信函数i2c_transfer在什么情况下出现错误描述: linux i2c设备驱动 本人在写i2c设备驱动的时候使用i2c transfer函数进行通信的时候无法进行 ...
- js -- 绑定的click addEventListener 事件只触发一次
var btn = document.getElementById('btn'); // 添加事件绑定 btn.addEventListener('click', btnClick, false); ...
- AlexNet总结
https://blog.csdn.net/Rasin_Wu/article/details/80017920 https://blog.csdn.net/chaipp0607/article/det ...
- How do I remove a particular element from an array in JavaScript?
9090down voteaccepted Find the index of the array element you want to remove, then remove that index ...