Parsing Document Metadata with Bing Scaping

Set up the environment - install goquery package.

https://github.com/PuerkitoBio/goquery

go get github.com/PuerkitoBio/goquery

Modify the Proxy setting if in China. Refer to: https://sum.golang.org/

Unzip an Office file and analyze the Open XML file struct. "creator", "lastModifiedBy" in core.xml and "Application", "Company", "AppVersion" in app.xml are of primary interest.

Defining the metadata Package and mapping the data to structs in GO to open, parse, and extract Office Open XML documents.

package metadata

import (
"archive/zip"
"encoding/xml"
"strings"
) // Open XML type definition and version mapping
type OfficeCoreProperty struct {
XMLName xml.Name `xml:"coreProperties"`
Creator string `xml:"creator"`
LastModifiedBy string `xml:"lastModifiedBy"`
} type OfficeAppProperty struct {
XMLName xml.Name `xml:"Properties"`
Application string `xml:"Application"`
Company string `xml:"Company"`
Version string `xml:"AppVersion"`
} var OfficeVersion = map[string]string{
"16": "2016",
"15": "2013",
"14": "2010",
"12": "2007",
"11": "2003",
} func (a *OfficeAppProperty) GetMajorVersion() string {
tokens := strings.Split(a.Version, ".") if len(tokens) < 2 {
return "Unknown"
}
v, ok := OfficeVersion[tokens[0]]
if !ok {
return "Unknown"
}
return v
} // Processing Open XML archives and embedded XML documents
func NewProperties(r *zip.Reader) (*OfficeCoreProperty, *OfficeAppProperty, error) {
var coreProps OfficeCoreProperty
var appProps OfficeAppProperty for _, f := range r.File {
switch f.Name {
case "docProps/core.xml":
if err := process(f, &coreProps); err != nil {
return nil, nil, err
}
case "docProps/app.xml":
if err := process(f, &appProps); err != nil {
return nil, nil, err
}
default:
continue
}
}
return &coreProps, &appProps, nil
} func process(f *zip.File, prop interface{}) error {
rc, err := f.Open()
if err != nil {
return err
}
defer rc.Close() if err := xml.NewDecoder(rc).Decode(&prop); err != nil {
return err
} return nil
}

Figure out how to search for and retrieve files by using Bing.

1. Submit a search request to Bing with proper filters to retrieve targeted results.

2. Scrape the HTML response, extracting the HRER(link) data to obtain direct URLs for documents.

3. Submit an HTTP request for each direct document URL.

4. Parse the response body to create a zip.Reader.

5. Pass the zip.Reader into the code you already developed to extract metadata.

Analyze the search result elements in Bing.

Now scrap Bing results and parse the document metadata.

package metadata

import (
"archive/zip"
"encoding/xml"
"strings"
) // Open XML type definition and version mapping
type OfficeCoreProperty struct {
XMLName xml.Name `xml:"coreProperties"`
Creator string `xml:"creator"`
LastModifiedBy string `xml:"lastModifiedBy"`
} type OfficeAppProperty struct {
XMLName xml.Name `xml:"Properties"`
Application string `xml:"Application"`
Company string `xml:"Company"`
Version string `xml:"AppVersion"`
} var OfficeVersion = map[string]string{
"16": "2016",
"15": "2013",
"14": "2010",
"12": "2007",
"11": "2003",
} func (a *OfficeAppProperty) GetMajorVersion() string {
tokens := strings.Split(a.Version, ".") if len(tokens) < 2 {
return "Unknown"
}
v, ok := OfficeVersion[tokens[0]]
if !ok {
return "Unknown"
}
return v
} // Processing Open XML archives and embedded XML documents
func NewProperties(r *zip.Reader) (*OfficeCoreProperty, *OfficeAppProperty, error) {
var coreProps OfficeCoreProperty
var appProps OfficeAppProperty for _, f := range r.File {
switch f.Name {
case "docProps/core.xml":
if err := process(f, &coreProps); err != nil {
return nil, nil, err
}
case "docProps/app.xml":
if err := process(f, &appProps); err != nil {
return nil, nil, err
}
default:
continue
}
}
return &coreProps, &appProps, nil
} func process(f *zip.File, prop interface{}) error {
rc, err := f.Open()
if err != nil {
return err
}
defer rc.Close() if err := xml.NewDecoder(rc).Decode(&prop); err != nil {
return err
} return nil
}

Go Pentester - HTTP CLIENTS(5)的更多相关文章

  1. Go Pentester - HTTP CLIENTS(1)

    Building HTTP Clients that interact with a variety of security tools and resources. Basic Preparatio ...

  2. Go Pentester - HTTP CLIENTS(4)

    Interacting with Metasploit msf.go package rpc import ( "bytes" "fmt" "gopk ...

  3. Go Pentester - HTTP CLIENTS(3)

    Interacting with Metasploit Early-stage Preparation: Setting up your environment - start the Metaspl ...

  4. Go Pentester - HTTP CLIENTS(2)

    Building an HTTP Client That Interacts with Shodan Shadon(URL:https://www.shodan.io/)  is the world' ...

  5. Creating a radius based VPN with support for Windows clients

    This article discusses setting up up an integrated IPSec/L2TP VPN using Radius and integrating it wi ...

  6. Deploying JRE (Native Plug-in) for Windows Clients in Oracle E-Business Suite Release 12 (文档 ID 393931.1)

    In This Document Section 1: Overview Section 2: Pre-Upgrade Steps Section 3: Upgrade and Configurati ...

  7. ZK 使用Clients.response

    参考: http://stackoverflow.com/questions/11416386/how-to-access-au-response-sent-from-server-side-at-c ...

  8. MySQL之aborted connections和aborted clients

    影响Aborted_clients 值的可能是客户端连接异常关闭,或wait_timeout值过小. 最近线上遇到一个问题,接口日志发现有很多超时报错,根据日志定位到数据库实例之后发现一切正常,一般来 ...

  9. 【渗透测试学习平台】 web for pentester -2.SQL注入

    Example 1 字符类型的注入,无过滤 http://192.168.91.139/sqli/example1.php?name=root http://192.168.91.139/sqli/e ...

随机推荐

  1. RocketMQ系列(七)事务消息(数据库|最终一致性)

    终于到了今天了,终于要讲RocketMQ最牛X的功能了,那就是事务消息.为什么事务消息被吹的比较热呢?近几年微服务大行其道,整个系统被切成了多个服务,每个服务掌管着一个数据库.那么多个数据库之间的数据 ...

  2. CSS中可以继承的元素(需要记住)

    可以继承的属性很少,只有颜色,文字,字体间距行高对齐方式,和列表的样式可以继承. 所有元素可继承:visibility和cursor. 内联元素可继承:letter-spacing.word-spac ...

  3. opencv C++矩阵操作

    int main(){ cv::Mat src1=(cv::Mat_<float>(2,3)<<1,2,3,4,5,6); cv::Mat src2=(cv::Mat_< ...

  4. 硬件对同步的支持-TAS和CAS指令

    目录 Test and Set Compare and Swap 使用CAS实现线程安全的数据结构. 现在主流的多处理器架构都在硬件水平上提供了对并发同步的支持. 今天我们讨论两个很重要的硬件同步指令 ...

  5. 入门大数据---HDFS-API

    第一步:创建一个新的项目 并导入需要的jar包 公共核心包 公共依赖包 hdfs核心包 hdfs依赖包 第二步:将Linux中hadoop的配置文件拷贝到项目的src目录下 第三步:配置windows ...

  6. 且谈 Apache Spark 的 API 三剑客:RDD、DataFrame 和 Dataset

    作者:Jules S. Damji 译者:足下 本文翻译自 A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets ,翻译已 ...

  7. DOM-BOM-EVENT(1)

    1.DOM简介 DOM(Document Object Model)即文档对象模型,是HTML和XML文档的编程接口.它提供了对文档的结构化的表述,并定义了一种方式可以使得从程序中对该结构进行访问,从 ...

  8. asp.net Core依赖注入(自带的IOC容器)

    今天我们主要讲讲如何使用自带IOC容器,虽然自带的功能不是那么强大,但是胜在轻量级..而且..不用引用别的库. 在新的ASP.NET Core中,大量的采用了依赖注入的方式来编写代码. 比如,在我们的 ...

  9. 小米商城项目(JSP+Servlet项目)

    小米商城项目 项目已托管到GitHub,大家可以去GitHub查看下载!并搜索关注微信公众号 码出Offer 领取各种学习资料! 在这里插入图片描述 基于Servlet+JSP开发的小米商城项目,因为 ...

  10. 生日聚会Party——这个线性dp有点嚣张

    题目描述 今天是hidadz小朋友的生日,她邀请了许多朋友来参加她的生日party. hidadz带着朋友们来到花园中,打算 坐成一排玩游戏.为了游戏不至于无聊,就座的方案应满足如下条件:对于任意连续 ...