HTML解析类，让你不使用正则也能轻松获取HTML相关元素 -C# .NET

功能：

1、轻松获取指元素HTML元素。

2、可以根据属性标签进行筛选

3、返回的都是Llist强类型无需转换

用过XElement的都知道用来解析XML非常的方便，但是对于HTML的格式多样化实在是没办法兼容。

所以我就写了这么一个类似XElement的 XHTMLElement

用法：

            string filePath = Server.MapPath("~/file/test.htm");
            //获取HTML代码
            string mailBody = FileHelper.FileToString(filePath);
 
            XHtmlElement xh = new XHtmlElement(mailBody);
 
            //获取body的子集a标签并且class="icon"
            var link = xh.Descendants("body").ChildDescendants("a").Where(c => c.Attributes.Any(a => a.Key == "class" && a.Value == "icon")).ToList();
 
            //获取带href的a元素
            var links = xh.Descendants("a").Where(c => c.Attributes.Any(a => a.Key == "href")).ToList();
            foreach (var r in links)
            {
                Response.Write(r.Attributes.Single(c => c.Key == "href").Value); //出输href
            }
 
            //获取第一个img
            var img = xh.Descendants("img");
 
            //获取最近的第一个p元素以及与他同一级的其它p元素
            var ps = xh.Descendants("p");

代码：

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Text;
using System.Text.RegularExpressions;
 
namespace SyntacticSugar
{
    /// <summary>
    /// ** 描述：html解析类
    /// ** 创始时间：2015-4-23
    /// ** 修改时间：-
    /// ** 作者：sunkaixuan
    /// ** qq：610262374 欢迎交流,共同提高 ,命名语法等写的不好的地方欢迎大家的给出宝贵建议
    /// </summary>
    public class XHtmlElement
    {
        private string _html;
        public XHtmlElement(string html)
        {
            _html = html;
        }
 
        /// <summary>
        /// 获取最近的相同层级的HTML元素
        /// </summary>
        /// <param name="elementName">等于null为所有元素</param>
        /// <returns></returns>
        public List<HtmlInfo> Descendants(string elementName = null)
        {
            if (_html == null)
            {
                throw new ArgumentNullException("html不能这空！");
            }
            var allList = RootDescendants(_html);
            var reval = allList.Where(c => elementName == null || c.TagName.ToLower() == elementName.ToLower()).ToList();
            if (reval == null || reval.Count == )
            {
                reval = GetDescendantsSource(allList, elementName);
            }
            return reval;
        }
 
        /// <summary>
        /// 获取第一级元素
        /// </summary>
        /// <param name="elementName"></param>
        /// <returns></returns>
        public List<HtmlInfo> RootDescendants(string html = null)
        {
            /*
             * 业务逻辑:
                         * 1、获取第一个html标签一直找结尾标签，如果在这个过程中遇到相同的标签收尾标签就要加1
                         * 2、第一个标签取到后继续第一步操作，找第2个元素 。。第N个元素
             */
            if (html == null) html = _html;
            var firstTag = Regex.Match(html, "<.+?>");
 
            List<string> eleList = new List<string>();
            List<HtmlInfo> reval = new List<HtmlInfo>();
            GetElementsStringList(html, ref eleList);
            foreach (var r in eleList)
            {
                HtmlInfo data = new HtmlInfo();
                data.OldFullHtml = r;
                data.SameLeveHtml = html;
                data.TagName = Regex.Match(r, @"(?<=\s{1}|\<)[a-z,A-Z]+(?=\>|\s)", RegexOptions.IgnoreCase).Value;
                data.InnerHtml = Regex.Match(r, @"(?<=\>).+(?=<)", RegexOptions.Singleline).Value;
                var eleBegin = Regex.Match(r, "<.+?>").Value;
                var attrList = Regex.Matches(eleBegin, @"[a-z,A-Z]+\="".+?""").Cast<Match>().Select(c => new { key = c.Value.Split('=').First(), value = c.Value.Split('=').Last().TrimEnd('"').TrimStart('"') }).ToList();
                data.Attributes = new Dictionary<string, string>();
                if (attrList != null && attrList.Count > )
                {
                    foreach (var a in attrList)
                    {
                        data.Attributes.Add(a.key, a.value);
                    }
                }
                reval.Add(data);
            }
            return reval;
 
        }
 
        #region private
        private List<HtmlInfo> GetDescendantsSource(List<HtmlInfo> allList, string elementName)
        {
            foreach (var r in allList)
            {
                if (r.InnerHtml == null || !r.InnerHtml.Contains("<")) continue;
                var childList = RootDescendants(r.InnerHtml).Where(c => elementName == null || c.TagName.ToLower() == elementName.ToLower()).ToList();
                if (childList == null || childList.Count == )
                {
                    childList = GetDescendantsSource(RootDescendants(r.InnerHtml), elementName);
                    if (childList != null && childList.Count > )
                        return childList;
                }
                else
                {
                    return childList;
                }
            }
            return null;
        }
 
        private void GetElementsStringList(string html, ref List<string> eleList)
        {
            HtmlInfo info = new HtmlInfo();
            info.TagName = Regex.Match(html, @"(?<=\<\s{0,5}|\<)([a-z,A-Z]+|h\d{1})(?=\>|\s)", RegexOptions.IgnoreCase).Value;
            string currentTagBeginReg = @"<\s{0,10}" + info.TagName + @".*?>";//获取当前标签元素开始标签正则
            string currentTagEndReg = @"\<\/" + info.TagName + @"\>";//获取当前标签元素收尾标签正则
            if (string.IsNullOrEmpty(info.TagName)) return;
 
            string eleHtml = "";
            //情况1 <a/>
            //情况2 <a></a>
            //情况3 <a> 错误格式
            //情况4endif
            if (Regex.IsMatch(html, @"<\s{0,10}" + info.TagName + "[^<].*?/>"))//单标签
            {
                eleHtml = Regex.Match(html, @"<\s{0,10}" + info.TagName + "[^<].*?/>").Value;
            }
            else if (!Regex.IsMatch(html, currentTagEndReg))//没有收尾
            {
                if (Regex.IsMatch(html, @"\s{0,10}\<\!\-\-\[if"))
                {
                    eleHtml = GetElementString(html, @"\s{0,10}\<\!\-\-\[if", @"\[endif\]\-\-\>", );
                }
                else
                {
                    eleHtml = Regex.Match(html, currentTagBeginReg,RegexOptions.Singleline).Value;
                }
            }
            else
            {
                eleHtml = GetElementString(html, currentTagBeginReg, currentTagEndReg, );
            }
 
            try
            {
                eleList.Add(eleHtml);
                html = html.Replace(eleHtml, "");
                html = Regex.Replace(html, @"<\!DOCTYPE.*?>", "");
                if (!Regex.IsMatch(html, @"^\s*$"))
                {
                    GetElementsStringList(html, ref eleList);
                }
 
            }
            catch (Exception ex)
            {
                throw new Exception("SORRY,您的HTML格式不能解析！！！");
 
            }
 
        }
 
        private string GetElementString(string html, string currentTagBeginReg, string currentTagEndReg, int i)
        {
 
            string newHtml = GetRegNextByNum(html, currentTagBeginReg, currentTagEndReg, i);
            var currentTagBeginMatches = Regex.Matches(newHtml, currentTagBeginReg, RegexOptions.Singleline).Cast<Match>().Select(c => c.Value).ToList();
            var currentTagEndMatches = Regex.Matches(newHtml, currentTagEndReg).Cast<Match>().Select(c => c.Value).ToList();
            if (currentTagBeginMatches.Count == currentTagEndMatches.Count)
            { //两个签标元素相等
                return newHtml;
            }
            return GetElementString(html, currentTagBeginReg, currentTagEndReg, ++i);
        }
 
        private string GetRegNextByNum(string val, string currentTagBeginReg, string currentTagEndReg, int i)
        {
            return Regex.Match(val, currentTagBeginReg + @"((.*?)" + currentTagEndReg + "){" + i + "}?", RegexOptions.IgnoreCase | RegexOptions.Singleline).Value;
        }
        #endregion
 
    }
    public static class XHtmlElementExtendsion
    {
        /// <summary>
        /// 获取最近的相同层级的HTML元素
        /// </summary>
        /// <param name="elementName">等于null为所有元素</param>
        /// <returns></returns>
        public static List<HtmlInfo> Descendants(this  IEnumerable<HtmlInfo> htmlInfoList, string elementName = null)
        {
            var html = htmlInfoList.First().InnerHtml;
            XHtmlElement xhe = new XHtmlElement(html);
            return xhe.Descendants(elementName);
        }
        /// <summary>
        /// 获取下级元素
        /// </summary>
        /// <param name="elementName"></param>
        /// <returns></returns>
        public static List<HtmlInfo> ChildDescendants(this  IEnumerable<HtmlInfo> htmlInfoList, string elementName = null)
        {
            var html = htmlInfoList.First().InnerHtml;
            XHtmlElement xhe = new XHtmlElement(html);
            return xhe.RootDescendants(html).Where(c => elementName == null || c.TagName == elementName).ToList();
        }
 
        /// <summary>
        /// 获取父级
        /// </summary>
        /// <param name="htmlInfoList"></param>
        /// <returns></returns>
        public static List<HtmlInfo> ParentDescendant(this  IEnumerable<HtmlInfo> htmlInfoList,string fullHtml)
        {
            var saveLeveHtml = htmlInfoList.First().SameLeveHtml;
            string replaceGuid=Guid.NewGuid().ToString();
            fullHtml = fullHtml.Replace(saveLeveHtml,replaceGuid);
            var parentHtml = Regex.Match(fullHtml, @"<[^<]+?>[^<]*?" + replaceGuid + @".*?<\/.+?>").Value;
            parentHtml = parentHtml.Replace(replaceGuid, saveLeveHtml);
            XHtmlElement xhe = new XHtmlElement(parentHtml);
            return xhe.RootDescendants();
        }
    }
    /// <summary>
    /// html信息类
    /// </summary>
    public class HtmlInfo
    {
        /// <summary>
        /// 元素名
        /// </summary>
        public string TagName { get; set; }
        /// <summary>
        /// 元素属性
        /// </summary>
        public Dictionary<string, string> Attributes { get; set; }
        /// <summary>
        /// 元素内部html
        /// </summary>
        public string InnerHtml { get; set; }
 
        public string OldFullHtml { get; set; }
 
        public string SameLeveHtml { get; set; }
 
        /// <summary>
        /// 得到元素的html
        /// </summary>
        /// <returns></returns>
        public string FullHtml
        {
            get
            {
                StringBuilder reval = new StringBuilder();
                string attributesString = string.Empty;
                if (Attributes != null && Attributes.Count > )
                {
                    attributesString = string.Join(" ", Attributes.Select(c => string.Format("{0}=\"{1}\"", c.Key, c.Value)));
                }
                reval.AppendFormat("<{0} {2}>{1}</{0}>", TagName, InnerHtml, attributesString);
                return reval.ToString();
            }
        }
    }
}

前台HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title></title>
</head>
<body>
    <a id="1">我是1</a>
    <a id="2" class="icon">icon</a>
    <img />
</body>
</html>

HTML解析类，让你不使用正则也能轻松获取HTML相关元素 -C# .NET的更多相关文章

自己用的框架写了一个PHP模版解析类
<?php if(!defined('IS_HEARTPHP')) exit('Access Denied'); /** * template.class.php 模板解析类 * * @copy ...
PHP模板解析类实例
作者:mckee 这篇文章主要介绍了PHP模板解析类,涉及php针对模板文件的解析与字符串处理的相关技巧,具有一定参考借鉴价值,需要的朋友可以参考下 <?php class template { ...
黄聪：C#类似Jquery的html解析类HtmlAgilityPack基础类介绍及运用
Html Agility Pack下载地址:http://htmlagilitypack.codeplex.com/ Html Agility Pack 源码中的类大概有28个左右,其实不算一个很复杂 ...
【转】C#类似Jquery的html解析类HtmlAgilityPack基础类介绍及运用
Html Agility Pack下载地址:http://htmlagilitypack.codeplex.com/ Html Agility Pack 源码中的类大概有28个左右,其实不算一个很复杂 ...
IniParse解析类
说明 iniParse这个类是一个解析ini文件的类,他的功能和Windows下GetPrivateProfileString的功能一样,可以很方便的保存读取配置. 当然他不是只有GetPrivate ...
深入源码解析类Route
微软官网对这个类的说明是:提供用于定义路由及获取路由相关信息的属性和方法.这个说明已经很简要的说明了这个类的作用,下面我们就从源码的角度来看看这个类的内部是如何工作的. public class Ro ...
Json解析类
Json解析类定义两个辅助类 public class JSONObject : Dictionary<string, object> { } public class JSONAr ...
C++PE文件格式解析类（轻松制作自己的PE文件解析器）
PE是Portable Executable File Format(可移植的运行体)简写,它是眼下Windows平台上的主流可运行文件格式. PE文件里包括的内容非常多,详细我就不在这解释了,有兴趣 ...
04StringBuffer相关知识、Arrays类、类型互换、正则、Date相关
04StringBuffer相关知识.Arrays类.类型互换.正则.Date相关-2018.7.12 1.StringBuffer A:StringBuffer的构造方法: public Strin ...

随机推荐

Entity Framework Code First迁移基本面拾遗
项目中用到了EF Code First和迁移,但发现有些方面似懂非懂.比如:如何在迁移文件中控制迁移过程?如果在迁移文件中执行SQL语句?如何使用Update-Database的其它参数?数据库在生产 ...
一个purge参数引发的惨案——从线上hbase数据被删事故说起
在写这篇blog前,我的心情久久不能平静,虽然明白运维工作如履薄冰,但没有料到这么一个细小的疏漏会带来如此严重的灾难.这是一起其他公司误用puppet参数引发的事故,而且这个参数我也曾被“坑过”. ...
QT Creater + vs2010 发布程序
这几天帮同学写了个简单的gui应用,用的qt5.0.2_msvc2010.写的程序需要在一台没有装过vs和qt的机子上运行. 在release下编译运行通过后,把相应的依赖dll加入到exe相同的文件 ...
embarcadero radstudio xe5 正式版下载地址
http://altd.embarcadero.com/download/radstudio/xe5/delphicbuilder_xe5_win.iso
Apache shiro之权限校验流程
从张开涛blog学习后整理:http://jinnianshilongnian.iteye.com/blog/2018398 图片原图比较大,建议将图片在新的选项卡打开后100%大小浏览在权限校验中 ...
LeetCode: Unique Paths 解题报告
A robot is located at the top-left corner of a m x n grid (marked 'Start' in the diagram below). The ...
ubuntu-16.04+-xxx-i386.iso ：安装 Oracle 11gR2 数据库
前言:说实在的,ubuntu 16.04以上很难安装oracle!其间走过了艰难的一段路! 重要附件:ubuntu16.04+-xxx-i386.iso_安装oracle所需的软件包.zip 特点: ...
Spring3系列11- Spring AOP——自动创建Proxy
Spring3系列11- Spring AOP——自动创建Proxy 在<Spring3系列9- Spring AOP——Advice>和<Spring3系列10- Spring A ...
Filter之——GZIP全站压缩
GZIP压缩:将压缩后的文本文件,发送给浏览器,减少流量. 一.进行gzip压缩条件: 1.请求头:Accept-Encoding : gzip 告诉服务器,该浏览器支持gzip压缩. 2.响应头: ...
MailMessage to EML
EML格式是微软公司在Outlook中所使用的一种遵循RFC822及其后续扩展的文件格式,并成为各类电子邮件软件的通用格式. 做个笔记,C# 邮件处理保存为eml格式: 一.网上好多这样的写法,可以在 ...

HTML解析类 ，让你不使用正则也能轻松获取HTML相关元素 -C# .NET

HTML解析类 ，让你不使用正则也能轻松获取HTML相关元素 -C# .NET的更多相关文章

随机推荐

热门专题

HTML解析类，让你不使用正则也能轻松获取HTML相关元素 -C# .NET

HTML解析类，让你不使用正则也能轻松获取HTML相关元素 -C# .NET的更多相关文章