PHP根据抖音的分享链接来抓包抖音视频

现在抖音是个很火的短视频平台，上面有许多不错的小视频。今天教大家怎么用PHP技术来获取到抖音上的的内容。

1：打开抖音选中你认为好的视频点击分享，复制链接，然后你会获取到如下的内容：

　　#科比愿你去的地方也有篮球陪伴，也能披着24号紫金战衣！ #动态壁纸 https://v.douyin.com/36xkCS/ 复制此链接，打开【抖音短视频】，直接观看视频！

这段内容就是我们进行抓包使用的路径。

2：需要使用到php解析html类库文件simple_html_dom

　　创建 simple_html_dom.php 代码如下：

 <?php

 /**

  * Website: http://sourceforge.net/projects/simplehtmldom/

  * Acknowledge: Jose Solorzano (https://sourceforge.net/projects/php-html/)

  * Contributions by:

  *     Yousuke Kumakura (Attribute filters)

  *     Vadim Voituk (Negative indexes supports of "find" method)

  *     Antcs (Constructor with automatically load contents either text or file/url)

  *

  * all affected sections have comments starting with "PaperG"

  *

  * Paperg - Added case insensitive testing of the value of the selector.

  * Paperg - Added tag_start for the starting index of tags - NOTE: This works but not accurately.

  *  This tag_start gets counted AFTER \r\n have been crushed out, and after the remove_noice calls so it will not reflect the REAL position of the tag in the source,

  *  it will almost always be smaller by some amount.

  *  We use this to determine how far into the file the tag in question is.  This "percentage will never be accurate as the $dom->size is the "real" number of bytes the dom was created from.

  *  but for most purposes, it's a really good estimation.

  * Paperg - Added the forceTagsClosed to the dom constructor.  Forcing tags closed is great for malformed html, but it CAN lead to parsing errors.

  * Allow the user to tell us how much they trust the html.

  * Paperg add the text and plaintext to the selectors for the find syntax.  plaintext implies text in the innertext of a node.  text implies that the tag is a text node.

  * This allows for us to find tags based on the text they contain.

  * Create find_ancestor_tag to see if a tag is - at any level - inside of another specific tag.

  * Paperg: added parse_charset so that we know about the character set of the source document.

  *  NOTE:  If the user's system has a routine called get_last_retrieve_url_contents_content_type availalbe, we will assume it's returning the content-type header from the

  *  last transfer or curl_exec, and we will parse that and use it in preference to any other method of charset detection.

  *

  * Found infinite loop in the case of broken html in restore_noise.  Rewrote to protect from that.

  * PaperG (John Schlick) Added get_display_size for "IMG" tags.

  *

  * Licensed under The MIT License

  * Redistributions of files must retain the above copyright notice.

  *

  * @author S.C. Chen <me578022@gmail.com>

  * @author John Schlick

  * @author Rus Carroll

  * @version 1.5 ($Rev: 196 $)

  * @package PlaceLocalInclude

  * @subpackage simple_html_dom

  */

 /**

  * All of the Defines for the classes below.

  * @author S.C. Chen <me578022@gmail.com>

  */

 define('HDOM_TYPE_ELEMENT', 1);

 define('HDOM_TYPE_COMMENT', 2);

 define('HDOM_TYPE_TEXT',    3);

 define('HDOM_TYPE_ENDTAG',  4);

 define('HDOM_TYPE_ROOT',    5);

 define('HDOM_TYPE_UNKNOWN', 6);

 define('HDOM_QUOTE_DOUBLE', 0);

 define('HDOM_QUOTE_SINGLE', 1);

 define('HDOM_QUOTE_NO',     3);

 define('HDOM_INFO_BEGIN',   0);

 define('HDOM_INFO_END',     1);

 define('HDOM_INFO_QUOTE',   2);

 define('HDOM_INFO_SPACE',   3);

 define('HDOM_INFO_TEXT',    4);

 define('HDOM_INFO_INNER',   5);

 define('HDOM_INFO_OUTER',   6);

 define('HDOM_INFO_ENDSPACE',7);

 define('DEFAULT_TARGET_CHARSET', 'UTF-8');

 define('DEFAULT_BR_TEXT', "\r\n");

 define('DEFAULT_SPAN_TEXT', " ");

 define('MAX_FILE_SIZE', 600000);

 // helper functions

 // -----------------------------------------------------------------------------

 // get html dom from file

 // $maxlen is defined in the code as PHP_STREAM_COPY_ALL which is defined as -1.

 function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)

 {

     // We DO force the tags to be terminated.

     $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);

     // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.

     $contents = file_get_contents($url, $use_include_path, $context, $offset);

     // Paperg - use our own mechanism for getting the contents as we want to control the timeout.

     //$contents = retrieve_url_contents($url);

     if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)

     {

         return false;

     }

     // The second parameter can force the selectors to all be lowercase.

     $dom->load($contents, $lowercase, $stripRN);

     return $dom;

 }

 // get html dom from string

 function str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)

 {

     $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);

     if (empty($str) || strlen($str) > MAX_FILE_SIZE)

     {

         $dom->clear();

         return false;

     }

     $dom->load($str, $lowercase, $stripRN);

     return $dom;

 }

 // dump html dom tree

 function dump_html_tree($node, $show_attr=true, $deep=0)

 {

     $node->dump($node);

 }

 /**

  * simple html dom node

  * PaperG - added ability for "find" routine to lowercase the value of the selector.

  * PaperG - added $tag_start to track the start position of the tag in the total byte index

  *

  * @package PlaceLocalInclude

  */

 class simple_html_dom_node

 {

     public $nodetype = HDOM_TYPE_TEXT;

     public $tag = 'text';

     public $attr = array();

     public $children = array();

     public $nodes = array();

     public $parent = null;

     // The "info" array - see HDOM_INFO_... for what each element contains.

     public $_ = array();

     public $tag_start = 0;

     private $dom = null;

     function __construct($dom)

     {

         $this->dom = $dom;

         $dom->nodes[] = $this;

     }

     function __destruct()

     {

         $this->clear();

     }

     function __toString()

     {

         return $this->outertext();

     }

     // clean up memory due to php5 circular references memory leak...

     function clear()

     {

         $this->dom = null;

         $this->nodes = null;

         $this->parent = null;

         $this->children = null;

     }

     // dump node's tree

     function dump($show_attr=true, $deep=0)

     {

         $lead = str_repeat('    ', $deep);

         echo $lead.$this->tag;

         if ($show_attr && count($this->attr)>0)

         {

             echo '(';

             foreach ($this->attr as $k=>$v)

                 echo "[$k]=>\"".$this->$k.'", ';

             echo ')';

         }

         echo "\n";

         if ($this->nodes)

         {

             foreach ($this->nodes as $c)

             {

                 $c->dump($show_attr, $deep+1);

             }

         }

     }

     // Debugging function to dump a single dom node with a bunch of information about it.

     function dump_node($echo=true)

     {

         $string = $this->tag;

         if (count($this->attr)>0)

         {

             $string .= '(';

             foreach ($this->attr as $k=>$v)

             {

                 $string .= "[$k]=>\"".$this->$k.'", ';

             }

             $string .= ')';

         }

         if (count($this->_)>0)

         {

             $string .= ' $_ (';

             foreach ($this->_ as $k=>$v)

             {

                 if (is_array($v))

                 {

                     $string .= "[$k]=>(";

                     foreach ($v as $k2=>$v2)

                     {

                         $string .= "[$k2]=>\"".$v2.'", ';

                     }

                     $string .= ")";

                 } else {

                     $string .= "[$k]=>\"".$v.'", ';

                 }

             }

             $string .= ")";

         }

         if (isset($this->text))

         {

             $string .= " text: (" . $this->text . ")";

         }

         $string .= " HDOM_INNER_INFO: '";

         if (isset($node->_[HDOM_INFO_INNER]))

         {

             $string .= $node->_[HDOM_INFO_INNER] . "'";

         }

         else

         {

             $string .= ' NULL ';

         }

         $string .= " children: " . count($this->children);

         $string .= " nodes: " . count($this->nodes);

         $string .= " tag_start: " . $this->tag_start;

         $string .= "\n";

         if ($echo)

         {

             echo $string;

             return;

         }

         else

         {

             return $string;

         }

     }

     // returns the parent of node

     // If a node is passed in, it will reset the parent of the current node to that one.

     function parent($parent=null)

     {

         // I am SURE that this doesn't work properly.

         // It fails to unset the current node from it's current parents nodes or children list first.

         if ($parent !== null)

         {

             $this->parent = $parent;

             $this->parent->nodes[] = $this;

             $this->parent->children[] = $this;

         }

         return $this->parent;

     }

     // verify that node has children

     function has_child()

     {

         return !empty($this->children);

     }

     // returns children of node

     function children($idx=-1)

     {

         if ($idx===-1)

         {

             return $this->children;

         }

         if (isset($this->children[$idx])) return $this->children[$idx];

         return null;

     }

     // returns the first child of node

     function first_child()

     {

         if (count($this->children)>0)

         {

             return $this->children[0];

         }

         return null;

     }

     // returns the last child of node

     function last_child()

     {

         if (($count=count($this->children))>0)

         {

             return $this->children[$count-1];

         }

         return null;

     }

     // returns the next sibling of node

     function next_sibling()

     {

         if ($this->parent===null)

         {

             return null;

         }

         $idx = 0;

         $count = count($this->parent->children);

         while ($idx<$count && $this!==$this->parent->children[$idx])

         {

             ++$idx;

         }

         if (++$idx>=$count)

         {

             return null;

         }

         return $this->parent->children[$idx];

     }

     // returns the previous sibling of node

     function prev_sibling()

     {

         if ($this->parent===null) return null;

         $idx = 0;

         $count = count($this->parent->children);

         while ($idx<$count && $this!==$this->parent->children[$idx])

             ++$idx;

         if (--$idx<0) return null;

         return $this->parent->children[$idx];

     }

     // function to locate a specific ancestor tag in the path to the root.

     function find_ancestor_tag($tag)

     {

         global $debugObject;

         if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }

         // Start by including ourselves in the comparison.

         $returnDom = $this;

         while (!is_null($returnDom))

         {

             if (is_object($debugObject)) { $debugObject->debugLog(2, "Current tag is: " . $returnDom->tag); }

             if ($returnDom->tag == $tag)

             {

                 break;

             }

             $returnDom = $returnDom->parent;

         }

         return $returnDom;

     }

     // get dom node's inner html

     function innertext()

     {

         if (isset($this->_[HDOM_INFO_INNER])) return $this->_[HDOM_INFO_INNER];

         if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);

         $ret = '';

         foreach ($this->nodes as $n)

             $ret .= $n->outertext();

         return $ret;

     }

     // get dom node's outer text (with tag)

     function outertext()

     {

         global $debugObject;

         if (is_object($debugObject))

         {

             $text = '';

             if ($this->tag == 'text')

             {

                 if (!empty($this->text))

                 {

                     $text = " with text: " . $this->text;

                 }

             }

             $debugObject->debugLog(1, 'Innertext of tag: ' . $this->tag . $text);

         }

         if ($this->tag==='root') return $this->innertext();

         // trigger callback

         if ($this->dom && $this->dom->callback!==null)

         {

             call_user_func_array($this->dom->callback, array($this));

         }

         if (isset($this->_[HDOM_INFO_OUTER])) return $this->_[HDOM_INFO_OUTER];

         if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);

         // render begin tag

         if ($this->dom && $this->dom->nodes[$this->_[HDOM_INFO_BEGIN]])

         {

             $ret = $this->dom->nodes[$this->_[HDOM_INFO_BEGIN]]->makeup();

         } else {

             $ret = "";

         }

         // render inner text

         if (isset($this->_[HDOM_INFO_INNER]))

         {

             // If it's a br tag...  don't return the HDOM_INNER_INFO that we may or may not have added.

             if ($this->tag != "br")

             {

                 $ret .= $this->_[HDOM_INFO_INNER];

             }

         } else {

             if ($this->nodes)

             {

                 foreach ($this->nodes as $n)

                 {

                     $ret .= $this->convert_text($n->outertext());

                 }

             }

         }

         // render end tag

         if (isset($this->_[HDOM_INFO_END]) && $this->_[HDOM_INFO_END]!=0)

             $ret .= '</'.$this->tag.'>';

         return $ret;

     }

     // get dom node's plain text

     function text()

     {

         if (isset($this->_[HDOM_INFO_INNER])) return $this->_[HDOM_INFO_INNER];

         switch ($this->nodetype)

         {

             case HDOM_TYPE_TEXT: return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);

             case HDOM_TYPE_COMMENT: return '';

             case HDOM_TYPE_UNKNOWN: return '';

         }

         if (strcasecmp($this->tag, 'script')===0) return '';

         if (strcasecmp($this->tag, 'style')===0) return '';

         $ret = '';

         // In rare cases, (always node type 1 or HDOM_TYPE_ELEMENT - observed for some span tags, and some p tags) $this->nodes is set to NULL.

         // NOTE: This indicates that there is a problem where it's set to NULL without a clear happening.

         // WHY is this happening?

         if (!is_null($this->nodes))

         {

             foreach ($this->nodes as $n)

             {

                 $ret .= $this->convert_text($n->text());

             }

             // If this node is a span... add a space at the end of it so multiple spans don't run into each other.  This is plaintext after all.

             if ($this->tag == "span")

             {

                 $ret .= $this->dom->default_span_text;

             }

         }

         return $ret;

     }

     function xmltext()

     {

         $ret = $this->innertext();

         $ret = str_ireplace('<![CDATA[', '', $ret);

         $ret = str_replace(']]>', '', $ret);

         return $ret;

     }

     // build node's text with tag

     function makeup()

     {

         // text, comment, unknown

         if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);

         $ret = '<'.$this->tag;

         $i = -1;

         foreach ($this->attr as $key=>$val)

         {

             ++$i;

             // skip removed attribute

             if ($val===null || $val===false)

                 continue;

             $ret .= $this->_[HDOM_INFO_SPACE][$i][0];

             //no value attr: nowrap, checked selected...

             if ($val===true)

                 $ret .= $key;

             else {

                 switch ($this->_[HDOM_INFO_QUOTE][$i])

                 {

                     case HDOM_QUOTE_DOUBLE: $quote = '"'; break;

                     case HDOM_QUOTE_SINGLE: $quote = '\''; break;

                     default: $quote = '';

                 }

                 $ret .= $key.$this->_[HDOM_INFO_SPACE][$i][1].'='.$this->_[HDOM_INFO_SPACE][$i][2].$quote.$val.$quote;

             }

         }

         $ret = $this->dom->restore_noise($ret);

         return $ret . $this->_[HDOM_INFO_ENDSPACE] . '>';

     }

     // find elements by css selector

     //PaperG - added ability for find to lowercase the value of the selector.

     function find($selector, $idx=null, $lowercase=false)

     {

         $selectors = $this->parse_selector($selector);

         if (($count=count($selectors))===0) return array();

         $found_keys = array();

         // find each selector

         for ($c=0; $c<$count; ++$c)

         {

             // The change on the below line was documented on the sourceforge code tracker id 2788009

             // used to be: if (($levle=count($selectors[0]))===0) return array();

             if (($levle=count($selectors[$c]))===0) return array();

             if (!isset($this->_[HDOM_INFO_BEGIN])) return array();

             $head = array($this->_[HDOM_INFO_BEGIN]=>1);

             // handle descendant selectors, no recursive!

             for ($l=0; $l<$levle; ++$l)

             {

                 $ret = array();

                 foreach ($head as $k=>$v)

                 {

                     $n = ($k===-1) ? $this->dom->root : $this->dom->nodes[$k];

                     //PaperG - Pass this optional parameter on to the seek function.

                     $n->seek($selectors[$c][$l], $ret, $lowercase);

                 }

                 $head = $ret;

             }

             foreach ($head as $k=>$v)

             {

                 if (!isset($found_keys[$k]))

                     $found_keys[$k] = 1;

             }

         }

         // sort keys

         ksort($found_keys);

         $found = array();

         foreach ($found_keys as $k=>$v)

             $found[] = $this->dom->nodes[$k];

         // return nth-element or array

         if (is_null($idx)) return $found;

         else if ($idx<0) $idx = count($found) + $idx;

         return (isset($found[$idx])) ? $found[$idx] : null;

     }

     // seek for given conditions

     // PaperG - added parameter to allow for case insensitive testing of the value of a selector.

     protected function seek($selector, &$ret, $lowercase=false)

     {

         global $debugObject;

         if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }

         list($tag, $key, $val, $exp, $no_key) = $selector;

         // xpath index

         if ($tag && $key && is_numeric($key))

         {

             $count = 0;

             foreach ($this->children as $c)

             {

                 if ($tag==='*' || $tag===$c->tag) {

                     if (++$count==$key) {

                         $ret[$c->_[HDOM_INFO_BEGIN]] = 1;

                         return;

                     }

                 }

             }

             return;

         }

         $end = (!empty($this->_[HDOM_INFO_END])) ? $this->_[HDOM_INFO_END] : 0;

         if ($end==0) {

             $parent = $this->parent;

             while (!isset($parent->_[HDOM_INFO_END]) && $parent!==null) {

                 $end -= 1;

                 $parent = $parent->parent;

             }

             $end += $parent->_[HDOM_INFO_END];

         }

         for ($i=$this->_[HDOM_INFO_BEGIN]+1; $i<$end; ++$i) {

             $node = $this->dom->nodes[$i];

             $pass = true;

             if ($tag==='*' && !$key) {

                 if (in_array($node, $this->children, true))

                     $ret[$i] = 1;

                 continue;

             }

             // compare tag

             if ($tag && $tag!=$node->tag && $tag!=='*') {$pass=false;}

             // compare key

             if ($pass && $key) {

                 if ($no_key) {

                     if (isset($node->attr[$key])) $pass=false;

                 } else {

                     if (($key != "plaintext") && !isset($node->attr[$key])) $pass=false;

                 }

             }

             // compare value

             if ($pass && $key && $val  && $val!=='*') {

                 // If they have told us that this is a "plaintext" search then we want the plaintext of the node - right?

                 if ($key == "plaintext") {

                     // $node->plaintext actually returns $node->text();

                     $nodeKeyValue = $node->text();

                 } else {

                     // this is a normal search, we want the value of that attribute of the tag.

                     $nodeKeyValue = $node->attr[$key];

                 }

                 if (is_object($debugObject)) {$debugObject->debugLog(2, "testing node: " . $node->tag . " for attribute: " . $key . $exp . $val . " where nodes value is: " . $nodeKeyValue);}

                 //PaperG - If lowercase is set, do a case insensitive test of the value of the selector.

                 if ($lowercase) {

                     $check = $this->match($exp, strtolower($val), strtolower($nodeKeyValue));

                 } else {

                     $check = $this->match($exp, $val, $nodeKeyValue);

                 }

                 if (is_object($debugObject)) {$debugObject->debugLog(2, "after match: " . ($check ? "true" : "false"));}

                 // handle multiple class

                 if (!$check && strcasecmp($key, 'class')===0) {

                     foreach (explode(' ',$node->attr[$key]) as $k) {

                         // Without this, there were cases where leading, trailing, or double spaces lead to our comparing blanks - bad form.

                         if (!empty($k)) {

                             if ($lowercase) {

                                 $check = $this->match($exp, strtolower($val), strtolower($k));

                             } else {

                                 $check = $this->match($exp, $val, $k);

                             }

                             if ($check) break;

                         }

                     }

                 }

                 if (!$check) $pass = false;

             }

             if ($pass) $ret[$i] = 1;

             unset($node);

         }

         // It's passed by reference so this is actually what this function returns.

         if (is_object($debugObject)) {$debugObject->debugLog(1, "EXIT - ret: ", $ret);}

     }

     protected function match($exp, $pattern, $value) {

         global $debugObject;

         if (is_object($debugObject)) {$debugObject->debugLogEntry(1);}

         switch ($exp) {

             case '=':

                 return ($value===$pattern);

             case '!=':

                 return ($value!==$pattern);

             case '^=':

                 return preg_match("/^".preg_quote($pattern,'/')."/", $value);

             case '$=':

                 return preg_match("/".preg_quote($pattern,'/')."$/", $value);

             case '*=':

                 if ($pattern[0]=='/') {

                     return preg_match($pattern, $value);

                 }

                 return preg_match("/".$pattern."/i", $value);

         }

         return false;

     }

     protected function parse_selector($selector_string) {

         global $debugObject;

         if (is_object($debugObject)) {$debugObject->debugLogEntry(1);}

         // pattern of CSS selectors, modified from mootools

         // Paperg: Add the colon to the attrbute, so that it properly finds <tag attr:ibute="something" > like google does.

         // Note: if you try to look at this attribute, yo MUST use getAttribute since $dom->x:y will fail the php syntax check.

 // Notice the \[ starting the attbute?  and the @? following?  This implies that an attribute can begin with an @ sign that is not captured.

 // This implies that an html attribute specifier may start with an @ sign that is NOT captured by the expression.

 // farther study is required to determine of this should be documented or removed.

 //        $pattern = "/([\w-:\*]*)(?:\#([\w-]+)|\.([\w-]+))?(?:\[@?(!?[\w-]+)(?:([!*^$]?=)[\"']?(.*?)[\"']?)?\])?([\/, ]+)/is";

         $pattern = "/([\w-:\*]*)(?:\#([\w-]+)|\.([\w-]+))?(?:\[@?(!?[\w-:]+)(?:([!*^$]?=)[\"']?(.*?)[\"']?)?\])?([\/, ]+)/is";

         preg_match_all($pattern, trim($selector_string).' ', $matches, PREG_SET_ORDER);

         if (is_object($debugObject)) {$debugObject->debugLog(2, "Matches Array: ", $matches);}

         $selectors = array();

         $result = array();

         //print_r($matches);

         foreach ($matches as $m) {

             $m[0] = trim($m[0]);

             if ($m[0]==='' || $m[0]==='/' || $m[0]==='//') continue;

             // for browser generated xpath

             if ($m[1]==='tbody') continue;

             list($tag, $key, $val, $exp, $no_key) = array($m[1], null, null, '=', false);

             if (!empty($m[2])) {$key='id'; $val=$m[2];}

             if (!empty($m[3])) {$key='class'; $val=$m[3];}

             if (!empty($m[4])) {$key=$m[4];}

             if (!empty($m[5])) {$exp=$m[5];}

             if (!empty($m[6])) {$val=$m[6];}

             // convert to lowercase

             if ($this->dom->lowercase) {$tag=strtolower($tag); $key=strtolower($key);}

             //elements that do NOT have the specified attribute

             if (isset($key[0]) && $key[0]==='!') {$key=substr($key, 1); $no_key=true;}

             $result[] = array($tag, $key, $val, $exp, $no_key);

             if (trim($m[7])===',') {

                 $selectors[] = $result;

                 $result = array();

             }

         }

         if (count($result)>0)

             $selectors[] = $result;

         return $selectors;

     }

     function __get($name) {

         if (isset($this->attr[$name]))

         {

             return $this->convert_text($this->attr[$name]);

         }

         switch ($name) {

             case 'outertext': return $this->outertext();

             case 'innertext': return $this->innertext();

             case 'plaintext': return $this->text();

             case 'xmltext': return $this->xmltext();

             default: return array_key_exists($name, $this->attr);

         }

     }

     function __set($name, $value) {

         switch ($name) {

             case 'outertext': return $this->_[HDOM_INFO_OUTER] = $value;

             case 'innertext':

                 if (isset($this->_[HDOM_INFO_TEXT])) return $this->_[HDOM_INFO_TEXT] = $value;

                 return $this->_[HDOM_INFO_INNER] = $value;

         }

         if (!isset($this->attr[$name])) {

             $this->_[HDOM_INFO_SPACE][] = array(' ', '', '');

             $this->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_DOUBLE;

         }

         $this->attr[$name] = $value;

     }

     function __isset($name) {

         switch ($name) {

             case 'outertext': return true;

             case 'innertext': return true;

             case 'plaintext': return true;

         }

         //no value attr: nowrap, checked selected...

         return (array_key_exists($name, $this->attr)) ? true : isset($this->attr[$name]);

     }

     function __unset($name) {

         if (isset($this->attr[$name]))

             unset($this->attr[$name]);

     }

     // PaperG - Function to convert the text from one character set to another if the two sets are not the same.

     function convert_text($text)

     {

         global $debugObject;

         if (is_object($debugObject)) {$debugObject->debugLogEntry(1);}

         $converted_text = $text;

         $sourceCharset = "";

         $targetCharset = "";

         if ($this->dom)

         {

             $sourceCharset = strtoupper($this->dom->_charset);

             $targetCharset = strtoupper($this->dom->_target_charset);

         }

         if (is_object($debugObject)) {$debugObject->debugLog(3, "source charset: " . $sourceCharset . " target charaset: " . $targetCharset);}

         if (!empty($sourceCharset) && !empty($targetCharset) && (strcasecmp($sourceCharset, $targetCharset) != 0))

         {

             // Check if the reported encoding could have been incorrect and the text is actually already UTF-8

             if ((strcasecmp($targetCharset, 'UTF-8') == 0) && ($this->is_utf8($text)))

             {

                 $converted_text = $text;

             }

             else

             {

                 $converted_text = iconv($sourceCharset, $targetCharset, $text);

             }

         }

         // Lets make sure that we don't have that silly BOM issue with any of the utf-8 text we output.

         if ($targetCharset == 'UTF-8')

         {

             if (substr($converted_text, 0, 3) == "\xef\xbb\xbf")

             {

                 $converted_text = substr($converted_text, 3);

             }

             if (substr($converted_text, -3) == "\xef\xbb\xbf")

             {

                 $converted_text = substr($converted_text, 0, -3);

             }

         }

         return $converted_text;

     }

     /**

     * Returns true if $string is valid UTF-8 and false otherwise.

     *

     * @param mixed $str String to be tested

     * @return boolean

     */

     static function is_utf8($str)

     {

         $c=0; $b=0;

         $bits=0;

         $len=strlen($str);

         for($i=0; $i<$len; $i++)

         {

             $c=ord($str[$i]);

             if($c > 128)

             {

                 if(($c >= 254)) return false;

                 elseif($c >= 252) $bits=6;

                 elseif($c >= 248) $bits=5;

                 elseif($c >= 240) $bits=4;

                 elseif($c >= 224) $bits=3;

                 elseif($c >= 192) $bits=2;

                 else return false;

                 if(($i+$bits) > $len) return false;

                 while($bits > 1)

                 {

                     $i++;

                     $b=ord($str[$i]);

                     if($b < 128 || $b > 191) return false;

                     $bits--;

                 }

             }

         }

         return true;

     }

     /*

     function is_utf8($string)

     {

         //this is buggy

         return (utf8_encode(utf8_decode($string)) == $string);

     }

     */

     /**

      * Function to try a few tricks to determine the displayed size of an img on the page.

      * NOTE: This will ONLY work on an IMG tag. Returns FALSE on all other tag types.

      *

      * @author John Schlick

      * @version April 19 2012

      * @return array an array containing the 'height' and 'width' of the image on the page or -1 if we can't figure it out.

      */

     function get_display_size()

     {

         global $debugObject;

         $width = -1;

         $height = -1;

         if ($this->tag !== 'img')

         {

             return false;

         }

         // See if there is aheight or width attribute in the tag itself.

         if (isset($this->attr['width']))

         {

             $width = $this->attr['width'];

         }

         if (isset($this->attr['height']))

         {

             $height = $this->attr['height'];

         }

         // Now look for an inline style.

         if (isset($this->attr['style']))

         {

             // Thanks to user gnarf from stackoverflow for this regular expression.

             $attributes = array();

             preg_match_all("/([\w-]+)\s*:\s*([^;]+)\s*;?/", $this->attr['style'], $matches, PREG_SET_ORDER);

             foreach ($matches as $match) {

               $attributes[$match[1]] = $match[2];

             }

             // If there is a width in the style attributes:

             if (isset($attributes['width']) && $width == -1)

             {

                 // check that the last two characters are px (pixels)

                 if (strtolower(substr($attributes['width'], -2)) == 'px')

                 {

                     $proposed_width = substr($attributes['width'], 0, -2);

                     // Now make sure that it's an integer and not something stupid.

                     if (filter_var($proposed_width, FILTER_VALIDATE_INT))

                     {

                         $width = $proposed_width;

                     }

                 }

             }

             // If there is a width in the style attributes:

             if (isset($attributes['height']) && $height == -1)

             {

                 // check that the last two characters are px (pixels)

                 if (strtolower(substr($attributes['height'], -2)) == 'px')

                 {

                     $proposed_height = substr($attributes['height'], 0, -2);

                     // Now make sure that it's an integer and not something stupid.

                     if (filter_var($proposed_height, FILTER_VALIDATE_INT))

                     {

                         $height = $proposed_height;

                     }

                 }

             }

         }

         // Future enhancement:

         // Look in the tag to see if there is a class or id specified that has a height or width attribute to it.

         // Far future enhancement

         // Look at all the parent tags of this image to see if they specify a class or id that has an img selector that specifies a height or width

         // Note that in this case, the class or id will have the img subselector for it to apply to the image.

         // ridiculously far future development

         // If the class or id is specified in a SEPARATE css file thats not on the page, go get it and do what we were just doing for the ones on the page.

         $result = array('height' => $height,

                         'width' => $width);

         return $result;

     }

     // camel naming conventions

     function getAllAttributes() {return $this->attr;}

     function getAttribute($name) {return $this->__get($name);}

     function setAttribute($name, $value) {$this->__set($name, $value);}

     function hasAttribute($name) {return $this->__isset($name);}

     function removeAttribute($name) {$this->__set($name, null);}

     function getElementById($id) {return $this->find("#$id", 0);}

     function getElementsById($id, $idx=null) {return $this->find("#$id", $idx);}

     function getElementByTagName($name) {return $this->find($name, 0);}

     function getElementsByTagName($name, $idx=null) {return $this->find($name, $idx);}

     function parentNode() {return $this->parent();}

     function childNodes($idx=-1) {return $this->children($idx);}

     function firstChild() {return $this->first_child();}

     function lastChild() {return $this->last_child();}

     function nextSibling() {return $this->next_sibling();}

     function previousSibling() {return $this->prev_sibling();}

     function hasChildNodes() {return $this->has_child();}

     function nodeName() {return $this->tag;}

     function appendChild($node) {$node->parent($this); return $node;}

 }

 /**

  * simple html dom parser

  * Paperg - in the find routine: allow us to specify that we want case insensitive testing of the value of the selector.

  * Paperg - change $size from protected to public so we can easily access it

  * Paperg - added ForceTagsClosed in the constructor which tells us whether we trust the html or not.  Default is to NOT trust it.

  *

  * @package PlaceLocalInclude

  */

 class simple_html_dom

 {

     public $root = null;

     public $nodes = array();

     public $callback = null;

     public $lowercase = false;

     // Used to keep track of how large the text was when we started.

     public $original_size;

     public $size;

     protected $pos;

     protected $doc;

     protected $char;

     protected $cursor;

     protected $parent;

     protected $noise = array();

     protected $token_blank = " \t\r\n";

     protected $token_equal = ' =/>';

     protected $token_slash = " />\r\n\t";

     protected $token_attr = ' >';

     // Note that this is referenced by a child node, and so it needs to be public for that node to see this information.

     public $_charset = '';

     public $_target_charset = '';

     protected $default_br_text = "";

     public $default_span_text = "";

     // use isset instead of in_array, performance boost about 30%...

     protected $self_closing_tags = array('img'=>1, 'br'=>1, 'input'=>1, 'meta'=>1, 'link'=>1, 'hr'=>1, 'base'=>1, 'embed'=>1, 'spacer'=>1);

     protected $block_tags = array('root'=>1, 'body'=>1, 'form'=>1, 'div'=>1, 'span'=>1, 'table'=>1);

     // Known sourceforge issue #2977341

     // B tags that are not closed cause us to return everything to the end of the document.

     protected $optional_closing_tags = array(

         'tr'=>array('tr'=>1, 'td'=>1, 'th'=>1),

         'th'=>array('th'=>1),

         'td'=>array('td'=>1),

         'li'=>array('li'=>1),

         'dt'=>array('dt'=>1, 'dd'=>1),

         'dd'=>array('dd'=>1, 'dt'=>1),

         'dl'=>array('dd'=>1, 'dt'=>1),

         'p'=>array('p'=>1),

         'nobr'=>array('nobr'=>1),

         'b'=>array('b'=>1),

         'option'=>array('option'=>1),

     );

     function __construct($str=null, $lowercase=true, $forceTagsClosed=true, $target_charset=DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)

     {

         if ($str)

         {

             if (preg_match("/^http:\/\//i",$str) || is_file($str))

             {

                 $this->load_file($str);

             }

             else

             {

                 $this->load($str, $lowercase, $stripRN, $defaultBRText, $defaultSpanText);

             }

         }

         // Forcing tags to be closed implies that we don't trust the html, but it can lead to parsing errors if we SHOULD trust the html.

         if (!$forceTagsClosed) {

             $this->optional_closing_array=array();

         }

         $this->_target_charset = $target_charset;

     }

     function __destruct()

     {

         $this->clear();

     }

     // load html from string

     function load($str, $lowercase=true, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)

     {

         global $debugObject;

         // prepare

         $this->prepare($str, $lowercase, $stripRN, $defaultBRText, $defaultSpanText);

         // strip out comments

         $this->remove_noise("'<!--(.*?)-->'is");

         // strip out cdata

         $this->remove_noise("'<!\[CDATA\[(.*?)\]\]>'is", true);

         // Per sourceforge http://sourceforge.net/tracker/?func=detail&aid=2949097&group_id=218559&atid=1044037

         // Script tags removal now preceeds style tag removal.

         // strip out <script> tags

         $this->remove_noise("'<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>'is");

         $this->remove_noise("'<\s*script\s*>(.*?)<\s*/\s*script\s*>'is");

         // strip out <style> tags

         $this->remove_noise("'<\s*style[^>]*[^/]>(.*?)<\s*/\s*style\s*>'is");

         $this->remove_noise("'<\s*style\s*>(.*?)<\s*/\s*style\s*>'is");

         // strip out preformatted tags

         $this->remove_noise("'<\s*(?:code)[^>]*>(.*?)<\s*/\s*(?:code)\s*>'is");

         // strip out server side scripts

         $this->remove_noise("'(<\?)(.*?)(\?>)'s", true);

         // strip smarty scripts

         $this->remove_noise("'(\{\w)(.*?)(\})'s", true);

         // parsing

         while ($this->parse());

         // end

         $this->root->_[HDOM_INFO_END] = $this->cursor;

         $this->parse_charset();

         // make load function chainable

         return $this;

     }

     // load html from file

     function load_file()

     {

         $args = func_get_args();

         $this->load(call_user_func_array('file_get_contents', $args), true);

         // Throw an error if we can't properly load the dom.

         if (($error=error_get_last())!==null) {

             $this->clear();

             return false;

         }

     }

     // set callback function

     function set_callback($function_name)

     {

         $this->callback = $function_name;

     }

     // remove callback function

     function remove_callback()

     {

         $this->callback = null;

     }

     // save dom as string

     function save($filepath='')

     {

         $ret = $this->root->innertext();

         if ($filepath!=='') file_put_contents($filepath, $ret, LOCK_EX);

         return $ret;

     }

     // find dom node by css selector

     // Paperg - allow us to specify that we want case insensitive testing of the value of the selector.

     function find($selector, $idx=null, $lowercase=false)

     {

         return $this->root->find($selector, $idx, $lowercase);

     }

     // clean up memory due to php5 circular references memory leak...

     function clear()

     {

         foreach ($this->nodes as $n) {$n->clear(); $n = null;}

         // This add next line is documented in the sourceforge repository. 2977248 as a fix for ongoing memory leaks that occur even with the use of clear.

         if (isset($this->children)) foreach ($this->children as $n) {$n->clear(); $n = null;}

         if (isset($this->parent)) {$this->parent->clear(); unset($this->parent);}

         if (isset($this->root)) {$this->root->clear(); unset($this->root);}

         unset($this->doc);

         unset($this->noise);

     }

     function dump($show_attr=true)

     {

         $this->root->dump($show_attr);

     }

     // prepare HTML data and init everything

     protected function prepare($str, $lowercase=true, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)

     {

         $this->clear();

         // set the length of content before we do anything to it.

         $this->size = strlen($str);

         // Save the original size of the html that we got in.  It might be useful to someone.

         $this->original_size = $this->size;

         //before we save the string as the doc...  strip out the \r \n's if we are told to.

         if ($stripRN) {

             $str = str_replace("\r", " ", $str);

             $str = str_replace("\n", " ", $str);

             // set the length of content since we have changed it.

             $this->size = strlen($str);

         }

         $this->doc = $str;

         $this->pos = 0;

         $this->cursor = 1;

         $this->noise = array();

         $this->nodes = array();

         $this->lowercase = $lowercase;

         $this->default_br_text = $defaultBRText;

         $this->default_span_text = $defaultSpanText;

         $this->root = new simple_html_dom_node($this);

         $this->root->tag = 'root';

         $this->root->_[HDOM_INFO_BEGIN] = -1;

         $this->root->nodetype = HDOM_TYPE_ROOT;

         $this->parent = $this->root;

         if ($this->size>0) $this->char = $this->doc[0];

     }

     // parse html content

     protected function parse()

     {

         if (($s = $this->copy_until_char('<'))==='')

         {

             return $this->read_tag();

         }

         // text

         $node = new simple_html_dom_node($this);

         ++$this->cursor;

         $node->_[HDOM_INFO_TEXT] = $s;

         $this->link_nodes($node, false);

         return true;

     }

     // PAPERG - dkchou - added this to try to identify the character set of the page we have just parsed so we know better how to spit it out later.

     // NOTE:  IF you provide a routine called get_last_retrieve_url_contents_content_type which returns the CURLINFO_CONTENT_TYPE from the last curl_exec

     // (or the content_type header from the last transfer), we will parse THAT, and if a charset is specified, we will use it over any other mechanism.

     protected function parse_charset()

     {

         global $debugObject;

         $charset = null;

         if (function_exists('get_last_retrieve_url_contents_content_type'))

         {

             $contentTypeHeader = get_last_retrieve_url_contents_content_type();

             $success = preg_match('/charset=(.+)/', $contentTypeHeader, $matches);

             if ($success)

             {

                 $charset = $matches[1];

                 if (is_object($debugObject)) {$debugObject->debugLog(2, 'header content-type found charset of: ' . $charset);}

             }

         }

         if (empty($charset))

         {

             $el = $this->root->find('meta[http-equiv=Content-Type]',0);

             if (!empty($el))

             {

                 $fullvalue = $el->content;

                 if (is_object($debugObject)) {$debugObject->debugLog(2, 'meta content-type tag found' . $fullvalue);}

                 if (!empty($fullvalue))

                 {

                     $success = preg_match('/charset=(.+)/', $fullvalue, $matches);

                     if ($success)

                     {

                         $charset = $matches[1];

                     }

                     else

                     {

                         // If there is a meta tag, and they don't specify the character set, research says that it's typically ISO-8859-1

                         if (is_object($debugObject)) {$debugObject->debugLog(2, 'meta content-type tag couldn\'t be parsed. using iso-8859 default.');}

                         $charset = 'ISO-8859-1';

                     }

                 }

             }

         }

         // If we couldn't find a charset above, then lets try to detect one based on the text we got...

         if (empty($charset))

         {

             // Have php try to detect the encoding from the text given to us.

             $charset = mb_detect_encoding($this->root->plaintext . "ascii", $encoding_list = array( "UTF-8", "CP1252" ) );

             if (is_object($debugObject)) {$debugObject->debugLog(2, 'mb_detect found: ' . $charset);}

             // and if this doesn't work...  then we need to just wrongheadedly assume it's UTF-8 so that we can move on - cause this will usually give us most of what we need...

             if ($charset === false)

             {

                 if (is_object($debugObject)) {$debugObject->debugLog(2, 'since mb_detect failed - using default of utf-8');}

                 $charset = 'UTF-8';

             }

         }

         // Since CP1252 is a superset, if we get one of it's subsets, we want it instead.

         if ((strtolower($charset) == strtolower('ISO-8859-1')) || (strtolower($charset) == strtolower('Latin1')) || (strtolower($charset) == strtolower('Latin-1')))

         {

             if (is_object($debugObject)) {$debugObject->debugLog(2, 'replacing ' . $charset . ' with CP1252 as its a superset');}

             $charset = 'CP1252';

         }

         if (is_object($debugObject)) {$debugObject->debugLog(1, 'EXIT - ' . $charset);}

         return $this->_charset = $charset;

     }

     // read tag info

     protected function read_tag()

     {

         if ($this->char!=='<')

         {

             $this->root->_[HDOM_INFO_END] = $this->cursor;

             return false;

         }

         $begin_tag_pos = $this->pos;

         $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

         // end tag

         if ($this->char==='/')

         {

             $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

             // This represents the change in the simple_html_dom trunk from revision 180 to 181.

             // $this->skip($this->token_blank_t);

             $this->skip($this->token_blank);

             $tag = $this->copy_until_char('>');

             // skip attributes in end tag

             if (($pos = strpos($tag, ' '))!==false)

                 $tag = substr($tag, 0, $pos);

             $parent_lower = strtolower($this->parent->tag);

             $tag_lower = strtolower($tag);

             if ($parent_lower!==$tag_lower)

             {

                 if (isset($this->optional_closing_tags[$parent_lower]) && isset($this->block_tags[$tag_lower]))

                 {

                     $this->parent->_[HDOM_INFO_END] = 0;

                     $org_parent = $this->parent;

                     while (($this->parent->parent) && strtolower($this->parent->tag)!==$tag_lower)

                         $this->parent = $this->parent->parent;

                     if (strtolower($this->parent->tag)!==$tag_lower) {

                         $this->parent = $org_parent; // restore origonal parent

                         if ($this->parent->parent) $this->parent = $this->parent->parent;

                         $this->parent->_[HDOM_INFO_END] = $this->cursor;

                         return $this->as_text_node($tag);

                     }

                 }

                 else if (($this->parent->parent) && isset($this->block_tags[$tag_lower]))

                 {

                     $this->parent->_[HDOM_INFO_END] = 0;

                     $org_parent = $this->parent;

                     while (($this->parent->parent) && strtolower($this->parent->tag)!==$tag_lower)

                         $this->parent = $this->parent->parent;

                     if (strtolower($this->parent->tag)!==$tag_lower)

                     {

                         $this->parent = $org_parent; // restore origonal parent

                         $this->parent->_[HDOM_INFO_END] = $this->cursor;

                         return $this->as_text_node($tag);

                     }

                 }

                 else if (($this->parent->parent) && strtolower($this->parent->parent->tag)===$tag_lower)

                 {

                     $this->parent->_[HDOM_INFO_END] = 0;

                     $this->parent = $this->parent->parent;

                 }

                 else

                     return $this->as_text_node($tag);

             }

             $this->parent->_[HDOM_INFO_END] = $this->cursor;

             if ($this->parent->parent) $this->parent = $this->parent->parent;

             $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

             return true;

         }

         $node = new simple_html_dom_node($this);

         $node->_[HDOM_INFO_BEGIN] = $this->cursor;

         ++$this->cursor;

         $tag = $this->copy_until($this->token_slash);

         $node->tag_start = $begin_tag_pos;

         // doctype, cdata & comments...

         if (isset($tag[0]) && $tag[0]==='!') {

             $node->_[HDOM_INFO_TEXT] = '<' . $tag . $this->copy_until_char('>');

             if (isset($tag[2]) && $tag[1]==='-' && $tag[2]==='-') {

                 $node->nodetype = HDOM_TYPE_COMMENT;

                 $node->tag = 'comment';

             } else {

                 $node->nodetype = HDOM_TYPE_UNKNOWN;

                 $node->tag = 'unknown';

             }

             if ($this->char==='>') $node->_[HDOM_INFO_TEXT].='>';

             $this->link_nodes($node, true);

             $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

             return true;

         }

         // text

         if ($pos=strpos($tag, '<')!==false) {

             $tag = '<' . substr($tag, 0, -1);

             $node->_[HDOM_INFO_TEXT] = $tag;

             $this->link_nodes($node, false);

             $this->char = $this->doc[--$this->pos]; // prev

             return true;

         }

         if (!preg_match("/^[\w-:]+$/", $tag)) {

             $node->_[HDOM_INFO_TEXT] = '<' . $tag . $this->copy_until('<>');

             if ($this->char==='<') {

                 $this->link_nodes($node, false);

                 return true;

             }

             if ($this->char==='>') $node->_[HDOM_INFO_TEXT].='>';

             $this->link_nodes($node, false);

             $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

             return true;

         }

         // begin tag

         $node->nodetype = HDOM_TYPE_ELEMENT;

         $tag_lower = strtolower($tag);

         $node->tag = ($this->lowercase) ? $tag_lower : $tag;

         // handle optional closing tags

         if (isset($this->optional_closing_tags[$tag_lower]) )

         {

             while (isset($this->optional_closing_tags[$tag_lower][strtolower($this->parent->tag)]))

             {

                 $this->parent->_[HDOM_INFO_END] = 0;

                 $this->parent = $this->parent->parent;

             }

             $node->parent = $this->parent;

         }

         $guard = 0; // prevent infinity loop

         $space = array($this->copy_skip($this->token_blank), '', '');

         // attributes

         do

         {

             if ($this->char!==null && $space[0]==='')

             {

                 break;

             }

             $name = $this->copy_until($this->token_equal);

             if ($guard===$this->pos)

             {

                 $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

                 continue;

             }

             $guard = $this->pos;

             // handle endless '<'

             if ($this->pos>=$this->size-1 && $this->char!=='>') {

                 $node->nodetype = HDOM_TYPE_TEXT;

                 $node->_[HDOM_INFO_END] = 0;

                 $node->_[HDOM_INFO_TEXT] = '<'.$tag . $space[0] . $name;

                 $node->tag = 'text';

                 $this->link_nodes($node, false);

                 return true;

             }

             // handle mismatch '<'

             if ($this->doc[$this->pos-1]=='<') {

                 $node->nodetype = HDOM_TYPE_TEXT;

                 $node->tag = 'text';

                 $node->attr = array();

                 $node->_[HDOM_INFO_END] = 0;

                 $node->_[HDOM_INFO_TEXT] = substr($this->doc, $begin_tag_pos, $this->pos-$begin_tag_pos-1);

                 $this->pos -= 2;

                 $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

                 $this->link_nodes($node, false);

                 return true;

             }

             if ($name!=='/' && $name!=='') {

                 $space[1] = $this->copy_skip($this->token_blank);

                 $name = $this->restore_noise($name);

                 if ($this->lowercase) $name = strtolower($name);

                 if ($this->char==='=') {

                     $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

                     $this->parse_attr($node, $name, $space);

                 }

                 else {

                     //no value attr: nowrap, checked selected...

                     $node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_NO;

                     $node->attr[$name] = true;

                     if ($this->char!='>') $this->char = $this->doc[--$this->pos]; // prev

                 }

                 $node->_[HDOM_INFO_SPACE][] = $space;

                 $space = array($this->copy_skip($this->token_blank), '', '');

             }

             else

                 break;

         } while ($this->char!=='>' && $this->char!=='/');

         $this->link_nodes($node, true);

         $node->_[HDOM_INFO_ENDSPACE] = $space[0];

         // check self closing

         if ($this->copy_until_char_escape('>')==='/')

         {

             $node->_[HDOM_INFO_ENDSPACE] .= '/';

             $node->_[HDOM_INFO_END] = 0;

         }

         else

         {

             // reset parent

             if (!isset($this->self_closing_tags[strtolower($node->tag)])) $this->parent = $node;

         }

         $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

         // If it's a BR tag, we need to set it's text to the default text.

         // This way when we see it in plaintext, we can generate formatting that the user wants.

         // since a br tag never has sub nodes, this works well.

         if ($node->tag == "br")

         {

             $node->_[HDOM_INFO_INNER] = $this->default_br_text;

         }

         return true;

     }

     // parse attributes

     protected function parse_attr($node, $name, &$space)

     {

         // Per sourceforge: http://sourceforge.net/tracker/?func=detail&aid=3061408&group_id=218559&atid=1044037

         // If the attribute is already defined inside a tag, only pay atetntion to the first one as opposed to the last one.

         if (isset($node->attr[$name]))

         {

             return;

         }

         $space[2] = $this->copy_skip($this->token_blank);

         switch ($this->char) {

             case '"':

                 $node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_DOUBLE;

                 $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

                 $node->attr[$name] = $this->restore_noise($this->copy_until_char_escape('"'));

                 $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

                 break;

             case '\'':

                 $node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_SINGLE;

                 $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

                 $node->attr[$name] = $this->restore_noise($this->copy_until_char_escape('\''));

                 $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

                 break;

             default:

                 $node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_NO;

                 $node->attr[$name] = $this->restore_noise($this->copy_until($this->token_attr));

         }

         // PaperG: Attributes should not have \r or \n in them, that counts as html whitespace.

         $node->attr[$name] = str_replace("\r", "", $node->attr[$name]);

         $node->attr[$name] = str_replace("\n", "", $node->attr[$name]);

         // PaperG: If this is a "class" selector, lets get rid of the preceeding and trailing space since some people leave it in the multi class case.

         if ($name == "class") {

             $node->attr[$name] = trim($node->attr[$name]);

         }

     }

     // link node's parent

     protected function link_nodes(&$node, $is_child)

     {

         $node->parent = $this->parent;

         $this->parent->nodes[] = $node;

         if ($is_child)

         {

             $this->parent->children[] = $node;

         }

     }

     // as a text node

     protected function as_text_node($tag)

     {

         $node = new simple_html_dom_node($this);

         ++$this->cursor;

         $node->_[HDOM_INFO_TEXT] = '</' . $tag . '>';

         $this->link_nodes($node, false);

         $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

         return true;

     }

     protected function skip($chars)

     {

         $this->pos += strspn($this->doc, $chars, $this->pos);

         $this->char = ($this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

     }

     protected function copy_skip($chars)

     {

         $pos = $this->pos;

         $len = strspn($this->doc, $chars, $pos);

         $this->pos += $len;

         $this->char = ($this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

         if ($len===0) return '';

         return substr($this->doc, $pos, $len);

     }

     protected function copy_until($chars)

     {

         $pos = $this->pos;

         $len = strcspn($this->doc, $chars, $pos);

         $this->pos += $len;

         $this->char = ($this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

         return substr($this->doc, $pos, $len);

     }

     protected function copy_until_char($char)

     {

         if ($this->char===null) return '';

         if (($pos = strpos($this->doc, $char, $this->pos))===false) {

             $ret = substr($this->doc, $this->pos, $this->size-$this->pos);

             $this->char = null;

             $this->pos = $this->size;

             return $ret;

         }

         if ($pos===$this->pos) return '';

         $pos_old = $this->pos;

         $this->char = $this->doc[$pos];

         $this->pos = $pos;

         return substr($this->doc, $pos_old, $pos-$pos_old);

     }

     protected function copy_until_char_escape($char)

     {

         if ($this->char===null) return '';

         $start = $this->pos;

         while (1)

         {

             if (($pos = strpos($this->doc, $char, $start))===false)

             {

                 $ret = substr($this->doc, $this->pos, $this->size-$this->pos);

                 $this->char = null;

                 $this->pos = $this->size;

                 return $ret;

             }

             if ($pos===$this->pos) return '';

             if ($this->doc[$pos-1]==='\\') {

                 $start = $pos+1;

                 continue;

             }

             $pos_old = $this->pos;

             $this->char = $this->doc[$pos];

             $this->pos = $pos;

             return substr($this->doc, $pos_old, $pos-$pos_old);

         }

     }

     // remove noise from html content

     // save the noise in the $this->noise array.

     protected function remove_noise($pattern, $remove_tag=false)

     {

         global $debugObject;

         if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }

         $count = preg_match_all($pattern, $this->doc, $matches, PREG_SET_ORDER|PREG_OFFSET_CAPTURE);

         for ($i=$count-1; $i>-1; --$i)

         {

             $key = '___noise___'.sprintf('% 5d', count($this->noise)+1000);

             if (is_object($debugObject)) { $debugObject->debugLog(2, 'key is: ' . $key); }

             $idx = ($remove_tag) ? 0 : 1;

             $this->noise[$key] = $matches[$i][$idx][0];

             $this->doc = substr_replace($this->doc, $key, $matches[$i][$idx][1], strlen($matches[$i][$idx][0]));

         }

         // reset the length of content

         $this->size = strlen($this->doc);

         if ($this->size>0)

         {

             $this->char = $this->doc[0];

         }

     }

     // restore noise to html content

     function restore_noise($text)

     {

         global $debugObject;

         if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }

         while (($pos=strpos($text, '___noise___'))!==false)

         {

             // Sometimes there is a broken piece of markup, and we don't GET the pos+11 etc... token which indicates a problem outside of us...

             if (strlen($text) > $pos+15)

             {

                 $key = '___noise___'.$text[$pos+11].$text[$pos+12].$text[$pos+13].$text[$pos+14].$text[$pos+15];

                 if (is_object($debugObject)) { $debugObject->debugLog(2, 'located key of: ' . $key); }

                 if (isset($this->noise[$key]))

                 {

                     $text = substr($text, 0, $pos).$this->noise[$key].substr($text, $pos+16);

                 }

                 else

                 {

                     // do this to prevent an infinite loop.

                     $text = substr($text, 0, $pos).'UNDEFINED NOISE FOR KEY: '.$key . substr($text, $pos+16);

                 }

             }

             else

             {

                 // There is no valid key being given back to us... We must get rid of the ___noise___ or we will have a problem.

                 $text = substr($text, 0, $pos).'NO NUMERIC NOISE KEY' . substr($text, $pos+11);

             }

         }

         return $text;

     }

     // Sometimes we NEED one of the noise elements.

     function search_noise($text)

     {

         global $debugObject;

         if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }

         foreach($this->noise as $noiseElement)

         {

             if (strpos($noiseElement, $text)!==false)

             {

                 return $noiseElement;

             }

         }

     }

     function __toString()

     {

         return $this->root->innertext();

     }

     function __get($name)

     {

         switch ($name)

         {

             case 'outertext':

                 return $this->root->innertext();

             case 'innertext':

                 return $this->root->innertext();

             case 'plaintext':

                 return $this->root->text();

             case 'charset':

                 return $this->_charset;

             case 'target_charset':

                 return $this->_target_charset;

         }

     }

     // camel naming conventions

     function childNodes($idx=-1) {return $this->root->childNodes($idx);}

     function firstChild() {return $this->root->first_child();}

     function lastChild() {return $this->root->last_child();}

     function createElement($name, $value=null) {return @str_get_html("<$name>$value</$name>")->first_child();}

     function createTextNode($value) {return @end(str_get_html($value)->nodes);}

     function getElementById($id) {return $this->find("#$id", 0);}

     function getElementsById($id, $idx=null) {return $this->find("#$id", $idx);}

     function getElementByTagName($name) {return $this->find($name, 0);}

     function getElementsByTagName($name, $idx=-1) {return $this->find($name, $idx);}

     function loadFile() {$args = func_get_args();$this->load_file($args);}

 }

 ?>

3：创建抓包代码 test.php,代码如下：

 <?php

     //error_reporting(0);

     set_time_limit(0);

     include_once 'simple_html_dom.php';

     echo '<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />';

     //$data = '#在抖音，记录美好生活#【库存】抛个硬币 如果摔碎了 今天我就不吃饭了 #校园生活  #大学 https://v.douyin.com/p9taJa/ 复制此链接，打开【抖音短视频】，直接观看视频！';

     $data = '#科比 愿你去的地方也有篮球陪伴，也能披着24号紫金战衣！ #动态壁纸 https://v.douyin.com/36xkCS/ 复制此链接，打开【抖音短视频】，直接观看视频！';

     $data = getData($data);

     echo json_encode($data);

     function getData($data){

         $url = getUrl($data);

         $cookie_jar = dirname(__FILE__).'/tmp.txt';//tempnam('./tmp','cookie');

         $data = get_content($url, $cookie_jar);

         $page = str_get_html($data);

         $data = array(

                 'base'=>array(

                     'headimg'=>false, // 头像

                     'name'=>false, // 昵称

                     'title'=>false, // 标题（姑且叫标题吧）

                     'description'=>false // 描述

                 ),

                 'video'=>array(

                     'cover'=>false, // 封面

                     'src'=>false, // 路径

                     'width'=>false, // 宽度

                     'height'=>false // 高度

                 )

             );

         $user = $page->find('div[class=user-info]');

         // 头像、 昵称

         if(count($user) > 0){

             $img = $user[0]->find('div[class=avatar]');

             if(count($img) > 0){

                 $img = $img[0]->find('img');

                 if(count($img) > 0){

                     // 头像

                     $data['base']['headimg'] = $img[0]->src;

                     // 昵称

                     $data['base']['name'] = $img[0]->alt;

                 }

             }

         }

         // 标题、描述

         $title = $page->find('div[class=challenge-info]');

         if(count($title) > 0){

             $description = $title[0]->next_sibling();

             $title = $title[0]->first_child()->first_child();

             $data['base']['title'] = $title->innertext;

             $data['base']['description'] = $description->innertext;

         }

         $video = $page->find('div[id=pageletReflowVideo]');

         if(count($video) > 0){

             $script = $video[0]->next_sibling();

             if(!empty($script)){

                 $script = $script->next_sibling();

                 if(!empty($script)){

                     $script = $script->next_sibling()->innertext;

                     $data['video'] = getVideo($script);

                 }

             }

         }

         return $data;

     }

     function getVideo($scripts){

         $video = array();

         $scripts = preg_replace('/\s+/','',$scripts);

         // 宽度

         preg_match('/videoWidth:([0-9.]*),/' , $scripts, $matches);

         if(empty($matches) || count($matches) < 2){

             $video['width'] = false;

         }else{

             $video['width'] = $matches[1];

         }

         // 高度

         preg_match('/videoHeight:([0-9.]*),/' , $scripts, $matches);

         if(empty($matches) || count($matches) < 2){

             $video['height'] = false;

         }else{

             $video['height'] = $matches[1];

         }

         // 视频路径

         preg_match('/playAddr:"(.*)",/' , $scripts, $matches);

         if(empty($matches) || count($matches) < 2){

             $video['src'] = false;

         }else{

             $video['src'] = $matches[1];

         }

         // 封面

         preg_match('/cover:"(.*)"}/' , $scripts, $matches);

         if(empty($matches) || count($matches) < 2){

             $video['cover'] = false;

         }else{

             $video['cover'] = $matches[1];

         }

         return $video;

     }

     function get_content($url, $cookie,$referfer='') {

     //var_dump($post);exit;

     $useragent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";

     /*if ($curl_loops++ >= $curl_max_loops) {

         $curl_loops = 0;

         return false;

     }*/

     if($referfer == ''){

         $referfer = 'https://www.kujiale.com/';

     }

     $header = array("Referer: ".$referfer);

       $curl = curl_init();//初始化curl模块

       curl_setopt($curl, CURLOPT_URL, $url);//登录提交的地址

       curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); //不验证证书

       curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false); //不验证证书

       curl_setopt($curl, CURLOPT_HEADER, 1);//是否显示头信息

       curl_setopt($curl, CURLOPT_HTTPHEADER,$header);

       //curl_setopt ($curl,CURLOPT_REFERER,'http://www.kujiale.com/');

       curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);//是否自动显示返回的信息

       curl_setopt($curl, CURLOPT_COOKIEFILE, $cookie); //设置Cookie信息保存在指定的文件中

       curl_setopt($curl, CURLOPT_COOKIEJAR, $cookie); //设置Cookie信息保存在指定的文件中

       //curl_setopt($curl, CURLOPT_POST, 1);//post方式提交

       //curl_setopt($curl, CURLOPT_POSTFIELDS, http_build_query($post));//要提交的信息

       //curl_setopt($curl,CURLOPT_POSTFIELDS,$post);

       curl_setopt($curl, CURLOPT_USERAGENT, $useragent);

       //curl_setopt($curl, CURLOPT_REFERER, 'http://www.kujiale.com/');

       $data = curl_exec($curl);//执行cURL

       $ret = $data;

       list($header, $data) = explode("\r\n\r\n", $data, 2);

       $http_code = curl_getinfo($curl, CURLINFO_HTTP_CODE);

       $last_url = curl_getinfo($curl, CURLINFO_EFFECTIVE_URL);

       //var_dump($last_url);

       //$httpCode = curl_getinfo($curl,CURLINFO_HTTP_CODE);

       //var_dump($httpCode);

       //echo '<hr/>';

       curl_close($curl);//关闭cURL资源，并且释放系统资源

       if ($http_code == 301 || $http_code == 302) {

             $matches = array();

             preg_match('/Location:(.*?)\n/', $header, $matches);

             $url = @parse_url(trim(array_pop($matches)));

             if (!$url) {

                   $curl_loops = 0;

                   return $data;

             }

             $new_url = $url['scheme'] . '://' . $url['host'] . $url['path']

                   . (isset($url['query']) ? '?' . $url['query'] : '');

             $new_url = stripslashes($new_url);

             return get_content($new_url,$cookie);

      } else {

           $curl_loops = 0;

           list($header, $data) = explode("\r\n\r\n", $ret, 2);

           return $data;

      }

 }

     function getUrl($data){

         preg_match('/https:\/\/v.douyin.com\/.*\//' , $data, $matches);

         if(empty($matches) || count($matches) != 1){

             return false;

         }else{

             return $matches[0];

         }

     }

 ?>

　将test.php中的$data 换成自己复制出来的链接就可以了。返回的是json格式的内容，可以直接渲染在前台也可以存到数据库里面。

到此从抖音上抓包视频内容就完成了，写的不好，请大家勿喷。大家有什么意见，欢迎在评论区留言。我看到了会回复大家，谢谢。

PHP根据抖音的分享链接来抓包抖音视频的更多相关文章

抖音C#版，自己抓第三方抖音网站
感谢http://dy.lujianqiang.com技术支持文章更新:http://dy.lujianqiang.com这个服务器已经关了,现在没用了版权归抖音公司所有,该博客只是为交流学习所使 ...
快速实现抖音的分享&登录(android)
快速实现抖音分享与第三方登录准备工作 1.注册抖音的key到抖音开放平台,点击这里查看步骤: 2.集成ShareSDK到Mob官网文档页面查看即可,点击这里查看集成: 业务代码分享要求: 视频: ...
Python 爬虫——抖音App视频抓包
APP抓包前面我们了解了一些关于 Python 爬虫的知识,不过都是基于 PC 端浏览器网页中的内容进行爬取.现在手机 App 用的越来越多,而且很多也没有网页端,比如抖音就没有网页版,那么上面的视 ...
一篇文章教会你用Python抓取抖音app热点数据
今天给大家分享一篇简单的安卓app数据分析及抓取方法.以抖音为例,我们想要抓取抖音的热点榜数据. 要知道,这个数据是没有网页版的,只能从手机端下手. 首先我们要安装charles抓包APP数据,它是一 ...
android/IOS各平台分享链接/跳转链接配置说明（备用）
Android: [Java] 纯文本查看复制代码 ? 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 微信/朋友圈 //设置 ...
解决h5网页微信分享链接不能显示缩略
<script type="text/javascript" src="http://res.wx.qq.com/open/js/jweixin-1.2.0.js& ...
微信分享缩略图，如何增加微信朋友圈分享链接的小图片? facebook、google+、twitter等分享的标签
微信分享缩略图如何增加微信朋友圈分享链接的小图片?在网页的头部<head>标签内加上以下代码,图片路径自行修改.<head><div id='wx_pic' style ...
vue微信分享链接添加动态参数
微信分享时分享链接携带参数可能不是固定的需要在分享的前一刻才知道这里就是动态设置分享链接的基本写法代码不是那么详尽但大致流程如下 1.安装引用jssdk npm install --save ...
H5微信自定义分享链接（设置标题+简介+图片）
起源:最近公司在做招募广告的html5页面,然后做出来后,产品提出一个问题,需要分享出去的链接是卡片形式,内容也要自己定义,这下就难到我了,因为是第一次遇到这种需求,果断百度,然而,我就像大家一样,看 ...

随机推荐

从零开始学习docker之在docker中运行springboot项目
一.docker环境配置首先需要一个安装了docker的服务器(本地或者云服务器),如果没有请看上文,传送门---https://www.cnblogs.com/wdfordream/p/12737 ...
pytorch中tensor张量的创建
import torch import numpy as np print(torch.tensor([1,2,3])) print(torch.tensor(np.arange(15).reshap ...
web.config 301
<?xml version="1.0" encoding="UTF-8"?> <configuration> <system.we ...
java list随机截取（洗牌）
public void solution(){ List<Integer> givenList = Arrays.asList(1, 2, 3,4,5,6); Collections.sh ...
7.哪些工具可以帮助查找bug或进行静态分析
哪些工具可以帮助查找bug或进行静态分析? PyChecker is a static analysis tool that detects the bugs in Python source cod ...
F查询，Q查询，事物，only与defer
F查询之前的单表查询多表查询筛选条件都是一个固定的值,那么如何用字段来筛选呢? 比如 : 查询卖出数大于库存数的商品这个时候我们就可以用到django里面的F查询了查询示例表 ...
谷歌浏览器报错Unchecked runtime.lastError: The message port closed before a response was received.。
浏览器版本 : 报错原因:扩展程序问题解决建议:打开chrome://extensions/,逐一关闭排查
spring boot 集成mybatis使用logback打印并保存日志信息
spring boot 打印执行的sql语句最近在学习spring boot 整合了Mybatis和druid之后总感觉少点什么东西,看了下在别的项目上用的框架,发现自己整合的东西不打印sql语句, ...
震惊，某博主为吸引眼球拿出压箱底SQL总结，如果你没看那就吃亏了！（超级详细的SQL基础，你还不会的话就别学数据库了）
这里还有数据库相关的优质文章:快戳我,快戳我
muduo网络库源码学习————线程安全
线程安全使用单例模式,保证了每次只创建单个对象,代码如下: Singleton.h // Use of this source code is governed by a BSD-style lice ...

PHP根据抖音的分享链接来抓包抖音视频

PHP根据抖音的分享链接来抓包抖音视频的更多相关文章

随机推荐

热门专题