如何使你的Ajax应用内容可让搜索引擎爬行
Overview of Solution
Briefly, the solution works as follows: the crawler finds a pretty AJAX URL (that is, a URL containing a #! hash fragment). It then requests the content for this URL from your server in a slightly modified form. Your web server returns the content in the form of an HTML snapshot, which is then processed by the crawler. The search results will show the original URL.
Step-by-step guide
1. Indicate to the crawler that your site supports the AJAX crawling scheme
The first step to getting your AJAX site indexed is to indicate to the crawler that your site supports the AJAX crawling scheme. The way to do this is to use a special token in your hash fragments (that is, everything after the # sign in a URL): hash fragments have to begin with an exclamation mark. For example, if your AJAX app contains a URL like this:
www.example.com/ajax.html#key=value
it should now become this:
www.example.com/ajax.html#!key=value
When your site adopts the scheme, it will be considered "AJAX crawlable." This means that the crawler will see the content of your app if your site supplies HTML snapshots.
2. Set up your server to handle requests for URLs that contain _escaped_fragment_
Suppose you would like to get www.example.com/index.html#!key=value indexed. Your part of the agreement is to provide the crawler with an HTML snapshot of this URL, so that the crawler sees the content. How will your server know when to return an HTML snapshot instead of a regular page? The answer is the URL that is requested by the crawler: the crawler will modify each AJAX URL such as
www.example.com/ajax.html#!key=value
to temporarily become
www.example.com/ajax.html?_escaped_fragment_=key=value
You may wonder why this is necessary. There are two very important reasons:
Hash fragments are never (by specification) sent to the server as part of an HTTP request. In other words, the crawler needs some way to let your server know that it wants the content for the URL www.example.com/ajax.html#!key=value (as opposed to simply www.example.com/ajax.html).
Your server, on the other hand, needs to know that it has to return an HTML snapshot, rather than the normal page sent to the browser. Remember: an HTML snapshot is all the content that appears on the page after the JavaScript has been executed. Your server's end of the agreement is to return the HTML snapshot for www.example.com/index.html#!key=value (that is, the original URL!) to the crawler.
Note: The crawler escapes certain characters in the fragment during the transformation. To retrieve the original fragment, make sure to unescape all %XX characters in the fragment. More specifically, %26 should become &, %20 should become a space, %23 should become #, and %25 should become %, and so on.
Now that you have your original URL back and you know what content the crawler is requesting, you need to produce an HTML snapshot. How do you do that? There are various ways; here are some of them:
If a lot of your content is produced with JavaScript, you may want to use a headless browser such as HtmlUnit to obtain the HTML snapshot. Alternatively, you can use a different tool such as crawljax or watij.com.
If much of your content is produced with a server-side technology such as PHP or ASP.NET, you can use your existing code and only replace the JavaScript portions of your web page with static or server-side created HTML.
You can create a static version of your pages offline, as is the current practice. For example, many applications draw content from a database that is then rendered by the browser. Instead, you may create a separate HTML page for each AJAX URL.
It's highly recommended that you try out your HTML snapshot mechanism. It's important to make sure that the headless browser indeed renders the content of your application's state correctly. Surely you'll want to know what the crawler will see, right? To do this, you can write a small test application and see the output, or you can use a tool such as Fetch as Googlebot.
To summarize, make sure the following happens on your server:
A request URL of the form www.example.com/ajax.html?_escaped_fragment_=key=value is mapped back to its original form: www.example.com/ajax.html#!key=value.
The token is URL unescaped. The easiest way to do this is to use standard URL decoding. For example, in Java you would do this:
mydecodedfragment = URLDecoder.decode(myencodedfragment, "UTF-8");
An HTML snapshot is returned, ideally along with a prominent link at the top of the page, letting end users know that they have reached the _escaped_fragment_ URL in error. (Remember that _escaped_fragment_ URLs are meant to be used only by crawlers.) For all requests that do not have an _escaped_fragment_, the server will return content as before.
Some of your pages may not have hash fragments. For example, you might want your home page to be www.example.com, rather than www.example.com#!home. For this reason, we have a special provision for pages without hash fragments.
Note:Make sure you use this option only for pages that contain dynamic, Ajax-created content. For pages that have only static content, it would not give extra information to the crawler, but it would put extra load on your and Google's servers.
In order to make pages without hash fragments crawlable, you include a special meta tag in the head of the HTML of your page. The meta tag takes the following form:
<meta name="fragment" content="!">
This indicates to the crawler that it should crawl the ugly version of this URL. As per the above agreement, the crawler will temporarily map the pretty URL to the corresponding ugly URL. In other words, if you place <meta name="fragment" content="!"> into the page www.example.com, the crawler will temporarily map this URL to www.example.com?_escaped_fragment_= and will request this from your server. Your server should then return the HTML snapshot corresponding to www.example.com. Please note that one important restriction applies to this meta tag: the only valid content is "!". In other words, the meta tag will always take the exact form: <meta name="fragment" content="!">, which indicates an empty hash fragment, but a page with AJAX content.
4. Consider updating your Sitemap to list the new AJAX URLs
Crawlers use Sitemaps to complement their discovery crawl. Your Sitemap should include the version of your URLs that you'd prefer to have displayed in search results, so in most cases it would be http://example.com/ajax.html#!key=value. Do not include links such as http://example.com/ajax.html?_escaped_fragment_=key=value in the Sitemap. Googlebot does not follow links that contain _escaped_fragment_! If you have an entry page to your site, such as your homepage, that you would like displayed in search results without the #!, then add this URL to the Sitemap as is. For instance, if you want this version displayed in search results:
http://example.com/
then include
http://example.com/
in your Sitemap and make sure that <meta name="fragment" content="!"> is included in the head of the HTML document. For more information, check out our additional articles on Sitemaps.
5. Optionally, but importantly, test the crawlability of your app: see what the crawler sees with "Fetch as Googlebot".
Google provides a tool that will allow you to get an idea of what the crawler sees, Fetch as Googlebot. You should use this tool to see whether your implementation is correct and whether the bot can now see all the content you want a user to see. It is also important to use this tool to ensure that your site is not cloaking.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 3.0 License, and code samples are licensed under the Apache 2.0 License. For details, see our Site Policies.
Last updated June 18, 2014.
如何使你的Ajax应用内容可让搜索引擎爬行的更多相关文章
- php面试专题---13、AJAX基础内容考点
php面试专题---13.AJAX基础内容考点 一.总结 一句话总结: ajax对提升用户速度,缓解服务器压力方面也是很有可取之处的,毕竟传递的数据少了 1.AJAX基础概念? Asynchronou ...
- PHPcurl抓取AJAX异步内容(转载)
PHPcurl抓取AJAX异步内容 其实抓ajax异步内容的页面和抓普通的页面区别不大.ajax只不过是做了一次异步的http请求,只要使用firebug类似的工具,找到请求的后端服务url和传值的参 ...
- PHP curl 抓取AJAX异步内容
其实抓ajax异步内容的页面和抓普通的页面区别不大.ajax只不过是做了一次异步的http请求,只要使用firebug类似的工具,找到请求的后端服务url和传值的参数,然后对该url传递参数进行抓取即 ...
- PHP面试 AJAX基础内容
AJAX基础内容 Ajax的基本工作原理 Ajax基础概念:通过在后台与服务器进行少量数据交换,Ajax可以使用网页实现异步更新 Ajax工作原理:XMLHttpRequest是Ajax的基础 ...
- 在asp.net中使JQuery的Ajax用总结
自从有了JQuery,Ajax的使用变的越来越方便了,但是使用中还是会或多或少的出现一些让人短时间内痛苦的问题.本文暂时总结一些在使用JQuery Ajax中应该注意的问题,如有不恰当或者不完善的地方 ...
- 几种方法实现ajax请求内容时使用浏览器后退和前进功能
ajax是一个非常好玩的小东西,不过用起来也会存在一些问题. 我们可以利用ajax进行无刷新改变文档内容,但是没办法去修改URL,即无法实现浏览器的前进与后退.书签的收藏功能. 利用location的 ...
- Windows Phone中扩展WebBrowser使其支持绑定html内容
在WP开发中,有时候会用到WebBrowser控件来展示一些html内容,这个控件有很多局限性,比如不支持绑定内容,这样的MVVM模式中就无法进行内容的绑定.为了实现这个目的,需要扩展一下,具体代码如 ...
- espcms列表页ajax获取内容 - 并初始化swiper
<link rel="stylesheet" href="swiper.min.css" type="text/css" media= ...
- 诅咒JavaScript之:Jquery ajax提交内容异常
jquery ajax 通过url提交内容,在服务器端获取却出现很奇怪的值,代码如下: ajaxurl = "aspx/logTable.ashx?action=load&Every ...
随机推荐
- Linux Debugging(七): 使用反汇编理解动态库函数调用方式GOT/PLT
本文主要讲解动态库函数的地址是如何在运行时被定位的.首先介绍一下PIC和Relocatable的动态库的区别.然后讲解一下GOT和PLT的理论知识.GOT是Global Offset Table,是保 ...
- maven -Dmaven.multiModuleProjectDirectory system propery is not set. Check $M2_HOME
遇到错误:-Dmaven.multiModuleProjectDirectory system propery is not set. Check $M2_HOME解决办法:在环境变量中设置M2_HO ...
- flex 强制转换类型失败无法将object转换为XXX
错误描述 flex在加载module时报出如题所示的错误, 实际表现 问题就出现在这 我取消这个错误提示框 再次在前台查询数据 就一切ok 问题就出现在这一句 var zoufangModel:ZfR ...
- Linux IPC实践(1) -- 概述
进程的同步与互斥 进程同步: 多个进程需要相互配合共同完成一项任务. 进程互斥: 由于各进程要求共享资源,而且有些资源需要互斥使用,因此各进程间竞争使用这些资源,进程的这种关系为进程的互斥;系统中某些 ...
- Strategy 设计模式 策略模式 超靠谱原代码讲解
先来假设一种情,我们需要向三种不同的客户做出不同的报价,一般来说要肿么设计呢,是不是马上会想到用IF,没有错,对于这种情况,策略模式是最好的选.大家可以这么理解,如果有情况需要用到大量的IF,那你用策 ...
- 浅析GDAL库C#版本支持中文路径问题
GDAL库对于C#的支持问题还是蛮多的,对于中文路径的支持就是其中之一(另一个就是通过OGR库获取图形的坐标信息). 关于C#支持中文路径,看过我之前博客的应该都不陌生,如果使用的是我修改过的GDAL ...
- Cocos2D v2.0至v3.x简洁转换指南(三)
Cocos2D 3.3中的注意事项 如果你在使用Cocos2D 3.3+(是SpriteBuilder 1.3+的一部分)你将不得不替分别的换所有存在的UITouch和UITouchEvent为CCT ...
- Android中代码运行指定的Apk
有时候,当我们编写自己的应用的时候,需要通过代码实现指定的apk,安装指定的主题,或者安装新的apk.可以通过以下方法实现: private void installAPK(String apkUrl ...
- cocos2d-x 读写 xml 文件
cocos2d-x 读写 xml 文件 A product of cheungmine使用cocos2d-x开发2d游戏确实方便,但是对于一般的小游戏,经常需要的工作是UI布局设计和调整,代码改来改去 ...
- 操作系统 - Linux进程实现的内部结构
在进程描述符中进入几个字段来表示进程之间的父子关系和兄弟关系. 图3-4显示了一组进程间的亲属关系. 表3-4:建立非亲属关系的进程描述符字段 在某些情况下,内核必须能从进程的PID到处对应的进程描述 ...