Rencently, my two teammates and I is doing a project, a simplified Chinese search engine for children(in primary school). We call it "kidsearch".

Since our project will be based on Baidu search engine. I'd like to have a simple analysis of Baidu search engine.

First, Baidu is not for children to use totally. Baidu, as a commercial company, provides the public a free service of searching. It is natural that not all the contents shown on the search engine are what people need. Some of them are shown because of benefits and some other factors.Perhaps it doesn't have a great impact on adults who can distinguish the contents of good or bad. But the impact will be obvious when it comes to children. For example,we can search these keys on Baidu : "波"(notice its pictures),"交换群"(notice its results),"医院"(notice its advertisements). And these are some normal words. Don't mention the results of some even worse key words. These results of searching not just inappropriate, some of them even harmful. So, the situation has to be fixed, which is also the purpose of our project "kidsearch".

Actually, seaching on the Internet for children is easier to that for adults. So the problem is also simplified. We can just use Baidu as a tool(not exagerated), rearrange the result, fix the inproper or useless entries, and add some contents suitable for children. The search engine will be really better for children after we do some fix on it.

So, what are the contents appropriate to children?

Based on the thoughts above, I concluded the requirements of children, which are what children may need.(Perhaps it doesn't cover all at present and we will perfect it in the future)

1.Notion -- encyclopedia

2.Material -- picture, music, video

3.Entertainment -- game

4.Study -- homework, knowledge

Moreover, there are some kinds of content that children don't need:

1.advertisement

2.adult(mature) content

3.sexual or homosexual content

4.sidebar(ad. or adult content or useless for children mostly)

Now that we have known what children need, what we should do next is to tackle them one by one.

What the technology we will use?

After tried many approaches, such as PHP, Java, Python, etc. I decided to use Python to do this job because it's really convenient to do the crawl job. Although it is a bit more difficult to make webpages than PHP, it doesn't matter too much.

Besides, there are huge amount of extended library to use with Python, such as requests, flask, django, jieba, etc. I have tried all of them preliminarily.

More details will be illustrated later. And our aim is to create a search engine which children can use and like to use.

[0.0]Analysis of Baidu search engine的更多相关文章

  1. 开源搜索 Iveely Search Engine 0.6.0 发布 -- 黎明前的娇嫩

    快两年了,Iveely Search Engine已经走过了5个版本的岁月,虽出生“贫寒”,没有任何开源基金会的支持,没有优秀的“干爹.干妈”,它凭着它的爱好者的支持,0.6.0终于破壳而出,7年前, ...

  2. Iveely Search Engine 0.4.0 的发布

    千呼万唤始出来,Iveely Search Engine 0.4.0 的发布   经过无数个夜晚的奋战,以及无数个夜晚的失眠,Iveely Search Engine 0.4.0 终于熬出来了,这其中 ...

  3. dotnet cli 5.0 新特性——dotnet tool search

    dotnet cli 5.0 新特性--dotnet tool search Intro .NET 5.0 SDK 的发布,给 dotnet cli 引入了一个新的特性,dotnet tool sea ...

  4. 微软的一篇ctr预估的论文:Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search Engine。

    周末看了一下这篇论文,觉得挺难的,后来想想是ICML的论文,也就明白为什么了. 先简单记录下来,以后会继续添加内容. 主要参考了论文Web-Scale Bayesian Click-Through R ...

  5. Search Engine Hacking – Manual and Automation

    Search Engine Hacking – Manual and Automation Ethical Hacking Boot Camp OUR MOST POPULAR COURSE! CLI ...

  6. Known BREAKING CHANGES from NH3.3.3.GA to 4.0.0

    Build 4.0.0.Alpha1 =============================   ** Known BREAKING CHANGES from NH3.3.3.GA to 4.0. ...

  7. Empirical Analysis of Beam Search Performance Degradation in Neural Sequence Models

    Empirical Analysis of Beam Search Performance Degradation in Neural Sequence Models  2019-06-13 10:2 ...

  8. 未能加载文件或程序集“Microsoft.SqlServer.Management.Sdk.Sfc, Version=11.0.0.0, Culture=neutral, PublicKeyToken...

    刚开始看老师 用VS新建一个“ADO.NET 实体数据模型” 但是一直报错:未能加载文件或程序集“Microsoft.SqlServer.Management.Sdk.Sfc, Version=11. ...

  9. 42 Bing Search Engine Hacks

    42 Bing Search Engine Hacks November 13, 2010 By Ivan Remember Bing, the search engine Microsoft lau ...

随机推荐

  1. js学习总结

    转自 http://blog.sina.com.cn/s/blog_75cf5f3201011csu.html 一: 关于基本数据类型在栈内存和堆内存中的关系 基本数据对于栈内存和堆内存是可以复制的, ...

  2. 修改placeholder属性

    input::-webkit-input-placeholder{ font-size:12px;}input:-ms-input-placeholder{ font-size:12px;}input ...

  3. MYSQL的分区字段,必须包含在主键字段内

    MYSQL的分区字段,必须包含在主键字段内   MYSQL的分区字段,必须包含在主键字段内 在对表进行分区时,如果分区字段没有包含在主键字段内,如表A的主键为ID,分区字段为createtime ,按 ...

  4. BOM浏览器对象模型和API速查

    什么是BOMBOM是Browser Object Model的缩写,简称浏览器对象模型BOM提供了独立于内容而与浏览器窗口进行交互的对象由于BOM主要用于管理窗口与窗口之间的通讯,因此其核心对象是wi ...

  5. MySQL操作备忘录

    在mysql包中,mysqld是数据库服务器,mysql是客户端,mysqladmin则用于管理数据库服务器的信息,如用户密码等. 关于安装: 1.在d:/sftwr/mysql/bin目录下: my ...

  6. JQuery实现——黑客帝国代码雨效果

    效果如你所见就是本页面上方那样的效果 实现方法来自一个印度小伙纸,学习完我也没总结一下,今儿个补上 如何实现,大家右键查看源码复制即可,不过学习的过程还是要总结总结. 下面通过另外两个小例子,一步一步 ...

  7. MySQL基础之第5章 操作数据库

    假设已经登录 mysql-h localhost -uroot -proot 5.1.显示.创建.删除数据库 show databases;     显示所有的数据库 create database ...

  8. HDU 5832 A water problem

    A water problem Time Limit: 5000/2500 MS (Java/Others)    Memory Limit: 65536/65536 K (Java/Others)T ...

  9. RMQ(dp)

    我一开始是不知道有这么个东西,但是由于最近在学习后缀数组,碰到一道题需要用到后缀数组+RMQ解决的所以不得不学习了. 原理:用A[1...n]表示一组数,dp[i][j]表示从A[i]到A[i+2^j ...

  10. HDU5045-Contest(状压dp)

    题意: 有n个学生,m道题,给出每个同学解出m个问题的概率,在解题过程中每个学生的解题数的差不大于1,求最大能解出题目数的期望 分析: n很小,知道用状压,但是比赛没做出来(脑子太死了,有一个限制条件 ...