Part 1:

前言:

最近看了一些关于短址(short URL)方面的一些博客,有些博客说到一些好的东西,但是,也不是很全,所以,这篇博客算是对其它博客的一个总结吧。

介绍:

短址,顾名思义,就是把长的 URL 转成短的 URL, 现在提供这种服务的有很多公司,我们以google家的 URL shortener 服务: http://goo.gl/ 为例。

首先我们到 http://goo.gl/,然后把本文博客的地址http://blog.csdn.net/beiyeqingteng 输入进去,最后它会返回一个更短的URL,http://goo.gl/Jfs6q 。如下图所示:


URL 解析:

当我们在浏览器里输入 http://goo.gl/Jfs6q 时,DNS首先解析获得http://goo.gl/的IP地址。当DNS获得IP地址以后(比如:74.125.225.72),会向这个地址发送HTTP GET请求,查询 Jfs6q, 这个时候,http://goo.gl/服务器会把请求通过HTTP 301转到对应的长URL http://blog.csdn.net/beiyeqingteng 。后面的解析过程就和平常网址解析是一样的了。

短址本质:

短址本质上是实现了一个映射函数 f: X -> Y 。而这个映射函数必须同时具有两个特点:

1. 如果 x1 != x2, 则 f (x1) != f(x2);

2. 对于每一个 y, 能够找到唯一的一个 x 使得 f(x) = y;

对于任何的线性函数,比如 f(x) = 2x,都满足这样的条件。

好了,如果了解了短址的本质,我们再来看它是如何实现的。

注明:在google URL shortener 服务中,它允许一个长 url 对应多个短的url。这可能是出于安全上的考虑。在本文中,我们不考虑这种情况。

实现:

短址的长度一般设为 6 位,而每一位是由 [a - z, A - Z, 0 - 9] 总共 62 个字母组成的,所以6位的话,总共会有 62^6 ~= 568亿种组合,基本上够用了。在google URL shortener 服务中,短址长度为 5,大概有9亿多种组合.

假设我们用数据库来保存长地址和短地址的映射,那么,在表 LongtoShortURL 中,我们会有三列:

1. ID,int,  自动增长;

2. LURL,varchar,  // 长URL;

3. SURL, varchar,  // 短URL。

现在我们考虑通过如何长URL得到唯一的短URL。

在讲具体算法以前,先提一个问题:10进制数和16进制数是否满足刚刚提到的映射函数 f: X -> Y中的两个条件?

答案: 是。

本文的思路也是利用进制之间的转换。因为我们总共有 62 个字母,我们可以自创一种进制,叫做 62 进制。其规则如下:

   → a
→ b
...
→ z
...

所以,对于每一个长地址,我们可以根据它的ID,得到一个6位的 62 进制数,这个6位的 62 进制数就是我们的短址。具体实现如下:

 public List<Integer> base62(int id) {
List<Integer> value = new LinkedList<Integer>();
while (id > ) {
int remainder = id % ;
value.add(, remainder);
id = id / ;
}
return value;
}

举例:

对于 ID = 138,通过 base62(138), 我们得到 value = [2, 14]。根据上面的对应规则表,我们可以得到其对应的短址为:aaaabn 。(由 value 得到具体的短址,可以通过switch 语句得到,因为代码太长,在此略过。)

当我们想通过短址找到所对应的长地址,方法也很简单,就是把62进制数转成10进制数即可,这样我们就可以得到长地址的ID了。

比如,对于短址aaae9a,其62进制为[0, 0, 0, 4,61,0] ,则其长地址的ID 为[0, 0, 0, 4,61,0] = 0×62^5+ 0×62^4 + 0×62^3 + 4×62^2 + 61×62^1 + 0×62^0 = 1915810。有了ID,我们自然就可以得到长地址了。

Part 2:

Question

How to create a TinyURL system?

If you are not familiar with TinyRUL, I’ll briefly explain here. Basically, TinyURL is a URL shortening service, a web service that provides short aliases for redirection of long URLs. There are many other similar services like Google URL ShortenerBitly etc..

For example, URL http://blog.gainlo.co/index.php/2015/10/22/8-things-you-need-to-know-before-system-design-interviews/ is long and hard to remember, TinyURL can create a alias for it – http://tinyurl.com/j7ve58y. If you click the alias, it’ll redirect you to the original URL.

So if you would design this system that allows people input URLs with short alias URLs generated, how would you do it?

High-level idea

Let’s get started with basic and high-level solutions and we can keep optimizing it later on.

At first glance, each long URL and the corresponding alias form a key-value pair. I would expect you to think about something related to hash immediately.

Therefore, the question can be simplified like this – given a URL, how can we find hash function F that maps URL to a short alias:
F(URL) = alias
and satisfies following condition:s

  1. Each URL can only be mapped to a unique alias
  2. Each alias can be mapped back to a unique URL easily

The second condition is the core as in the run time, the system should look up by alias and redirect to the corresponding URL quickly.

Basic solution

To make things easier, we can assume the alias is something like http://tinyurl.com/<alias_hash> and alias_hash is a fixed length string.

If the length is 7 containing [A-Z, a-z, 0-9], we can serve 62 ^ 7 ~= 3500 billion URLs. It’s said that there are ~644 million URLs at the time of this writing.

To begin with, let’s store all the mappings in a single database. A straightforward approach is using alias_hash as the ID of each mapping, which can be generated as a random string of length 7.

Therefore, we can first just store <ID, URL>. When a user inputs a long URL “http://www.gainlo.co”, the system creates a random 7-character string like “abcd123” as ID and inserts entry <“abcd123”, “http://www.gainlo.co”> into the database.

In the run time, when someone visits http://tinyurl.com/abcd123, we look up by ID “abcd123” and redirect to the corresponding URL “http://www.gainlo.co”.

Performance VS flexibility

There are quite a few follow-up questions for this problem. One thing I’d like to further discuss here is that by using GUID (Globally Unique Identifier) as the entry ID, what would be pros/cons versus incremental ID in this problem?

If you dig into the insert/query process, you will notice that using random string as IDs may sacrifice performance a little bit. More specifically, when you already have millions of records, insertion can be costly. Since IDs are not sequential, so every time a new record is inserted, the database needs to go look at the correct page for this ID. However, when using incremental IDs, insertion can be much easier – just go to the last page.

So one way to optimize this is to use incremental IDs. Every time a new URL is inserted, we increment the ID by 1 for the new entry. We also need a hash function that maps each integer ID to a 7-character string. If we think each string as a 62-base numeric, the mapping should be easy (Of course, there are other ways).

On the flip side, using incremental IDs will make the mapping less flexible. For instance, if the system allows users to set custom short URL, apparently GUID solution is easier because for whatever custom short URL, we can just calculate the corresponding hash as the entry ID.

Note: In this case, we may not use random generated key but a better hash function that maps any short URL into an ID, e.g. some traditional hash functions like CRC32SHA-1 etc..

Cost

I can hardly not ask about how to evaluate the cost of the system. For insert/query, we’ve already discussed above. So I’ll focus more on storage cost.

Each entry is stored as <ID, URL> where ID is a 7-character string. Assuming max URL length is 2083 characters, then each entry takes 7 * 4 bytes + 2083 * 4 bytes = 8.4 KB. If we store a million URL mappings, we need around 8.4G storage.

If we consider the size of database indexand we may also store other information like user ID, date each entry is inserted etc., it definitely requires much more storage.

Multiple machines

Apparently, when the system has developed to certain scale, a single machine is not capable to store all the mappings. How do we scale with multiple instances?

The more general problem is how to store hash mapping across multiple machines. If you know distributed key-value store, you should know that this can be a very complicated problem. I will discuss only high-level ideas here and if you are interested in all those details, I’d recommend you read papers like Dynamo: Amazon’s Highly Available Key-value Store.

In a nutshell, if you want to store a huge amount of key-value pairs across multiple instances, you need to design a lookup algorithm that allows you to find the corresponding machine for a given lookup key.

For example, if the incoming short alias is http://tinyurl.com/abcd123, based on key “abcd123” the system should know which machine stores the database that contains entry for this key. This is exactly the same idea of database sharding.

A common approach is to have machines that act as a proxy, which is responsible for dispatching requests to corresponding backend stores based on the lookup key. Backend stores are actually databases that store the mapping. They can be split by various ways like use hash(key) % 1024 to divide mappings to 1024 stores.

There are tons of details that can make the system complicated, I’ll just name a few here:

    • Replication. Data stores can crash for various random reasons, therefore a common solution is having multiple replicas for each database. There can be many problems here: how to replicate instances? How to recover fast? How to keep read/write consistent?
    • Resharding. When the system scales to another level, the original sharding algorithm might not work well. We may need to use a new hash algorithm to reshard the system. How to reshard the database while keeping the system running can be an extremely difficult problem.
    • Concurrency. There can be multiple users inserting the same URL or editing the same alias at the same time. With a single machine, you can control this with a lock. However, things become much more complicated when you scale to multiple instances.

Reference:

http://stackoverflow.com/questions/742013/how-to-code-a-url-shortener (本文算法来源)

http://blog.sina.com.cn/s/blog_65db99840100lg4n.html

http://blog.csdn.net/beiyeqingteng

http://blog.gainlo.co/index.php/2016/03/08/system-design-interview-question-create-tinyurl-system/

Design Tiny URL的更多相关文章

  1. [LeetCode] Design TinyURL 设计精简URL地址

    Note: For the coding companion problem, please see: Encode and Decode TinyURL. How would you design ...

  2. [面试] Design Questions

    Uber总是考一些系统设计的题目,而且重复率很高,汇总了一下地里的所有design的题目,希望可以跟小伙伴们讨论下. Uber Design Questions 1.    让design uber ...

  3. [LeetCode] Encode and Decode TinyURL 编码和解码精简URL地址

    Note: This is a companion problem to the System Design problem: Design TinyURL. TinyURL is a URL sho ...

  4. LeetCode Design TinyURL

    原题链接在这里:https://leetcode.com/problems/design-tinyurl/description/ 题目: How would you design a URL sho ...

  5. 535. Encode and Decode TinyURL 长短URL

    [抄题]: TinyURL is a URL shortening service where you enter a URL such as https://leetcode.com/problem ...

  6. leetcode 新题型----SQL,shell,system design

    leetcode 主要是一个针对北美的coder人群找工作的代码练习网站,我在2015年初次接触这个网站的时候,总共只有200多道题目,是一个类似acm 的a题网站.这些年变化越来越大,主要是因为找工 ...

  7. [LeetCode] 534. Design TinyURL 设计短网址

    Note: For the coding companion problem, please see: Encode and Decode TinyURL. How would you design ...

  8. Using URL Schemes to Communicate with Apps

    要素:1)通信双方:2)协议:3)支持机制(系统,进行协议注册):4)数据 https://developer.apple.com/library/content/documentation/iPho ...

  9. leetcode bugfree note

    463. Island Perimeterhttps://leetcode.com/problems/island-perimeter/就是逐一遍历所有的cell,用分离的cell总的的边数减去重叠的 ...

随机推荐

  1. C 语言学习 第12次作业总结

    作业总结 本次课堂的内容为字符串相关的几个函数还有结构体. 字符串相关函数 在此之前的课程中,输入主要都是使用scanf这个函数.而在这节课上,冯老师讲解了字符串获取函数gets.在不需要控制符的情况 ...

  2. Python学习--Python简介

    Python 简介 Python是一种解释型.编译性.面向对象.动态数据类型的高级程序设计语言.Python由Guido van Rossum于1989年底发明,第一个公开发行版发行于1991年. P ...

  3. SQL Server 阻止了对组件 'Ad Hoc Distributed Queries' 的 STATEMENT'OpenRowset/OpenDatasource' 的访问

    delphi ado 跨数据库访问 语句如下 ' and db = '帐套1' 报错内容是:SQL Server 阻止了对组件 'Ad Hoc Distributed Queries' 的 STATE ...

  4. 高通vuforia+Unity3D 制作ar app

    很简单就可以用Unity3D做出增强现实的一个小例子 新人第一次写博客,若出现错误望指正^_^ 需要下载de东西: unity3d 5.0 http://unity3d.com/get-unity   ...

  5. 基于redis 实现分布式锁的方案

    在电商项目中,经常有秒杀这样的活动促销,在并发访问下,很容易出现上述问题.如果在库存操作上,加锁就可以避免库存卖超的问题.分布式锁使分布式系统之间同步访问共享资源的一种方式 基于redis实现分布式锁 ...

  6. 实例1-gettree

    -- zx create PROCEDURE getbomtree @MaterialID INT --参数,父节点的id AS BEGIN -- 如果主BOM禁用,不显示树结构 ) BEGIN RE ...

  7. winform总结2> Action<> ,Action,func<>,委托相关的理解

    1,他们是什么: Action 封装一个方法,该方法不具有参数并且不返回值. Action<> 封装一个方法,该方法具有最多16个参数并且不返回值. func<> 封装一个具有 ...

  8. Android Automotive开发之一《环境: JDK7&JDK8切换 》

    http://ubuntuhandbook.org/index.php/2015/01/install-openjdk-8-ubuntu-14-04-12-04-lts/ 安装OpenJDK8 sud ...

  9. Jquery实现特效滑动菜单栏

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/ ...

  10. java spring mvc restful 上传文件

    spring mvc 配置文件 <bean class="com.baiyyy.yfz.core.RestfulHandlerMethodMapping" />     ...