[转载]NoSQL by Martin Flower

==============================================================

URL1 nosql

==============================================================

The rise of NoSQL databases marks the end of the era of relational database dominance

But NoSQL databases will not become the new dominators. Relational will still be popular, and used in the majority of situations. They, however, will no longer be the automatic choice.

The era of Polyglot Persistence has begun

Big Data is the driver for NoSQL’s rise, but not the only reason to use NoSQL

Many NoSQL databases are designed to run well on large clusters, which makes them more attractive for large data volumes.

But often people select NoSQL due to easier database interaction in their applications.

To take advantage of this change, you need to be familiar with concepts underlying NoSQL

In the future, organizations will use many data technologies. Data professionals will need to be familiar with these different approaches and know how to match them to different problems. This introduces new ways to think about data modeling, data consistency, and evolution.

Learning the concepts is an important first step, but to really understand you’ll need to get the experience of building representative systems using these technologies.

==============================================================

URL2 Polyglot Persistence

==============================================================

Martin Fowler 16 November 2011

In 2006, my colleague Neal Ford coined the term Polyglot Programming, to express the idea that applications should be written in a mix of languages to take advantage of the fact that different languages are suitable for tackling different problems. Complex applications combine different types of problems, so picking the right language for the job may be more productive than trying to fit all aspects into a single language.

Over the last few years there's been an explosion of interest in new languages, particularly functional languages, and I'm often tempted to spend some time delving into Clojure, Scala, Erlang, or the like. But my time is limited and I'm giving a higher priority to another, more significant shift, that of the DatabaseThaw. The first drips have been coming through from clients and other contacts and the prospects are enticing. I'm confident to say that if you starting a new strategic enterprise application you should no longer be assuming that your persistence should be relational. The relational option might be the right one - but you should seriously look at other alternatives.

One of the interesting consequences of this is that we are gearing up for a shift to polyglot persistence [1] - where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.

This polyglot affect will be apparent even within a single application[2]. A complex enterprise application uses different kinds of data, and already usually integrates information from different sources. Increasingly we'll see such applications manage their own data using different technologies depending on how the data is used. This trend will be complementary to the trend of breaking up application code into separate components integrating through web services. A component boundary is a good way to wrap a particular storage technology chosen for the way its data in manipulated.

This will come at a cost in complexity. Each data storage mechanism introduces a new interface to be learned. Furthermore data storage is usually a performance bottleneck, so you have to understand a lot about how the technology works to get decent speed. Using the right persistence technology will make this easier, but the challenge won't go away.

Many of these NoSQL option involve running on large clusters. This introduces not just a different data model, but a whole range of new questions about consistency and availability. The transactional single point of truth will no longer hold sway (although its role as such has often been illusory).

So polyglot persistence will come at a cost - but it will come because the benefits are worth it. When relational databases are used inappropriately, they exert a significant drag on application development. I was recently talking to a team whose application was essentially composing and serving web pages. They only looked up page elements by ID, they had no need for transactions, and no need to share their database. A problem like this is much better suited to a key-value store than the corporate relational hammer they had to use. A good public example of using the right NoSQL choice for the job is The Guardian - who have felt a definite productivity gain from using MongoDB over their previous relational option.

Another benefit comes in running over a cluster. Scaling to lots of traffic gets harder and harder to do with vertical scaling - a fact we've known for a long time. Many NoSQL databases are designed to operate over clusters and can tackle larger volumes of traffic and data than is realistic with single server. As enterprises look to use data more, this kind of scaling will become increasingly important. The Danish medication system described at gotoAarhus2011 was a good example of this.

All of this leads to a big change, but it won't be rapid one - companies are naturally conservative when it comes to their data storage.

The more immediate question is which types of projects should consider an alternative persistence model? My thinking is that firstly you should only consider projects that are at the strategic end of the UtilityVsStrategicDichotomy. That's because utility projects don't have enough benefit to be worth a new technology.

Given a strategic project, you then have two drivers that raise alternatives: either reducing development drag or dealing with intensive data needs. Even here I suspect many projects, probably a majority, are better off sticking with the relational orthodoxy. But the minority that shouldn't is a significant one.

One factor that is perhaps less important is whether the project is new, or already established. The Guardian's shift to MongoDB has been happening over the last year or so on a code base developed several years ago. Polyglot persistence is something you can introduce on an existing code base.

What all of this means is that if you're working in the enterprise application world, now is the time to start familiarizing yourself with alternative data storage options. This won't be a fast revolution, but I do believe the next decade will see the database thaw progress rapidly.

Notes

1: As far as I can tell, Scott Leberknight was the first person to start using the term "polyglot persistence".

2: Don't take the example in the diagram too seriously. I'm not making any recommendations about which database technology to use for what kind of service. But I do think that people should consider these kinds of technologies as part of application architecture.

==============================================================

URL3 Key Points from NoSQL Distilled

==============================================================

NoSQL精粹

1 Why NoSQL?

Relational databases have been a successful technology for twenty years, providing persistence, concurrency control, and an integration mechanism.

Application developers have been frustrated with the impedance mismatch between the relational model and the in-memory data structures.

There is a movement away from using databases as integration points towards encapsulating databases within applications and integrating through services.

The vital factor for a change in data storage was the need to support large volumes of data by running on clusters. Relational databases are not designed to run efficiently on clusters.

NoSQL is an accidental neologism. There is no prescriptive definition—all you can make is an observation of common characteristics.

The common characteristics of NoSQL databases are
- Not using the relational model
- Running well on clusters
- Open-source
- Built for the 21st century web estates
- Schemaless

The most important result of the rise of NoSQL is Polyglot Persistence.

为什么会出现NoSQL?

关系型数据库是已有20年历史的一项取得成功的技术，提供了持久化、并发控制和集成机制。

应用开发者们已经意识到并容忍着关系模型和内存中数据结构两者之间的阻抗不匹配。

已有将数据库作为集成点的工作，包括封装应用中的数据库和服务间的集成。

数据存储方式需要随支持运行在集群上的大量数据的需求而有所改变。关系型数据库不是为高效运行在集群上而设计的。

NoSQL是一个突然出现的术语，目前还没有规范的定义，你就其所作出的定义只能是基于对通用特征的观察。

NoSQL数据库的通用特征包括：(1)不使用关系模型，(2)在集群上运行良好，(3)开源实现，(4)为21实际Web应用所用，(5)无模式。

NoSQL技术涌现的结果是多种存储方式。

2 Aggregate Data Models

An aggregate is a collection of data that we interact with as a unit. Aggregates form the boundaries for ACID operations with the database.
Key-value, document, and column-family databases can all be seen as forms of aggregate-oriented database.
Aggregates make it easier for the database to manage data storage over clusters.
Aggregate-oriented databases work best when most data interaction is done with the same aggregate; aggregate-ignorant databases are better when interactions use data organized in many different formations.

聚合的数据模型

聚合是数据的集合，作为与用户交互的一个单元。聚合形成数据库ACID操作的边界。

键值、文档和列族数据库都可以被视为面向聚合的数据库形式。

聚合为数据库管理集群上数据存储提供了便利。

在大多数数据操作在同一个聚集内完成时，面向聚集的数据库最高效；而当交互需要使用被组织成多种不同形式的数据时，非聚集数据库的效果更好。

3 More Details on Data Models

Aggregate-oriented databases make inter-aggregate relationships more difficult to handle than intra-aggregate relationships.
Graph databases organize data into node and edge graphs; they work best for data that has complex relationship structures.
Schemaless databases allow you to freely add fields to records, but there is usually an implicit schema expected by users of the data.
Aggregate-oriented databases often compute materialized views to provide data organized differently from their primary aggregates. This is often done with map-reduce computations.

数据模型更多细节

面向聚集的数据库中，聚集内的关系比聚集间的关系更复杂。

图数据库将数据组织成节点和边组成的图；在数据有复杂的关系结构时效果更好。

无模式的数据库允许自由的在记录中添加字段，但数据用户往往会有一个隐含的模式。

面向聚集的数据库长计算物化视图以提供与初始聚集不同的数据组织方式。这通常用MapReduce计算完成。

4 Distribution Models

There are two styles of distributing data:
- Sharding distributes different data across multiple servers, so each server acts as the single source for a subset of data.
- Replication copies data across multiple servers, so each bit of data can be found in multiple places.
A system may use either or both techniques.
Replication comes in two forms:
- Master-slave replication makes one node the authoritative copy that handles writes while slaves synchronize with the master and may handle reads.
- Peer-to-peer replication allows writes to any node; the nodes coordinate to synchronize their copies of the data.
Master-slave replication reduces the chance of update conflicts but peer-to-peer replication avoids loading all writes onto a single point of failure.

分布式模型

有两类分布式数据风格：

(1)在多个服务器间分摊分布性，每个服务器提供的数据是整体数据的一个子集；

(2)在多个服务器间复制数据，每份数据可以在多个地方获取。

赋值数据有两种形式：

(1)主从式，将一个节点作为权威数据源处理用户的写操作，而从节点与主节点同步，可以处理用户的读操作；

(2)对等式，允许每个节点处理用户的写操作，节点自主的保护数据副本同步。

主从式复制策略减少了更新冲突的可能性，对等式复制策略可以避免单个节点上处理所有写操作而带有的可用性失败问题。

5 Consistency

Write-write conflicts occur when two clients try to write the same data at the same time. Read-write conflicts occur when one client reads inconsistent data in the middle of another client's write.
Pessimistic approaches lock data records to prevent conflicts. Optimistic approaches detect conflicts and fix them.
Distributed systems see read-write conflicts due to some nodes having received updates while other nodes have not. Eventual consistency means that at some point the system will become consistent once all the writes have propagated to all the nodes.
Clients usually want read-your-writes consistency, which means a client can write and then immediately read the new value. This can be difficult if the read and the write happen on different nodes.
To get good consistency, you need to involve many nodes in data operations, but this increases latency. So you often have to trade off consistency versus latency.
The CAP theorem states that if you get a network partition, you have to trade off availability of data versus consistency.
Durability can also be traded off against latency, particularly if you want to survive failures with replicated data.
You do not need to contact all replicants to preserve strong consistency with replication; you just need a large enough quorum.

一致性

读-读冲突在两个用户同时写同一份数据时发生。读-写冲突在一个用户读取到另一个用户写操作中的不一致性数据时发生。

悲观冲突避免方式通过对数据加锁。乐观冲突消解方式检测冲突并解决。

分布式系统中的读-写冲突是因为一些节点接收到了更新，而另一些没有。最终一致性是指一旦所有写操作已传播到所有节点上，在某一时间点系统中数据是一致的。

用户常需要读所写一致性，即用户执行写操作后，他/她立即可读出刚写入的新值。如果读、写操作在不同节点上发生时，要保证这种一致性是比较困难的。

要想获得较好的一致性，需要在数据操作中涉及多个节点，而这增加了延迟。所以经常需要在一致性和延迟间作出权衡。

CAP理论断言：如果有一个网络分区，则必须在数据可用性和一致性之间做出权衡。

也需要在持久性和延迟间作出权衡，尤其是想通过副本数据避免失败时。

不需要所有数据副本来保持强一致性，只需要足够多的代表。

6 Version Stamps

Version stamps help you detect concurrency conflicts. When you read data, then update it, you can check the version stamp to ensure nobody updated the data between your read and write.
Version stamps can be implemented using counters, GUIDs, content hashes, timestamps, or a combination of these.
With distributed systems, a vector of version stamps allows you to detect when different nodes have conflicting updates.

版本戳

版本戳可以帮助检测并发冲突。当读数据后再更新时，可以检测版本戳以保证没人在你读和写操作之间某一时刻更新了数据。

版本戳可以用计数器、GUID、数据内容哈希值、时间戳或前者的组合实现。

在分布式系统中，版本戳向量可以用于检测不同节点上存在的更新冲突。

7 Map-Reduce

Map-reduce is a pattern to allow computations to be parallelized over a cluster.
The map task reads data from an aggregate and boils it down to relevant key-value pairs. Maps only read a single record at a time and can thus be parallelized and run on the node that stores the record.
Reduce tasks take many values for a single key output from map tasks and summarize them into a single output. Each reducer operates on the result of a single key, so it can be parallelized by key.
Reducers that have the same form for input and output can be combined into pipelines. This improves parallelism and reduces the amount of data to be transferred.
Map-reduce operations can be composed into pipelines where the output of one reduce is the input to another operation's map.
If the result of a map-reduce computation is widely used, it can be stored as a materialized view.
Materialized views can be updated through incremental map-reduce operations that only compute changes to the view instead of recomputing everything from scratch.

MapReduce

MapReduce是一种在集群上执行并行计算的模式。

Map任务从聚集中读取数据，将数据拆分到相应的键值对。Map一次仅读取一个记录，故能以并行方式运行于存储该记录的节点上。

Reduce任务的数据源与Map任务的输出，其中单个键有多个值，Reduce任务按键将其对应的值汇总成单个输出。每个Reduce任务可以处理单个键值，也能以并行方式运行。

Reduce任务的输入和输出形式相同时，可以纳入管道中。这增强了并行性，同时减少了需要传输的数据量。

当将Reduce的输出作为Map的输入时，MapReduce操作也可以纳入管道中。

如果MapReduce计算的结果需要被多个节点使用，可以将其存储为物化视图。

物化视图可以通过增量的MapReduce操作更新，这只需要计算对视图的更新部分，而不是重新计算所有结果。

12 Schema Migrations

Databases with strong schemas, such as relational databases, can be migrated by saving each schema change, plus its data migration, in a version-controlled sequence.
Schemaless databases still need careful migration due to the implicit schema in any code that accesses the data.
Schemaless databases can use the same migration techniques as databases with strong schemas.
Schemaless databases can also read data in a way that's tolerant to changes in the data's implicit schema and use incremental migration to update data.

模式迁移

有强模式的数据库(如关系数据库)以在版本控制序列中保存每次模式变更和数据迁移的方式实现迁移。

无模式数据库迁移时需要特别考虑在访问数据的代码中隐含的模式。

无模式数据库可以使用强模式数据库的迁移技术。

无模式数据库可以通过读数据、使用增量迁移更新数据，尽管读数据需要容忍数据隐含模式的变更。

13 Polyglot Persistence

Polyglot persistence is about using different data storage technologies to handle varying data storage needs.
Polyglot persistence can apply across an enterprise or within a single application.
Encapsulating data access into services reduces the impact of data storage choices on other parts of a system.
Adding more data storage technologies increases complexity in programming and operations, so the advantages of a good data storage fit need to be weighed against this complexity.

多种持久化存储方式

多种持久化存储方式的含义是使用不同的数据存储技术处理多种数据存储需求。

多种持久化存储方式可以用用于跨企业或单个应用中。

将数据访问封装到服务中降低了数据存储选择带来的影响。

增加多种数据存储技术增加了编程和操作的复杂性，故需要权衡数据存储带来的优势与其复杂性。

14 Beyond NoSQL

NoSQL is just one set of data storage technologies. As they increase comfort with polyglot persistence, we should consider other data storage technologies whether or not they bear the NoSQL label.

NoNoSQL?

NoSQL只是一个数据存储技术的集合。随着多种持久化存储方式概念被接受，可以作出其他数据存储技术能否打上NoSQL标签的判断。

15 Choosing Your Database

The two main reasons to use NoSQL technology are:
- To improve programmer productivity by using a database that better matches an application's needs.
- To improve data access performance via some combination of handling larger data volumes, reducing latency, and improving throughput.
It's essential to test your expectations about programmer productivity and/or performance before committing to using a NoSQL technology.
Service encapsulation supports changing data storage technologies as needs and technology evolve. Separating parts of applications into services also allows you to introduce NoSQL into an existing application.
Most applications, particularly nonstrategic ones, should stick with relational technology—at least until the NoSQL ecosystem becomes more mature.

数据库选择

使用NoSQL技术的两个主要理由是：

(1)使用更符合应用需要的数据库以提高程序员的生产力；

(2)处理大量数据时，通过减少延迟和增加吞吐量来提高数据访问性能。

在使用NoSQL技术前需要估计出程序员生产力和/或对数据访问性能的要求。

服务封装支持随需求和技术发展时变更数据存储技术。将应用不同部分划分为服务，为在该应用中引入NoSQL技术提供了便利。

大多数应用，特别是非战略性的应用，至少在NoSQL生态系统更成熟之前，应该仍采用关系数据库技术。

Pramod Sadalage & Martin Fowler: 12 Sep 2012

There are no key points for Chapters 8-11 since these are examples of database use