Unprecedented volumes of data make developers and businesses to look at alternatives to relational databases that are used for more than thirty years. Taken together, all these technologies are known as "NoSQL database". the
The main problem is that relational databases can't cope with stress relevant in our time (we're talking about high-load projects). There are three specific areas of concern:
- horizontal scaling for large amounts of data, such as in the case of Digg (3 TB for green badges displayed if your friend has made dugg the article) or Facebook (50 terabytes to search for incoming messages) or eBay (2 petabytes in total)
- the performance of each server
- flexible design of the logical structure.
ScalableUnder scalability, some may imply replication, so when we talk about scalability in this context we meanautomatic data distribution across multiple servers. Such systems we call distributed databases. They include Cassandra, HBase, Riak, Scalaris and Voldemort. This is your only option if you're using the amount of data which cannot be processed on one machine or if you don't want to manage the distribution manually. There are two things that you need to watch in a distributed database: support multiple datacenters and the ability to add new machines to a running cluster transparently to the your applications. Distributed databases do not include CouchDB, MongoDB, Neo4j, Redis and Tokyo Cabinet. These systems can serve as a layer for data storage for distributed systems; MongoDB provides limited support for sharding (sharding), as well as a separate project Lounge for CouchDB and Tokyo Cabinet can be used as a file storage system for Voldemort.
data Model and queriesThere is a huge variety of data models and query API to NoSQL databases. (Relevant references Thrift, Map/Reduce, Thrift, Cursor, Graph, Collection, Nested hashes, get/put, get/put, get/put) System column family (columnfamily) is used in Cassandra and HBase, and her idea was instilled in them from documents describing the Google Bigtable (Cassandra though a bit away from the ideas of Bigtable and introduced supercolumns). In both systems, you have rows and columns like you used to see, but the number of rows is not large: each line has more or fewer columns, depending on the need and the columns cannot be determined in advance. System key/value itself is simple, and not complicated to implement, but not effective if you are only interested in the query or updating of the data. It is also difficult to implement complex structures on top of distributed systems. Document-oriented databases are essentially the next level of systems, key/value, allowing nested data to associate with each key. Support for such queries is more effective than just returning the entire BLOB each time. Neo4J has a unique data model, storing objects and relationships as nodes and edges count. For queries that correspond to that model (e.g., hierarchical data), they can be a thousand times faster than the alternatives. Scalaris is unique in the use of distributed transactions across multiple keys. A discussion of the tradeoffs between consistency and availability is beyond the scope of this post, but this is another aspect that must be considered in the evaluation of distributed systems.
storageUnder storage I mean how data is stored within the system. the
The storage system can tell us much about what load a database can normally withstand. Databases store data in memory very, very fast (Redis can perform up to 100,000 operations per second), but I can't work with data exceeding the size of available RAM. Durability (saving data in case of server failure or power outage) can also be a problem (the new versions support append-only log). The amount of data that can expect the writing to disk is potentially large. Another system with data storage in RAM Scalaris, solves the problem of durability through replication, but it does not support scaling to multiple data centers, so data loss is probable and in case of a power outage. Memtables and SSTables buffer write requests in memory (memtable) after writing to the commit log for data integrity (it's hard to explain, but you can read more in the wiki Cassandra — http://wiki.apache.org/cassandra/ArchitectureOverview). After accumulating a sufficient number of records, the Memtable is sorted and written to disk as SSTable already. It gives performance close to the performance of memory at the same time, the system is devoid of problems relevant when stored only in memory. (This procedure is described in more detail in sections 5.3 and 5.4, as well as the merger trees based on the log — The log-structured merge-tree) B-trees used in databases for a long time. They provide reliable support for the indexing but the performance is very low when using on machines with hard disks magnetic disks (which are still the most cost-effective), as there is a large number of positionings of the heads when writing or reading data. An interesting option is the use of CouchDB B-trees, only with the function add (append-only B-Trees — binary tree which does not need to rebuild when you add elements), which allows to obtain a not bad performance when writing data to disk.
OpinionThe NoSQL movement has grown dramatically in 2009 due to passion in the number of companies associated with the use of large amounts of data. There are more systems allows you to organize and transparently to support huge amounts of data, process and control this data. I hope through this short article, you will learn about some of the strengths of NoSQL systems and may contribute to the development of this movement.
SQL Injection ate our copyright notice.