The following is a description of current and emerging Big Data database technologies. The discussion is based on the objectives of these technologies as well as the type of data involved.
First Some Noteworthy Data Facts
An interesting implementation fact in most successful Big Data systems show that that the value of an individual piece of data decreases with time and the value of a collection of data rises with time. Additionally, the value of aggregated data should continue to increase over time, and closing the gap in the time taken to extract, transform, and load a data item will increase the value of the data more rapidly as the system tries to approach the theoretical concept of real-time decision making. Like many well engineered systems, the closer we get to zero defects and real-time processing, the more expensive the implementation becomes for the system owner.
So how do we most effectively achieve our Big Data decision making objectives given the tools available today? By selecting the proper database management tool that most closely matches our analytical decision making requirements.
In the world of database management systems used today for processing Big Data we have the following solutions:
1. RDMS/SQL - These are the traditional relational Database management systems that use the traditional relational tables and indexes that we're used to. Some examples are Microsoft SQL, Oracle, MySQL, etc.
Benefits:
A well understood and consistent model meaning an application than runs on MySQL can be altered to run on Oracle without changing its basic assumptions.
Maintain relational integrity. ACID guarantees, ie ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably.
Comprehensive OLTP/transaction support. Strong OLAP/analysis tools, often built in (MS Analysis Services, Oracle OLAP)
Problems:
Most solutions are expensive.
Scales up (i.e. bigger servers), but struggles to scale out (i.e. lots of servers). Also expensive. Not 'natural' for developers, which results in translation overhead and common mistakes like N+1 errors.
2. NoSQL - In-memory non-relational databases
These don't support the SQL language (hence the name) but more significantly don't support ACID or relationships between tables. Instead they're designed to query document data very quickly.
Examples: Hadoop, MongoDB, CouchDB, Riak, Redis, Cassandra, Neo4J, MemBase, HBase, etc
Benefits:
Cheap, mostly open source implementations. Systems can scale out very easily, tables can be readily sharded/federated across servers.
Most store native programmer objects, so no translation to tables.
Very, very fast at finding records from massive datasets.
Problems:
No common model and there is quite a bit of differences between the many solutions.
No ACID guarantees, instead high fault tolerance must be built into the application.
Transactions are at the row level only (if supported at all).
Poor at aggregation - where an RDMS solution would use SUM, AVG and GROUP BY a NoSQL solution has map-reduce, which (some minor optimizations aside) has to do the equivalent of a table-scan.
Poor at complex joins, although arguably this is something you'd design differently for.
3. NewSQL- In-memory relational databases
NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (read-write) workloads while still maintaining the ACID guarantees of a traditional single-node database system.
These maintain ACID and relational integrity, but are in memory (like NoSQL) and readily scalable. They support SQL syntax. These are relatively new implementations and many traditional database vendors have rolled out their own solutions with the same capabilities. Think Oracle, Sybase, and even SAP with their in-memory HANA solution.
The most popular NewSQL systems attempt to distribute query fragments to different data nodes. These are designed to operate in a distributed cluster of shared-nothing nodes. Here nodes typically own a subset of the data. SQL Queries are split into query fragments and sent to the nodes that own the data. These databases are able to scale linearly as additional nodes are added.
Examples: Clustrix, VoltDB, GenieDB, etc.