Monday, April 27, 2015

Do NoSQL databases have a place in your enterprise?

There’s been a quiet revolution going on in the world of database technology. After dominating for years, the monopoly of relational databases is ending. Names like Cassandra, MongoDB, Google AppEngine DataStore, Amazon’s DynamoDB, HBase (a Hadoop technology), and Raik are showing up alongside traditional databases like Oracle, SQL Server, and MySQL. The primary data storage technologies offered today by many cloud providers are NoSQL, and NoSQL databases are already at work within a growing number of enterprises enabling them to store and process more data than ever before.

So what changed?

Loosely encompassing everything ‘not SQL’, or not relational, NoSQL databases began emerging as a phenomena in the 2000s at the intersection of some developments in the industry. Internet presences like Google, Amazon, Facebook, and Twitter were rapidly growing. Trend-defining concepts like ‘cloud computing’ and ‘big data’ were also beginning to take hold, powered by more affordable physical data storage. Together these increased the demand for databases that could scale far greater than was typically needed before.

At the same time an important change was also taking place on the software side. Web services were becoming a prominent means of providing access to data, reducing dependence on SQL. SQL is tied to relational databases. Web services opened the door to alternatives.

And in the era of ever ‘bigger data’, companies needed alternatives. Relational databases weren’t designed to be run on clusters of commodity computers, where many of the benefits that give them value can be lost and traditional pricing structures just don’t fit. They also sometimes don't give enough control over query performance, and it can be tough for developers to make them fit application logic.

Enter NoSQL.

Led by pioneering work by Google and Amazon to create databases that could scale to massive clusters of commodity computer hardware, NoSQL is now a loose term encompassing all databases that don’t model and access data in the same way relational databases do.

The majority of NoSQL databases are for clusters and come in roughly three flavors, each of which model data in a different way:

Key-value databases associate chunks of data with keys which are used to access the chunks. These chunks can be virtually anything, but to the database are just meaningless bytes to store and retrieve. For example, a customer chunk may include name and address information, but a key-value database cannot identify them and can only retrieve these chunks by their key.

Document databases also associate chunks of data with keys. The difference with document databases is that information within chunks can be identified to the database and used for indexing and queries. For example, a customer chunk’s name and address information could be identified to a document database which would then be able to find customer chunks by name, address, or by key.

Column family databases save chunk information in columns within column families. For example, a customer chunk's name and address could be stored in columns in one column family, and customer orders in columns in another column family, enabling information on customers to be accessed independently from their orders. Column family databases like Cassandra and HBase store column family data across rows together, giving them some resemblance to column-oriented databases. It’s fairly easy to imagine storing data that needs to be summed or averaged in a column family so processes can get to the data efficiently within a cluster. Storing event information like customer updates in a CRM system, posted messages in a blog, and user actions in a system are other uses that come to mind.

In addition to those designed for clusters, the NoSQL term also includes other types like NoSQL graph databases. Graph databases are not designed specifically for clusters and are instead focused on connecting data values together in a variety of ways for social, organizational, spatial, and other rich interrelationships.

While many organizations aspire to standardize on a single family of technology products, making all data storage needs fit in a single database technology is a bit like mandating carpenters only use a table saw for cutting wood. Modern functions like online session management, transaction processing, event tracking, data warehousing, analysis, social relations, spatial correlation, customer preferences, compliance, and search, to name a few, have different database needs. As the data grows, these differences only become more apparent. This is why in the era of big data the trend is towards matching databases with specific needs, even to the extent of using more than one database within a single application.

I'll paint a simple picture to illustrate.

Consider an e-commerce system positioned for growth. Such a system today may include the following high-level requirements:
  • Collect rich data on user interactions for future data mining of customer patterns, security audits, compliance, and disclosures.
  • Capture relationships between customers for cross-marketing.
  • Capable of broad international expansion.
  • Aggregate an expanding inventory of catalog items from a growing list of international and regional suppliers.
  • Perform well under heavy loads.
  • Scale without technology costs spiraling exponentially.

Such a system probably could not use relational databases exclusively and meet these requirements under today’s loads. Implementations of large systems do, however, meet similar requirements by incorporating NoSQL technologies, oftentimes alongside relational database technologies. Breaking the system down into subsystems, and then relating each to a type of database that could fit, a hypothetical implementation could look something like this:
  • User profiles, preferences, and other cached data served from a NoSQL key-value database for simple, highly scalable, rapid lookup of arbitrary chunks of data.
  • A user session shopping cart subsystem also built on a NoSQL key-value database, keyed by user session.
  • A central accounting and finance subsystem built on a relational database for strong consistency, rigid data structures, rich data validation, and ad-hoc reporting.
  • A customer subsystem for capturing customer information, relationships, and recommendations for cross marketing and social mining built, in part, on a NoSQL graph database, capable of connecting data together in a variety of ways.
  • A catalog of products capable of handling and organizing large amounts of data from a variety of suppliers based on a NoSQL document database where stored values have structure that can be queried on.
  • An order maintenance subsystem for capturing, viewing, and exchanging information on recent orders to supplier fulfillment systems also based on a NoSQL document database.
  • An analytics subsystem for archiving orders for trend analysis and data mining based on a NoSQL column family database, where related information in rows can be grouped together into column families for efficient processing.
  • An event tracking and logging subsystem that captures actions taken in the system for experience optimization, fraud analysis, intrusion detection, and future data mining also built on a NoSQL column family database.

NoSQL technologies are increasingly being used like this to power modern systems, but they do have limitations of their own.

NoSQL databases built for clusters keep chunks of data whole for more efficient distribution within a cluster, but this can limit real-time, ad-hoc reporting capability. For example, it may make sense to save orders in NoSQL whole, with each order chunk containing complete information on the customer who made the order and all items in the order. Doing so, however, makes it more difficult to query for things like “give me all customers who ordered a particular item” because this information is spread among all order chunks. In contrast, a relational database design would typically store customer and item information separately then relate both in separate order records, enabling the data to be queried in a variety of ways.

This, of course, relates to the problem of how to group data together into chunks in the first place. Continuing with the same example, would it be best to group information into order chunks or customer chunks, where each customer chunk would contain information on all orders and items in each order for that customer? Grouping by customer chunks could be best for providing customers access to their orders because satisfying a customer's request to view their orders would only require a single retrieval of data within a cluster, helping make the application more responsive. On the other hand, grouping by order chunks may be best for a fulfillment system so it’s not necessary to retrieve and open all customer chunks just to see if there’s an active order that needs to be fulfilled. How to group data into chunks depends on how data will be queried. Unfortunately, it’s usually not possible to predict, especially early in projects, exactly how data will be queried, and making changes to how data is grouped into chunks down the road can be difficult.

This example’s a bit oversimplified, but it serves to illustrate some of the new design challenges NoSQL databases present in contrast to those of relational databases. Relational technology offers fairly standard guidance on how data should be broken apart into finer-grained units so new queries can be made without restructuring the data or extra processing, but in doing so trades off the ability to easily distribute the data in clusters. For example, how can order information linked to separate customer and item information be sent to different computers in a cluster and the relational database still support queries -- at least without sending linked customer and item information along to each machine too?

If it sounds like NoSQL presents new software engineering challenges, it does. Some work like designing data structures and queries, ensuring data integrity and consistency, aggregating data for reporting, and performance tuning, handled for years by Database Administrators (DBAs) using built-in relational database features, are headed squarely back over to your application developers. And NoSQL technologies are young and changing quickly. Software design skills capable of modeling data in new ways and keeping database logic separate from application logic are becoming more important again.

While relational databases may continue to be a good fit for many enterprise applications, big data has ended their monopoly. NoSQL databases are now prominent offerings of major cloud platform providers like Amazon, Rackspace, and Google. Whether it be because of a next-generation application, the ability to archive a richer set of data for mining purposes, or the need to collect more system events for new compliance requirements, it’s prudent to prepare for NoSQL having a place alongside relational databases within your enterprise.

No comments:

Post a Comment