Big Data Acceleration Project on Apache HBase – White Paper from ThirdEye 2018-12-10T07:28:40+00:00
WHITE PAPER

Apache HBase –
Big Data Acceleration Project

OVERVIEW OF APACHE HBASE

The whole concept of big data, or total data, and how to collect it and get it to the data lake can sound scary, but it becomes less so if you break down the data collection problem into subsets. All of these are technologies mentioned below are part of Big Data framework:

  • Apache HBase is a popular open source scale-out NoSQL, columnar database designed for low-latency
    random access to data stored on the popular Hadoop platform.
  • Apache Hadoop is a Big Data framework which uses HDFS(Hadoop Distributed File System) to Store the data
    and MapReduce framework to process that data. Java is used as native language to write MapReduce
    programs.
  • Apache Hbase and Cassandra are both NoSQL databases which does not follow the strict ACID transactions.
    Both are columnar databases and needs proper data modelling to be used effectively. Their performance can
    be evaluated by benchmarking the database using a tool called YCSB(Yahoo Cloud Serving Benchmark).
  • Apache HBase is an Apache open source project offering NoSQL data storage. Often used together with
    HDFS, HBase is widely used across the world. Well-known users include Facebook, Twitter, Yahoo, and more.
  • Apache Cassandra is a masterless and shared-nothing,ring based architecture which does not depend upon
    Hadoop framework .It is good for Write heavy and less read applications.
  • Apache Hive is a batch processing framework which is used to process the data using a language called Hive
    Query Language(HQL). HQL is a sql wrapper on top of HDFS which prevents writing Mapreduce programs in
    Java.Instead one can use SQL like language to do their daily tasks.

While other NoSQL databases, such as Cassandra and MongoDB, are designed and written as single standalone executables, HBase is heavily dependent on outside processes such as Apache Zookeeper, the Hadoop Distributed Filesystem (HDFS), and Hadoop’s MapReduce. These dependencies can present challenges when troubleshooting an HBase cluster and are open-source projects that are maintained outside of the HBase project.

There are philosophical differences in the architectures: Cassandra borrows many design elements from Amazon’s DynamoDB system, has an eventual consistency model and is write-optimized while HBase is a Google BigTable clone with read-optimization and strong consistency. An interesting proof point for the superiority of HBase is the fact that Facebook, the creator of Cassandra, replaced Cassandra with HBase for their internal use!

Benchmarking NoSQL databases with YCSB

Yahoo! Cloud Serving Benchmark (YCSB) is a popular benchmark tool for NoSQL and comes with 6 out of the box workloads, each testing a different common use case. Workload C is a 100% Read only workload. From a user’s point of view, the latency for each single query matters very much. As we work with users to test, tune, and optimize HBase workloads, we encounter a significant number now who really want 99th percentile operation
latencies. That means a round-trip, from client request to the response back to the client, all within 100 milliseconds. Random read latency (YCSB workload c) was measured for a 150GB dataset. Workload c is intended to model a type of user database that serves information to interactive web services.

Zipfian Distribution:

The following table shows measurements of the 150GB dataset using Zipfian read distribution. During the 30-40 minute period, the HBase flash cache exhibited about 30% lower average latency. After 80 minutes, the flash case out-performed the pure HDD case considerably: 7x, 21x and 12x lower latency in the average, 95th percentile and 99th percentile, respectively.

The following chart compares the average latency for Zipfian distribution for flash cache- accelerated and HDD (no bucket cache) cases.

YCSB Workload C (Read Intensive), HBase 0.95
Operations Per Second
90 Minute Run, 20 Threads, Zipfian

The following charts compare the avarage, 95th and 99th percentile latencies for Zipfian distridution for flash chache-accelerated and HDD (no bucket cache) cases:

YCSB Workload C (Read Intensive), HBase 0.95
Operations Per Second
90 Minute Run, 20 Threads, Zipfian

YCSB Workload C (Read Intensive), HBase 0.95
95th Percentile Read Latency
90 Minute Run, 20 Threads, Zipfian

YCSB Workload C (Read Intensive), HBase 0.95
99th Percentile Read Latency
90 Minute Run, 20 Threads, Zipfian

Uniform Distribution:

Although a uniform distribution is hard to find in the real world, it is often used to make comparisons as it is a special case with a complete absence of skew. The reason: a uniform query distribution should be difficult to cache effectively, regardless of where in the memory or cache hierarchy it is seen.

YCSB Workload C (Read Intensive), HBase 0.95
Operated Per Second
90 Minute Run, 20 Threads, Uniform

The following chart compares the average latency for uniform distribution for flash cache-accelerated and HDD (no bucket cache) cases.

YCSB Workload C (Read Intensive), HBase 0.95
Average Read Latency
90 Minute Run, 20 Threads, Uniform

The following chart compares the average latency for uniform distribution for flash cache-accelerated and HDD (no bucket cache) cases.

YCSB Workload CU (Read Intensive), HBase 0.95
95th Percentile Read Latency
90 Minute Run, 20 Threads, Uniform

YCSB Workload CU (Read Intensive), HBase 0.95
95th Percentile Read Latency
90 Minute Run, 20 Threads, Uniform

ThirdEye’s approach:

Third Eye brings a strong set of real-world experiences and expertise around Big Data technologies and projects. Our partnership approach to projects, our no-surprises project management philosophy and capability to leverage the (onsite-offsite-offshore) delivery model which will ensure value for every dollar spent. Third Eye will conduct your project in phases, which will include data ingesting, profiling, modeling & ETL Design.

PROJECT PHASES

1. Logical test composition with test data and parameter variations

2. Baseline hard disk performance testing

3. HBase + Hadoop, tuning and testing (possibly iterating on LSI card tweaks, distro, patches)

4. HBase random access and batch characterization

5. Cassandra tuning and testing (possibly iterating on LSI card tweaks, distro, patches)

6. Cassandra random access characterization

PROJECT REQUIREMENTS

  • Trace of full cycle I/O stats
  • Recommendation for acceleration caching algorithm
  • Recommendation for data that can be pre-fetched

PROJECT DELIVERABLES

An improvement of 30% or more over disk-only for random access (using best tuning parameters for each case) should certainly be possible as long as there is more solid state store than RAM and the hot parts are bigger than RAM. There caveat here being: the processing time must be sufficiently brief to show advantage over disk-only. The processing time includes app level and subsystem level expansion.

TECHNOLOGIES INCORPORATED

HBase version: – 0.94.4+
Measurement methods: To use well known OS-level to see IO ops rates, throughput.
Hadoop Distros to be used:

  • Cloudera CDH4 – to start with.
  • Facebook Hadoop (has changes to support SSD)
  • Will progress on to Apache Hadoop, Hortonworks HDP, MapR
  • Suggested Distros – Intel, EMC and Wandisco

Access Requirements: SSH access from Internet to each test machine.
Selectivity: Special hardware should serve data partitions, not the OS.
Hardware server: Dell R720XD with Nytro MegaRAID card

Hardware Requirements:
Ideally, we need 3-4 hosts, each with plain-old local disk, Nytro WarpDrive, and a Nytro MegaRAID card. This enables A/B comparisons. Machines should be interconnected via gigabit ethernet (or better), and must have the same cpu, plain-old-disk, memory, OS configurations. The plain-old-disk for data area should be about the same capacity as LSI target devices.

Performance measurement:
Yahoo! Cloud Serving Benchmark (YCSB) version 0.14 was used to measure the performance because it is both
well-known and open source. Random read latency (YCSB workload c) was measured for a 150GB dataset.

KEY FEATURES OF USING APACHE HBASE

Deep Integration with Apache Hadoop – Since HBase has been built on top of Hadoop, it supports parallelized processing via MapReduce. HBase can be used as both a source and output for MapReduce jobs. Integration with Apache Hive allows users to query HBase tables using the Hive Query Language, which is similar to SQL.

Strong Consistency – The HBase project has made strong consistency of reads and writes a core design tenet. A single server in an HBase cluster is responsible for a subset of data, and with atomic row operations, HBase is able to ensure consistency.

Failure Detection – When a node fails, HBase automatically recovers the writes in progress and edits that have not been flushed, then reassigns the region server that was handling the data set where the node failed.

Real-time Queries – HBase is able to provide random, real-time access to its data by utilizing the configuration
bloom filters, block caches, and Log Structured Merge trees to efficiently store and query data.

Third Eye Data
5201 Great America Parkway, Suite 320,

Santa Clara, CA USA 95054
Phone: 408-462-5257

Contact us

ThirdEye Data

Data Answers

ThirdEye answers your data questions and offers actionable insights, real-world experiences and strategic recommendations.