Skip directly to content

Big Data & Big Compute with Grid Engine and Hadoop

on Mon, 03/28/2011 - 07:46

Big Data and Big Compute – Friends or Enemies?

Processing huge sets of data on large and scalable computational infrastructures (recently coined big data) gains increasing importance, not only for information giants such as Google. Apache Hadoop is an innovative and popular framework for developing and running data analytics applications such as data mining or parameter studies. It is an open implementation of the map-reduce paradigm and comes with HDFS, the Hadoop Distributed File System for high throughput access to distributed data.

Large and scalable computational environments are used for years for many other tasks, such as batch and throughput computing, technical computing, simulations or parallel high performance computing. Big compute, as one might call these environments, is the workhorse for innovation in computer aided research and development in industries as well as science. They are commonly operated by Distributed Resource Management Systems (DRMS) which enable the efficient sharing of resources across applications and among users and which allow to utilize even thousands of servers at 90% or higher.

Big data and big compute are both utilizing ever growing computational infrastructures and yet it is quite rare that they are used together, be it in combination or just sharing the same pool of resources. Sometimes the reason simply can be that organizations either are focused on data analytics or large scale computation and they entertain their own computational environment. Another and maybe more common reason is that these environments often do not cooperate well. They are unaware of the each other and using them on the same pool of resources can result in conflicts or bad performance.

Bridging the Gap – The Grid Engine Hadoop Integration

A solution for combining the two worlds is provided by the integration of Grid Engine with Apache Hadoop. It allows to jointly use a resource pool for Hadoop calculations as well as for other workloads under Grid Engine control and it enhances Hadoop with enterprise level accounting and reporting features as well as expanded policy control.

Using Grid Engine together with Hadoop will be beneficial in the following situations, for example:

  • Resource Pool Sharing: You have the intention to utilize your resource pool not only for Hadoop calculations but also sharing those resources for other workloads.

  • Analytics Plus Compute: You need to integrate Hadoop applications with non-Hadoop applications.

  • State-of-the-Art Management: Your focus are Hadoop-based applications only, but you require the comprehensive resource utilization control, policy management and accounting features of Grid Engine.

  • Cloud Interoperability: You want to operate Hadoop embedded in a private, hybrid or public cloud framework managed by Grid Engine.

How the Integration between Hadoop and Grid Engine works

Grid Engine treats Hadoop applications as a so called 'tightly integrated' parallel job. In doing so, Grid Engine has full control over all aspects of the Hadoop framework:

  • It selects nodes which are suitable and available for executing Hadoop jobs and have resident the HDFS data blocks needed by the Hadoop application
    ==> This leads to best achievable performance

  • It sets up a MapReduce cluster on those nodes

  • It submits the Hadoop job to the newly configured MapReduce cluster

  • It has full accounting and process control over the Hadoop application

Hadoop thereby inherits all the benefits which Grid Engine as an advanced enterprise-grade resource management system provides, such as versatile policy control and full accounting and reporting. It also does so in friendly coexistence with other workloads in that Grid Engine cluster, be they other parallel applications, e.g. using MPI, or throughput applications or even interactive work under Grid Engine control. Furthermore, on demand cloud management provided by Grid Engine and its integration with Univa's UniCloud (see “About Univa Grid Engine” below) becomes a given also for Hadoop.

The integration of Grid Engine with Hadoop has been tested with Apache Hadoop and with Cloudera's builds of Hadoop, CDH3 in particular (see “About Cloudera's Hadoop Distribution CDH3” below).

About Univa Grid Engine

Univa Grid Engine is the advanced and enterprise-ready version of the world's most popular resource and workload management system Grid Engine. It allows sites to manage shared usage of large resource pools like clusters of thousands of servers. Resources can be shared across users, user groups, departments or projects with advanced policies. Grid Engine provides a wide variety of resource entitlement control features like resource quotas or urgency, deadline and fair-share policies. Further features offered by Univa Grid Engine are advance reservation and, in combination with Univa UniSight, end-to-end accounting and reporting.

Univa Grid Engine supports arbitrary workloads ranging from batch jobs to parametric studies, parallel processing and interactive sessions. Together with Univa's UniCloud product, Univa Grid Engine can provide private cloud facilities, can expand into a public cloud to form a hybrid cloud or can be operated in a self-contained dynamically resizing public cloud.

About Cloudera's Hadoop Distribution CDH3

The Hadoop distribution from Cloudera, namely CDH3 (see http://www.cloudera.com/downloads/) offers a complete bundled platform based on the most recent version of Apache Hadoop plus a significant number of patches and enhancements, and it is very stable. According to Cloudera, the CDH is also the most widely used Hadoop distribution by far.