Hadoop Ecosystem

Hadoop Ecosystem

Apache Hadoop is the talk of the town pretty much all over the Big Data world. For the beginners of Big Data and Hadoop, there are quite of few terminologies, frameworks, libraries, etc. to digest to get a feel of the Hadoop Ecosystem.

In the process of learning them myself, I came across a wonderful article written from Edd Dumbill. I have excerpted few notes from his article and formatted in tabular form in this blog entry.

Apache HadoopHadoop Logo

  • an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
  • It is a Batch-oriented system.
  • Components
    • MapReduce – Framework developed by Google for parallel computation on server clusters.

    • HDFS (Hadoop Distributed File System) – Distributed redundant file system for storing unstructured and schemaless data in Hadoop.
    • YARN (Yet Another Resource Negotiator) – a framework for job scheduling and cluster resource management.

HadoopEcosystem

Programmability

Pig

Pig

  • High-level programming language that simplifies the common tasks of working with Hadoop: loading data, expressing transformations on the data, and storing the final results.
  • Pig’s built-in operations can make sense of semi-structured data, such as log files.
  • Main advantage is to drastically cut the amount of code needed compared to direct use of Hadoop’s Java APIs.

Hive

Hive

  • Enables Hadoop to operate as a data warehouse with SQL-like access. Easily integratable via JDBC/ODBC.
  • It superimposes structure on data in HDFS, and then permits queries over the data using a familiar SQL-like syntax.
  • More suitable for data warehousing tasks.

Data Collection

Sqoop

Sqoop

  • a tool to import data from relational databases into Hadoop: either directly into HDFS, or into Hive

Flume

Logo

  • a tool to import streaming flows of log and event data directly into HDFS
  • Efficient service for collecting, aggregating, and moving large amounts of log data.

Chukwa

Chukwa

  • open source data collection system for monitoring large distributed systems
  • built-on top of HDFS and MapReduce

Data Serialization

Avro

Avro

  • Data-serialization framework
  • Primarily used in Hadoop for both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

Configuration and Coordination

Zookeeper

  • a tool for configuration management and coordination of computing nodes in a cluster

Workflow

Oozie

Oozie

  • Orchestration and workflow management tool to manage the workflow and dependencies, removing the need for developers to code custom solutions.

Deployment, Monitoring and Administration

Ambari

ApacheAmbari

  • Tool to help system administrators deploy and configure Hadoop, upgrade clusters, and monitor services. Through an API it may be integrated with other system management tools.
  • Developed by HortonWorks

Whirr

  • Tool for cloud-agnostic deployment of clusters, offers a way of running services, including Hadoop, on cloud platforms.
  • Currently supports the Amazon EC2 and Rackspace services.

Machine Learning

Mahout

ApacheMahout

  • Library of machine learning and data mining algorithms.
  • Use cases include user collaborative filtering, user recommendations, clustering and classification.

Databases

HBaseApacheHBase
  • a column-oriented database scaling to billions of rows
  • Runs on top of HDFS for rapid data access.
  • MapReduce can use HBase as both a source and a destination for its computations.
  • Hive and Pig can be used in combination with HBase.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s