Hadoop Ecosystem

Hadoop Ecosystem

Apache Hadoop is the talk of the town pretty much all over the Big Data world. For the beginners of Big Data and Hadoop, there are quite of few terminologies, frameworks, libraries, etc. to digest to get a feel of the Hadoop Ecosystem.

In the process of learning them myself, I came across a wonderful article written from Edd Dumbill. I have excerpted few notes from his article and formatted in tabular form in this blog entry.

Apache HadoopHadoop Logo

  • an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
  • It is a Batch-oriented system.
  • Components
    • MapReduce – Framework developed by Google for parallel computation on server clusters.

    • HDFS (Hadoop Distributed File System) – Distributed redundant file system for storing unstructured and schemaless data in Hadoop.
    • YARN (Yet Another Resource Negotiator) – a framework for job scheduling and cluster resource management.





  • High-level programming language that simplifies the common tasks of working with Hadoop: loading data, expressing transformations on the data, and storing the final results.
  • Pig’s built-in operations can make sense of semi-structured data, such as log files.
  • Main advantage is to drastically cut the amount of code needed compared to direct use of Hadoop’s Java APIs.



  • Enables Hadoop to operate as a data warehouse with SQL-like access. Easily integratable via JDBC/ODBC.
  • It superimposes structure on data in HDFS, and then permits queries over the data using a familiar SQL-like syntax.
  • More suitable for data warehousing tasks.

Data Collection



  • a tool to import data from relational databases into Hadoop: either directly into HDFS, or into Hive



  • a tool to import streaming flows of log and event data directly into HDFS
  • Efficient service for collecting, aggregating, and moving large amounts of log data.



  • open source data collection system for monitoring large distributed systems
  • built-on top of HDFS and MapReduce

Data Serialization



  • Data-serialization framework
  • Primarily used in Hadoop for both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

Configuration and Coordination


  • a tool for configuration management and coordination of computing nodes in a cluster




  • Orchestration and workflow management tool to manage the workflow and dependencies, removing the need for developers to code custom solutions.

Deployment, Monitoring and Administration



  • Tool to help system administrators deploy and configure Hadoop, upgrade clusters, and monitor services. Through an API it may be integrated with other system management tools.
  • Developed by HortonWorks


  • Tool for cloud-agnostic deployment of clusters, offers a way of running services, including Hadoop, on cloud platforms.
  • Currently supports the Amazon EC2 and Rackspace services.

Machine Learning



  • Library of machine learning and data mining algorithms.
  • Use cases include user collaborative filtering, user recommendations, clustering and classification.


  • a column-oriented database scaling to billions of rows
  • Runs on top of HDFS for rapid data access.
  • MapReduce can use HBase as both a source and a destination for its computations.
  • Hive and Pig can be used in combination with HBase.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s