7V’s of Big Data – Briefly

  1. Velocity – speed at which data is created currently is almost unimaginable. e.g., 2.5 million Google queries per minute
  2. Volume – enormous amount of data generated. e.g., airplanes generate 2.5 billion TB of data each year from the sensors installed in the engines.
  3. Variety – Data today comes in many different formats: structured data, semi-structured data, unstructured data and even complex structured data. e.g., Facebook, Twitter, etc.
  4. Veracity – Trust-worthy data. Analysis performed would be useless unless the data is accurate.
  5. Variability – Meaning of the same data could be different based on the context. e.g., Same tweet words can have different meanings and sentiments.
  6. Visualization – Visual representation of analysed data in a comprehensible way.
  7. Value – Data in itself is not valuable at all. The value is in the analyses done on that data and how the data is turned into information and eventually turning it into knowledge.

More detailed explanation: http://www.bigdata-startups.com/3vs-sufficient-describe-big-data/

Advertisements

Hadoop Ecosystem

Hadoop Ecosystem

Apache Hadoop is the talk of the town pretty much all over the Big Data world. For the beginners of Big Data and Hadoop, there are quite of few terminologies, frameworks, libraries, etc. to digest to get a feel of the Hadoop Ecosystem.

In the process of learning them myself, I came across a wonderful article written from Edd Dumbill. I have excerpted few notes from his article and formatted in tabular form in this blog entry.

Apache HadoopHadoop Logo

  • an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
  • It is a Batch-oriented system.
  • Components
    • MapReduce – Framework developed by Google for parallel computation on server clusters.

    • HDFS (Hadoop Distributed File System) – Distributed redundant file system for storing unstructured and schemaless data in Hadoop.
    • YARN (Yet Another Resource Negotiator) – a framework for job scheduling and cluster resource management.

HadoopEcosystem

Programmability

Pig

Pig

  • High-level programming language that simplifies the common tasks of working with Hadoop: loading data, expressing transformations on the data, and storing the final results.
  • Pig’s built-in operations can make sense of semi-structured data, such as log files.
  • Main advantage is to drastically cut the amount of code needed compared to direct use of Hadoop’s Java APIs.

Hive

Hive

  • Enables Hadoop to operate as a data warehouse with SQL-like access. Easily integratable via JDBC/ODBC.
  • It superimposes structure on data in HDFS, and then permits queries over the data using a familiar SQL-like syntax.
  • More suitable for data warehousing tasks.

Data Collection

Sqoop

Sqoop

  • a tool to import data from relational databases into Hadoop: either directly into HDFS, or into Hive

Flume

Logo

  • a tool to import streaming flows of log and event data directly into HDFS
  • Efficient service for collecting, aggregating, and moving large amounts of log data.

Chukwa

Chukwa

  • open source data collection system for monitoring large distributed systems
  • built-on top of HDFS and MapReduce

Data Serialization

Avro

Avro

  • Data-serialization framework
  • Primarily used in Hadoop for both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

Configuration and Coordination

Zookeeper

  • a tool for configuration management and coordination of computing nodes in a cluster

Workflow

Oozie

Oozie

  • Orchestration and workflow management tool to manage the workflow and dependencies, removing the need for developers to code custom solutions.

Deployment, Monitoring and Administration

Ambari

ApacheAmbari

  • Tool to help system administrators deploy and configure Hadoop, upgrade clusters, and monitor services. Through an API it may be integrated with other system management tools.
  • Developed by HortonWorks

Whirr

  • Tool for cloud-agnostic deployment of clusters, offers a way of running services, including Hadoop, on cloud platforms.
  • Currently supports the Amazon EC2 and Rackspace services.

Machine Learning

Mahout

ApacheMahout

  • Library of machine learning and data mining algorithms.
  • Use cases include user collaborative filtering, user recommendations, clustering and classification.

Databases

HBaseApacheHBase
  • a column-oriented database scaling to billions of rows
  • Runs on top of HDFS for rapid data access.
  • MapReduce can use HBase as both a source and a destination for its computations.
  • Hive and Pig can be used in combination with HBase.