Big Data Solution in Enterprises

Advait Kulkarni, CTO, Digistic LLC

Advait Kulkarni, CTO, Digistic LLC

Big Data solution in Enterprises Introduction:

Data is the most valuable currency in today's Digital era. Whether it is trying to gather information about internal operational process metrics, collecting customer satisfaction scores or getting insights into the sales and marketing effectiveness, data is of prime importance. This is then used to implement continuous improvement and predictive planning to improve customer satisfaction, quickly inform big decisions, eliminate waste and reduce risk in different areas. Since this collection of data sets or information has too large and complex to be processed by standard tools, Big Data has been identified as an art and science of combining rapidly changing enterprise data, social data and machine data that are of different varieties to derive new insights, which are otherwise not possible.

History of Big Data components:

The biggest and the early entrant in the Big Data space is the Hadoop infrastructure. To make indexing of the immense amount of data generated by the web possible, Google created the MapReduce style of processing. MapReduce programming uses two functions, a map job that converts a data set into key/ value pairs, and a reduce job that combines the outputs of the map job into a single result. This approach to problem solving was then adopted by developers who were working on Apache’s “Nutch” web-crawler project, and developed into Hadoop.

  The best part of Hadoop is that commodity hardware is used to achieve large-capacity storage plus high availability and fault tolerant computing 

Hadoop is made up of four modules; the two key modules provide storage and processing functions.

1. MapReduce: The MapReduce component supports the distributed computation of large projects by breaking the problem into smaller parts and combining the results to derive a final answer.

2. Hadoop Distributed File System (HDFS): This component supports the distributed storage of large data files. HDFS splits up the data and distributes it across the nodes in the Hadoop cluster. It creates multiple copies of the data for redundancy and reliability purposes. If a node fails, HDFS will automatically access the data from one of the replicas. The data managed by HDFS can be either structured or unstructured. It can support almost any format. Hadoop does not require the use of HDFS. However, other file systems such as Amazon S3 or the MapR File System can also be used with Hadoop.

3. Yet Another Resource Negotiator (YARN): Introduced in Hadoop 2 .0, YARN provides scheduling services and manages the Hadoop cluster’s computing resources. Through YARN’s features, Hadoop can run other frameworks besides MapReduce. This has extended Hadoop’s functionality so that it can support real-time, interactive computations on streaming data in addition to batch job processing.

4. Hadoop Common: The library and utilities that support the other three modules.

Hadoop can run on a single machine, which is useful for experimenting with it, but normally Hadoop runs in a cluster configuration. Clusters can range from just a few nodes to thousands of nodes. When the data is managed by HDFS, there is a master NameNode that holds the file index. The data is stored on slave DataNodes. Due to the introduction of YARN, the scheduling jobs that run on the nodes are different in Hadoop 1 and Hadoop 2. Hadoop 1 managed resources with the JobTracker and TaskTracker processes on the master and slave nodes respectively. In Hadoop 2, they are replaced by YARN’s ResourceManager, NodeManager, and ApplicationMaster daemons.

Benefits and limitations of Using Hadoop

Hadoop provides a number of advantages for solving Big Data applications:

• Hadoop is cost-effective: The best part of Hadoop is that commodity hardware is used to achieve large-capacity storage plus high availability and fault tolerant computing.

• Hadoop solves problems efficiently: The efficiency is partly due to using the multiple nodes to work on the problem’s parts in parallel, and partly from performing computation on the storage nodes, eliminating delays due to moving data from storage to compute nodes. Because data is not moved between servers, the volume does not overload the network.

• Hadoop is extensible: Servers can be added dynamically, and each machine added provides an increase in both storage and compute capacity.

• Hadoop is flexible: Although most commonly used to run MapReduce, it can be used to run other applications, as well. It can handle any type of data, structured or unstructured.

Hadoop may not be needed for all purposes. Problems with smaller data sets can most likely be more easily solved with traditional methods. The HDFS was intended to support write-once read-many operations, and may not work for applications that need to make data updates.

Big Data Ecosystem

Hadoop is typically used in conjunction with several other Apache products to form a complete analytics processing environment. These products include:

• Pig : Pig is a scripting language that makes the data manipulations commonly needed for analytics (the ETL extract, transform, and load—operations) easy. Scripts written in “Pig Latin” get turned into MapReduce jobs.

• Hive : Hive provides a query language, similar to SQL, which can interface with a Hadoop-based data warehouse.

• Oozie : Oozie is a job scheduler that can be used to manage Hadoop jobs.

• Sqoop : Sqoop provides tools for moving data between Hadoop and relational databases.

For users who find creating and supporting their own Hadoop environment too complex, there are several vendors who provide supported versions that make getting started easier. They also provide enhanced services that make Hadoop enterprise-ready. Some vendors include:

• Amazon Web Services: AWS can rapidly provision a Hadoop cluster, add resources to it, and provides administrative support as a managed service. Other products often used with Hadoop, like Pig and Hive, can also be deployed.

• Google Cloud Platform: Similar to AWS, Google lets you rapidly provision Hadoop clusters and associated resources like Pig and Hive. Other Google offerings, like BigQuery and Cloud DataStore connectors, make it easy to access data stored in Google’s data warehouse and cloud storage services.

• Cloudera : One of the co-creators of Hadoop is now Cloudera’s Chief Architect. The firm offers a fully supported environment for enterprise Hadoop, including professional services.

• HortonWorks : HortonWorks offers Hortonworks Data Platform, based on Hadoop, to support enterprise data lakes and analysis, plus Hortonworks DataFlow for real-time data collection and analysis.

Competitors for Apache Hadoop

Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. The advantages that Spark has over Hadoop are as follows:

It runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. It supports multiple languages like Scala, R, Java, Python and C#. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming while also combining these libraries seamlessly in the same application. It runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and Amazon S3.

Spark has the following components for processing the data

• Spark Streaming
• Spark SQL
• Mlib
• GraphX

Modern Trends:

For the unstructured data of all kinds, Apache Hadoop continues to be the mainstay application component.

It is used for data acquisition, transformation, cleansing, and query able archiving. However when it comes to some of Hadoop’s core components, like MapReduce for massively parallel data processing and Hadoop Distributed File System (HDFS) for data storage, the trend is slowly moving towards using Apache Spark and distributed object storage.

Read Also

A New Architecture for a New Age

David Easthope, SVP, Celent Securities & Investment Group

Upgrading Your Digital Business

James Eichmann, Chief Data Officer, Billtrust

Defining the Future - Where Do I Start?

Steven A Warner, Director of Technology Innovation, Fannie Mae

Today's Data Centers - Driven by Business Applications

Marc Naese, VP of Data Center Business, Panduit Corporation