How To Configure And Monitor A Hadoop Cluster




Hadoop started as an open-source Apache project and today is one of the best cluster application and data processing software out there. The data is districuted across networked nodes and is composed of the Hadoop Distributed File System, also known as (HDFS). Which handles redundancy and scalability of data across nodes.  Another component of it is Hadoop YARD which is a framework of job scheduling which executes data processing tasks across all nodes.

Today any person looking to go into a data science certification will come across specifics relating cluster based processing in their training. So how does Hadoop work? What are the steps to be taken to configure and monitor a Hadoop cluster?

Installing a Hadoop Cluster

The installation of a Hadoop cluster typically involves either installing the software on all machines through a packaging system that would be appropriate for the OS that you are using or unpacking the software on all machines in the cluster.

Usually, a machine in the cluster is assigned as the NameNode while another is the ResourceManager. These are the exclusive masters of the process. For other services (like Web App Proxy Server), dedicated hardware or shared infrastructure is used depending upon the load.

Other machines in the cluster act as DataNode and NodeManagers. Or in other words, Worker Machines.

Configuring Hadoop In Non-Secure Mode

For Java configuration, Hadoop runs on two main configuration files:

  • Read only default configuration - core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml.
  • Site-specific configuration - etc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml, etc/hadoop/yarn-site.xml and etc/hadoop/mapred-site.xml

In addition to this, the scripts in the bin/directory of the distribution can also be controlled by setting site-specific values though etc/hadoop/hadoop-env.sh and etc/hadoop/yarn-env.sh.

For Hadoop cluster configuration, the environment where the Hadoop daemons execute will need to be configured along with the configuration parameters for Hadoop daemons.

NameNode, DataNode and SecondaryNameNode are used as HDFS daemons. WebAppProxy, ResourceManager and NodeManager are used as YARD daemons. In case you decide to use MapReduce, then the Job History Server for MapReduce will also be run simultaneously. For heavy load installation, they are generally run on separate hosts.

Monitoring the Hadoop Cluster

A very important thing you will be taught during data analytics courses is Monitoring Hadoop clusters. A large selection of tools is available that allows proper analysis and optimization of performance for a Hadoop Cluster. The tools are available in open source, commercial and free (but closed) sources.

Open Source

As we begin with open source, all of Hadoop’s components come packaged with their own administrative tool and interface. This can be used cluster wide performance data and metrics. The only challenge that you face here is that collecting data from disparate sources can be very time consuming and difficult.

A recommended and very useful Hadoop analysis and management tool is the Apache Ambari Project. Ambari provides the user with a very well designed and graphical interface which can help the user monitor the cluster with ease. It is a single interface that gives you the ability to manage, monitor and provision clusters with thousands of machines.

Nagios is also a powerful cluster monitoring system which does not only monitor hosts and servers, but also all the interconnected devices in the cluster such as switches and routers. It has prompt alerting services that allow for fast response when facing system problems.

Free, Closed Source Tools

Of all the free but closed source tools available out there for cluster management, Cloudera  Manager is considered the best. It has both a free and paid version, but the free version works just as well for beginners and clusters with not too heavy a load. Just like Ambari it provides a good graphical interface for management.

You can try out both the tools to see which one suits you and your clusters best.

Commercial Tools

Commercial Tools’ list goes on, but one suggested name is Datadog. Datadog is limited when it comes to management interface, but considering monitoring it does more than just the Hadoop ecosystem. Datadog can help you collect metrics from all the applications in your cluster/environment.

These are some of the basics when it comes to configuring and monitoring Hadoop clusters. Data science certification can teach you all you need to know about the different processes you need to follow and what you’re supposed to be on the lookout for when configuring and monitoring large clusters.

About The Author
James
Data Scientist (Growth) at QuickStart

James Maningo

James is a stochastic tinkerer with over 8 years of experience in digital analytics. His passion lies in providing meaningful impact through data, utilizing growth hacking techniques for business and "quantified self" for personal life. His weapons of choice are linux, python, tmux+vim and good old common sense.