In these series of posts we explore how to monitore computers with Big-Data technologies.
The most misunderstanding in Big-Data topic is that there is NOT ONE technology, an unique tool,
silver bullet to manage a large set of data but there are a lot of technologies used under the name Big-Data.
Every single piece of technology is used to manage a single aspect.
For this reason I normally prefer to use the terminology Big-Data technologies instead of simple Big-Data.
Objective
Create a monitoring system for a remote linux machine and show the informations with a dashboard, obviously using Big-Data technologies.
Simple, isn't it? oh no... there is a lot of stuff...
Let's explore a possible solution
Analysis
Data source
The simple way to monitor a system is to get its message logging. In computing there is a standard in order to distribute message logging called syslog. In this project a remote machine send its message logging (kernel, auth, ...) over the network and the Big-Data server the system that stores them.
Data collection
In order to keep and collect the syslog data there are mainly two way:
omhdfs: Hadoop Filesystem Output Module, a module that supports writing message into files on Hadoop HDFS;
-
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. flume website
We will not use omhdfs because it is not flexible, you can not insert or modify how to store or preprocessing the data, and it is strongly linked to Hadoop version. For strongly linked I mean that if it is required to upgrade Hadoop to a new major release then you need to recompile rsyslog module.
For this reasons in this project we will use flume with a dedicated agent, a JVM process that hosts the components through which events flow from an external source to the next destination, that receive syslog events and store the informations into HDFS.
Storage
We need to store data in a way that could be easy to query data, scalable and distributed. In this post we will explore two ways for storing data:
Hadoop HDFS (HaDoop File System): the informations retrieved from flume will be saved as plain text in files;
elasticsearch: a distributed scalable and highly available real-time search and analytics. It will be used not only for storing data but it is used for indexing and quering data via RESTful API.
Processing
In order to retrieve and query data from HDFS we will use:
Hive: retrieve data stored in HDFS in a SQL way (a lot of programmer know SQL, a lingua franca in querying data);
elasticsearch: via RESTful API.
Visualization
To get more value from the data retrieved we will use kibana, it is a flexible analytics and visualisation platform.
Architecture
I prepared a high level diagram to describe this solution:
Repository
All source code is available on Github
Post
So I will split the in three parts:
Part One: Set up Virtual Machine
Part Two: Retrieve and query the remote linux machine message logging
Part Three: Create a monitoring system for a remote linux machine and show the informations with a dashboard
Comments
No comments yet.Add Comment