Quantcast
Channel: stackArmor OpenOpsIQ Blog » HDFS
Viewing all articles
Browse latest Browse all 3

Does real-time log data collection and analysis = monitoring?

$
0
0

Various stacks of an application including the infrastructure generate tons of log data. Traditionally, there has been a latency between the logging action (i.e. writing to a file) and the ability to analyze and act upon the data. latency tends to vary depending on the criticality of the application/system. In order to meet SLA’s and ensure the Availability of applications, most IT Ops people tend to apply another layer of monitoring software that detects and alerts critical events within a system. Overall this leads to a costly and many times brittle solution with multiple layers of software and system data silos. Key challenges with the existing logging and monitoring approaches are described below

  1. Log Lag – due to the inability to quickly collect, analyze and act upon this data, there is a significant cost overhead introduced by adding an additional monitoring layer. At times this monitoring layer is brittle and needs significant tuning to avoid noise.
  2. The Developer versus Ops Engineer Gap – many organizations still have a disconnect between the development team and the operations team. The Ops team generally is aware of and uses the log system because the general use cases tends to be trending analysis, system behavior analysis etc. Whereas the developers tend to prefer going to straight away to the log on the system or application and not to the centralized logging system because real-time data is generally not captured and due to their knowledge of the application they know which log to analyze for fault-diagnosis. Overall, there is a disconnect between the Developers (who tend to ultimately get involved in solving production issues) and the Ops teams. While the DevOps process and methodology has done a lot in the production release process, the real “ops” process is still a silo and disconnected.
  3. Data Anarchy – The sheer variety of logging data is just mind-boggling when one starts to look at the entire operations stack which includes application, systems software, infrastructure, network etc. On top of that you have other logs generated through vulnerability scans, anti-virus scans etc. Getting to a truly “unified event collection and logging system” is nearly impossible for most organizations. The lack of such a system drives up the maintenance costs and makes it difficult to truly know the “State of Operations” of the system.
  4. Management Overhead – Given some of the reasons described in the comments, the requirement for logging includes performance, management, compliance etc. The log management process is fairly cumbersome and costly. Organizations struggle with transporting, aggregating, analyzing and acting upon this log data. These costs add-up and include log management software costs, labor for log management, storage, network and compute etc. Overall, I believe that the logging and system monitoring process is broken and extremely costly which adds to the general inefficiency in IT operations.

The diagram below captures the essence of the approach today.

Image

Based on the emergence of real-time streaming solutions like AWS Kinesis, Apache Storm etc., emerging best practices suggest implementing a unified real-time log data collection, analysis and presentation/alerting engine, which removes the need for a separate monitoring stack. The picture shows a high-level schematic of the approach.

Image

What is really interesting is the emergence of real-time logging and event processing solutions that meld stream-processing, highly-available infrastructure and Hadoop/HDFS enabled heterogeneous format storage. Solutions like Apache Flume, Apache Kafka (log collectors), streaming solutions like Apache Storm, AWS Kinesis and HDFS for storage.

Can the Cost of Systems Operations be reduced through more robust and real-time event data collection and processing?

If you are into systems management and operations you may also like the following post

http://peter.gillardmoss.me.uk/blog/2013/05/28/monitor-dont-log/



Viewing all articles
Browse latest Browse all 3

Latest Images

Trending Articles





Latest Images