Scope of the Case Study
Streamlining and Enhancing Nagios XI, NNA, and NLS Platforms
Introduction
A global leader in audio equipment, required a robust monitoring solution to support its large-scale infrastructure spanning three continents: the USA, Europe, and Asia. Tasked with managing over 10,000 devices, client relied on Nagios XI, Nagios Network Analyzer (NNA), and Nagios Log Server (NLS) to monitor its complex IT ecosystem. Despite the capabilities of these tools, inefficiencies in configuration, false alerts, and a lack of advanced monitoring posed significant challenges.
To address these issues, Tetra undertook a comprehensive overhaul of client’s Nagios environment, delivering streamlined processes, automation, plugin enhancements, and robust monitoring capabilities.
Challenges Faced
- High Volume of False Alerts: Inefficient configurations led to numerous false and unknown alerts, hindering operational efficiency.
- Lack of Advanced Monitoring: Existing plugins could not cater to advanced monitoring needs such as kernel-level metrics, CRC errors, and AWS-specific data.
- Complex Infrastructure Changes: Frequent changes to infrastructure caused gaps in monitoring dashboards and missing hosts/services.
- Global Distributed Architecture: Managing a distributed Nagios setup across three data centres with various device types added complexity.
- Restricted Access Monitoring: Monitoring secured servers behind firewalls, like CDE servers, posed technical and procedural challenges.
Objectives of the Overhaul
- Streamline and enhance Nagios XI, NNA, and NLS platforms..
- Develop custom plugins for advanced monitoring.
- Automate routine processes like AWS instance deployment.
- Provide a secure and scalable environment.
- Build comprehensive dashboards and SOPs for seamless operations.
Key Solutions Delivered by Tetra
1. Streamlining Existing Platforms
- Conducted a thorough gap analysis to identify inefficiencies in the Nagios XI, NNA, and NLS setups.
- Reduced false alerts by optimizing configurations and workflows.
- Improved the usability of Nagios Network Analyzer and Log Server for real-time insights.
2. Advanced Monitoring Capabilities
Tetra developed plugins for:
- Network Monitoring: CRC/input errors, fan/power supply status, BGP link and neighbour statuses, OSPF link monitoring, and VPN tunnels.
- Windows Monitoring: RAID alarms, cluster services, LDAP and DNS response checks, and critical event detection for Windows servers.
- AWS Monitoring: ELB services, HTTP status codes, disk operations, auto-scaling group sizes, and EC2 performance metrics.
3. Automation and Processes
- Automated AWS instance deployment for streamlined resource management.
- Created SOPs for infrastructure changes, troubleshooting, and onboarding new devices.
- Documented processes to empower client teams to manage ongoing enhancements independently.
4. Customized Dashboards and BPI Integration
- Built traditional dashboards for the client OCC team and BPI-enabled dashboards for management insights.
- Introduced parent-child device relationships for enhanced correlation of alerts and events.
- Designed secure, hardened environments for dashboards and reports.
5. Integration with External Tools
- Enabled seamless integration with the Cherwell ticketing tool for automated incident reporting
- Incorporated database and application monitoring based on data availability.
- Designed secure, hardened environments for dashboards and reports.
Architecture Overview
The project featured a distributed architecture:
- Three Data Centers: USA, Europe, and Asia.
- Devices Deployed: Over 10,000 devices, including System X servers, PowerEdge servers, Brocade NetIron switches, Palo Alto firewalls, Aruba controllers, and Cisco ASR routers.
- Mod Gearman Servers: For UAT and test development platforms.
Key Components and Monitoring Areas
1. Network Monitoring
- Errors: CRC, media, crypto, and hardware-related module errors.
- Protocols: VPN tunnels, BGP/OSPF statuses, and neighbour relationships.
- Performance Metrics: Bandwidth, top talkers, and historical data.
2. Windows Monitoring
- Physical Hardware: RAID alarms, HBA, and battery statuses.
- Roles and Services: LDAP, DNS, Hyper-V, and cluster services.
- Alerts: Critical events and disk/CPU utilization.
3. AWS Monitoring
- ELB Services: HTTP codes, load balancer capacity, and connection metrics.
- Target Groups: Healthy hosts and backend connection statuses.
- Auto Scaling Groups: Desired capacity and instance statuses.
- EC2 Instances: Disk reads/writes, network I/O, and status checks.
Results Achieved
- Improved Monitoring Accuracy: Reduced false and unknown alerts by 80%, enhancing system reliability.
- Advanced Monitoring: Enabled proactive detection of critical issues through new plugins and integrations.
- Automation and Efficiency: Streamlined AWS instance deployment and operational workflows, saving valuable time.
- Enhanced Insights: Comprehensive dashboards provided real-time data and actionable intelligence.
- Global Cohesion: Unified monitoring across all data centres ensured consistency and operational transparency.
Conclusion
Tetra’s Nagios overhaul for this client is a testament to the transformative power of tailored IT solutions. By addressing inefficiencies, enabling advanced monitoring, and providing scalable automation, Tetra delivered a robust and future-ready monitoring ecosystem for client’s global operations.
Partner with Tetra to redefine your IT monitoring strategy with Nagios. Let us help you achieve seamless, proactive, and efficient infrastructure management.