HOME | SOLUTIONS | TELECOM SOLUTIONS | FAULT MANAGEMENT (netAICE)

Telecom Fault Management System (netAICE)

Large telecommunications networks are comprised of a variety of systems and components: routers, switches, wireless access points, network interface controllers, modems.  There are literally hundreds of thousands of network elements in a major Telco’s network. Connecting it all together is cabling so great in length that it could be wrapped around Earth several times.

The complexity lies not only in its scale but also in the presence of tens or even hundreds of devices manufacturers in a single telecom network. When in tune, these networks can perform amazing tasks. However when things go wrong, problem resolution can be a daunting task. And because a CSP’s reputation depends directly on the health of its network, every telecom company does its best to ensure uninterrupted services.

fault-management

This diagram illustrates the CyberVision NetAICE alarm correlation process and the approximate results during each stage


Any fault, whether it’s a cable outage or server unavailability, can effect neighboring devices, which in turn affect their neighbors and so on. As a result, a large network segment may suffer serious outages. Even a minor glitch may lead to SLA violations and revenue loss. When problems do occur, affected network devices trigger alarm messages alerting network operators about the malfunction, often flooding management stations with a huge number of alarms, making it difficult for the operator to process them and take corrective actions.  According to IEEE’s analytic reports, a typical mid and large size network can generate event floods of more than 50 alarms per second.

The challenge is to efficiently isolate the specific fault that is at the root of the alarm storm. The size and complexity of today’s networks makes the levels of human intervention required to perform this function prohibitively high. Instead telecom companies are increasingly turning to powerful Fault Management systems to do the root cause analysis.

For over a decade CyberVision has been supplying solutions to the telecom industry. CyberVision’s NetAICE (Artificial Intelligence Correlation Engine) system takes an innovative approach towards alarm correlation and root cause analysis. By applying our Artificial Intelligence-based correlation engine, NetAICE delivers superior root cause analysis, offering the following benefits:

  • Faster determination of root cause by applying different methods for correlating alarms;
  • Grouping alarms by root cause;
  • Minimizing false alarms generated by Trouble Ticketing systems by tracking interconnections between network elements and suppressing error messages from incidentally affected devices.

A significant advantage of CyberVision’s NetAICE is our “Enhanced Impact Analysis Module” (EIAM). When faults occur, this module calculates the possible consequences and predicts the future state of network elements. EIAM also helps estimate a problem’s severity, its topological disposition, and helps plan steps for problem resolution. EIAM is especially useful for identifying possible SLA violations.

The core of an industrial Fault Management system is its correlation engine, which is responsible for associating alarm dependencies, and filtering and sorting out spurious alarms.. When performing ideally, a correlation engine sets the stage for fast and accurate determination of root cause. More typically, a correlation engine might generate hundreds of misleading reports obscuring the actual problem. In this case, it often requires reviewing every alarm or testing in manual mode to determine root cause.

The majority of existing Fault Management systems on the market use only a few alarm correlation methods – correlating on average only 15% of incoming alarms. The advantage of CyberVision’s NetAICE solution is that it combines the four most effective correlation methods, reducing the number of alarms by 70% - 90%, and it can achieve these results processing as many as 100 alarms per second.

The four most effective correlation methods:

  • Non-topology correlation
  • Topological correlation
  • Artificial neural networks
  • Bayesian Belief Networks and subsidiary non-deterministic methods

Each method is activated on an as-needed basis depending on the type of alarm, severity level, uncertainty level, etc. During non-topology correlation CyberVision’s NetAICE compiles a preliminary list of alarms by discarding alarms that are deemed irrelevant or unessential. The remaining alarms are sorted and aggregated according to their parameters and rule sets.

CyberVision’s NetAICE system applies topological correlation as the primary method of correlation, which takes into account relative locations and interconnections of network elements. In modern, complex networks this mechanism has proven to be indispensable because of its ability to directly track the current state of network devices. Topology based analysis successfully replaces complicated rule sets, in the process, increasing the amount of accurately correlated alarms in a large network by up to several times.

CyberVision’s NetAICE System utilizes two additional alarm correlation methods to complement non-topological and topological correlation methods: Bayesian Belief Networks with related non-deterministic algorithms and Neural Networks. In practice a large number of alarms slip past rule-based methods due to incomplete rule sets and possible scenarios.  Moreover, topology information may lack accuracy and completeness for certain network segments causing direct topology analysis to be ineffective. Non-deterministic algorithms, such as Bayesian Belief Networks and Neural Networks are ideal solutions for situations where alarms couldn't be processed using rule-based and topology correlations.

Bayesian Belief Networks is a mathematical technique for representing probable relationships between network faults and possible sources. Using specific mathematical algorithms for identifying the root cause of a problem, it takes into account the relationship between network elements to calculate the probable cause.

Finally, the Neural Network approach for alarm correlation is a particularly powerful feature in CyberVision’s NetAICE solution. Artificial Intelligence has proven to be an essential feature for managing next generation networks, especially in situations that involve a near infinite range of scenarios and a changing network architecture. Neural Networks offer flexibility and can be trained to perform a variety of tasks. When rule-based analysis fails, Neural Networks can identify root cause alarms in cases of incomplete information and can learn new alarm patterns following network topology modifications.

 

 

© 2011 CyberVision, Inc.
All Rights Reserved.