NetScout vs. BMC
Photo: Tambako

If you’re like many of our customers, you have accumulated plenty of monitoring tools.  And with the variety of tools comes the challenge of managing all the data.  One approach is to integrate all of your data sources together, though the extra complexity often brings challenges of its own.

One customer recently asked us how to go about integrating NetScout nGenius data with BPPM for metric and service level reporting.  They had developed a fairly complex reporting solution based on perl scripts and excel spreadsheets, but the original developer(s) had since moved on to other things.  Meanwhile, their reporting needs had far outgrown the original scope.  They needed either to rewrite the custom solution for scalability or replace it altogether with something off the shelf.  The goal was to provide weekly reports showing the business availability of the WAN links to various data centers around the globe after accounting for local business hours, planned outages, etc.

Since we had just finished helping them deploy BPPM and BPPM Reporting, we wondered if it would make sense to use that to replace their scripts and spreadsheets.  It would just be a matter of integrating their NetScout data into their BMC reporting platform.  Though there wasn’t a vendor-supported solution available from either BMC Software or NetScout, the tools from both vendors are fairly extensible.  Rapidly developing a prototype integration seemed doable.  Of course, it wasn’t quite as simple as it seemed…

[contentblock id=3 img=html.png]

NetScout vs. BMC

The first challenge we found was that while nGenius provides a handy Common Data Export (CDE) facility for dumping historical data for reporting purposes, BPPM requires data to be imported in real time.  There is no practical way to import historical data into BPPM, and the nGenius CDE isn’t really designed to be a real-time data pipeline.

That said, the nGenius CDE does offer enough granularity to select a limited snapshot of data, such as the last 15 minutes.  We reasoned that it should be possible to create a sufficiently real-time stream of data to meet the requirements of BPPM by polling every 15 minutes for the last 15 minutes of data.

Timing is Everything

That’s when we discovered the next complication.  Querying nGenius took significantly longer than we would have hoped.  A single query for the status of one link took about 15 seconds.  That much overhead isn’t significant when you’re pulling a week of data for a historical report, but when you’re polling every 15 minutes, 15 seconds of overhead is significant.  And we had a lot more than one link to poll.

Without knowing the source of the bottleneck, we explored two approaches to the scaling issue: sequential and parallel queries for a limited sample of around 20 WAN links.  When the queries were sent sequentially, waiting for a response from the last query before sending the next, the entire process took about 5 minutes to complete.  When the queries were sent simultaneously (20 at once), the results were horribly unreliable, with many queries failing to return at all.  Plus, it took much longer overall from the first query until the final result came back, possibly due to collisions and timeouts.

So, with a best case of polling 20 WAN links every 5 minutes, the prospects were looking bleak.  We needed to monitor more than a thousand links.

Now what?

The eureka moment came after we discovered that the 15-second overhead was actually due to authentication handshaking.  Fortunately, the nGenius CDE facility provides a way to batch multiple queries together to be executed within a single authentication session.  The process isn’t very intuitive, but it works.

Batching the commands together in this way dramatically improved the scalability of the integration.  It turns out that, while each authentication exchange takes about 15 seconds, the actual query and response of metric data only consumes about 500 milliseconds.   So, instead of taking 5 minutes to query the statistics for 20 WAN links, we were able to query 500 WAN links in the same amount of time.

Brass Tacks

With the above in mind, we settled on an approach.  We’d create a Knowledge Module (KM) for BMC PATROL, which would loop through a list of WAN links and write a set of commands like the following for each one:

nGeniusCDE
  -vt link
  -vs overtime
  -me <WAN Link>
  -st "2013-06-24 10:00:00"
  -et "2013-06-24 10:15:00"
  -res 15min
  -fp <Path to output file>
  -fn <output file name>
The KM would save the commands in a batch file to be executed as follows:
nGeniusCDE –script <batch file name>

nGenius would then route the output for each query into a CSV file for processing by the KM.  The KM would read and parse each file, storing the appropriate metrics in PATROL parameters, and then clean up the file.  PATROL would forward the collected data to the BPPM Integration Server to be picked up by the PATROL adapter.

The approach worked, and we were able to get the data from nGenius into Patrol and through to ProactiveNet for eventual use in Business Objects and Service Level Management.

Have you dealt with a challenging integration recently?  Tell us about it…

[contentblock id=1 img=html.png]