[Good for enterprise] How to monitor GFE? - 2014-08-09 update
GFE monitoring related posts,
http://www.cnblogs.com/LarryAtCNBlog/p/3890743.html
Welcome email to: larry.song@outlook.com
Few days after i made the post, issue again, the issue didn't have any user impacted, only between internal team.
The issue relate to several events i in my previous post which is 5662/5669/5675/5733, as i said, those event shouldn't occur, if does, means there is communication problem with Good NOC, this is always right, the issue generated one or two of events, and it recovered very fast, like never happened, so no user noticed, good news is our monitoring script captured the issue and sent us alert. This problem happens sometimes, unable to reach Good NOC occasionally, cause email delay for several minutes, of course, we should exclude the situation from our monitoring system, otherwise the on-call person will have to weak up at night.
I also opened case to Good, engineer can't give any specific root cause, as there is many parties involved, the communication packages come out from our proxy, though many ISP provider, then reach Good NOC, any network failure will cause the communication failed, and due to extremely short time and alert delay, we are unable to capture trace logs and ping logs, so it's expected for me Good is unable to provide solution.
At last, I trun around, asked the web service urls from Good NOC, and how to decode Good logs, since i want to make a script to exclude the false alert by analysising Good web service urls and Good logs.
PS: Some of GFE's log are encoded, Good can provide decoding, but need the Good product owner's approval, from my company is a man in UK, I can hardly get approval from him, so finally i give up logs analysis.
I get NOC web service urls from Good engineer, they gave to me without questions.
https://xml28.good.com/
https://xml29.good.com/
https://xml30.good.com/
The basic idea is update the original monitoring, if script captured NOC failure events, it invokes NOC connection test script to see if the web service really can't be reached.
so, turns out, Test-NOC.ps1 coming, if all those URLs are reachable, the script will return $true, anyone of them is unreachable, the script return $false. they is some balancing mechanism from Good, so GFE will not only use one web service url.
### this script invoked by EventID.Monitoring.ps1 ### Used to test Good NOC connectivity $GoodNOC_Url = @( 'https://xml28.good.com/', 'https://xml29.good.com/', 'https://xml30.good.com/' ) $WebProxy = New-Object 'System.Net.WebProxy' # Change below proxy to your own proxy server and port $WebProxy.Address = 'http://ProxyServer:Port' $WebClient = New-Object 'System.Net.WebClient' $WebClient.Proxy = $WebProxy $Result = $true foreach($Url in $GoodNOC_Url) { $LoopCount = 0 do { $LoopResult = $false $LoopCount++ if(($WebClient.DownloadString($Url)).Contains('Congratulations! You have successfully connected to the GoodLink Service.')) { $LoopResult = $true break } } while($LoopCount -lt 3) $Result = $Result -and $LoopResult if($Result) { Add-Log -Path $strLogFile_e -Value "NOC Testing succeed: [$Url]" -Type Info } else { Add-Log -Path $strLogFile_e -Value "NOC Testing failed: [$Url]" -Type Warning } } return $Result
The main monitoring script updated,
#change working directory Set-Location (Get-Item ($MyInvocation.MyCommand.Definition)).DirectoryName #define events to be monitored and their properties #EventClass means this event indicates this event is similar to other events, same class events will trigger class script to do final judgement #ID means eventID, if you use array like @(xx,yy), means combine results first, e.g. xx matched 10 enties, yy matched 10 enties, combine as 20 than compare with threshold #Pattern is regular expression in C#, used for filter specific events. #MinusPattern also regular expression, used for filter specific events. #if Pattern and MinusPattern be defined, pattern matched 100 enties, MinusPattern matched 90 enties, so final number is 10, than compare with threshold, this is the way to exclude "auto-recover". $Events = @( @{EventClass = 1; ID = 3563; Pattern = '\bPausing .*MAPI error'; MinusPattern = 'Unpausing'; Threshold = 100;}, @{EventClass = 2; ID = @(1299, 1300, 1301); Pattern = $null; Threshold = 100;}, @{EventClass = 1; ID = 3386; Pattern = 'GDMAPI_OpenMsgStore failed'; Threshold = 100;}, @{EventClass = 3; ID = @(5662, 5669); Pattern = $null; Threshold = 1;}, @{EventClass = 3; ID = 5675; Pattern = 'errNetConnect'; Threshold = 1;}, @{EventClass = 3; ID = 5733; Pattern = 'errNetTimeout'; Threshold = 1;} ) # Script is null means not external script invoked, all depends on threshould # Script isn't null, means trigger external script, and do final judge depaneds on the return $EventClass = @{ = @{Script = $null; Description = 'MAPI Error'}; = @{Script = $null; Description = 'Good thread hung up'}; = @{Script = 'Test-NOC.ps1'; Description = 'Failed to contact NOC'}; } $Date = Get-Date $strDate = $Date.ToString("yyyy-MM-dd") $End_time = $Date $Start_time = $Date.AddMinutes(-15) $strLogFile = "${strDate}.log.txt" $strLogFile_e = "${strDate}_Error.log.txt" #define email properties $Mail_From = "$($env:COMPUTERNAME)@fil.com" $Mail_To = 'xxxxx@xxx.xxx' $Mail_Subject = 'Good event IDs warning' $Mail_SMTPServer = 'smtpserver' Set-Content -Path $strLogFile_e -Value $null function Add-Log { PARAM( [String]$Path, [String]$Value, [String]$Type ) $Type = $Type.ToUpper() Write-Host "$((Get-Date).ToString('[HH:mm:ss] '))[$Type] $Value" if($Path){ Add-Content -Path $Path -Value "$((Get-Date).ToString('[HH:mm:ss] '))[$Type] $Value" } } Add-Log -Path $strLogFile_e -Value "Catch logs after : $($Start_time.ToString('HH:mm:ss'))" -Type Info Add-Log -Path $strLogFile_e -Value "Catch logs before: $($End_time.ToString('HH:mm:ss'))" -Type Info Add-Log -Path $strLogFile_e -Value "Working directory: $($PWD.Path)" -Type Info $EventsCache = @(Get-EventLog -LogName Application -After $Start_time -Before $End_time.AddMinutes(5)) Add-Log -Path $strLogFile_e -Value "Total logs count : $($EventsCache.Count)" -Type Info $Error_Array = @() foreach($e in $Events) { $Events_e_ALL = $null $Events_e_Matched = $null $Events_e_NMatched = $null $Events_e_FinalCount = 0 $Events_e_ALL = @($EventsCache | ?{$e.ID -contains $_.EventID}) Add-Log -Path $strLogFile_e -Value "Captured [$($e.ID -join '], [')], count: $($Events_e_ALL.Count)" -Type Info $Events_e_Matched = @($Events_e_ALL | ?{$_.Message -imatch $e.Pattern}) Add-Log -Path $strLogFile_e -Value "Pattern matched, count: $($Events_e_Matched.Count)" -Type Info if($e.MinusPattern) { $Events_e_NMatched = @($Events_e_ALL | ?{$_.Message -imatch $e.MinusPattern}) Add-Log -Path $strLogFile_e -Value "Minus pattern matched, count: $($Events_e_NMatched.Count)" -Type Info } $Events_e_FinalCount = $Events_e_Matched.Count - [int]$Events_e_NMatched.Count Add-Log -Path $strLogFile_e -Value "Final matched, count: $Events_e_FinalCount" -Type Info if($Events_e_FinalCount -ge $e.Threshold) { Add-Log -Path $strLogFile_e -Value "Over threshold: $($e.Threshold)" -Type Warning if($Error_Array -notcontains $e.EventClass) { $Error_Array += $e.EventClass } } } Add-Log -Path $strLogFile_e -Value "Alert classes captured: [$($Error_Array -join '], [')]" -Type Info for($e = 0; $e -lt $Error_Array.Count; $e++) { Add-Log -Path $strLogFile_e -Value "Process class: [$e]" -Type Info if($EventClass.$($Error_Array[$e]).Script -imatch '^$') { Add-Log -Path $strLogFile_e -Value 'Final script not set, need to send alert.' -Type Warning } else { Add-Log -Path $strLogFile_e -Value "Run final script: [$($EventClass.$($Error_Array[$e]).Script)]" -Type Info if((& $EventClass.$($Error_Array[$e]).Script) -eq $true) { Add-Log -Path $strLogFile_e -Value 'Final script: [Positive], no need to send alert.' -Type Info $Error_Array[$e] = $null } else { Add-Log -Path $strLogFile_e -Value 'Final script: [Negetive], need to send alert' -Type Warning } } } $Error_Array | %{$Mail_Body = @()}{ if($_) { $Mail_Body += $EventClass.$_.Description } } $Mail_Body = $Mail_Body -join "`n" Add-Log -Path $strLogFile_e -Value "===================split line====================" -Type Info Get-Content -Path $strLogFile_e | Add-Content -Path $strLogFile If($Mail_Body) { try { Send-MailMessage -From $Mail_From -To $Mail_To -Subject $Mail_Subject -Body $Mail_Body -SmtpServer $Mail_SMTPServer -Attachments $strLogFile_e } catch { Add-Log -Path $strLogFile -Value "Failed to send mail, cause: $($Error[0])" -Type Error } }