[Good for enterprise] How to monitor GFE? - 2014-08-09 update

GFE monitoring related posts,

http://www.cnblogs.com/LarryAtCNBlog/p/3890743.html

Welcome email to: larry.song@outlook.com

Few days after i made the post, issue again, the issue didn't have any user impacted, only between internal team.

The issue relate to several events i in my previous post which is 5662/5669/5675/5733, as i said, those event shouldn't occur, if does, means there is communication problem with Good NOC, this is always right, the issue generated one or two of events, and it recovered very fast, like never happened, so no user noticed, good news is our monitoring script captured the issue and sent us alert. This problem happens sometimes, unable to reach Good NOC occasionally, cause email delay for several minutes, of course, we should exclude the situation from our monitoring system, otherwise the on-call person will have to weak up at night.

I also opened case to Good, engineer can't give any specific root cause, as there is many parties involved, the communication packages come out from our proxy, though many ISP provider, then reach Good NOC, any network failure will cause the communication failed, and due to extremely short time and alert delay, we are unable to capture trace logs and ping logs, so it's expected for me Good is unable to provide solution.

At last, I trun around, asked the web service urls from Good NOC, and how to decode Good logs, since i want to make a script to exclude the false alert by analysising Good web service urls and Good logs.

PS: Some of GFE's log are encoded, Good can provide decoding, but need the Good product owner's approval, from my company is a man in UK, I can hardly get approval from him, so finally i give up logs analysis.

I get NOC web service urls from Good engineer, they gave to me without questions.

https://xml28.good.com/
https://xml29.good.com/
https://xml30.good.com/

The basic idea is update the original monitoring, if script captured NOC failure events, it invokes NOC connection test script to see if the web service really can't be reached.

so, turns out, Test-NOC.ps1 coming, if all those URLs are reachable, the script will return $true, anyone of them is unreachable, the script return $false. they is some balancing mechanism from Good, so GFE will not only use one web service url.

### this script invoked by EventID.Monitoring.ps1
### Used to test Good NOC connectivity

$GoodNOC_Url = @(
    'https://xml28.good.com/',
    'https://xml29.good.com/',
    'https://xml30.good.com/'
)

$WebProxy = New-Object 'System.Net.WebProxy'
# Change below proxy to your own proxy server and port
$WebProxy.Address = 'http://ProxyServer:Port'

$WebClient = New-Object 'System.Net.WebClient'
$WebClient.Proxy = $WebProxy

$Result = $true
foreach($Url in $GoodNOC_Url)
{
    $LoopCount = 0
    do
    {
        $LoopResult = $false
        $LoopCount++
        if(($WebClient.DownloadString($Url)).Contains('Congratulations!  You have successfully connected to the GoodLink Service.'))
        {
            $LoopResult = $true
            break
        }
    }
    while($LoopCount -lt 3)
    $Result = $Result -and $LoopResult
    if($Result)
    {
        Add-Log -Path $strLogFile_e -Value "NOC Testing succeed: [$Url]" -Type Info
    }
    else
    {
        Add-Log -Path $strLogFile_e -Value "NOC Testing failed: [$Url]" -Type Warning
    }
}

return $Result

The main monitoring script updated,

#change working directory
Set-Location (Get-Item ($MyInvocation.MyCommand.Definition)).DirectoryName

#define events to be monitored and their properties
#EventClass means this event indicates this event is similar to other events, same class events will trigger class script to do final judgement
#ID means eventID, if you use array like @(xx,yy), means combine results first, e.g. xx matched 10 enties, yy matched 10 enties, combine as 20 than compare with threshold
#Pattern is regular expression in C#, used for filter specific events.
#MinusPattern also regular expression, used for filter specific events.
#if Pattern and MinusPattern be defined, pattern matched 100 enties, MinusPattern matched 90 enties, so final number is 10, than compare with threshold, this is the way to exclude "auto-recover".
$Events = @(
    @{EventClass = 1; ID = 3563; Pattern = '\bPausing .*MAPI error'; MinusPattern = 'Unpausing'; Threshold = 100;},
    @{EventClass = 2; ID = @(1299, 1300, 1301); Pattern = $null; Threshold = 100;},
    @{EventClass = 1; ID = 3386; Pattern = 'GDMAPI_OpenMsgStore failed'; Threshold = 100;},
    @{EventClass = 3; ID = @(5662, 5669); Pattern = $null; Threshold = 1;},
    @{EventClass = 3; ID = 5675; Pattern = 'errNetConnect'; Threshold = 1;},
    @{EventClass = 3; ID = 5733; Pattern = 'errNetTimeout'; Threshold = 1;}
)

# Script is null means not external script invoked, all depends on threshould
# Script isn't null, means trigger external script, and do final judge depaneds on the return
$EventClass = @{
= @{Script = $null; Description = 'MAPI Error'};
= @{Script = $null; Description = 'Good thread hung up'};
= @{Script = 'Test-NOC.ps1'; Description = 'Failed to contact NOC'};
}

$Date = Get-Date
$strDate = $Date.ToString("yyyy-MM-dd")

$End_time = $Date
$Start_time = $Date.AddMinutes(-15)
$strLogFile = "${strDate}.log.txt"
$strLogFile_e = "${strDate}_Error.log.txt"

#define email properties
$Mail_From = "$($env:COMPUTERNAME)@fil.com"
$Mail_To = 'xxxxx@xxx.xxx'
$Mail_Subject = 'Good event IDs warning'
$Mail_SMTPServer = 'smtpserver'

Set-Content -Path $strLogFile_e -Value $null 

function Add-Log
{
    PARAM(
        [String]$Path,
        [String]$Value,
        [String]$Type
    )
    $Type = $Type.ToUpper()
    Write-Host "$((Get-Date).ToString('[HH:mm:ss] '))[$Type] $Value"
    if($Path){
        Add-Content -Path $Path -Value "$((Get-Date).ToString('[HH:mm:ss] '))[$Type] $Value"
    }
}

Add-Log -Path $strLogFile_e -Value "Catch logs after : $($Start_time.ToString('HH:mm:ss'))" -Type Info
Add-Log -Path $strLogFile_e -Value "Catch logs before: $($End_time.ToString('HH:mm:ss'))" -Type Info
Add-Log -Path $strLogFile_e -Value "Working directory: $($PWD.Path)" -Type Info

$EventsCache = @(Get-EventLog -LogName Application -After $Start_time -Before $End_time.AddMinutes(5))
Add-Log -Path $strLogFile_e -Value "Total logs count : $($EventsCache.Count)" -Type Info
$Error_Array = @()
foreach($e in $Events)
{
    $Events_e_ALL = $null
    $Events_e_Matched = $null
    $Events_e_NMatched = $null
    $Events_e_FinalCount = 0

    $Events_e_ALL = @($EventsCache | ?{$e.ID -contains $_.EventID})
    Add-Log -Path $strLogFile_e -Value "Captured [$($e.ID -join '], [')], count: $($Events_e_ALL.Count)" -Type Info
    $Events_e_Matched = @($Events_e_ALL | ?{$_.Message -imatch $e.Pattern})
    Add-Log -Path $strLogFile_e -Value "Pattern matched, count: $($Events_e_Matched.Count)" -Type Info
    
    if($e.MinusPattern)
    {
        $Events_e_NMatched = @($Events_e_ALL | ?{$_.Message -imatch $e.MinusPattern})
        Add-Log -Path $strLogFile_e -Value "Minus pattern matched, count: $($Events_e_NMatched.Count)" -Type Info
    }

    $Events_e_FinalCount = $Events_e_Matched.Count - [int]$Events_e_NMatched.Count
    Add-Log -Path $strLogFile_e -Value "Final matched, count: $Events_e_FinalCount" -Type Info
    if($Events_e_FinalCount -ge $e.Threshold)
    {
        Add-Log -Path $strLogFile_e -Value "Over threshold: $($e.Threshold)" -Type Warning
        if($Error_Array -notcontains $e.EventClass)
        {
            $Error_Array += $e.EventClass
        }
    }
}

Add-Log -Path $strLogFile_e -Value "Alert classes captured: [$($Error_Array -join '], [')]" -Type Info
for($e = 0; $e -lt $Error_Array.Count; $e++)
{
    Add-Log -Path $strLogFile_e -Value "Process class: [$e]" -Type Info
    if($EventClass.$($Error_Array[$e]).Script -imatch '^$')
    {
        Add-Log -Path $strLogFile_e -Value 'Final script not set, need to send alert.' -Type Warning
    }
    else
    {
        Add-Log -Path $strLogFile_e -Value "Run final script: [$($EventClass.$($Error_Array[$e]).Script)]" -Type Info
        if((& $EventClass.$($Error_Array[$e]).Script) -eq $true)
        {
            Add-Log -Path $strLogFile_e -Value 'Final script: [Positive], no need to send alert.' -Type Info
            $Error_Array[$e] = $null
        }
        else
        {
            Add-Log -Path $strLogFile_e -Value 'Final script: [Negetive], need to send alert' -Type Warning
        }
    }
}

$Error_Array | %{$Mail_Body = @()}{
    if($_)
    {
        $Mail_Body += $EventClass.$_.Description
    }
}
$Mail_Body = $Mail_Body -join "`n"

Add-Log -Path $strLogFile_e -Value "===================split line====================" -Type Info
Get-Content -Path $strLogFile_e | Add-Content -Path $strLogFile

If($Mail_Body)
{
    try
    {
        Send-MailMessage -From $Mail_From -To $Mail_To -Subject $Mail_Subject -Body $Mail_Body -SmtpServer $Mail_SMTPServer -Attachments $strLogFile_e
    }
    catch
    {
        Add-Log -Path $strLogFile -Value "Failed to send mail, cause: $($Error[0])" -Type Error
    }
}

 

posted @ 2014-08-09 14:59  LarryIsTaken  阅读(533)  评论(0编辑  收藏  举报