20130427遇到的2个问题:503错误与Couchbase集群CPU占用不均衡
(注:这2个问题与阿里云一点关系没有)
一、503错误
今天13:00~13:10左右,出现了503错误。出错原因是当时的并发请求数超出了IIS应用程序池的队列长度(Queue Length),当时用的是IIS的默认设置1000(见下图)。
我们将这里的Queue Length由1000改为2000解决了问题(最大可以设置为65535)。
后来发现可以通过 Performance Monitor 监测 "HTTP Service Request queue" -> "Arrival Rate" 来设定 Queue Length。
比如上图中显示"Arrival Rate"的最大值是400,那么Queue Length最好大于400。
看一下当时的负载均衡中一台Web服务器的CPU监控图:
(红色曲线表示%Processor Time,绿色曲线表示Request Execution Time)
不知当时这台云服务器发生了什么异常情况?看来503错误的根源是云服务器的CPU异常,已向阿里云提交工单了解情况。
更新:
经过仔细排查,503错误是当时应用程序池崩溃引起的,应用程序池崩溃是Couchbase客户端引起的,当时正在进行Couchbase集群增/减服务器的操作。
证据来自Windows事件日志:
Exception: System.NullReferenceException Message: Object reference not set to an instance of an object. StackTrace: at Hammock.RestClient.CompleteWithQuery(WebQuery query, RestRequest request, RestCallback callback, WebQueryAsyncResult result) at Hammock.RestClient.<>c__DisplayClass18.<BeginRequestImpl>b__15(Object sender, WebQueryResponseEventArgs args) at System.EventHandler`1.Invoke(Object sender, TEventArgs e) at Hammock.Web.WebQuery.OnQueryResponse(WebQueryResponseEventArgs args) at Hammock.Web.WebQuery.HandleWebException(WebException exception) at Hammock.Web.WebQuery.GetAsyncResponseCallback(IAsyncResult asyncResult) at System.Net.LazyAsyncResult.Complete(IntPtr userToken) at System.Threading.ExecutionContext.runTryCode(Object userData) at System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode code, CleanupCode backoutCode, Object userData) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Net.ContextAwareResult.Complete(IntPtr userToken) at System.Net.HttpWebRequest.SetResponse(Exception E) at System.Net.ConnectionReturnResult.SetResponses(ConnectionReturnResult returnResult) at System.Net.Connection.CompleteConnectionWrapper(Object request, Object state) at System.Net.PooledStream.ConnectionCallback(Object owningObject, Exception e, Socket socket, IPAddress address) at System.Net.ServicePoint.ConnectSocketCallback(IAsyncResult asyncResult) at System.Net.LazyAsyncResult.Complete(IntPtr userToken) at System.Net.ContextAwareResult.Complete(IntPtr userToken) at System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* nativeOverlapped) at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* pOVERLAP)
Application: w3wp.exe Framework Version: v4.0.30319 Description: The process was terminated due to an unhandled exception. Exception Info: System.NullReferenceException Stack: at System.Net.ServicePoint.ConnectSocketCallback(System.IAsyncResult) at System.Net.LazyAsyncResult.Complete(IntPtr) at System.Net.ContextAwareResult.Complete(IntPtr) at System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*) at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
Faulting application name: w3wp.exe, version: 7.5.7601.17514, time stamp: 0x4ce7afa2 Faulting module name: unknown, version: 0.0.0.0, time stamp: 0x00000000 Exception code: 0xc0000005 Fault offset: 0x000007ff0033cbed Faulting process id: 0x10b4 Faulting application start time: 0x01ce42fb6c5d3e18 Faulting application path: c:\windows\system32\inetsrv\w3wp.exe Faulting module path: unknown Report Id: 30767fd7-aef7-11e2-8bf7-e5d3e0390d57
2. Couchbase集群CPU占用不均衡
(Couchbase管理控制台)
(Linux top命令运行结果)
两台Couchbase组建的集群,CPU占用却相差很大,Couchbase版本是2.0.0。
google之后找到High cpu usage in memcached process,原来是Couchbase 2.0.0的bug,升级至最新版Couchbase 2.0.1可以解决这个问题。
升级操作方法:
1. 在两台Couchbase服务器上下载好安装包:wget http://packages.couchbase.com/releases/2.0.1/couchbase-server-enterprise_x86_64_2.0.1.rpm
2. 进入Coucbase管理控制台,从集群中摘掉1台服务器,具体操作方法见 couchbase-getting-started-upgrade-online
3. 升级Couchbase至2.0.1:rpm -U couchbase-server-enterprise_x86_64_2.0.1.rpm (升级之后最好重启一下couchbase服务:service couchbase restart)
4. 将升级后的Couchbase服务器重新加入集群。
5. 对另一台Couchbase服务器进行同样的升级操作。
升级后,问题解决