如何找bug
将发送端sleep时间定为0, 即sleep(0).
事例率也在700Hz持续3分钟后降为0.查看log文件:
1. SFI 的log文件:
WARNING 2017-Feb-08 15:39:55 [DC::StatusWord EventAssembly::EventCompleted(...) at SFI/src/EventAssembly.cxx:480] Problem with the flow of data: Event with LVL1ID 130745 misses 1 data fragment(s) from: ROS-Eth-07,
WARNING 2017-Feb-08 15:39:55 [DC::StatusWord EventAssembly::EventCompleted(...) at SFI/src/EventAssembly.cxx:480] Problem with the flow of data: Event with LVL1ID 130749 misses 1 data fragment(s) from: ROS-Eth-07,
WARNING 2017-Feb-08 15:39:55 [DC::StatusWord EventAssembly::EventCompleted(...) at SFI/src/EventAssembly.cxx:480] Problem with the flow of data: Event with LVL1ID 130753 misses 1 data fragment(s) from: ROS-Eth-07,
2. ROS 的log文件:
Warning could not load libEthClientSequentialReadoutModule.so, trying libROSEthClientSequentialReadoutModule.so
ERROR 2017-Feb-08 15:44:10 [virtual void ROS::RequestHandler::run(...) at ROSCore/src/RequestHandler.cpp:274] Unclassified error: RequestHandler caught exception, deleting Request
was caused by: ERROR 2017-Feb-08 15:44:10 [unknown at unknown:0] No pages available
找到ROSCore/src/RequestHandler.cpp:274:
catch (std::exception& issue) { 274 ENCAPSULATE_ROS_EXCEPTION(newIssue, CoreException, UNCLASSIFIED, issue, "RequestHandler caught exception, deleting Request "); 275 ers::error(newIssue); 276 deleteRequest(); 277 }
先找到 ENCAPSULATE_ROS_EXCEPTION的定义:并没有看出啥,只看出来例外的类型是CoreException, errorCode是UNCLASSIFIED.
/** \def ENCAPSULATE_ROS_EXCEPTION(instanceName, exceptionClass, errorCode, oldException, messageContent) Macro to encapsulate an exception inside a ROSException along with some explanatory text. */ #define ENCAPSULATE_ROS_EXCEPTION(instanceName, exceptionClass, errorCode, oldException, messageContent) \ std::ostringstream instanceName##_tStream; \ instanceName##_tStream << messageContent; \ exceptionClass instanceName(oldException, exceptionClass::errorCode, instanceName##_tStream.str(), ERS_HERE) #endif
然后再查看try里面的语句
243 try { 244 TS_RECORD(TS_H1,stamp_offset + 2100); 245 int requestOk; 246 //unsigned int tsdata = m_handlerId * 100 + 3000000000; 247 //TS_RECORD(TS_H5,tsdata); 248 249 requestOk = m_request->execute(); 250 TS_RECORD(TS_H1,stamp_offset + 2200); 251 252 if (requestOk == Request::REQUEST_OK) { 253 m_requestsHandled++; 254 deleteRequest(); 255 TS_RECORD(TS_H1, stamp_offset + 2300); 256 } 257 else if (requestOk == Request::REQUEST_TIMEOUT) { 258 m_requestsTimedOut++; 259 deleteRequest(); 260 TS_RECORD(TS_H1, stamp_offset + 2300); 261 } 262 else { 263 TS_RECORD(TS_H1, stamp_offset + 2400); 264 DEBUG_TEXT(DFDB_ROSCORE, 15, "RequestHandler(" << m_handlerId <<")::run: Request non completed: will be requeued."); 265 m_requestsFailed++; 266 m_request = m_requestQueue->swap(m_request); 267 DFThread::yieldOrCancel(); 268 TS_RECORD(TS_H1, stamp_offset + 2500); 269 } 270 271 TS_RECORD(TS_H1, stamp_offset + 2990); 272 }
整个try...catch语句是在requesthandler的run函数里面,run函数是requesthandler的执行线程,从request queue里面拿出一个request, 执行这个request,然后再删除这个request.
通过加打印得到:try语句里执行了第一个if语句,证明request被执行后返回了REQUEST_OK,在执行if内部的语句deleteRequest();时,出现了例外。
再查看deleteRequest()函数里面写了什么:
193 void RequestHandler::deleteRequest() 194 { 195 if (m_request != 0) { 197 m_mutex->lock() ; // Memory deallocation has to be protected as 198 delete(m_request); // it is thread-unsafe! 199 m_request=0; 200 m_mutex->unlock(); 201 } 202 }
在 if 里面每条语句里面加打印,在requesthandler的析构函数里面加打印,并查看log文件
void RequestHandler::deleteRequest() 194 { 195 if (m_request != 0) { 196 std::cout << "m_request is not 0:1!" << std::endl; 197 m_mutex->lock() ; // Memory deallocation has to be protected as 198 std::cout << "m_request is not 0:2!" << std::endl; 199 delete(m_request); // it is thread-unsafe! 200 std::cout << "m_request is not 0:3!" << std::endl; 201 m_request=0; 202 std::cout << "m_request is not 0:4!" << std::endl; 203 m_mutex->unlock(); 204 std::cout << "m_request is not 0:5!" << std::endl; 205 } 206 }
查看ROS-Eth-04的log文件,发现:
[lhaaso@cmm03node01 part_dk_ef]$ grep "m_request is not 0:1" ROS-Eth-04_cmm03node01_1486545277.out |wc -l
261133
[lhaaso@cmm03node01 part_dk_ef]$ grep "m_request is not 0:2" ROS-Eth-04_cmm03node01_1486545277.out |wc -l
261239
[lhaaso@cmm03node01 part_dk_ef]$ grep "~Request" ROS-Eth-04_cmm03node01_1486545277.out |wc -l
130624
[lhaaso@cmm03node01 part_dk_ef]$ grep "m_request is not 0:3" ROS-Eth-04_cmm03node01_1486545277.out |wc -l
261239
[lhaaso@cmm03node01 part_dk_ef]$ grep "m_request is not 0:4" ROS-Eth-04_cmm03node01_1486545277.out |wc -l
261239
[lhaaso@cmm03node01 part_dk_ef]$ grep "m_request is not 0:5" ROS-Eth-04_cmm03node01_1486545277.out |wc -l
261238
这几条语句的执行次数,发现deleteRequest操作的次数比其前后打印的次数要少很多,证明有很多次delteRequest并没有成功,抛出例外,并导致程序报错了。
(批注:实际上是Request有两个子类,FragmentRequest 和 ReleaseFragment类, RequestHandler执行的时候根据收到SFI的消息种类判断执行哪一种Request。)