日志分析zz

那道socket的面试题我说我不会,那公司又给我了另一道更加变态的题目:

日志分析
1）从文件中用户提取字符串(下面示例中包含在[]中的字符)

2）利用hash算法把所有的字符串散列到一个hash表中，统计相同字符串出现的次数

3）编写一个排序算法(例如快速排序)，把这些字符串按照出现的频率排序.

4）把输出结果保存到一个文件，文件格式为：

字符串/出现次数/r/n

输入文件示例:

09/30 14:40:49 PB .1306 TRACE: [西部地区] cl=2 lm=0 ct=0 si=gi tn=sohu pn=0 | disp=33817 list=200(10) 143ms cache=0

09/30 14:40:50 PB .1302 TRACE: [河南电脑福利彩票] cl=0 lm=0 ct=0 si=gi tn=sohu pn=0 | disp=39 list=39(10) 245ms cache=0

09/30 14:40:50 PB .1304 TRACE: [富丽雅] cl=2 lm=0 ct=0 si=gi tn=sohu pn=0 | disp=46 list=46(10) 151ms cache=0

09/30 14:40:50 PB .1297 TRACE: [国内生产总值] cl=2 lm=0 ct=0 si=gi tn=sohu pn=0 | disp=21674 list=200(10) 7ms cache=1

09/30 14:40:50 PB .1308 TRACE: [超星图书浏览器] cl=2 lm=0 ct=0 si=gi tn=sohu pn=0 | disp=54 list=54(10) 145ms cache=0

09/30 14:40:50 PB .1294 TRACE: [mp3] cl=2 lm=0 ct=0 si=gi tn=sohu pn=0 | disp=96110 list=200(10) 7ms cache=1

要求：
1）使用C++
2）说明设计要点
3）实现的性能瓶颈在哪里，如何优化
4）给出日志文件5M、10M、50M、500M和1G时的测试结果并作简要分析

我了割草,又是算法,又是动不动就500M,你们公司是写操作系统的吗?还是C++,最后还是一句,求各位大神解答,谢

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <algorithm>
#include <boost/regex.hpp>
#include <boost/foreach.hpp>
#include <boost/unordered_map.hpp>

using namespace std;
using namespace boost;

struct logs_t
{
        string id;
        unsigned int cnt;
};

typedef unordered_map<string,unsigned int> logcntmap_t;
typedef vector<logs_t> logsv_t;

bool scomp( const logs_t & v1, const logs_t & v2 )
{
        return v1.cnt > v2.cnt;
}

int main(int argc, char ** argv)
{
        if( argc<3 )
        {
                cout << "Usage: logc InputLogFile OutputFile" << endl;
                return -1;
        }

        ifstream ifs( argv[1] );
        ofstream ofs( argv[2] );

        if( !ifs || !ofs )
        {
                cout << "File(s) open error" << endl;
                return -2;
        }

        regex pat(".*//[(.*)//].*");
        string line;
        smatch sval;
        logcntmap_t lgm;

        while( getline(ifs,line) )
        {
                if( regex_match( line, sval, pat ) )
                {
                        const string & s = sval[1];
                        if( lgm.find(s) != lgm.end() )
                        {
                                lgm[s]++;
                        }
                        else
                        {
                                lgm[s] = 1;
                        }
                }
        }
        ifs.close();

        logsv_t logsv;
        logsv.reserve( lgm.size() );
        BOOST_FOREACH( logcntmap_t::value_type & r, lgm )
        {
                logs_t lg = {r.first, r.second};
                logsv.push_back( lg );
        }
        lgm.clear();

        sort( logsv.begin(), logsv.end(), scomp );

        BOOST_FOREACH( logsv_t::value_type & v, logsv )
        {
                ofs << v.id << "//" << v.cnt << "/r/n";
        }
        ofs.close();

        cout << logsv.size() << " lines write in file " << argv[2] << endl;
        return 0;
}

posted @ 2011-03-21 11:38 BiG5 阅读(131) 评论(0) 收藏举报

刷新页面返回顶部

BiG5

TiME MachInE

日志分析zz

公告