Time limit: 3.000 seconds
Many researchers are faced with an ever increasing number of journal articles to read and find it difficult to locate papers of relevance to their particular lines of research. However, it is possible to subscribe to various services which claim that they will find articles that fit an `interest profile' that you supply, and pass them on to you. One simple way of performing such a search is to determine whether a pair of keywords occurs `sufficiently' close to each other in the title of an article. The threshold is determined by the researchers themselves, and refers to the number of words that may occur between the pair of keywords. Thus an archeologist interested in cave paintings could specify her profile as ``0 rock art'', meaning that she wants all titles in which the words ``rock'' and ``art'' appear with 0 words in between, that is next to each other. This would select not only ``Rock Art of the Maori'' but also ``Pop Art, Rock, and the Art of Hang-glider Maintenance''.
许多研究人员都面临这样一个问题:阅读的期刊文章数量与日俱增,要找到与他们特定研究方向相关的文章困难重重。然而,有一些订阅服务声称它们可以按你制定的“兴趣配置”找到匹配的文章,并传送给你。一种简单的方式就是执行这样一种搜索:确定文章中是否有一对单词出现的“足够” 靠近。研究人员设定一个阈值,指出一对单词之间应出现的单词数量。例如一个考古学家对岩洞壁画感兴趣,就会指定她的兴趣配置为“0 rock
Art of the Maori”和“Pop Art, Rock, and the Art of Hang-glider
Write a program that will read in a series of
profiles followed by a series of titles and determine which of the
titles (if any) are selected by each of the profiles. A title is
selected by a profile if at least one pair of keywords from the profile
is found in the title, separated by no more than the given threshold.
For the purposes of this program, a word is a sequence of letters,
preceded by one or more blanks and terminated by a blank or the end of
line marker.
Input will consist of no more than 50 profiles followed by no more than 250 titles. Each profile and title will be numbered in the order of their appearance, starting from 1, although the numbers will not appear in the file.
profile will start with the characters ``P:'', and will consist of a
number representing a threshold, followed by two or more keywords in
lower case.
title will start with the characters ``T:'', and will consist of a
string of characters terminated by ``|''. The character ``|'' will not
occur anywhere in a title except at the end. No title will be longer
than 255 characters, and if necessary it will flow on to more than one
line. No line will be longer than eighty characters and each
continuation line of a title will start with at least one blank. Line
breaks will only occur between words.
All non-alphabetic characters are to
be ignored, thus the title ``Don't Rock -- the Boat as Metaphor in
1984'' would be treated as ``Dont Rock the Boat as Metaphor in'' and
``HP2100X'' will be treated as ``HPX''. The file will be terminated by a
line consisting of a single #.
所有非字母的字符都应忽略,例如标题“Don't Rock -- the
Boat as Metaphor in 1984”应被当作“Dont Rock the Boat as Metaphor
Output will consist of a series of lines, one for each profile in the input. Each line will consist of the profile number (the number of its appearance in the input) followed by ``:'', a blank space, and the numbers of the selected titles in numerical order, separated by commas and with no spaces.
Sample input
Sample output
- 所有非字母的字符都不处理;
- 仅以空格或换行作为单词的分隔符;
- 单词均以小写形式处理;
- 配置中的单词任两个都要算做一对。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 | #include <algorithm> #include <iostream> #include <string> #include <vector> #include <map> #include <utility> typedef unsigned long ulong; typedef unsigned short ushort; // 用于存储profile中的阈值和转成数字序列的关键词组合 struct PROFILE { size_t nThreshold; std::vector<ushort> nArray; }; // 用于存储profile中的阈值和profile的编号,title中的包含的两个关键字之间的距离和title的编号 struct INFO { size_t nDist; size_t nIdx; }; typedef std::vector<std::string> VECSTR; typedef std::vector<ushort> ARRAY; typedef std::vector<ARRAY> MATRIX; typedef std::map<ulong, std::vector<INFO> > MAPINFO; typedef std::pair< size_t , size_t > PAIR; // 将keywords对中的两个单词用数字序列表示,用一个unsigned short数据类型存储 ulong MakeWordPair(ushort w1, ushort w2) { return (w1 > w2)? (w1 | (w2 << 16)) : (w2 | (w1 << 16)); } // 排序过程,重载“<”运算符 bool operator < ( const INFO &f1, const INFO &f2) { return (f1.nDist < f2.nDist || (f1.nDist == f2.nDist && f1.nIdx < f2.nIdx)); } // 去重过程,重载“==”运算符 bool operator == ( const INFO &f1, const INFO &f2) { return (f1.nDist == f2.nDist && f1.nIdx == f2.nIdx); } int main( void ) { VECSTR profileStrs, titleStrs; for (std::string str; getline(std::cin, str) && str[0] != '#' ; ) { // 读入数据,若以“P:”开头,则表示profile,若以“T:”开头,则表示title,若以空格或者tab开头,则承接上一个title。 switch (str[0]) { case 'P' : profileStrs.push_back(std::string(str.begin() + 2, str.end())); break ; case 'T' : titleStrs.push_back(std::string(str.begin() + 2, str.end())); break ; case ' ' : case '\t' : titleStrs.back() += str; break ; } } std::map<std::string, ushort> wordTbl; // 用于给每一个keywords编号,keywords与编号的映射关系存入wordTbl中 std::vector<PROFILE> arrProfile; // 将每个profile中的keywords序列转化为相应的keywords编号序列 for (VECSTR::iterator i = profileStrs.begin(); i != profileStrs.end(); ++i) { i->push_back( ' ' ); std::string::iterator iBeg = i->begin(); // 由于profile由阈值和keywords串组成,遍历profile字符串,找到阈值的起始位置 for (; iBeg != i->end() && ! isdigit (*iBeg); ++iBeg); // 找到阈值的结束位置,读取阈值 std::string strThre; std::string::iterator iEnd = iBeg; for (; iEnd != i->end() && isdigit (*iEnd); ++iEnd) strThre.push_back(*iEnd); // 保存每一个profile的阈值和由keywords的编号组成的序列 arrProfile.push_back(PROFILE()); PROFILE &cur = arrProfile.back(); // 将阈值由文本形式转为数值形式 cur.nThreshold = atoi (strThre.c_str()); //用于存储keywords中读取的单词 std::string word; for (std::string::iterator j = iEnd; j != i->end(); ++j) { if (*j != ' ' && *j != '\t' ) word.push_back(*j); else if (!word.empty()) { // 更新keywords与编号的映射表 ushort &wordIdx = wordTbl[word]; if (wordIdx == 0) wordIdx = wordTbl.size(); // 存储keywords编号序列 cur.nArray.push_back(wordIdx); word.clear(); } } } // 原输入为一个profile对应一组keywords pair,将其转变为一个keywords pair对应一个profile编号组,建立映射关系 MAPINFO profileTbl; for (std::vector<PROFILE>::iterator i = arrProfile.begin(); i != arrProfile.end(); ++i) { // 所有的keywords两两组合作为一个keywords pair for (ARRAY::iterator j = i->nArray.begin(); j != i->nArray.end() - 1; ++j) { for (ARRAY::iterator k = j + 1; k != i->nArray.end(); ++k) { INFO info = {i->nThreshold, i - arrProfile.begin()}; profileTbl[MakeWordPair(*j, *k)].push_back(info); } } } MATRIX titleAry; for (VECSTR::iterator i = titleStrs.begin(); i != titleStrs.end(); ++i) { (*i)[i->size() - 1] = ' ' ; titleAry.push_back(ARRAY()); std::string word; // 按题中要求处理title,去掉非字母的符号。再将title序列转化为编号序列,若某一个单词为keyword,则标记为相应的编号,若不是,则标记为-1 for (std::string::iterator j = i->begin(); j != i->end(); ++j) { char cTmp = tolower (*j); if (cTmp != ' ' && cTmp != '\t' ) { if ( isalpha (cTmp)) word.push_back(cTmp); } else if (!word.empty()) { std::map<std::string, ushort>::iterator idx = wordTbl.find(word); titleAry.back().push_back(idx != wordTbl.end() ? idx->second : -1); word.clear(); } } } // 每一个title中包含多个keywords pair,计算并存储每对keywords的距离 MAPINFO titleTbl; for (MATRIX::iterator i = titleAry.begin(); i != titleAry.end(); ++i) { // 对当前title建立keywords pair,每对keywords的距离以及title编号的映射表 std::map<ulong, ushort> curWordmap; for (ARRAY::iterator j = i->begin(); j != i->end() - 1; ++j) { if (*j != ushort(-1)) { for (ARRAY::iterator k = j + 1; k != i->end(); ++k) { if (*k != ushort(-1)) { // 若存在关键字对,则计算两个关键字间的距离,保留最小值 ushort nDist = k - j; ushort &nWord = curWordmap[MakeWordPair(*j, *k)]; if (nWord == 0 || nDist < nWord) nWord = nDist; } } } } // 将title处理为一个keywords pair对应一组title编号和距离 for (std::map<ulong, ushort>::iterator j = curWordmap.begin(); j != curWordmap.end(); ++j) { INFO info = {j->second, i - titleAry.begin()}; titleTbl[j->first].push_back(info); } } // 比较profile和title,确定哪些title属于相应的profile std::vector<PAIR> result; for (MAPINFO::iterator i = profileTbl.begin(); i != profileTbl.end(); ++i) { std::vector<INFO> &curP = i->second; std::vector<INFO> &curT = titleTbl[i->first]; // 判断title中是否有该keywords pair if (!curT.empty()) { // 当profile和title包含相同的keywords时,将当前的profile编号排序去重 std::sort(curP.begin(), curP.end()); curP.erase(std::unique(curP.begin(), curP.end()), curP.end()); std::sort(curT.begin(), curT.end()); // 将当前的title编号排序 for (std::vector<INFO>::iterator icurP = curP.begin(), icurT = curT.begin(); icurP != curP.end() && icurT != curT.end();) { // 若当前title中关键字的距离小于当前profile中阈值,则该title的编号必定属于当前之后的所有profile(包含当前profile) // 若大于当前阈值,则去下一个profile的阈值 if (icurT->nDist - 1 <= icurP->nDist) { for (std::vector<INFO>::iterator j = icurP; j != curP.end(); ++j) result.push_back(std::make_pair(j->nIdx + 1, icurT->nIdx + 1)); ++icurT; } else ++icurP; } } else result.push_back(std::make_pair(curP.front().nIdx + 1, 0)); } // 对结果排序并输出 std::sort(result.begin(), result.end()); int nProfIdx = 0; for (std::vector<PAIR>::iterator i = result.begin(); i != result.end(); ++i) { if (i->first != nProfIdx) { nProfIdx = i->first; if (i != result.begin()) std::cout << std::endl; std::cout << nProfIdx << ": " ; if (i->second != 0) std::cout << i->second; } else if (i->second != 0) std::cout << ',' << i->second; } } std::cout << std::endl; return 0; } |

作者:王雨濛;新浪微博:@吉祥村码农;来源:《程序控》博客 -- http://www.cnblogs.com/devymex/ 此文章版权归作者所有(有特别声明的除外),转载必须注明作者及来源。您不能用于商业目的也不能修改原文内容。 |
