站内搜索--索引、XML解析

创建索引库，就是去请求指定的页面，将页面的信息下载到本地，经过分词保存起来，形成索引库。

看代码（看码说话）：

　　　　  　　string indexPath = "c:/index"; //索引库 保存地址  web.config可配性
            FSDirectory directory = FSDirectory.Open(new DirectoryInfo(indexPath), new NativeFSLockFactory());//索引文件
            bool isUpdate = IndexReader.IndexExists(directory);//判读是否为索引目录
            if (isUpdate)
            {
                //如果索引目录被锁定（比如索引过程中程序异常退出），则首先解锁
                if (IndexWriter.IsLocked(directory))
                {
                    IndexWriter.Unlock(directory);
                }
            }
            IndexWriter writer = new IndexWriter(directory, new PanGuAnalyzer(), !isUpdate, Lucene.Net.Index.IndexWriter.MaxFieldLength.UNLIMITED);

　　　　　　　//指定分词器，将文章分词存入索引库中
            WebClient wc = new WebClient();
            wc.Encoding = Encoding.UTF8;//否则下载的是乱码
　　　　　　//不同的网站文章文章保存的方式不同 *本项目主要针对论坛
            //todo：读取rss，获得第一个item中的链接的编号部分就是最大的帖子编号

            int maxId = GetMaxId();
            for (int i = 2000; i <= maxId; i++)
            {
                string url = "http://localhost:8081/showtopic-" + i.ToString() + ".aspx";
                string html = wc.DownloadString(url);

                HTMLDocumentClass doc = new HTMLDocumentClass(); //mshtml 解析 网页中的文本 *IE 就使用的此方法解析 
                   doc.designMode = "on"; //不让解析引擎去尝试运行javascript
                doc.IHTMLDocument2_write(html);
                doc.close();

                string title = doc.title;
                string body = doc.body.innerText;//去掉标签

　　　　　　　　　//同时，可以使用 document.getElementById()

　　　　　　　　//为避免重复索引，所以先删除number=i的记录，再重新添加   否则：就会成倍增加 
                writer.DeleteDocuments(new Term("number", i.ToString()));

                Document document = new Document();
                //只有对需要全文检索的字段才ANALYZED
                document.Add(new Field("number", i.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
                document.Add(new Field("title", title, Field.Store.YES, Field.Index.NOT_ANALYZED));//文章标题就无需分词了
                document.Add(new Field("body", html, Field.Store.YES, Field.Index.ANALYZED, Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));
                writer.AddDocument(document);

            }
            writer.Close();
            directory.Close();//不要忘了Close，否则索引结果搜不到

论坛向外公布的RSS是xml格式文件，所以，可以通过解读次文件来获得最大论坛帖子

           private int GetMax()

　　　　　　　XDocument xdoc = XDocument.Load("http://localhost:8081/tools/rss.aspx");
            XElement channel = xdoc.Root.Element("channel"); 
            XElement firstItem = channel.Elements("item").First();
            XElement link = firstItem.Element("link");
            Match match = Regex.Match(link.Value, @"showtopic-(\d+)\.aspx");
            string id = match.Groups[1].Value;//正则表达式使用有点忘记了
            return Convert.ToInt32(id);
　　　　　　}

忘记xml操作了，学习了

http://www.cnblogs.com/malin/archive/2010/03/04/1678352.html

posted on 2012-03-16 09:09 ancient_sir 阅读(564) 评论(1) 编辑收藏举报