Feature-NLP - 随笔分类(第2页) - 郝壹贰叁

[IR] Graph Compression

摘要：Ref: [IR] Compression Ref: [IR] Link Analysis Planar Graph From: http://www.csie.ntnu.edu.tw/~u91029/PlanarGraph.html#1 由於缺乏優美規律，因此談論對偶圖時，習慣忽略同構。最特別的阅读全文

posted @ 2017-06-07 10:30 郝壹贰叁阅读(761) 评论(0) 推荐(0) 编辑

[IR] XPath for Search Query

摘要：XPath 1.0 XPath Containment Distributed Query Evaluation RE and DFA XPath 1.0 -- 在XML中的使用 XPath 语法: http://www.w3school.com.cn/xpath/xpath_syntax.asp 阅读全文

posted @ 2017-06-06 16:23 郝壹贰叁阅读(491) 评论(0) 推荐(0) 编辑

[IR] Advanced XML Compression - XBW

摘要：思考：与ISX对比后能得出什么结论原理解析： We proposed the XBW-transform that mimics on trees the nice structural properties of the Burrows-and-Wheeler Trasform on strin 阅读全文

posted @ 2017-06-06 15:19 郝壹贰叁阅读(331) 评论(0) 推荐(0) 编辑

[IR] Advanced XML Compression - ISX

摘要：Ori paper: http://www.cse.unsw.edu.au/~wong/papers/www07.pdf ISX Requirements For mobile devices: To find a space-efficient storage scheme for XML dat 阅读全文

posted @ 2017-06-05 19:51 郝壹贰叁阅读(340) 评论(0) 推荐(0) 编辑

[IR] XML Compression

摘要：Ref: https://www.ibm.com/developerworks/cn/xml/x-datacompression/ Language-Equivalent （类似路径压缩） root --> o12有如下三条路径： staff dept/member support/member 阅读全文

posted @ 2017-06-05 10:36 郝壹贰叁阅读(245) 评论(0) 推荐(0) 编辑

[IR] What is XML

摘要：Concept: http://www.w3school.com.cn/xml/xml_cdata.asp Semistructured：和普通纯文本相比，半结构化数据具有一定的结构性。OEM(Object exchange Model)是一种典型的半结构化数据模型。 An OEM object 阅读全文

posted @ 2017-06-05 10:33 郝壹贰叁阅读(277) 评论(0) 推荐(0) 编辑

[Bayes] Concept Search and PLSA

摘要：【Topic Model】主题模型之概率潜在语义分析(Probabilistic Latent Semantic Analysis) 感觉LDA在实践中的优势其实不大，学好pLSA才是重点阅读笔记 PLSI 2008年的时候，pLSA已经被新兴的LDA掩盖了。 LDA是pLSA的generaliz 阅读全文

posted @ 2017-05-13 21:02 郝壹贰叁阅读(380) 评论(0) 推荐(0) 编辑

[Bayes] Concept Search and LSI

摘要：基于术语关系的贝叶斯网络信息检索模型扩展研究 LSI 阅读笔记背景知识提出一种改进的共现频率法，利用该方法挖掘了索引术语之间的相关关系，将这种相关关系引入信念网络模型，提出了一个具有两层术语节点的扩展信念网络模型，利用实验验证了模型的性能。将查询术语同义词作为查询证据引入信念网络模型，提出了组阅读全文

posted @ 2017-05-13 18:11 郝壹贰叁阅读(319) 评论(0) 推荐(0) 编辑

[Bayes] Concept Search and LDA

摘要：重要的是通过实践更深入地了解贝叶斯思想，先浅浅地了解下LDA。相关数学知识 LDA-math-MCMC 和 Gibbs Sampling LDA-math - 认识 Beta/Dirichlet 分布 LDA-math - 神奇的 Gamma 函数 LDA学习心得(一)——Gamma函数与Beta 阅读全文

posted @ 2017-05-05 19:27 郝壹贰叁阅读(409) 评论(0) 推荐(0) 编辑

[IR] Search Server - Sphinx

摘要：使用 Sphinx 更好地进行 MySQL 搜索 - IBM 尽管 MySQL 是一个出色的通用数据库，但是如果您的应用程序需要进行大量搜索，那么使用 Sphinx 可获得更好的性能。尽管 Sphinx 是一种全文本搜索工具，但即使与非全文本查询一起使用，它仍然可以提高应用程序的速度。本文将介绍阅读全文

posted @ 2017-05-03 18:49 郝壹贰叁阅读(280) 评论(0) 推荐(0) 编辑

[IR] Open Source Search Engines

摘要：From: http://blog.csdn.net/xum2008/article/details/8740063 本文档是对现有的开源的搜索引擎的一个简单介绍 1. Lucene Lucene的开发语言是Java, 也是java家族中最为出名的一个开源搜索引擎, 在java世界中已经是标准的全文阅读全文

posted @ 2017-05-03 17:09 郝壹贰叁阅读(420) 评论(0) 推荐(0) 编辑

[IR] Information Extraction

摘要：阶段性总结 Boolean retrieval 单词搜索【Qword1 and Qword2】 O(x+y) 【Qword1 and Qword2】- 改进： Galloping Search O(2a*log2(b/a)) 【Qword1 and not Qword2】 O(m*log2n) 【阅读全文

posted @ 2016-11-08 19:11 郝壹贰叁阅读(288) 评论(0) 推荐(0) 编辑

[IR] Evaluation

摘要：无序检索结果的评价方法： Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) Accuracy = (tp + tn) / ( tp + fp + fn + tn) 有序检索结果的评价方法： A precison-recall curve 调式sea 阅读全文

posted @ 2016-11-08 16:33 郝壹贰叁阅读(438) 评论(0) 推荐(0) 编辑

[IR] Ranking - top k

摘要：PageRanking 通过：传统的方式又是什么呢？ Every term在某个doc中的权重（地位）。公共的terms在Query与Doc中对应的的地位（单位化后）直接相乘，然后全部加起来，构成了cosin相似度。 Efficient cosine ranking 传统放入堆的模式：n * l 阅读全文

posted @ 2016-11-08 13:25 郝壹贰叁阅读(317) 评论(0) 推荐(0) 编辑

[IR] Link Analysis

摘要：网络信息的特点在于： Query: "IBM" --> "Computer" --> documentIDs. In degree i 正比于 1/iα , 例如: α = 2.1 即：i越大，量越少。 Query processing § First retrieve all pages meet 阅读全文

posted @ 2016-11-08 09:05 郝壹贰叁阅读(825) 评论(0) 推荐(0) 编辑

[IR] Probabilistic Model

摘要：If user has told us some relevant and some irrelevant documents, then we can proceed to build a probabilistic classifier, such as a Naive Bayes model. 阅读全文

posted @ 2016-11-07 20:08 郝壹贰叁阅读(1045) 评论(0) 推荐(0) 编辑

[IR] Tolerant Retrieval & Spelling Correction & Language Model

摘要：Dictionary不一定是个list，它可以是多种形式。放弃Hash的原因：通常，tree是比较适合的结构。 From: http://www.cnblogs.com/v-July-v/archive/2011/06/07/2075992.html B--tree B-树又叫平衡多路查找树。一阅读全文

posted @ 2016-11-06 19:48 郝壹贰叁阅读(520) 评论(0) 推荐(0) 编辑

[IR] Compression

摘要：关系：Vocabulary vs. collection size Heaps’ law: M = kTbM is the size of the vocabulary, T is the number of tokens in the collec*on Typical values: 30 ≤ 阅读全文

posted @ 2016-11-05 15:04 郝壹贰叁阅读(366) 评论(0) 推荐(0) 编辑

[IR] Index Construction

摘要：抛出问题倒排索引的构建 Three steps to construct Inverted Index as following: 海量term排序最难的step中：第2步中的最现实的问题是：假如100G的terms如何排序？参考文档：http://home.ustc.edu.cn/~zhu 阅读全文

posted @ 2016-11-05 13:36 郝壹贰叁阅读(661) 评论(0) 推荐(0) 编辑

[IR] Inverted Index & Boolean retrieval

摘要：教材：《信息检索导论》倒排索引 How to build Inverted Index? 1. Token sequence. 2. Sort by terms. 3. Dictionary & Postings 查询同时包含两单词的文档【Qword1 and Qword2】等高线式前进。 O 阅读全文

posted @ 2016-11-03 14:00 郝壹贰叁阅读(887) 评论(0) 推荐(0) 编辑

机器学习水很深

We all have two lives. The second one starts when we realize that we only have one. --- Tom Hiddleston

随笔分类 - Feature-NLP

公告

积分与排名

随笔分类 (961)

Academic

Common

阅读排行榜

评论排行榜

最新评论