信息检索问答部分
第一章
1.What is the definition of Information Retrieval?
Information retrieval (IR) deals with the representation, storage, organization of, and access to information items.
2.What is the primary goal of an IR system?
the primary goal of an IR system is to retrieve all the documents which are relevant to a user query while retrieving as few non-relevant documents as possible
3. What is the difference between IR and DR?
Data retrieval: which docs contain a set of keywords?
Well defined(定义) semantics(语义学) a single erroneous object implies failure!
Information retrieval: information about a subject or topic(主题)
semantics is frequently loose(宽松的) small errors are tolerated
4.the effective retrieval of relevant information is affected by ?
user task ,the logical view of the documents
5.What is the relation of the retrieval and browsing?
Retrieval(检索) ,information or data ,Purposeful ,Browsing(浏览) ,glancing around
6 pay more attention to 1.2,1.3,1.4
第二章
1.What is the index items?
In a restricted sense, an index term is a keyword (or group of related words) which has some meaning of its own (i.e., which usually has the semantics of a noun).(从严格意义上讲,标引词就是关键词(或一组相关词),这些字或词有其自身的一些意义(即通常含有名词的语义)) In its more general form, an index term is simply any word which appears in the text of a document in the collection.(标引词更为一般的形式是在该集合的某一文献文本中出现的任何词)
2.What is the three classic models in information retrieval?
The three classic models in information retrieval are called Boolean, vector, and probabilistic
3.What is a taxonomy of information retrieval models?
IR Models |
Non-Overlapping Lists Proximal Nodes |
Structured Models |
Retrieval: Adhoc Filtering |
Browsing |
U s e r T a s k |
Classic Models |
boolean vector probabilistic |
Set Theoretic |
Fuzzy Extended Boolean |
Probabilistic |
Inference Network Belief Network |
Algebraic |
Generalized Vector Lat. Semantic Index Neural Networks |
Browsing |
Flat Structure Guided Hypertext |
4.What is definition and characters of ad hoc and filtering?
hoc retrieval :the documents in the collection remain relatively static while new queries are submitted to the system.
filtering :the queries remain relatively static while new documents come into the system (and leave).
5.Please tell something about Boolean Model?
The Boolean model is a simple retrieval model based on set theory and Boolean algebra集合理论和布尔代数
inherent simplicity and neat formalism
Drawbacks of the Boolean Model
6.What is the hypertext?
A hypertext is a high level interactive navigational structure高层交互式导航结构 which allows us to browse text non-sequentially非顺序的 on a computer screen. It consists basically of nodes结点 which are correlated by directed links链 in a graph structure
7.How to avoid losing in the web?
it is desirable that the hypertext include a hypertext map which shows where the user is at all times
8.Please tell something about the Web.
the Web is not exactly a proper hypertext because it lacks an underlying基本的 data model, it lacks a navigational plan导航计划, and it lacks a consistently designed user interface设计统一的用户界面.
第三章
1.What is the definition of retrieval performance evaluation?
information retrieval systems require the evaluation of how precise is the answer set. This type of evaluation is referred to as retrieval performance evaluation.
2. The test reference collection consists of ?
The test reference collection consists of a collection of documents文献集, a set of example information requests信息查询实例, and a set of relevant documents (provided by specialists) for each example information request每个信息查询实例的一组相关文献
3.Please calculate the R( Recall Ratio), P( Precision Ratio), O(Omission Ratio),M(Miss Ratio)
R=检中的相关信息量/系统中的相关信息总量*100%
O=(1-R)*100%
P=检中的相关信息量/检索出的信息总量*100%
M=(1-P)*100%
4.Please search some information about the trend of TREC on the web
第四章
1.What is the definition of query language?
Typical queries on structure allow the selection of areas that contain (or not) other areas, that are contained (or not) in other areas, that follow (or are followed by) other areas, that are close to other areas, and set manipulation. Many of them are implemented in most models, although each model has unique features. Some kind of standardization, expressiveness taxonomy, or formal categorization would be highly desirable but does not exist yet.
2.What is the advantage of keyword based query?
they are intuitive,直观 easy to express易于表达, and allow for fast ranking允许快速的排序.
3.Why is it a low quantity keyword based query?
Thus, a query can be (and in many cases is) simply a word, although it can in general be a more complex combination of operations involving several words
4.What is a query?
A query is the formulation of a user information need. In its simplest form,a query is composed of keywords (查询是用户信息需求的概要表示,最简单的表示是由关键词组成)and the documents containing such keywords are searched for
5.What is retrieval unit?
The retrieval unit 检索单元is the basic element which can be retrieved as an answer to a query (检索单元是响应查询而检出的结果集的基本要素。
6. please tell something about the Boolean query
Boolean query has a syntax composed of atoms (i.e., basic queries) that retrieve documents, and of Boolean operators which work on their operands (which are sets of documents) and deliver sets of documents布尔查询的语法由众多基本查询和布尔运算符所组成.
7.what is the Boolean operations of Boolean query?
OR The query (el OR e2) selects all documents which satisfy el or e2.Duplicates are eliminated.重复的部分被去掉
AND The query (el AND e2) selects all documents which satisfy both el and e2.既满足…又满足…
BUT The query (el BUT e2) selects all documents which satisfy el but not e2. Notice that classical Boolean logic uses a NOT operation, where (NOT e2) is valid whenever e2 is not查询e1 BUT e2查找满足e1但不满足e2只要e2是无效的,那么逻辑就成立.
8.please take some examples of query protocols
Z39.50 is a protocol approved as a standard in 1995 by ANSI美国国家标准协会 and NISO美国国家信息科学组织. This protocol is intended to query bibliographical information using a standard interface协议旨在用一个标准的界面来查询书目信息(该界面介于用户和数据库之间,但又独立于用户界面和数据库查询语言)between the client and the host database manager which is independent of the client user interface and of the query database language at the host.
第五章
1. at is the definition of query reformulation strategy?
expanding the original query with new terms and reweighting the terms in the expanded query.
2. hat is the main ideas of user relevance feedback?
The main idea consists of selecting important terms, or expressions, attached to the documents that have been identified as relevant by the user, and of enhancing the importance of these terms in a new query formulation.
第六章
1. What is the definition of metadata?
Metadata is information on the organization of the data, the various data domains, and the relationship between them
In short ,metadata is 'data about the data.'
2. What is the functions of metadata?
3. please tell something about the application of metadata on the web
Metadata information on Web documents
cataloging, content rating, property rights, digital signatures
New standard: Resource Description Framework
description of Web resources to facilitate automated processing of information
4. please take some examples of the format of text
Formats
Formats for document interchange (RTF)
Formats for displaying (PDF, PostScript)
Formats for encode email (MIME)
5. please take some examples of the format of multimedia
Tagged Image File Format (TIFF标签图像文件格式)
Joint Photographic Experts Group (JPEG) Portable Network Graphics (PNG新型位图图像格式)
MPEG (Moving Pictures Expert Group)
6. what is makeup language ?
Markup is defined as extra textual syntax that can be used to describe formatting actions, structure information, text semantics, attributes, etc
what is the trends of the makeup language?
第七章
1.what is the text operations?
various text transformation techniques which we call simply text operations.
2.what is the function of text compression?
The gain obtained from compressing text is that it requires less storage space, 更少存储空间it takes less time to be transmitted over a communication link,通信链路中的传输时间大为减少 and it takes less time to search directly the compressed text.直接检索压缩文本的时间减少
3.please tell the procedure of document preprocessing.
Document preprocessing is a procedure which can be divided mainly into five text operations (or transformations): lexical analysis, elimination of stopwords, stemming, selection of index terms, and thesauri
4.what is the function of elimination of stopwords?
Elimination of stopwords has an additional important benefit. It reduces the size of the indexing structure缩小了索引的结构considerably.
5.what is the definition and function of a stem?
A stem is the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes).
Stems are thought to be useful for improving retrieval performancebecause they reduce variants of the same root word to a common concept. Furthermore, stemming has the secondary effect of reducing the size of the indexing structure because the number of distinct index terms is reduced.
6.what is the motivation of text compression?
Text compression is about finding ways to represent the text in fewer bits or bytes.
7.what is the approaches to text compression?
There are two general approaches to text compression: statistical and dictionary based.基于统计的压缩方法和基于词典的压缩方法
8.What is the compression ratio?
Compression ratio is the size of the compressed file as a fraction of the uncompressed file.压缩比是指已压缩文本的大小与未压缩文本大小的比率
第八章
1.What is the approaches to searching for a basic query?
An obvious option in searching for a basic query is to scan the text sequentially.
A second option is to build data structures over the text (called indices) to speed up the search.
2.what is the three main indexing techniques?
Indexing techniques:
Inverted files
Suffix arrays
Signature files
3.what is the definition and structures of inverted files ?
an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching task.
Structure of inverted file:
Vocabulary: is the set of all distinct words in the text
Occurrences: lists containing all information necessary for each word of the vocabulary (text position, frequency, documents where the word appears, etc.)
4.what are the steps of the search algorithm on an inverted index?
The search algorithm on an inverted index follows three steps:
Vocabulary search: the words present in the query are searched in the vocabulary
Retrieval occurrences: the lists of the occurrences of all words found are retrieved
Manipulation of occurrences: the occurrences are processed to solve the query
第九章
1.What is the definition of parallel computing?
Parallel computing is the simultaneous aplication of multiple processors to solve a single problem.
2.What is the definition of distributed computing?
Distributed computing is the application of multiple computers connected by a network to solve a single problem.
3.What is the parallel IR and distributed IR
There are two ways in which a retrieval system can exploit a MIMD machine:
— Parallel multitasking;
— Partitioned parallel processing.
分布式信息检索将更大范围分布的异构数据联合起来,形成一个逻辑整体,为用户提供强大的信息检索能力。
分布式检索体系结构
4.Please tell something about the source selection
Source selection is the process of determining which of the distributed document collections are most likely to contain relevant documents for the current query, and therefore should receive the query for processing.
5.Please descript the query processing.
(1) Select collections to search.选取要检索的文献集
(2) Distribute query to selected collections将查询分发给选取的文献集
(3) Evaluate query at distributed collections in parallel.在分发了查询的文献集中对查询进行并行处理
(4) Combine results from distributed collections into final result将各文献集的中间结果合并成最终结果
第十章
1.What is the definition of parallel computing?
Parallel computing is the simultaneous aplication of multiple processors to solve a single problem
2.What is the definition of distributed computing?
Distributed computing is the application of multiple computers connected by a network to solve a single problem.
3.What is the parallel IR and distributed IR
There are two ways in which a retrieval system can exploit a MIMD machine:
Parallel multitasking;
Partitioned parallel processing.
分布式信息检索将更大范围分布的异构数据联合起来,形成一个逻辑整体,为用户提供强大的信息检索能力。
4.Please tell something about the source selection
Source selection is the process of determining which of the distributed document collections are most likely to contain relevant documents for the current query, and therefore should receive the query for processing.
5.Please descript the query processing.
(1) Select collections to search.选取要检索的文献集
(2) Distribute query to selected collections将查询分发给选取的文献集
(3) Evaluate query at distributed collections in parallel.在分发了查询的文献集中对查询进行并行处理
(4) Combine results from distributed collections into final result将各文献集的中间结果合并成最终结果
6.Principles for design of user interfaces:
feedback.
reversal of actions.
Internal locus of control.
reduce working memory load.
alternative interfaces for novice and expert users.
7.Design Principles设计原则
1) Offer informative feedback ;
2) Reduce working memory load ;
3) Provide alternative interfaces for novice and expert users
第十一、二章
1.What is the definition of Multimedia IR system?
The most important characteristic of a multimedia information system is the variety of data it must be able to support.
The architecture构建 of a Multimedia IR system depends on two main factors: the peculiar characteristics of multimedia data, 多媒体数据的特性and second, the kinds of operations to be performed on such data在这种数据上的操作.
2.What is the main goal of a Multimedia IR system?
The main goal of a Multimedia IR system is to efficiently perform retrieval, based on user requests, exploiting not only data attributes, as in traditional DBMSs, but also the content of multimedia objects.
3.Please take some examples of Multimedia Data Support in Commercial DBMSs.
For example, the Oracle DBMS provides the VARCHAR2 data type to represent variable length character strings.
The Sybase SQL server supports IMAGE and TEXT data types to store images and unstructured text,
4.Please tell something about the web query languages.
Because of the semi-structured nature of multimedia objects, the previous approach is no longer adequate in a Multimedia IR system.
More often, a similarity-based approach基于相似性的方法 is applied that considers both the structure and the content of the objects.它不仅考虑对象的结构而且还考虑它的内容 Queries of the latter type are called content-based queries
第十三章
1.What is the condition of the web?
The problems related to the data are: Distributed data ; High percentage of volatile data ; Large volume ; Unstructured and redundant data ; Quality of data ; Heterogeneous data
2.How to search the web?
There are basically three different forms of searching the Web :The first is to use search engines that index a portion of the Web documents as a full-text database ; The second is to use Web directories ; The third and not yet fully available, is to search the Web exploiting its hyperlink structure
3.What is the definition of the Search Engines?
In this section we cover different architectures of retrieval systems that model the Web as a full-text database检索系统将Web看作一个全文本数据库. One main difference between standard IR systems and the Web is that, in the Web, all queries must be answered without accessing the text访问文本 (that is, only the indices索引 axe available).
4.What is the structures of the search engines?
5.What is the important aspects of the user interface?
There are two important aspects of the user interface of search engines:两个重要方面 the query interface and the answer interface查询界面与响应界面
6.Please take examples of the techniques of ranking.
Most search engines use variations of the Boolean or vector model to do ranking.
The classical tf-idf scheme 经典的tf-idf方法
The new ranking algorithms also use hyperlink information.
PageRank which is part of the ranking algorithm used by Google
7.What is the definition of the metasearchers?
Metasearchers are Web servers that send a given query to several search engines, Web directories and other databases, collect the answers and unify them.
8.What is the hyperlinks?
Hyperlinks 超链接can also be used to infer information about the Web. Although this is not exactly searching the Web, this is an important trend called Web mining.
9.Please tell something about the web query languages.
the first generation of Web query languages were aimed at combining content with structure
The second generation of languages, called Web data manipulation languages, maintain the emphasis on semi-structured data.
第十四、五章
1.What is definition of digital library?
DLs can be just part of the ‘middleware’ of the Internet, providing various services that can be embedded in other task-support systems. 数字图书馆可以看作是互联网的“中间部件”,它提供各种嵌入其他任务支持系统的服务 DLs can be independent systems and so must have an architecture of their own in order to be built. 数字图书馆也可以是独立的系统,因此构建时它们必须具有自己的体系结构
2.What is the DL’s architectural issues?
DLs can be just part of the ‘middleware’ of the Internet, providing various services that can be embedded in other task-support systems. 数字图书馆可以看作是互联网的“中间部件”,它提供各种嵌入其他任务支持系统的服务 DLs can be independent systems and so must have an architecture of their own in order to be built. 数字图书馆也可以是独立的系统,因此构建时它们必须具有自己的体系结构
3.Please take some examples of the projects of Digital Library.
At the University of Michigan, the emphasis has been on agent technologies 代理技术[97]. This approach can have a number of classes of entities involved in far-flung distributed processing. It is still unknown how efficiently an agent-based DL can operate or even be built.
4.Please tell the application of metadata in digital library.
We see that DLs can be complex collections with various structuring mechanisms for managing data and descriptions of that data, the so-called metadata.数字图书馆是非常复杂的文献集,包括很多结构化的机制来管理数据并对数据进行描述,即所谓的元数据
v