Datamining Concepts
- Use case: a use case is a list of steps, typically defining interactions between a role (known in UML as an "actor") and a system, to achieve a goal. The actor can be a human or an external system.
- e.g.
- Stemming: In linguisticmorphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form.
- Stop words: In computing, stop words are words which are filtered out prior to, or after, processing of natural language data (text).
- Broad query and exact match: in results of broad query, keywords can appear in any order, exact match otherwise.
- double-blind experiment: an experimental procedure in which neither the subjects of the experiment nor the persons administering the experiment know the critical aspects of the experiment; "a double-blind procedure is used to guard against both experimenter bias and placebo effects"
- tf-idf: term frequency–inverse document frequency, is a numerical statistic which reflects how important a word is to a document in a collection or corpus. the number of times a term occurs in a document is called its term frequency. Tf–idf is the product of two statistics, term frequency and inverse document frequency. Then tf–idf is calculated as:
- ; idf(t,D)=log(|D|/(t出现的document数+1))
- where ft(t,d) can be raw frequency of a term
- Random walk: A random walk is a mathematical formalization of a path that consists of a succession of random steps.
- Monte Carlo methods: 也称统计模拟方法,对应确定性方法(deterministic).
- Heuristic: refers to experience-based techniques for problem solving, learning, and discovery. Examples of this method include using a rule of thumb, an educated guess, an intuitive judgment, or common sense. The most fundamental heuristic is trial and error
- recall and precision: precision is the fraction of retrieved instances that are relevant, while recall is the fraction of relevant instances that are retrieved. eg: Suppose a program for recognizing dogs in scenes identifies 7 dogs in a scene containing 9 dogs and some cats. If 4 of the identifications are correct, but 3 are actually cats, the program's precision is 4/7 while its recall is 4/9:
- recall is 0.6; precision is 0.75
- 1 kilogram equals 2.2 pounds.
- CMS: Content Management System.
- SERP: Search Engine Results Page, is the actual result returned by a search engine in response to a keyword query.
- Web Graph:created by all World Wide Web pages as nodes and hyperlinks as edges
- Ad-hoc & A priori:
- Ad-hoc: It generally signifies a solution designed for a specific problem or task, non-generalizable, and not intended to be able to adapted to other purposes
- A priori: A priori knoledge or justification is independent of experience
- Heuristic: refers to experience-based techniques for problem solving, learning, and discovery that give a solution which is not guaranteed to be optimal. Where the exhaustive search is impractical, heuristic methods are used to speed up the process of finding a satisfactory solution via mental shortcuts to ease the cognitive load of making a decision. Examples of this method include using a rule of thumb, an educated guess, an intuitive judgment, stereotyping, or common sense.
- True positive, etc:
-
Positive = identified and negative = rejected. Therefore: True positive = correctly identified False positive = incorrectly identified True negative = correctly rejected False negative = incorrectly rejected
-
A power law relationship between two quantities x and y can be written as : y = ax^k. (a,k are constants)
- A priori knowledge: "from the earlier", a posteriori is "from later"
- s.t.: meaning, "such that", "subject to"
- v
- v
- v
- v
- v