Datamining Concepts

  1. Use caseuse case is a list of steps, typically defining interactions between a role (known in UML as an "actor") and a system, to achieve a goal. The actor can be a human or an external system.
  2. e.g.
  3. Stemming: In linguisticmorphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form.
  4. Stop words: In computingstop words are words which are filtered out prior to, or after, processing of natural language data (text).
  5. Broad query and exact match: in results of broad query, keywords can appear in any order, exact match otherwise.
  6. double-blind experiment: an experimental procedure in which neither the subjects of the experiment nor the persons administering the experiment know the critical aspects of the experiment; "a double-blind procedure is used to guard against both experimenter bias and placebo effects"
  7. tf-idfterm frequency–inverse document frequency, is a numerical statistic which reflects how important a word is to a document in a collection or corpus. the number of times a term occurs in a document is called its term frequency. Tf–idf is the product of two statistics, term frequency and inverse document frequency. Then tf–idf is calculated as:
    \mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \times \mathrm{idf}(t, D) ; idf(t,D)=log(|D|/(t出现的document数+1))
    where ft(t,d) can be raw frequency of a term

  8. Random walk: A random walk is a mathematical formalization of a path that consists of a succession of random steps.
  9. Monte Carlo methods: 也称统计模拟方法,对应确定性方法(deterministic).
  10. Heuristic: refers to experience-based techniques for problem solving, learning, and discovery. Examples of this method include using a rule of thumb, an educated guess, an intuitive judgment, or common sense. The most fundamental heuristic is trial and error
  11. recall and precision precision is the fraction of retrieved instances that are relevant, while recall is the fraction of relevant instances that are retrieved. eg: Suppose a program for recognizing dogs in scenes identifies 7 dogs in a scene containing 9 dogs and some cats. If 4 of the identifications are correct, but 3 are actually cats, the program's precision is 4/7 while its recall is 4/9:
  12.  recall is 0.6; precision is 0.75
  13. 1 kilogram equals 2.2 pounds.
  14. CMS: Content Management System.
  15. SERP: Search Engine Results Page, is the actual result returned by a search engine in response to a keyword query.
  16. Web Graph:created by all World Wide Web pages as nodes and hyperlinks as edges
  17. Ad-hoc & A priori:
    • Ad-hoc: It generally signifies a solution designed for a specific problem or task, non-generalizable, and not intended to be able to adapted to other purposes
    • A priori: A priori knoledge or justification is independent of experience
  18. Heuristicrefers to experience-based techniques for problem solving, learning, and discovery that give a solution which is not guaranteed to be optimal. Where the exhaustive search is impractical, heuristic methods are used to speed up the process of finding a satisfactory solution via mental shortcuts to ease the cognitive load of making a decision. Examples of this method include using a rule of thumb, an educated guess, an intuitive judgment, stereotyping, or common sense.
  19. True positive, etc:
  20. Positive = identified and negative = rejected.
    
    Therefore:
    True positive = correctly identified
    False positive = incorrectly identified
    True negative = correctly rejected
    False negative = incorrectly rejected
  21. A power law relationship between two quantities x and y can be written as : y = ax^k. (a,k are constants)

  22. A priori knowledge: "from the earlier", a posteriori is "from later"
  23. s.t.: meaning, "such that", "subject to"
  24. v
  25. v
  26. v
  27. v
  28. v
posted @ 2013-04-03 23:05  wxwcase  阅读(232)  评论(0编辑  收藏  举报