TF-IDF词频逆文档频率算法
一.简介
1.RF-IDF【term frequency-inverse document frequency】是一种用于检索与探究的常用加权技术。
2.TF-IDF是一种统计方法,用于评估一个词对于一个文件集或一个语料库中的其中一个文件的重要程度。
3.词的重要性随着它在文件中出现的次数的增加而增加,但同时也会随着它在语料库中出现的频率的升高而降低。
二.词频
指的是某一个给定的词语在一份给定的文件中出现的次数。这个数字通常会被归一化,以防止它偏向长的文件【同一个词语在文件里可能会比短文件有更高的词频,而不管该词重要与否】。
公式:
ni,j:是该词在文件dj中出现的次数,而分母则是在文件dj中所有词出现的次数之和。
三.逆文档频率
是一个词普遍重要性的度量。某一个特定词的IDF可以由总文件数目除以包含该词语的文件数据,再将得到的商取对数得到。
公式:
|D|:语料库中的文件总数
|{j:ti€dj}|:包含ti的文件总数
四.TF-IDF
公式:TF-IDF = TF * IDF
特点:某一特定文件内的高频率词语,以及该词语在整个语料库中的低文件频率,可以产生高权重的TF-IDF。因此,TF-IDF倾向于过滤掉常见的词语,保留重要的词语。
思想:如果某个词或短语在一篇文章中出现的频率TF高,并且在其它文章中很少出现,则认为此词或短语具有很好的类别区分能力,适合用来分类。
五.代码实现
1 package big.data.analyse.tfidf 2 3 import org.apache.log4j.{Level, Logger} 4 import org.apache.spark.sql.SparkSession 5 6 /** 7 * Created by zhen on 2019/05/28. 8 */ 9 object TF_IDF { 10 /** 11 * 设置日志级别 12 */ 13 Logger.getLogger("org").setLevel(Level.WARN) 14 def main(args: Array[String]) { 15 val spark = SparkSession 16 .builder() 17 .appName("TF_IDF") 18 .master("local[2]") 19 .config("spark.sql.warehouse.dir", "file:///D://warehouse").getOrCreate() 20 val sc = spark.sparkContext 21 /** 22 * 计算TF 23 */ 24 val tf = sc.textFile("src/big/data/analyse/tfidf/TF.txt") 25 .map(row => row.replace(",", " ").replace(".", " ").replace(" ", " ")) // 数据清洗 26 .flatMap(row => row.split(" ")) // 拆分 27 .map(row => (row, 1.0)) 28 .reduceByKey(_+_) 29 30 val tfSize = tf.map(row => row._2).sum() // 计算总词数 31 32 val tfed = tf.map(row => (row._1, row._2 / tfSize.toDouble)) //求词频 33 println("TF:") 34 tfed.foreach(println) 35 36 /** 37 * 计算IDF 38 */ 39 val idf_0 = tf.map(row => (row._1, 1.0)) 40 println("加载IDF1文件数据。。。") 41 val idf_1 = sc.textFile("src/big/data/analyse/tfidf/IDF1.txt") 42 .map(row => row.replace(",", " ").replace(".", " ").replace(" ", " ")) 43 .flatMap(row => row.split(" ")) 44 .map(row => (row, 1.0)) 45 .reduceByKey(_+_) 46 .map(row => (row._1, 1.0)) 47 48 println("加载IDF2文件数据。。。") 49 val idf_2 = sc.textFile("src/big/data/analyse/tfidf/IDF2.txt") 50 .map(row => row.replace(",", " ").replace(".", " ").replace(" ", " ")) 51 .flatMap(row => row.split(" ")) 52 .map(row => (row, 1.0)) 53 .reduceByKey(_+_) 54 .map(row => (row._1, 1.0)) 55 56 /** 57 * 整合语料库数据 58 */ 59 val idf = idf_0.union(idf_1).union(idf_2) 60 .reduceByKey(_+_) 61 .map(row => (row._1, 3 / row._2)) 62 println("IDF:") 63 idf.foreach(println) 64 65 /** 66 * 关联TF和IDF,计算TF-IDF 67 */ 68 println("TF-IDF:") 69 tfed.join(idf).map(row => (row._1, (row._2._1 * row._2._2).formatted("%.4f"))) 70 .foreach(println) 71 } 72 }
六.结果
TF: (GraphX,0.011494252873563218) (are,0.011494252873563218) (learning,0.011494252873563218) (Python,0.011494252873563218) (provides,0.011494252873563218) (is,0.022988505747126436) (Please,0.011494252873563218) (higher-level,0.011494252873563218) (general,0.011494252873563218) (Security,0.034482758620689655) (R,0.011494252873563218) (fast,0.011494252873563218) (SQL,0.022988505747126436) (Apache,0.011494252873563218) (Java,0.011494252873563218) (data,0.011494252873563218) (attack,0.011494252873563218) (This,0.011494252873563218) (cluster,0.011494252873563218) (graph,0.011494252873563218) (execution,0.011494252873563218) (MLlib,0.011494252873563218) (Scala,0.011494252873563218) (computing,0.011494252873563218) (downloading,0.011494252873563218) (Streaming,0.011494252873563218) (supports,0.022988505747126436) (engine,0.011494252873563218) (set,0.011494252873563218) (running,0.011494252873563218) (Spark,0.08045977011494253) (you,0.011494252873563218) (Overview,0.011494252873563218) (general-purpose,0.011494252873563218) (rich,0.011494252873563218) (APIs,0.011494252873563218) (vulnerable,0.011494252873563218) (that,0.011494252873563218) (a,0.022988505747126436) (high-level,0.011494252873563218) (processing,0.022988505747126436) (OFF,0.011494252873563218) (before,0.011494252873563218) (including,0.011494252873563218) (could,0.011494252873563218) (optimized,0.011494252873563218) (in,0.022988505747126436) (to,0.011494252873563218) (see,0.011494252873563218) (graphs,0.011494252873563218) (of,0.011494252873563218) (also,0.011494252873563218) (by,0.022988505747126436) (structured,0.011494252873563218) (tools,0.011494252873563218) (It,0.022988505747126436) (for,0.034482758620689655) (mean,0.011494252873563218) (an,0.011494252873563218) (machine,0.011494252873563218) (and,0.06896551724137931) (system,0.011494252873563218) (default,0.022988505747126436) 加载IDF1文件数据。。。 加载IDF2文件数据。。。 IDF: (running,1.5) (For,3.0) (visit,3.0) (The,3.0) (you,1.0) (website,1.5) (than,3.0) (7,3.0) (PATH,3.0) (that,1.0) (was,1.5) (a,1.0) (main,3.0) (old,3.0) (high-level,1.5) (be,1.5) (quick,3.0) (processing,1.5) (could,1.5) (all,3.0) (augmenting,3.0) (optimized,1.5) (Downloads,3.0) (follow,3.0) (applications,3.0) (classpath,3.0) (structured,1.5) (like,1.5) (along,3.0) (support,3.0) (Spark’s,1.5) (If,3.0) (but,3.0) (and,1.0) (reference,3.0) (1,3.0) (g,3.0) (system,1.5) (your,3.0) (10,3.0) (It’s,3.0) (are,1.0) (learning,1.5) (download,1.5) (its,3.0) (After,3.0) (Building,3.0) (can,1.5) (Security,1.5) (have,3.0) (runs,3.0) (6,3.0) (build,3.0) (0,1.5) (SQL,1.0) (with,1.5) (locally,3.0) (projects,3.0) (their,3.0) (Get,3.0) (UNIX-like,3.0) (This,1.0) (,1.5) (first,3.0) (documentation,3.0) (Since,3.0) (still,3.0) (Downloading,3.0) (packaged,3.0) (better,3.0) (However,3.0) (switch,3.0) (hood,3.0) (Linux,3.0) (Streaming,1.5) (supports,1.5) (PyPI,3.0) ((2,3.0) (vulnerable,1.5) (RDD,3.0) (Dataset,3.0) (package,3.0) (this,3.0) (under,3.0) (Python,1.0) (provides,1.0) (API,1.5) (higher-level,1.5) (introduction,3.0) (Apache,1.5) (will,1.5) (Java,1.0) (2,1.5) (data,1.5) (as,3.0) (YARN,3.0) (installed,3.0) (pointing,3.0) (optimizations,3.0) (get,3.0) (cluster,1.5) (tutorial,3.0) (graph,1.5) (easy,3.0) (execution,1.5) (MLlib,1.5) (We,3.0) (you’d,3.0) (supported,3.0) (downloading,1.5) (shell,3.0) (handful,3.0) (1+,3.0) (Users,3.0) (engine,1.5) (version,1.5) (11,3.0) (set,1.5) (performance,3.0) (rich,1.5) (systems,3.0) (replaced,3.0) (Spark,1.0) (project,3.0) (Overview,1.5) (APIs,1.5) (Mac,3.0) (or,1.5) (popular,3.0) (Support,3.0) (richer,3.0) (downloads,3.0) (OFF,1.5) (future,3.0) (detailed,3.0) (GraphX,1.5) (removed,3.0) (4,3.0) (installation,3.0) (Please,1.5) (is,1.0) (guide,3.0) (recommend,3.0) (R,1.5) (general,1.5) (JAVA_HOME,3.0) (fast,1.5) (include,3.0) (need,3.0) (one,3.0) (attack,1.5) (how,3.0) (uses,3.0) (compatible,3.0) (information,3.0) (we,3.0) (interactive,3.0) (—,3.0) (using,1.5) (Note,1.5) (7+/3,3.0) (java,3.0) (pre-packaged,3.0) (Scala,1.0) (any,1.5) (computing,1.5) (variable,3.0) (users,3.0) (from,1.5) (has,3.0) (won’t,3.0) (through,3.0) (at,3.0) (more,3.0) (3,3.0) (versions,3.0) (of,1.0) (tools,1.5) (8+,3.0) (by,1.0) (mean,1.5) (RDDs,3.0) ((e,3.0) (It,1.5) (for,1.0) (To,3.0) (were,3.0) (both,3.0) (an,1.0) (12,3.0) (which,3.0) (machine,1.5) (libraries,3.0) (introduce,3.0) (environment,3.0) ((in,3.0) (programming,3.0) (See,3.0) (use,1.5) (default,1.5) (the,1.5) (write,3.0) (highly,3.0) (release,3.0) (Resilient,3.0) (interface,3.0) (strongly-typed,3.0) (about,3.0) (run,3.0) (general-purpose,1.5) (5,3.0) (Distributed,3.0) (on,3.0) (You,3.0) (source,3.0) (Scala),3.0) (show,3.0) (then,3.0) (before,1.0) (including,1.5) (to,1.0) (in,1.0) (client,3.0) (see,1.5) (HDFS,1.5) (graphs,1.5) (Hadoop’s,3.0) (also,1.5) (“Hadoop,3.0) (binary,3.0) (x),3.0) (free”,3.0) (Maven,3.0) (coordinates,3.0) (Windows,3.0) (deprecated,3.0) (install,3.0) ((RDD),3.0) (4+,3.0) (page,3.0) (OS),3.0) (Hadoop,1.5) TF-IDF: (you,0.0115) (that,0.0115) (a,0.0230) (high-level,0.0172) (processing,0.0345) (could,0.0172) (optimized,0.0172) (structured,0.0172) (and,0.0690) (system,0.0172) (are,0.0115) (learning,0.0172) (Security,0.0517) (SQL,0.0230) (This,0.0115) (Streaming,0.0172) (supports,0.0345) (vulnerable,0.0172) (Spark,0.0805) (Overview,0.0172) (APIs,0.0172) (OFF,0.0172) (of,0.0115) (tools,0.0172) (by,0.0230) (mean,0.0172) (It,0.0345) (for,0.0345) (an,0.0115) (machine,0.0172) (default,0.0345) (Python,0.0115) (provides,0.0115) (higher-level,0.0172) (Apache,0.0172) (GraphX,0.0172) (Please,0.0172) (is,0.0230) (R,0.0172) (general,0.0172) (fast,0.0172) (attack,0.0172) (Java,0.0115) (Scala,0.0115) (computing,0.0172) (data,0.0172) (cluster,0.0172) (graph,0.0172) (execution,0.0172) (MLlib,0.0172) (downloading,0.0172) (engine,0.0172) (set,0.0172) (rich,0.0172) (general-purpose,0.0172) (before,0.0115) (including,0.0172) (to,0.0115) (in,0.0230) (see,0.0172) (graphs,0.0172) (also,0.0172) Process finished with exit code 0