hadoo访问MySql数据库_2_统计sid被引用情况
假设在一个引用表中,是uid,sid这样的引用,现在在mapreduce中对sid被引用情况的统计,参照上一篇,及《hadoop in action》中引用计数的例子,
新建slist_8表及导入数据。
新建SidCitated.java
进一步考虑两个mysql表输入,并联结sid对应的name
注意:DBInputFormat 输入的key,value各是什么?
在虚拟机上
javac -classpath hadoop-core-1.0.3.jar:lib/commons-cli-1.2.jar:lib/mysql-connector-java-5.1.18-bin.jar -d mysinfo/classes/ mysinfo/src/SidCitated.java
jar -cvf mysinfo/SidCitated.jar -C mysinfo/classes/ .
bin/hadoop fs -rmr dboutput //输出路径在代码中指定的
time bin/hadoop jar mysinfo/SidCitated.jar SidCitated
将输入原样输出:
1 import java.io.IOException; 2 import java.io.DataInput; 3 import java.io.DataOutput; 4 import java.sql.Connection; 5 import java.sql.DriverManager; 6 import java.sql.PreparedStatement; 7 import java.sql.ResultSet; 8 import java.sql.SQLException; 9 10 import org.apache.hadoop.filecache.DistributedCache; 11 import org.apache.hadoop.fs.Path; 12 import org.apache.hadoop.io.IntWritable; 13 import org.apache.hadoop.io.LongWritable; 14 import org.apache.hadoop.io.Text; 15 import org.apache.hadoop.io.Writable; 16 import org.apache.hadoop.mapred.JobClient; 17 import org.apache.hadoop.mapred.JobConf; 18 import org.apache.hadoop.mapred.MapReduceBase; 19 import org.apache.hadoop.mapred.Mapper; 20 import org.apache.hadoop.mapred.OutputCollector; 21 import org.apache.hadoop.mapred.FileOutputFormat; 22 import org.apache.hadoop.mapred.Reporter; 23 import org.apache.hadoop.mapred.lib.IdentityReducer; 24 import org.apache.hadoop.mapred.lib.db.DBWritable; 25 import org.apache.hadoop.mapred.lib.db.DBInputFormat; 26 import org.apache.hadoop.mapred.lib.db.DBConfiguration; 27 28 import org.apache.hadoop.util.Tool; 29 import org.apache.hadoop.util.ToolRunner; 30 import org.apache.hadoop.conf.Configuration; 31 import org.apache.hadoop.conf.Configured; 32 33 public class SidCitated extends Configured implements Tool { 34 35 36 public static class SlistRecord implements Writable, DBWritable { // add "static" ?? 37 int uid; 38 int sid; 39 40 public SlistRecord(){ // add by brian 41 System.out.println("SlistRecord()"); 42 } 43 44 public SlistRecord(SlistRecord t){ // add by brian 45 System.out.println("SlistRecord(SlistRecord t)"); 46 this.uid = t.uid; 47 this.sid = t.sid; 48 } 49 50 public void readFields(DataInput in) throws IOException { 51 // TODO Auto-generated method stub 52 this.uid = in.readInt(); 53 this.sid = in.readInt(); 54 } 55 56 public void write(DataOutput out) throws IOException { 57 out.writeInt(this.uid); 58 out.writeInt(this.sid); 59 } 60 61 public void readFields(ResultSet result) throws SQLException { 62 this.uid = result.getInt(1); 63 this.sid = result.getInt(2); 64 } 65 66 public void write(PreparedStatement stmt) throws SQLException{ 67 stmt.setInt(1, this.uid); 68 stmt.setInt(2, this.sid); 69 } 70 71 public String toString() { 72 return new String(this.uid + "," + this.sid); // 73 } 74 75 } 76 77 78 public static class SidMapper extends MapReduceBase 79 implements Mapper<LongWritable, SlistRecord, LongWritable, Text> { 80 81 private final static IntWritable uno = new IntWritable(1); 82 private IntWritable citationCount = new IntWritable(); 83 84 public void map(LongWritable key, SlistRecord value, 85 OutputCollector<LongWritable, Text> collector, Reporter reporter) throws IOException { 86 collector.collect(new LongWritable(value.uid), new Text(value.toString())); 87 } 88 } 89 90 91 92 public int run(String[] args) throws Exception { 93 94 Configuration conf = getConf(); 95 JobConf job = new JobConf(conf, SidCitated.class); 96 // job.set("mapred.job.tracker", "172.19.102.12:9001");// 97 // DistributedCache.addFileToClassPath(new Path("/lib/mysql-connector-java-5.1.18-bin.jar"), job); // 98 job.setInputFormat(DBInputFormat.class); 99 FileOutputFormat.setOutputPath(job, new Path("dboutput")); 100 // DBConfiguration.configureDB(job, "com.mysql.jdbc.Driver", "jdbc:mysql://172.19.102.12/tmp", "brian", "123456"); 101 DBConfiguration.configureDB(job, "com.mysql.jdbc.Driver", "jdbc:mysql://localhost/tmp", "root", "123456"); 102 String [] fields = {"uid_", "sid"}; 103 //DBInputFormat.setInput(job, SlistRecord.class, "slist_8", null, "uid_", fields); 104 DBInputFormat.setInput(job, SlistRecord.class, "slist_8", "sid < 1000000", "uid_", fields); 105 job.setMapperClass(SidMapper.class); 106 // job.setReducerClass(SidReducer.class); 107 job.setReducerClass(IdentityReducer.class); 108 JobClient.runJob(job); 109 110 return 0; 111 } 112 113 public static void main(String[] args) throws Exception { 114 int res = ToolRunner.run(new Configuration(), 115 new SidCitated(), 116 args); 117 118 System.exit(res); 119 } 120 }
Error: Java heap space
DBInputFormat.setInput(job, SlistRecord.class, "slist_8", "sid < 1000000", "uid_", fields);
只查询6位数以内的sid ,就正常
brian@ubuntu:~/Downloads/hadoop-1.0.3$ bin/hadoop fs -cat dboutput/* | wc
cat: File does not exist: /user/brian/dboutput/_logs
56662 113324 1437963
表太大,内存溢出!!这里java堆容量被撑爆吧?
很值得重视的问题!!!
这里虚拟机上是2个map,溢出,是指单个map的,还是全局的,如果是单个,是否可以增加map的数量来避免?
假设DBInputFormat是在每个map上读取它自身的那个split,并且假设在一个CPU上跑多个map是串行的,那是否只要设置cpu数量*split大小 不超过java heap space,就可以?
如果真是这样,那如何设置split大小,不同节点上的不同map会自动只读取自己想要的那部分吗??
要深入了解下split的大小配置,以及如何从web上信息看出结果是否一致,以及DBInputFormat是如何保持对大数据量的数的有效读取的(特别是在节点数少,map数少的情况下,节点数
少map数多是否不一样?)?
默认情况下是怎么样的?
-----------------------------------------------
--------------------------------------------------
20120923
12/09/23 19:07:50 INFO mapred.JobClient: Task Id : attempt_201209201531_0020_m_000000_0, Status : FAILED
java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1019)
at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
at SidCitated$SidMapper.map(SidCitated.java:89)
at SidCitated$SidMapper.map(SidCitated.java:80)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
http://www.hadoopor.com/thread-2516-1-1.html说
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
brian@ubuntu:~/Downloads/hadoop-1.0.3$ bin/hadoop fs -cat dboutput/part-00000 | grep 400000
400000 9252
与mysql中统计的数据是一样的:
mysql> select count(*),sid from slist_8 where sid = 400000 group by sid;
+----------+--------+
| count(*) | sid |
+----------+--------+
| 9252 | 400000 |
+----------+--------+
@问:将此例子放到测试机上,并去掉DBInputFormat.setInput(job, SlistRecord.class, "slist_8", "sid < 1000000", "uid_", fields);中的"sid < 1000000",
会不会溢出呢?
这段代码是:
1 import java.io.IOException; 2 import java.io.DataInput; 3 import java.io.DataOutput; 4 import java.util.Iterator; 5 import java.sql.Connection; 6 import java.sql.DriverManager; 7 import java.sql.PreparedStatement; 8 import java.sql.ResultSet; 9 import java.sql.SQLException; 10 11 import org.apache.hadoop.filecache.DistributedCache; 12 import org.apache.hadoop.fs.Path; 13 import org.apache.hadoop.io.IntWritable; 14 import org.apache.hadoop.io.LongWritable; 15 import org.apache.hadoop.io.Text; 16 import org.apache.hadoop.io.Writable; 17 import org.apache.hadoop.mapred.JobClient; 18 import org.apache.hadoop.mapred.JobConf; 19 import org.apache.hadoop.mapred.MapReduceBase; 20 import org.apache.hadoop.mapred.Mapper; 21 import org.apache.hadoop.mapred.Reducer; // 22 import org.apache.hadoop.mapred.OutputCollector; 23 import org.apache.hadoop.mapred.FileOutputFormat; 24 import org.apache.hadoop.mapred.Reporter; 25 import org.apache.hadoop.mapred.lib.IdentityReducer; 26 import org.apache.hadoop.mapred.lib.db.DBWritable; 27 import org.apache.hadoop.mapred.lib.db.DBInputFormat; 28 import org.apache.hadoop.mapred.lib.db.DBConfiguration; 29 30 import org.apache.hadoop.util.Tool; 31 import org.apache.hadoop.util.ToolRunner; 32 import org.apache.hadoop.conf.Configuration; 33 import org.apache.hadoop.conf.Configured; 34 35 public class SidCitated extends Configured implements Tool { 36 37 38 public static class SlistRecord implements Writable, DBWritable { // add "static" ?? 39 int uid; 40 int sid; 41 42 public SlistRecord(){ // add by brian 43 System.out.println("SlistRecord()"); 44 } 45 46 public SlistRecord(SlistRecord t){ // add by brian 47 System.out.println("SlistRecord(SlistRecord t)"); 48 this.uid = t.uid; 49 this.sid = t.sid; 50 } 51 52 public void readFields(DataInput in) throws IOException { 53 // TODO Auto-generated method stub 54 this.uid = in.readInt(); 55 this.sid = in.readInt(); 56 } 57 58 public void write(DataOutput out) throws IOException { 59 out.writeInt(this.uid); 60 out.writeInt(this.sid); 61 } 62 63 public void readFields(ResultSet result) throws SQLException { 64 this.uid = result.getInt(1); 65 this.sid = result.getInt(2); 66 } 67 68 public void write(PreparedStatement stmt) throws SQLException{ 69 stmt.setInt(1, this.uid); 70 stmt.setInt(2, this.sid); 71 } 72 73 public String toString() { 74 return new String(this.uid + "," + this.sid); // 75 } 76 77 } 78 79 80 public static class SidMapper extends MapReduceBase 81 implements Mapper<LongWritable, SlistRecord, LongWritable, LongWritable> { 82 83 private final static LongWritable uno = new LongWritable(1); 84 // private IntWritable citationCount = new IntWritable(); 85 86 public void map(LongWritable key, SlistRecord value, 87 OutputCollector<LongWritable, LongWritable> collector, Reporter reporter) throws IOException { 88 // collector.collect(new LongWritable(value.uid), new Text(value.toString())); 89 collector.collect(new LongWritable(value.sid), uno); 90 } 91 92 } 93 94 95 public static class SidReducer extends MapReduceBase 96 implements Reducer<LongWritable, LongWritable, LongWritable, LongWritable> 97 { 98 public void reduce(LongWritable key, Iterator<LongWritable> values, 99 OutputCollector<LongWritable, LongWritable> output, Reporter reporter) throws IOException { 100 long count = 0; 101 while (values.hasNext()) { 102 count += values.next().get(); 103 } 104 output.collect(key, new LongWritable(count)); 105 } 106 107 } 108 109 110 public int run(String[] args) throws Exception { 111 112 Configuration conf = getConf(); 113 JobConf job = new JobConf(conf, SidCitated.class); 114 // job.set("mapred.job.tracker", "172.19.102.12:9001");// 115 // DistributedCache.addFileToClassPath(new Path("/lib/mysql-connector-java-5.1.18-bin.jar"), job); // 116 job.setInputFormat(DBInputFormat.class); 117 FileOutputFormat.setOutputPath(job, new Path("dboutput")); 118 // DBConfiguration.configureDB(job, "com.mysql.jdbc.Driver", "jdbc:mysql://172.19.102.12/tmp", "brian", "123456"); 119 DBConfiguration.configureDB(job, "com.mysql.jdbc.Driver", "jdbc:mysql://localhost/tmp", "root", "123456"); 120 String [] fields = {"uid_", "sid"}; 121 // DBInputFormat.setInput(job, SlistRecord.class, "slist_8", null, "uid_", fields); 122 DBInputFormat.setInput(job, SlistRecord.class, "slist_8", "sid < 1000000", "uid_", fields); 123 124 job.setMapperClass(SidMapper.class); 125 job.setReducerClass(SidReducer.class); 126 // job.setReducerClass(IdentityReducer.class); 127 128 129 /////////////////// 130 job.setMapOutputKeyClass(LongWritable.class); 131 job.setMapOutputValueClass(LongWritable.class); 132 job.setOutputKeyClass(LongWritable.class); 133 job.setOutputValueClass(LongWritable.class); 134 135 JobClient.runJob(job); 136 137 return 0; 138 } 139 140 public static void main(String[] args) throws Exception { 141 int res = ToolRunner.run(new Configuration(), 142 new SidCitated(), 143 args); 144 145 System.exit(res); 146 } 147 }
在测试机上导入slist_8表
use tmp
set names utf
CREATE TABLE `slist_8` (
`uid_` bigint(19) NOT NULL DEFAULT '0',
`sid` int(10) NOT NULL DEFAULT '0',
`snameremark` varchar(255) DEFAULT NULL,
`__version` bigint(20) unsigned DEFAULT '0',
`__deleted` tinyint(4) DEFAULT '0',
PRIMARY KEY (`uid_`,`sid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
load data infile '/tmp/slist_8' into table slist_8 fields terminated by ',' enclosed by '\"';
注意:表名不能加引号,且导入文件要放到/tmp下(这是因为之前导出是导出到/tmp下的原因吗??)
mysql> load data infile '/tmp/slist_8' into table slist_8 fields terminated by ',' enclosed by '\"';
Query OK, 815582 rows affected (7.27 sec)
Records: 815582 Deleted: 0 Skipped: 0 Warnings: 0
select * from slist_8 limit 10;
查看是否成功并正常显示。
改代码中相应的数据库配置等,改完代码如下:
1 import java.io.IOException; 2 import java.io.DataInput; 3 import java.io.DataOutput; 4 import java.util.Iterator; 5 import java.sql.Connection; 6 import java.sql.DriverManager; 7 import java.sql.PreparedStatement; 8 import java.sql.ResultSet; 9 import java.sql.SQLException; 10 11 import org.apache.hadoop.filecache.DistributedCache; 12 import org.apache.hadoop.fs.Path; 13 import org.apache.hadoop.io.IntWritable; 14 import org.apache.hadoop.io.LongWritable; 15 import org.apache.hadoop.io.Text; 16 import org.apache.hadoop.io.Writable; 17 import org.apache.hadoop.mapred.JobClient; 18 import org.apache.hadoop.mapred.JobConf; 19 import org.apache.hadoop.mapred.MapReduceBase; 20 import org.apache.hadoop.mapred.Mapper; 21 import org.apache.hadoop.mapred.Reducer; // 22 import org.apache.hadoop.mapred.OutputCollector; 23 import org.apache.hadoop.mapred.FileOutputFormat; 24 import org.apache.hadoop.mapred.Reporter; 25 import org.apache.hadoop.mapred.lib.IdentityReducer; 26 import org.apache.hadoop.mapred.lib.db.DBWritable; 27 import org.apache.hadoop.mapred.lib.db.DBInputFormat; 28 import org.apache.hadoop.mapred.lib.db.DBConfiguration; 29 30 import org.apache.hadoop.util.Tool; 31 import org.apache.hadoop.util.ToolRunner; 32 import org.apache.hadoop.conf.Configuration; 33 import org.apache.hadoop.conf.Configured; 34 35 public class SidCitated extends Configured implements Tool { 36 37 38 public static class SlistRecord implements Writable, DBWritable { // add "static" ?? 39 int uid; 40 int sid; 41 42 public SlistRecord(){ // add by brian 43 System.out.println("SlistRecord()"); 44 } 45 46 public SlistRecord(SlistRecord t){ // add by brian 47 System.out.println("SlistRecord(SlistRecord t)"); 48 this.uid = t.uid; 49 this.sid = t.sid; 50 } 51 52 public void readFields(DataInput in) throws IOException { 53 // TODO Auto-generated method stub 54 this.uid = in.readInt(); 55 this.sid = in.readInt(); 56 } 57 58 public void write(DataOutput out) throws IOException { 59 out.writeInt(this.uid); 60 out.writeInt(this.sid); 61 } 62 63 public void readFields(ResultSet result) throws SQLException { 64 this.uid = result.getInt(1); 65 this.sid = result.getInt(2); 66 } 67 68 public void write(PreparedStatement stmt) throws SQLException{ 69 stmt.setInt(1, this.uid); 70 stmt.setInt(2, this.sid); 71 } 72 73 public String toString() { 74 return new String(this.uid + "," + this.sid); // 75 } 76 77 } 78 79 80 public static class SidMapper extends MapReduceBase 81 implements Mapper<LongWritable, SlistRecord, LongWritable, LongWritable> { 82 83 private final static LongWritable uno = new LongWritable(1); 84 // private IntWritable citationCount = new IntWritable(); 85 86 public void map(LongWritable key, SlistRecord value, 87 OutputCollector<LongWritable, LongWritable> collector, Reporter reporter) throws IOException { 88 // collector.collect(new LongWritable(value.uid), new Text(value.toString())); 89 collector.collect(new LongWritable(value.sid), uno); 90 } 91 92 } 93 94 95 public static class SidReducer extends MapReduceBase 96 implements Reducer<LongWritable, LongWritable, LongWritable, LongWritable> 97 { 98 public void reduce(LongWritable key, Iterator<LongWritable> values, 99 OutputCollector<LongWritable, LongWritable> output, Reporter reporter) throws IOException { 100 long count = 0; 101 while (values.hasNext()) { 102 count += values.next().get(); 103 } 104 output.collect(key, new LongWritable(count)); 105 } 106 107 } 108 109 110 public int run(String[] args) throws Exception { 111 112 Configuration conf = getConf(); 113 JobConf job = new JobConf(conf, SidCitated.class); 114 job.set("mapred.job.tracker", "172.19.102.12:9001");// 115 DistributedCache.addFileToClassPath(new Path("/lib/mysql-connector-java-5.1.18-bin.jar"), job); // 116 job.setInputFormat(DBInputFormat.class); 117 FileOutputFormat.setOutputPath(job, new Path("dboutput")); 118 DBConfiguration.configureDB(job, "com.mysql.jdbc.Driver", "jdbc:mysql://172.19.102.12/tmp", "brian", "123456"); 119 // DBConfiguration.configureDB(job, "com.mysql.jdbc.Driver", "jdbc:mysql://localhost/tmp", "root", "123456"); 120 String [] fields = {"uid_", "sid"}; 121 DBInputFormat.setInput(job, SlistRecord.class, "slist_8", null, "uid_", fields); 122 // DBInputFormat.setInput(job, SlistRecord.class, "slist_8", "sid < 1000000", "uid_", fields); 123 124 job.setMapperClass(SidMapper.class); 125 job.setReducerClass(SidReducer.class); 126 // job.setReducerClass(IdentityReducer.class); 127 128 129 /////////////////// 130 job.setMapOutputKeyClass(LongWritable.class); 131 job.setMapOutputValueClass(LongWritable.class); 132 job.setOutputKeyClass(LongWritable.class); 133 job.setOutputValueClass(LongWritable.class); 134 135 JobClient.runJob(job); 136 137 return 0; 138 } 139 140 public static void main(String[] args) throws Exception { 141 int res = ToolRunner.run(new Configuration(), 142 new SidCitated(), 143 args); 144 145 System.exit(res); 146 } 147 }
在测试机.12上:
nano mytest/src/SidCitated.java
javac -classpath hadoop-core-1.0.3.jar:lib/commons-cli-1.2.jar:lib/mysql-connector-java-5.1.18-bin.jar -d mytest/classes/ mytest/src/SidCitated.java
jar -cvf mytest/SidCitated.jar -C mytest/classes/ .
bin/hadoop fs -rmr dboutput //输出路径在代码中指定的
time bin/hadoop jar mytest/SidCitated.jar SidCitated
huangshaobin@backtest12:~/hadoop-1.0.3$ javac -classpath hadoop-core-1.0.3.jar:lib/commons-cli-1.2.jar:lib/mysql-connector-java-5.1.18-bin.jar -d mytest/classes/
mytest/src/SidCitated.java
Note: mytest/src/SidCitated.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
@!!!运行成功,经查询mysql验证,统计结果是正确的。
这里对slist_8全量处理,没有出现虚拟机出现的:Error: Java heap space,说明在这里没有内存溢出,所以,要好好调查一下,关于split、关于map、关于伪分布式
真分布式、是如何不同的???
... 12/09/23 19:46:30 INFO mapred.JobClient: map 50% reduce 0% 12/09/23 19:46:39 INFO mapred.JobClient: map 100% reduce 16% 12/09/23 19:46:48 INFO mapred.JobClient: map 100% reduce 100% 12/09/23 19:46:53 INFO mapred.JobClient: Job complete: job_201209201824_0017 12/09/23 19:46:53 INFO mapred.JobClient: Counters: 29 12/09/23 19:46:53 INFO mapred.JobClient: Job Counters 12/09/23 19:46:53 INFO mapred.JobClient: Launched reduce tasks=1 12/09/23 19:46:53 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=26945 12/09/23 19:46:53 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/09/23 19:46:53 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/09/23 19:46:53 INFO mapred.JobClient: Launched map tasks=3 12/09/23 19:46:53 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=16327 12/09/23 19:46:53 INFO mapred.JobClient: File Input Format Counters 12/09/23 19:46:53 INFO mapred.JobClient: Bytes Read=0 12/09/23 19:46:53 INFO mapred.JobClient: File Output Format Counters 12/09/23 19:46:53 INFO mapred.JobClient: Bytes Written=4650196 12/09/23 19:46:53 INFO mapred.JobClient: FileSystemCounters 12/09/23 19:46:53 INFO mapred.JobClient: FILE_BYTES_READ=29360982 12/09/23 19:46:53 INFO mapred.JobClient: HDFS_BYTES_READ=150 12/09/23 19:46:53 INFO mapred.JobClient: FILE_BYTES_WRITTEN=44110658 12/09/23 19:46:53 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=4650196 12/09/23 19:46:53 INFO mapred.JobClient: Map-Reduce Framework 12/09/23 19:46:53 INFO mapred.JobClient: Map output materialized bytes=14680488 12/09/23 19:46:53 INFO mapred.JobClient: Map input records=815582 12/09/23 19:46:53 INFO mapred.JobClient: Reduce shuffle bytes=7340244 12/09/23 19:46:53 INFO mapred.JobClient: Spilled Records=2446746 12/09/23 19:46:53 INFO mapred.JobClient: Map output bytes=13049312 12/09/23 19:46:53 INFO mapred.JobClient: Total committed heap usage (bytes)=442302464 12/09/23 19:46:53 INFO mapred.JobClient: CPU time spent (ms)=13030 12/09/23 19:46:53 INFO mapred.JobClient: Map input bytes=815582 12/09/23 19:46:53 INFO mapred.JobClient: SPLIT_RAW_BYTES=150 12/09/23 19:46:53 INFO mapred.JobClient: Combine input records=0 12/09/23 19:46:53 INFO mapred.JobClient: Reduce input records=815582 12/09/23 19:46:53 INFO mapred.JobClient: Reduce input groups=428209 12/09/23 19:46:53 INFO mapred.JobClient: Combine output records=0 12/09/23 19:46:53 INFO mapred.JobClient: Physical memory (bytes) snapshot=513601536 12/09/23 19:46:53 INFO mapred.JobClient: Reduce output records=428209 12/09/23 19:46:53 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2069061632 12/09/23 19:46:53 INFO mapred.JobClient: Map output records=815582 huangshaobin@backtest12:~/hadoop-1.0.3$ bin/hadoop fs -cat dboutput/part-00000 | wc 428209 856418 4650196 huangshaobin@backtest12:~/hadoop-1.0.3$ bin/hadoop fs -cat dboutput/part-00000 | grep ^400000 400000 9252 40000000 1 40000052 2
job.setCombinerClass(SidReducer.class);
-------------------------------------------
20120924
对计数统计结果再联结sid相应的name,是不是要作二次排序?即,要统一将name放在记录的某一列。
为了与统计完的结果,再一次mapreduce与sinfo联结,将原来计数的结果V3输出格式由LongWritable改为Text,以便和另一个输入一致,其实没必要吧,都写到输出文件里面了。
这样一改,job.setCombinerClass(SidReducer.class)就不能正常工作了,因为这时combiner的输出的V格式是Text,而reducer的输入V格式是LongWritable。
@另外重要的是:如何设置不同的数据输入,一个来自文件,一个来自mysql?然后怎么在map当中如何区分是来自不同的输入源?
像《hadoop开发者第一期》联结的例子是这么做的:
// 获得输入文件的路径名
String path=((FileSplit)reporter.getInputSplit()).getPath().toString();
...
if (path.indexOf("action")>=0) { // 数据来自商品表
...
而对于一个是文件源输入,一个是数据库源输入,并且连接这样不同的源,问题有:
1.如何配置不同的输入源
2.在map中如何区分记录是来自哪个源
3.map处理后,输出如何排序,如保证name在key值sid后的第一列