hadoo访问MySql数据库_2_统计sid被引用情况

假设在一个引用表中,是uid,sid这样的引用,现在在mapreduce中对sid被引用情况的统计,参照上一篇,及《hadoop in action》中引用计数的例子,

新建slist_8表及导入数据。

新建SidCitated.java

 

 

进一步考虑两个mysql表输入,并联结sid对应的name

注意:DBInputFormat 输入的key,value各是什么?

 

在虚拟机上

javac -classpath hadoop-core-1.0.3.jar:lib/commons-cli-1.2.jar:lib/mysql-connector-java-5.1.18-bin.jar -d mysinfo/classes/ mysinfo/src/SidCitated.java
jar -cvf mysinfo/SidCitated.jar -C mysinfo/classes/ .
bin/hadoop fs -rmr dboutput //输出路径在代码中指定的
time bin/hadoop jar mysinfo/SidCitated.jar SidCitated


将输入原样输出:

  1 import java.io.IOException;
  2 import java.io.DataInput;
  3 import java.io.DataOutput;
  4 import java.sql.Connection;
  5 import java.sql.DriverManager;
  6 import java.sql.PreparedStatement;
  7 import java.sql.ResultSet;
  8 import java.sql.SQLException;
  9 
 10 import org.apache.hadoop.filecache.DistributedCache;
 11 import org.apache.hadoop.fs.Path;
 12 import org.apache.hadoop.io.IntWritable;
 13 import org.apache.hadoop.io.LongWritable;
 14 import org.apache.hadoop.io.Text;
 15 import org.apache.hadoop.io.Writable;
 16 import org.apache.hadoop.mapred.JobClient;
 17 import org.apache.hadoop.mapred.JobConf;
 18 import org.apache.hadoop.mapred.MapReduceBase;
 19 import org.apache.hadoop.mapred.Mapper;
 20 import org.apache.hadoop.mapred.OutputCollector;
 21 import org.apache.hadoop.mapred.FileOutputFormat;
 22 import org.apache.hadoop.mapred.Reporter;
 23 import org.apache.hadoop.mapred.lib.IdentityReducer;
 24 import org.apache.hadoop.mapred.lib.db.DBWritable;
 25 import org.apache.hadoop.mapred.lib.db.DBInputFormat;
 26 import org.apache.hadoop.mapred.lib.db.DBConfiguration;
 27 
 28 import org.apache.hadoop.util.Tool;
 29 import org.apache.hadoop.util.ToolRunner;
 30 import org.apache.hadoop.conf.Configuration;
 31 import org.apache.hadoop.conf.Configured;
 32 
 33 public class SidCitated extends Configured implements Tool {
 34 
 35 
 36 public static class SlistRecord implements Writable, DBWritable { // add "static" ??
 37 int uid;
 38 int sid;
 39 
 40 public SlistRecord(){ // add by brian
 41 System.out.println("SlistRecord()");
 42 }
 43 
 44 public SlistRecord(SlistRecord t){ // add by brian
 45 System.out.println("SlistRecord(SlistRecord t)");
 46 this.uid = t.uid;
 47 this.sid = t.sid;
 48 }
 49 
 50 public void readFields(DataInput in) throws IOException {
 51 // TODO Auto-generated method stub
 52 this.uid = in.readInt();
 53 this.sid = in.readInt();
 54 }
 55 
 56 public void write(DataOutput out) throws IOException {
 57 out.writeInt(this.uid);
 58 out.writeInt(this.sid);
 59 }
 60 
 61 public void readFields(ResultSet result) throws SQLException {
 62 this.uid = result.getInt(1);
 63 this.sid = result.getInt(2);
 64 }
 65 
 66 public void write(PreparedStatement stmt) throws SQLException{
 67 stmt.setInt(1, this.uid);
 68 stmt.setInt(2, this.sid);
 69 }
 70 
 71 public String toString() {
 72 return new String(this.uid + "," + this.sid); //
 73 }
 74 
 75 }
 76 
 77 
 78 public static class SidMapper extends MapReduceBase
 79 implements Mapper<LongWritable, SlistRecord, LongWritable, Text> {
 80 
 81 private final static IntWritable uno = new IntWritable(1);
 82 private IntWritable citationCount = new IntWritable();
 83 
 84 public void map(LongWritable key, SlistRecord value,
 85 OutputCollector<LongWritable, Text> collector, Reporter reporter) throws IOException {
 86 collector.collect(new LongWritable(value.uid), new Text(value.toString()));
 87 }
 88 } 
 89 
 90 
 91 
 92 public int run(String[] args) throws Exception {
 93 
 94 Configuration conf = getConf(); 
 95 JobConf job = new JobConf(conf, SidCitated.class);
 96 // job.set("mapred.job.tracker", "172.19.102.12:9001");//
 97 // DistributedCache.addFileToClassPath(new Path("/lib/mysql-connector-java-5.1.18-bin.jar"), job); //
 98 job.setInputFormat(DBInputFormat.class);
 99 FileOutputFormat.setOutputPath(job, new Path("dboutput"));
100 // DBConfiguration.configureDB(job, "com.mysql.jdbc.Driver", "jdbc:mysql://172.19.102.12/tmp", "brian", "123456");
101 DBConfiguration.configureDB(job, "com.mysql.jdbc.Driver", "jdbc:mysql://localhost/tmp", "root", "123456");
102 String [] fields = {"uid_", "sid"};
103 //DBInputFormat.setInput(job, SlistRecord.class, "slist_8", null, "uid_", fields);
104 DBInputFormat.setInput(job, SlistRecord.class, "slist_8", "sid < 1000000", "uid_", fields);
105 job.setMapperClass(SidMapper.class);
106 // job.setReducerClass(SidReducer.class); 
107 job.setReducerClass(IdentityReducer.class); 
108 JobClient.runJob(job); 
109 
110 return 0;
111 }
112 
113 public static void main(String[] args) throws Exception { 
114 int res = ToolRunner.run(new Configuration(), 
115 new SidCitated(), 
116 args);
117 
118 System.exit(res);
119 }
120 }

 


Error: Java heap space
DBInputFormat.setInput(job, SlistRecord.class, "slist_8", "sid < 1000000", "uid_", fields);
只查询6位数以内的sid ,就正常

brian@ubuntu:~/Downloads/hadoop-1.0.3$ bin/hadoop fs -cat dboutput/* | wc
cat: File does not exist: /user/brian/dboutput/_logs
56662 113324 1437963


表太大,内存溢出!!这里java堆容量被撑爆吧?
很值得重视的问题!!!
这里虚拟机上是2个map,溢出,是指单个map的,还是全局的,如果是单个,是否可以增加map的数量来避免?
假设DBInputFormat是在每个map上读取它自身的那个split,并且假设在一个CPU上跑多个map是串行的,那是否只要设置cpu数量*split大小 不超过java heap space,就可以?
如果真是这样,那如何设置split大小,不同节点上的不同map会自动只读取自己想要的那部分吗??
要深入了解下split的大小配置,以及如何从web上信息看出结果是否一致,以及DBInputFormat是如何保持对大数据量的数的有效读取的(特别是在节点数少,map数少的情况下,节点数
少map数多是否不一样?)?

默认情况下是怎么样的?

-----------------------------------------------

--------------------------------------------------

20120923


12/09/23 19:07:50 INFO mapred.JobClient: Task Id : attempt_201209201531_0020_m_000000_0, Status : FAILED
java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1019)
at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
at SidCitated$SidMapper.map(SidCitated.java:89)
at SidCitated$SidMapper.map(SidCitated.java:80)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

http://www.hadoopor.com/thread-2516-1-1.html说
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

brian@ubuntu:~/Downloads/hadoop-1.0.3$ bin/hadoop fs -cat dboutput/part-00000 | grep 400000
400000 9252
与mysql中统计的数据是一样的:
mysql> select count(*),sid from slist_8 where sid = 400000 group by sid;
+----------+--------+
| count(*) | sid |
+----------+--------+
| 9252 | 400000 |
+----------+--------+

@问:将此例子放到测试机上,并去掉DBInputFormat.setInput(job, SlistRecord.class, "slist_8", "sid < 1000000", "uid_", fields);中的"sid < 1000000",
会不会溢出呢?
这段代码是:

  1 import java.io.IOException;
  2 import java.io.DataInput;
  3 import java.io.DataOutput;
  4 import java.util.Iterator;
  5 import java.sql.Connection;
  6 import java.sql.DriverManager;
  7 import java.sql.PreparedStatement;
  8 import java.sql.ResultSet;
  9 import java.sql.SQLException;
 10 
 11 import org.apache.hadoop.filecache.DistributedCache;
 12 import org.apache.hadoop.fs.Path;
 13 import org.apache.hadoop.io.IntWritable;
 14 import org.apache.hadoop.io.LongWritable;
 15 import org.apache.hadoop.io.Text;
 16 import org.apache.hadoop.io.Writable;
 17 import org.apache.hadoop.mapred.JobClient;
 18 import org.apache.hadoop.mapred.JobConf;
 19 import org.apache.hadoop.mapred.MapReduceBase;
 20 import org.apache.hadoop.mapred.Mapper;
 21 import org.apache.hadoop.mapred.Reducer; //
 22 import org.apache.hadoop.mapred.OutputCollector;
 23 import org.apache.hadoop.mapred.FileOutputFormat;
 24 import org.apache.hadoop.mapred.Reporter;
 25 import org.apache.hadoop.mapred.lib.IdentityReducer;
 26 import org.apache.hadoop.mapred.lib.db.DBWritable;
 27 import org.apache.hadoop.mapred.lib.db.DBInputFormat;
 28 import org.apache.hadoop.mapred.lib.db.DBConfiguration;
 29 
 30 import org.apache.hadoop.util.Tool;
 31 import org.apache.hadoop.util.ToolRunner;
 32 import org.apache.hadoop.conf.Configuration;
 33 import org.apache.hadoop.conf.Configured;
 34 
 35 public class SidCitated extends Configured implements Tool {
 36 
 37 
 38 public static class SlistRecord implements Writable, DBWritable { // add "static" ??
 39 int uid;
 40 int sid;
 41 
 42 public SlistRecord(){ // add by brian
 43 System.out.println("SlistRecord()");
 44 }
 45 
 46 public SlistRecord(SlistRecord t){ // add by brian
 47 System.out.println("SlistRecord(SlistRecord t)");
 48 this.uid = t.uid;
 49 this.sid = t.sid;
 50 }
 51 
 52 public void readFields(DataInput in) throws IOException {
 53 // TODO Auto-generated method stub
 54 this.uid = in.readInt();
 55 this.sid = in.readInt();
 56 }
 57 
 58 public void write(DataOutput out) throws IOException {
 59 out.writeInt(this.uid);
 60 out.writeInt(this.sid);
 61 }
 62 
 63 public void readFields(ResultSet result) throws SQLException {
 64 this.uid = result.getInt(1);
 65 this.sid = result.getInt(2);
 66 }
 67 
 68 public void write(PreparedStatement stmt) throws SQLException{
 69 stmt.setInt(1, this.uid);
 70 stmt.setInt(2, this.sid);
 71 }
 72 
 73 public String toString() {
 74 return new String(this.uid + "," + this.sid); //
 75 }
 76 
 77 }
 78 
 79 
 80 public static class SidMapper extends MapReduceBase
 81 implements Mapper<LongWritable, SlistRecord, LongWritable, LongWritable> {
 82 
 83 private final static LongWritable uno = new LongWritable(1);
 84 // private IntWritable citationCount = new IntWritable();
 85 
 86 public void map(LongWritable key, SlistRecord value,
 87 OutputCollector<LongWritable, LongWritable> collector, Reporter reporter) throws IOException {
 88 // collector.collect(new LongWritable(value.uid), new Text(value.toString()));
 89 collector.collect(new LongWritable(value.sid), uno);
 90 }
 91 
 92 } 
 93 
 94 
 95 public static class SidReducer extends MapReduceBase
 96 implements Reducer<LongWritable, LongWritable, LongWritable, LongWritable> 
 97 {
 98 public void reduce(LongWritable key, Iterator<LongWritable> values,
 99 OutputCollector<LongWritable, LongWritable> output, Reporter reporter) throws IOException {
100 long count = 0;
101 while (values.hasNext()) {
102 count += values.next().get();
103 }
104 output.collect(key, new LongWritable(count));
105 }
106 
107 }
108 
109 
110 public int run(String[] args) throws Exception {
111 
112 Configuration conf = getConf(); 
113 JobConf job = new JobConf(conf, SidCitated.class);
114 // job.set("mapred.job.tracker", "172.19.102.12:9001");//
115 // DistributedCache.addFileToClassPath(new Path("/lib/mysql-connector-java-5.1.18-bin.jar"), job); //
116 job.setInputFormat(DBInputFormat.class);
117 FileOutputFormat.setOutputPath(job, new Path("dboutput"));
118 // DBConfiguration.configureDB(job, "com.mysql.jdbc.Driver", "jdbc:mysql://172.19.102.12/tmp", "brian", "123456");
119 DBConfiguration.configureDB(job, "com.mysql.jdbc.Driver", "jdbc:mysql://localhost/tmp", "root", "123456");
120 String [] fields = {"uid_", "sid"};
121 // DBInputFormat.setInput(job, SlistRecord.class, "slist_8", null, "uid_", fields);
122 DBInputFormat.setInput(job, SlistRecord.class, "slist_8", "sid < 1000000", "uid_", fields);
123 
124 job.setMapperClass(SidMapper.class);
125 job.setReducerClass(SidReducer.class); 
126 //     job.setReducerClass(IdentityReducer.class);
127 
128 
129 ///////////////////
130 job.setMapOutputKeyClass(LongWritable.class);
131 job.setMapOutputValueClass(LongWritable.class);
132 job.setOutputKeyClass(LongWritable.class);
133 job.setOutputValueClass(LongWritable.class);
134 
135 JobClient.runJob(job); 
136 
137 return 0;
138 }
139 
140 public static void main(String[] args) throws Exception { 
141 int res = ToolRunner.run(new Configuration(), 
142 new SidCitated(), 
143 args);
144 
145 System.exit(res);
146 }
147 }

 


在测试机上导入slist_8表

use tmp
set names utf
CREATE TABLE `slist_8` (
`uid_` bigint(19) NOT NULL DEFAULT '0',
`sid` int(10) NOT NULL DEFAULT '0',
`snameremark` varchar(255) DEFAULT NULL,
`__version` bigint(20) unsigned DEFAULT '0',
`__deleted` tinyint(4) DEFAULT '0',
PRIMARY KEY (`uid_`,`sid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

load data infile '/tmp/slist_8' into table slist_8 fields terminated by ',' enclosed by '\"';
注意:表名不能加引号,且导入文件要放到/tmp下(这是因为之前导出是导出到/tmp下的原因吗??)
mysql> load data infile '/tmp/slist_8' into table slist_8 fields terminated by ',' enclosed by '\"';
Query OK, 815582 rows affected (7.27 sec)
Records: 815582 Deleted: 0 Skipped: 0 Warnings: 0
select * from slist_8 limit 10;
查看是否成功并正常显示。

改代码中相应的数据库配置等,改完代码如下:

  1 import java.io.IOException;
  2 import java.io.DataInput;
  3 import java.io.DataOutput;
  4 import java.util.Iterator;
  5 import java.sql.Connection;
  6 import java.sql.DriverManager;
  7 import java.sql.PreparedStatement;
  8 import java.sql.ResultSet;
  9 import java.sql.SQLException;
 10 
 11 import org.apache.hadoop.filecache.DistributedCache;
 12 import org.apache.hadoop.fs.Path;
 13 import org.apache.hadoop.io.IntWritable;
 14 import org.apache.hadoop.io.LongWritable;
 15 import org.apache.hadoop.io.Text;
 16 import org.apache.hadoop.io.Writable;
 17 import org.apache.hadoop.mapred.JobClient;
 18 import org.apache.hadoop.mapred.JobConf;
 19 import org.apache.hadoop.mapred.MapReduceBase;
 20 import org.apache.hadoop.mapred.Mapper;
 21 import org.apache.hadoop.mapred.Reducer; //
 22 import org.apache.hadoop.mapred.OutputCollector;
 23 import org.apache.hadoop.mapred.FileOutputFormat;
 24 import org.apache.hadoop.mapred.Reporter;
 25 import org.apache.hadoop.mapred.lib.IdentityReducer;
 26 import org.apache.hadoop.mapred.lib.db.DBWritable;
 27 import org.apache.hadoop.mapred.lib.db.DBInputFormat;
 28 import org.apache.hadoop.mapred.lib.db.DBConfiguration;
 29 
 30 import org.apache.hadoop.util.Tool;
 31 import org.apache.hadoop.util.ToolRunner;
 32 import org.apache.hadoop.conf.Configuration;
 33 import org.apache.hadoop.conf.Configured;
 34 
 35 public class SidCitated extends Configured implements Tool {
 36 
 37 
 38 public static class SlistRecord implements Writable, DBWritable { // add "static" ??
 39 int uid;
 40 int sid;
 41 
 42 public SlistRecord(){ // add by brian
 43 System.out.println("SlistRecord()");
 44 }
 45 
 46 public SlistRecord(SlistRecord t){ // add by brian
 47 System.out.println("SlistRecord(SlistRecord t)");
 48 this.uid = t.uid;
 49 this.sid = t.sid;
 50 }
 51 
 52 public void readFields(DataInput in) throws IOException {
 53 // TODO Auto-generated method stub
 54 this.uid = in.readInt();
 55 this.sid = in.readInt();
 56 }
 57 
 58 public void write(DataOutput out) throws IOException {
 59 out.writeInt(this.uid);
 60 out.writeInt(this.sid);
 61 }
 62 
 63 public void readFields(ResultSet result) throws SQLException {
 64 this.uid = result.getInt(1);
 65 this.sid = result.getInt(2);
 66 }
 67 
 68 public void write(PreparedStatement stmt) throws SQLException{
 69 stmt.setInt(1, this.uid);
 70 stmt.setInt(2, this.sid);
 71 }
 72 
 73 public String toString() {
 74 return new String(this.uid + "," + this.sid); //
 75 }
 76 
 77 }
 78 
 79 
 80 public static class SidMapper extends MapReduceBase
 81 implements Mapper<LongWritable, SlistRecord, LongWritable, LongWritable> {
 82 
 83 private final static LongWritable uno = new LongWritable(1);
 84 // private IntWritable citationCount = new IntWritable();
 85 
 86 public void map(LongWritable key, SlistRecord value,
 87 OutputCollector<LongWritable, LongWritable> collector, Reporter reporter) throws IOException {
 88 // collector.collect(new LongWritable(value.uid), new Text(value.toString()));
 89 collector.collect(new LongWritable(value.sid), uno);
 90 }
 91 
 92 } 
 93 
 94 
 95 public static class SidReducer extends MapReduceBase
 96 implements Reducer<LongWritable, LongWritable, LongWritable, LongWritable> 
 97 {
 98 public void reduce(LongWritable key, Iterator<LongWritable> values,
 99 OutputCollector<LongWritable, LongWritable> output, Reporter reporter) throws IOException {
100 long count = 0;
101 while (values.hasNext()) {
102 count += values.next().get();
103 }
104 output.collect(key, new LongWritable(count));
105 }
106 
107 }
108 
109 
110 public int run(String[] args) throws Exception {
111 
112 Configuration conf = getConf(); 
113 JobConf job = new JobConf(conf, SidCitated.class);
114 job.set("mapred.job.tracker", "172.19.102.12:9001");//
115 DistributedCache.addFileToClassPath(new Path("/lib/mysql-connector-java-5.1.18-bin.jar"), job); //
116 job.setInputFormat(DBInputFormat.class);
117 FileOutputFormat.setOutputPath(job, new Path("dboutput"));
118 DBConfiguration.configureDB(job, "com.mysql.jdbc.Driver", "jdbc:mysql://172.19.102.12/tmp", "brian", "123456");
119 // DBConfiguration.configureDB(job, "com.mysql.jdbc.Driver", "jdbc:mysql://localhost/tmp", "root", "123456");
120 String [] fields = {"uid_", "sid"};
121 DBInputFormat.setInput(job, SlistRecord.class, "slist_8", null, "uid_", fields);
122 //     DBInputFormat.setInput(job, SlistRecord.class, "slist_8", "sid < 1000000", "uid_", fields);
123 
124 job.setMapperClass(SidMapper.class);
125 job.setReducerClass(SidReducer.class); 
126 //     job.setReducerClass(IdentityReducer.class);
127 
128 
129 ///////////////////
130 job.setMapOutputKeyClass(LongWritable.class);
131 job.setMapOutputValueClass(LongWritable.class);
132 job.setOutputKeyClass(LongWritable.class);
133 job.setOutputValueClass(LongWritable.class);
134 
135 JobClient.runJob(job); 
136 
137 return 0;
138 }
139 
140 public static void main(String[] args) throws Exception { 
141 int res = ToolRunner.run(new Configuration(), 
142 new SidCitated(), 
143 args);
144 
145 System.exit(res);
146 }
147 }

 

在测试机.12上:
nano mytest/src/SidCitated.java


javac -classpath hadoop-core-1.0.3.jar:lib/commons-cli-1.2.jar:lib/mysql-connector-java-5.1.18-bin.jar -d mytest/classes/ mytest/src/SidCitated.java
jar -cvf mytest/SidCitated.jar -C mytest/classes/ .
bin/hadoop fs -rmr dboutput //输出路径在代码中指定的
time bin/hadoop jar mytest/SidCitated.jar SidCitated

huangshaobin@backtest12:~/hadoop-1.0.3$ javac -classpath hadoop-core-1.0.3.jar:lib/commons-cli-1.2.jar:lib/mysql-connector-java-5.1.18-bin.jar -d mytest/classes/
mytest/src/SidCitated.java
Note: mytest/src/SidCitated.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

@!!!运行成功,经查询mysql验证,统计结果是正确的。

这里对slist_8全量处理,没有出现虚拟机出现的:Error: Java heap space,说明在这里没有内存溢出,所以,要好好调查一下,关于split、关于map、关于伪分布式
真分布式、是如何不同的???

...
12/09/23 19:46:30 INFO mapred.JobClient: map 50% reduce 0%
12/09/23 19:46:39 INFO mapred.JobClient: map 100% reduce 16%
12/09/23 19:46:48 INFO mapred.JobClient: map 100% reduce 100%
12/09/23 19:46:53 INFO mapred.JobClient: Job complete: job_201209201824_0017
12/09/23 19:46:53 INFO mapred.JobClient: Counters: 29
12/09/23 19:46:53 INFO mapred.JobClient: Job Counters 
12/09/23 19:46:53 INFO mapred.JobClient: Launched reduce tasks=1
12/09/23 19:46:53 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=26945
12/09/23 19:46:53 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/09/23 19:46:53 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/09/23 19:46:53 INFO mapred.JobClient: Launched map tasks=3
12/09/23 19:46:53 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=16327
12/09/23 19:46:53 INFO mapred.JobClient: File Input Format Counters 
12/09/23 19:46:53 INFO mapred.JobClient: Bytes Read=0
12/09/23 19:46:53 INFO mapred.JobClient: File Output Format Counters 
12/09/23 19:46:53 INFO mapred.JobClient: Bytes Written=4650196
12/09/23 19:46:53 INFO mapred.JobClient: FileSystemCounters
12/09/23 19:46:53 INFO mapred.JobClient: FILE_BYTES_READ=29360982
12/09/23 19:46:53 INFO mapred.JobClient: HDFS_BYTES_READ=150
12/09/23 19:46:53 INFO mapred.JobClient: FILE_BYTES_WRITTEN=44110658
12/09/23 19:46:53 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=4650196
12/09/23 19:46:53 INFO mapred.JobClient: Map-Reduce Framework
12/09/23 19:46:53 INFO mapred.JobClient: Map output materialized bytes=14680488
12/09/23 19:46:53 INFO mapred.JobClient: Map input records=815582
12/09/23 19:46:53 INFO mapred.JobClient: Reduce shuffle bytes=7340244
12/09/23 19:46:53 INFO mapred.JobClient: Spilled Records=2446746
12/09/23 19:46:53 INFO mapred.JobClient: Map output bytes=13049312
12/09/23 19:46:53 INFO mapred.JobClient: Total committed heap usage (bytes)=442302464
12/09/23 19:46:53 INFO mapred.JobClient: CPU time spent (ms)=13030
12/09/23 19:46:53 INFO mapred.JobClient: Map input bytes=815582
12/09/23 19:46:53 INFO mapred.JobClient: SPLIT_RAW_BYTES=150
12/09/23 19:46:53 INFO mapred.JobClient: Combine input records=0
12/09/23 19:46:53 INFO mapred.JobClient: Reduce input records=815582
12/09/23 19:46:53 INFO mapred.JobClient: Reduce input groups=428209
12/09/23 19:46:53 INFO mapred.JobClient: Combine output records=0
12/09/23 19:46:53 INFO mapred.JobClient: Physical memory (bytes) snapshot=513601536
12/09/23 19:46:53 INFO mapred.JobClient: Reduce output records=428209
12/09/23 19:46:53 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2069061632
12/09/23 19:46:53 INFO mapred.JobClient: Map output records=815582


huangshaobin@backtest12:~/hadoop-1.0.3$ bin/hadoop fs -cat dboutput/part-00000 | wc
428209 856418 4650196

huangshaobin@backtest12:~/hadoop-1.0.3$ bin/hadoop fs -cat dboutput/part-00000 | grep ^400000
400000 9252
40000000 1
40000052 2

 

 

job.setCombinerClass(SidReducer.class);

-------------------------------------------
20120924
对计数统计结果再联结sid相应的name,是不是要作二次排序?即,要统一将name放在记录的某一列。
为了与统计完的结果,再一次mapreduce与sinfo联结,将原来计数的结果V3输出格式由LongWritable改为Text,以便和另一个输入一致,其实没必要吧,都写到输出文件里面了。
这样一改,job.setCombinerClass(SidReducer.class)就不能正常工作了,因为这时combiner的输出的V格式是Text,而reducer的输入V格式是LongWritable。
@另外重要的是:如何设置不同的数据输入,一个来自文件,一个来自mysql?然后怎么在map当中如何区分是来自不同的输入源?
像《hadoop开发者第一期》联结的例子是这么做的:
// 获得输入文件的路径名
String path=((FileSplit)reporter.getInputSplit()).getPath().toString();
...
if (path.indexOf("action")>=0) { // 数据来自商品表
...

而对于一个是文件源输入,一个是数据库源输入,并且连接这样不同的源,问题有:

1.如何配置不同的输入源

2.在map中如何区分记录是来自哪个源

3.map处理后,输出如何排序,如保证name在key值sid后的第一列

 

posted @ 2012-09-26 11:03  aha~  阅读(491)  评论(0编辑  收藏  举报