多表关联
输入是两个文件,一个代表工厂表,包含工厂名列和地址编号列;另一个代表地址表,包含地址名列和地址编号列。要求从输入数据中找出工厂名和地址名的对应关系,输出"工厂名——地址名"表。
样例输入如下所示。
1)factory:
factoryname addressed
Beijing Red Star 1
Shenzhen Thunder 3
Guangzhou Honda 2
Beijing Rising 1
Guangzhou Development Bank 2
Tencent 3
Back of Beijing 1
2)address:
addressID addressname
1 Beijing
2 Guangzhou
3 Shenzhen
4 Xian
样例输出如下所示。
factoryname addressname
Back of Beijing Beijing
Beijing Red Star Beijing
Beijing Rising Beijing
Guangzhou Development Bank Guangzhou
Guangzhou Honda Guangzhou
Shenzhen Thunder Shenzhen
Tencent Shenzhen
设计思路:和单表关联百分之九十九相似,因为单表关联也算是多表关联(自连接),最重要的还是找到”中间数据“,这里的中间数据
很明显是addressId,一句话概括(所有引用"xId"的factory都属于"xId所对应addressname " 。这里只需要按照ID分组,把
该组里所有的factory和addressname 做个笛卡尔积就得出了答案。
代码:
public static class Map extends Mapper<LongWritable, Text, Text,Text> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { //输入的一行预处理文本 StringTokenizer itr = new StringTokenizer(value.toString(),"\t"); String[] values = new String[2]; int i = 0; while (itr.hasMoreElements()) { values[i] = itr.nextToken(); i++; } if (!value.toString().contains("factoryname")&&!value.toString().contains("addressed")) { if (values[0].charAt(0)>='0'&&values[0].charAt(0)<='9') {//说明是右表的ID context.write(new Text(values[0]),new Text("Right+"+values[1])); }else{ //说明是左表的factory context.write(new Text(values[1]),new Text("Left+"+values[0])); } } } } public static class IntSumReducer extends Reducer<Text, Text, Text, Text> { @Override public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { List<String> childs = new ArrayList<String>(); List<String> parents = new ArrayList<String>(); for (Text value : values) { String[] split = value.toString().split("\\+"); if (split[0].equals("Right")) { parents.add(split[1]); }else{ childs.add(split[1]); } } for (String cds : childs) { for (String pts : parents) { context.write(new Text(cds), new Text(pts)); } } } }