概述:该项目分成4个模块:爬取模块、解析模块、索引模块、搜索模块。
功能:爬取智联招聘和前程无忧两个网站上的招聘信息,解析并保存,在本地建立索引,最后提供web界面和各种搜索功能。
技术:heritrix3.0 、 hbase-writer0.9 、 hbase0.9、hadoop0.20.2、HTMLParser2.0 、 lucene3.3 、 bobo-browse2.5 、struts2.2 、 freemarker2.3 、 jquery
开发时间:3周。
开发环境:linux(fedora15)、eclipse
前期学习hadoop、lucene:2个月。
参考书:《Hadoop:The Definitive Guide, 2nd Edition》《Lucene in Action, Second Edition》《开发自己的搜索引擎:Lucene+Heritrix(第2版)》
一、爬取模块:
1、技术:heritrix3.0 + hbase-writer0.9 + hbase0.9 2、概述:用heritrix爬取前程无忧和智联招聘两个网站上职位有关的页面,并保存到hbase中rawjobs表。
rawjobs表:
key:Keying.createKey(url)
column family:content、curi 3、详细: (1)、创建自己的DecideRule类:
package com.qjqiao.modules.deciderules; import org.archive.modules.CrawlURI; import org.archive.modules.deciderules.DecideResult; import org.archive.modules.deciderules.DecideRule; public class JobsRule extends DecideRule { private static final long serialVersionUID = 1L; @Override protected DecideResult innerDecide(CrawlURI uri) { String u = uri.getURI(); if (u.startsWith("dns") || u.startsWith("DNS") || u.endsWith("robots.txt") // www.zhaopin.com || u.contains("zhaopin.com/jobseeker") || u.contains("company.zhaopin.com") || u.contains("jobs.zhaopin.com") || u.contains("search.zhaopin.com/jobs") || u.contains("search.zhaopin.com/jobseeker") // www.51job.com || u.contains("search.51job.com")) { if (!u.contains("research.51job.com")) { return DecideResult.ACCEPT; } } return DecideResult.REJECT; } }
(2)、创建assingment policy类:
package org.archive.crawler.frontier; import org.apache.commons.httpclient.URIException; import org.archive.modules.CrawlURI; import org.archive.net.UURI; public class ELFHashQueueAssignmentPolicy extends URIAuthorityBasedQueueAssignmentPolicy { /** * */ private static final long serialVersionUID = 1L; public int ELFHash(String str, int number) { int hash = 0; long x = 0l; char[] array = str.toCharArray(); for (int i = 0; i < array.length; i++) { hash = (hash << 4) + array[i]; if ((x = (hash & 0xF0000000L)) != 0) { hash ^= (x >> 24); hash &= ~x; } } int result = (hash & 0x7FFFFFFF) % number; return result; } @Override protected String getCoreKey(UURI basis) { try { System.out.println(this.ELFHash(basis.getURI().toString(), 50) + " |||| ELFHashQueueAssignmentPolicy : " + basis.getURI()); return this.ELFHash(basis.getURI().toString(), 50) + ""; } catch (URIException e) { e.printStackTrace(); return "0"; } } }
(3)、修改配置文件crawler-beans.xml:
<bean id="simpleOverrides" class="org.springframework.beans.factory.config.PropertyOverrideConfigurer"> <property name="properties"> <value> metadata.operatorContactUrl=http://www.51job.com metadata.jobName=51job metadata.description=jobs from 51job.com </value> </property> </bean> <bean id="metadata" class="org.archive.modules.CrawlMetadata" autowire="byName"> <property name="operatorContactUrl" value="[see override above]"/> <property name="jobName" value="[see override above]"/> <property name="description" value="[see override above]"/> <property name="userAgentTemplate" value="Mozilla/5.0 (compatible; heritrix/3.1.0 +@OPERATOR_CONTACT_URL@)"/> </bean> <bean id="seeds" class="org.archive.modules.seeds.TextSeedModule"> <property name="textSource"> <bean class="org.archive.spring.ConfigFile"> <property name="path" value="seeds.txt" /> </bean> </property> <property name='sourceTagSeeds' value='false'/> </bean> <bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence"> <property name="rules"> <list> <bean class="org.archive.modules.deciderules.RejectDecideRule"> </bean> <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule"> </bean> <bean class="org.archive.modules.deciderules.TooManyHopsDecideRule"> </bean> <bean class="org.archive.modules.deciderules.TransclusionDecideRule"> </bean> <bean class="com.qjqiao.modules.deciderules.JobsRule"> </bean> <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule"> <property name="decision" value="REJECT"/> <property name="seedsAsSurtPrefixes" value="false"/> <property name="surtsDumpFile" value="negative-surts.dump" /> </bean> <bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule"> </bean> <bean class="org.archive.modules.deciderules.PathologicalPathDecideRule"> </bean> <bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule"> </bean> <bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule"> </bean> </list> </property> </bean> <bean id="hbaseParameterSettings" class="org.archive.io.hbase.HBaseParameters"> <property name="contentColumnFamily" value="content"></property> <property name="contentColumnName" value="raw-data"></property> <property name="charsetColumnName" value="charset"></property> <property name="curiColumnFamily" value="curi"></property> <property name="ipColumnName" value="ip"></property> <property name="pathFromSeedColumnName" value="path-from-seed"></property> <property name="isSeedColumnName" value="is-seed"></property> <property name="viaColumnName" value="via"></property> <property name="urlColumnName" value="url"></property> <property name="requestColumnName" value="request"></property> <!-- Overwrite more options here --> </bean> <bean id="hbaseWriterProcessor" class="org.archive.modules.writer.HBaseWriterProcessor"> <property name="zkQuorum" value="localhost"> </property> <property name="zkClientPort" value="2181"> </property> <property name="hbaseTable" value="rawjobs"> </property> <property name="onlyProcessNewRecords" value="false"> </property> <property name="onlyWriteNewRecords" value="false"> </property> <property name="hbaseParameters"> <ref bean="hbaseParameterSettings" /> </property> </bean> <bean id="dispositionProcessors" class="org.archive.modules.DispositionChain"> <property name="processors"> <list> <ref bean="hbaseWriterProcessor" /> <ref bean="candidates"/> <ref bean="disposition"/> </list> </property> </bean> <bean id="queueAssignmentPolicy" class="org.archive.crawler.frontier.ELFHashQueueAssignmentPolicy"> </bean>
(4)种子文件:seeds.txt:
http://search.51job.com/jobsearch/advance_search.php?lang=c&stype=2
http://www.zhaopin.com/jobseeker/index_industry.html
(5)运行示例:
二、解析模块:
1、技术:HTMLParser2.0 + hadoop mapreduce + hbase
2、概述:
整个解析模块作为一个mapreduce过程,并行处理;
从hbase的rawjobs表中读取爬取的页面信息,用HTMLParser解析页面,获取招聘信息,格式化(不同网站,招聘信息的字段取值范围有差别)并保存到hbase的parsedjobs表中。
parsedjobs表:
key:Keying.createKey(url)
column family:job
3、遇到的问题:中文乱码问题。
问题原因:不同网站、同一网站不同页面,有不同的编码方式,所以爬虫获取的InputStream字节流编码不同,hbase-writer直接将该字节序列写入hbase中;解析模块读取hbase中的网页内容时,Bytes.toString(byte[])方法用的是utf-8字符集,所以乱码。
解决方法:在hbase-writer的HbaseWriter中,获取爬取的网页的编码方式,并写入到hbase中rawjobs表中:
String contentType = curi.getContentType(); String charset = "utf-8"; if (-1 != contentType.indexOf("charset=")) { charset = contentType .substring(contentType.indexOf("charset=") + 8); } else { if (curi.getURI().contains("51job.com")) { charset = "gb2312"; } else { charset = "utf-8"; } } batchPut.add(Bytes.toBytes(getHbaseOptions().getContentColumnFamily()), Bytes.toBytes(getHbaseOptions().getCharsetColumnName()), Bytes.toBytes(charset));
解析模块读取时,用自定义的makeString (byte[],String)方法,使用对应的字符集将字节转换为字符串:
String charset=Bytes.toString(result.getValue(Bytes.toBytes("content"), Bytes.toBytes("charset") )); String rawData = JobsParser.makeString(result.getValue(Bytes.toBytes("content"), Bytes.toBytes("raw-data")), charset); public static String makeString(byte[] b, String charset) { if (b == null) { return null; } if (b.length == 0) { return ""; } try { return new String(b, 0, b.length, charset); } catch (UnsupportedEncodingException e) { System.out.println("charset not supported?"); e.printStackTrace(); return null; } }
4、详细:
(1)、mapreduce任务:
package com.jobsearcher.parser.mapreduce; import java.io.ByteArrayOutputStream; import java.io.DataOutputStream; import java.io.IOException; import java.io.UnsupportedEncodingException; import java.util.Map; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.hbase.mapreduce.TableInputFormat; import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil; import org.apache.hadoop.hbase.mapreduce.TableMapper; import org.apache.hadoop.hbase.mapreduce.TableReducer; import org.apache.hadoop.hbase.util.Base64; import org.apache.hadoop.hbase.util.Bytes; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper.Context; import org.apache.hadoop.mapreduce.lib.output.NullOutputFormat; import com.jobsearcher.parser.util.PageParser; public class JobsParser { /** Name of this 'program'. */ static final String NAME = "jobsparser"; static final String FROM_TABLENAME = "rawjobs"; static final String TO_TABLENAME = "parsedjobs"; /** * Mapper. */ static class JobsParserMapper extends TableMapper<ImmutableBytesWritable, Result> { static int i = 0; protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException { context.write(key, value); } } /* * Reducer. */ static class JobsParserReducer extends TableReducer<ImmutableBytesWritable, Result, ImmutableBytesWritable> { @Override protected void reduce(ImmutableBytesWritable row, Iterable<Result> results, Context context) throws IOException, InterruptedException { for (Result result : results) { Put put = new Put(result.getRow()); String url = Bytes.toString(result.getValue( Bytes.toBytes("curi"), Bytes.toBytes("url"))); String charset = Bytes.toString(result.getValue( Bytes.toBytes("content"), Bytes.toBytes("charset"))); String rawData = JobsParser.makeString( result.getValue(Bytes.toBytes("content"), Bytes.toBytes("raw-data")), charset); if (null != url && null != rawData) { Map<String, String> job = PageParser.parse(url, rawData); if (null != job) { byte[] jobFamily = Bytes.toBytes("job"); for (Map.Entry<String, String> kv : job.entrySet()) { put.add(jobFamily, Bytes.toBytes(kv.getKey()), Bytes.toBytes(kv.getValue())); } long timestamp = (result .getColumn(Bytes.toBytes("curi"), Bytes.toBytes("url")).get(0) .getTimestamp() + Long.parseLong(job .get("job_time"))) / 2; put.add(jobFamily, Bytes.toBytes("timestamp"), Bytes.toBytes(timestamp)); context.write(row, put); } } } } } public static Job createSubmittableJob(Configuration conf) throws IOException { Job job = new Job(conf, NAME + "_" + FROM_TABLENAME + "_" + TO_TABLENAME); job.setJarByClass(JobsParser.class); Scan scan = new Scan(); TableMapReduceUtil.initTableMapperJob(FROM_TABLENAME, scan, JobsParserMapper.class, ImmutableBytesWritable.class, Result.class, job); TableMapReduceUtil.addDependencyJars(job); TableMapReduceUtil.initTableReducerJob(TO_TABLENAME, JobsParserReducer.class, job); return job; } public static void main(String[] args) throws Exception { Configuration conf = HBaseConfiguration.create(); Job job = createSubmittableJob(conf); System.exit(job.waitForCompletion(true) ? 0 : 1); } public static String makeString(byte[] b, String charset) { if (b == null) { return null; } if (b.length == 0) { return ""; } try { return new String(b, 0, b.length, charset); } catch (UnsupportedEncodingException e) { System.out.println("charset not supported?"); e.printStackTrace(); return null; } } }
(2)、parser类:
package com.jobsearcher.parser.util; import java.util.Map; public class PageParser { public static Map<String, String> parse(String url, String page) { if (url.contains("51job.com")) { return PageParser51job.parse(url,page); } else if (url.contains("zhaopin.com")) { return PageParserZhaopin.parse(url,page); } else { System.out.println("unexpected url : " + url); return null; } } }
前程无忧页面解析类:PageParser51job
package com.jobsearcher.parser.util; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.HashMap; import java.util.Map; import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.Parser; import org.htmlparser.filters.AndFilter; import org.htmlparser.filters.CssSelectorNodeFilter; import org.htmlparser.filters.HasChildFilter; import org.htmlparser.filters.StringFilter; import org.htmlparser.filters.TagNameFilter; import org.htmlparser.util.NodeList; import org.htmlparser.visitors.TextExtractingVisitor; public class PageParser51job { public static Map<String, String> parse(String url, String page) { Map<String, String> job = new HashMap<String, String>(); if (!url.matches("http://search.51job.com/job/.*html?")) { System.out.println("abandoned : " + url); return null; } Parser parser = Parser.createParser(page, "UTF-8"); NodeFilter jobTitleFilter = new CssSelectorNodeFilter( "div.s_txt_jobs table.jobs_1 td.sr_bt"); NodeFilter companyNameFilter = new CssSelectorNodeFilter( "div.s_txt_jobs table.jobs_1 table td"); NodeFilter companyPropsFilter = new AndFilter( new CssSelectorNodeFilter("div.s_txt_jobs table.jobs_1 td"), new HasChildFilter(new StringFilter("公司行业"))); NodeFilter jobPropsNameFilter = new CssSelectorNodeFilter( "div.s_txt_jobs div.jobs_com div.grayline div table.jobs_1 td.txt_1"); NodeFilter jobPropsValueFilter = new CssSelectorNodeFilter( "div.s_txt_jobs div.jobs_com div.grayline div table.jobs_1 td.txt_2"); NodeFilter jobDescriptionFilter = new CssSelectorNodeFilter( "div.s_txt_jobs div.jobs_com div.grayline div table.jobs_1 td div"); NodeFilter jobCategoryFilter = new AndFilter(new CssSelectorNodeFilter( "div.s_txt_jobs div.jobs_com div.grayline div table.jobs_1 td"),new HasChildFilter(new StringFilter("职位职能"))); NodeFilter companyDescriptionFilter = new AndFilter( new CssSelectorNodeFilter( "div.s_txt_jobs div.jobs_com div.grayline div.jobs_txt"), new HasChildFilter(new TagNameFilter("p"))); try { NodeList nodeList = parser.parse(jobTitleFilter); Node job_title_node = nodeList.elementAt(0); job.put("job_title", job_title_node.toPlainTextString().trim()); parser.reset(); nodeList = parser.parse(companyNameFilter); Node company_name_node = nodeList.elementAt(0); String rawName = company_name_node.toPlainTextString(); String cname = rawName.substring( 0, (-1 == rawName.indexOf("&")) ? rawName.length() : rawName .indexOf("&")); job.put("company_name", cname.trim()); parser.reset(); nodeList = parser.parse(companyPropsFilter); Node company_props_node = nodeList.elementAt(0); Parser company_props_parser = new Parser( company_props_node.toHtml()); TextExtractingVisitor visitor = new TextExtractingVisitor(); company_props_parser.visitAllNodesWith(visitor); String company_props = visitor.getExtractedText(); int s_industry = company_props.indexOf("公司行业:"); int s_type = company_props.indexOf("公司性质:"); int s_scale = company_props.indexOf("公司规模:"); int len = company_props.length(); if (-1 != s_industry) { job.put("company_industry", company_props.substring(s_industry + 5, -1 != s_type ? s_type : (-1 != s_scale? s_scale: len)).trim()); } if (-1 != s_type) { job.put("company_type", adaptCType(company_props.substring( s_type + 5, -1 != s_scale ? s_scale : len).trim())); } if (-1 != s_scale) { job.put("company_scale", company_props.substring( s_scale + 5).trim()); } parser.reset(); NodeList jobPropsNameNodeList = parser.parse(jobPropsNameFilter); parser.reset(); NodeList jobPropsValueNodeList = parser.parse(jobPropsValueFilter); Node jobPropsNameNode; Node jobPropsValueNode; for (int i = 0; i < jobPropsNameNodeList.size(); i++) { jobPropsNameNode = jobPropsNameNodeList.elementAt(i); jobPropsValueNode = jobPropsValueNodeList.elementAt(i); String name = jobPropsNameNode.toPlainTextString().trim(); if (name.contains("发布日期")) { job.put("job_time", adaptDate(jobPropsValueNode.toPlainTextString() .trim()) + ""); } if (name.contains("工作地点")) { job.put("job_address", jobPropsValueNode .toPlainTextString().trim()); } if (name.contains("招聘人数")) { job.put("job_count", jobPropsValueNode.toPlainTextString() .trim()); } if (name.contains("工作年限")) { job.put("job_experience", jobPropsValueNode .toPlainTextString().trim()); } if (name.contains("学") && name.contains("历")) { job.put("job_education", jobPropsValueNode .toPlainTextString().trim()); } if (name.contains("语言要求")) { job.put("job_language", jobPropsValueNode .toPlainTextString().trim()); } if (name.contains("薪水范围")) { job.put("job_salary", jobPropsValueNode.toPlainTextString() .trim()); } } parser.reset(); nodeList = parser.parse(jobDescriptionFilter); Node job_desc_node = nodeList.elementAt(0); job.put("job_description", job_desc_node.getChildren().toHtml() .trim()); parser.reset(); nodeList = parser.parse(companyDescriptionFilter); Node company_desc_node = nodeList.elementAt(0); String rawComDesc = company_desc_node.getChildren().toHtml(); job.put("company_description", rawComDesc.replaceAll("<a.*>.*</a>", "").trim()); parser.reset(); nodeList = parser.parse(jobCategoryFilter); Node job_category_node = nodeList.elementAt(0); if(null != job_category_node){ String rawJobc = job_category_node.toPlainTextString().replaceAll(" ", " "); job.put("job_category", rawJobc.substring(rawJobc.indexOf("职位职能:") + 5).trim()); } } catch (Exception e) { System.out.println("abandoned : " + url); e.printStackTrace(); return null; } job.put("from", "前程无忧"); job.put("url", url); System.out.println("parsed : " + url); return job; } private static SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd"); private static long adaptDate(String date) throws ParseException{ return df.parse(date).getTime(); } private static String adaptCType(String ctype){ if(ctype.contains("合资")){ return "合资"; } if(ctype.contains("民营")){ return "民营"; } if(ctype.contains("国企")){ return "国企"; } if(ctype.contains("外资")){ return "外商独资"; } if(ctype.contains("代表处")){ return "外企代表处"; } if(ctype.contains("机关")){ return "国家机关"; } if(ctype.contains("事业单位")){ return "事业单位"; } else{ return "其他"; } } }
智联招聘页面解析类PageParserZhaopin:
package com.jobsearcher.parser.util; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.HashMap; import java.util.Map; import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.Parser; import org.htmlparser.filters.AndFilter; import org.htmlparser.filters.CssSelectorNodeFilter; import org.htmlparser.filters.HasChildFilter; import org.htmlparser.filters.TagNameFilter; import org.htmlparser.util.NodeList; import org.htmlparser.visitors.TextExtractingVisitor; public class PageParserZhaopin { public static Map<String, String> parse(String url, String page) { Map<String, String> job = new HashMap<String, String>(); if (!url.matches("http://jobs.zhaopin.com/.*html?")) { System.out.println("not a valid url,abandoned : " + url); return null; } Parser parser = Parser.createParser(page, "UTF-8"); NodeFilter jobTitleFilter = new CssSelectorNodeFilter( "#positionTitle h1"); NodeFilter companyPropsFilter = new CssSelectorNodeFilter( "#zpcontent > table.companyInfoTab"); NodeFilter jobPropsFilter = new CssSelectorNodeFilter( "#zpcontent table.jobInfoTab table.jobInfoItems"); NodeFilter jobDesFilter = new CssSelectorNodeFilter( "#zpcontent table.jobInfoTab div.jobDes div div"); NodeFilter companyDesFilter = new AndFilter(new CssSelectorNodeFilter( "#zpcontent table.black12 td"), new HasChildFilter( new AndFilter(new TagNameFilter("p"), new HasChildFilter( new TagNameFilter("br"))))); try { NodeList nodeList = parser.parse(jobTitleFilter); Node job_title_node = nodeList.elementAt(0); job.put("job_title", job_title_node.toPlainTextString().trim()); // company props parser.reset(); nodeList = parser.parse(companyPropsFilter); Node cp_node = nodeList.elementAt(0); Parser cp_parser = new Parser(cp_node.toHtml()); TextExtractingVisitor visitor = new TextExtractingVisitor(); cp_parser.visitAllNodesWith(visitor); String cp = visitor.getExtractedText(); int s_industry = cp.indexOf("公司行业:"); int s_type = cp.indexOf("公司类型:"); int s_scale = cp.indexOf("公司规模:"); int len = cp.length(); job.put("company_name", cp.substring(0, s_industry).trim()); if (-1 != s_industry) { job.put("company_industry", cp.substring( s_industry + 5, -1 != s_type ? s_type : (-1 != s_scale ? s_scale : len)) .trim()); } if (-1 != s_type) { job.put("company_type", cp.substring(s_type + 5, -1 != s_scale ? s_scale : len) .trim()); } if (-1 != s_scale) { job.put("company_scale", cp.substring(s_scale + 5).trim()); } // job props parser.reset(); nodeList = parser.parse(jobPropsFilter); Node jp_node = nodeList.elementAt(0); Parser jp_parser = new Parser(jp_node.toHtml()); TextExtractingVisitor visitor2 = new TextExtractingVisitor(); jp_parser.visitAllNodesWith(visitor2); String jp = visitor2.getExtractedText(); int sj[] = new int[10]; int sj_category = sj[0] = jp.indexOf("职位类别"); int sj_addr = sj[1] = jp.indexOf("工作地点"); int sj_time = sj[2] = jp.indexOf("发布日期"); int sj_experience = sj[3] = jp.indexOf("工作经验"); int sj_education = sj[4] = jp.indexOf("最低学历"); int sj_manage = sj[5] = jp.indexOf("管理经验"); int sj_type = sj[6] = jp.indexOf("工作性质"); int sj_count = sj[7] = jp.indexOf("招聘人数"); int sj_salary = sj[8] = jp.indexOf("职位月薪"); int jlen = sj[9] = jp.length(); if (-1 != sj_category) { job.put("job_category", jp.substring(sj_category + 5, sjNext(sj,1)).trim()); } if (-1 != sj_addr) { job.put("job_address", jp.substring(sj_addr + 5, sjNext(sj,2)).trim() .split(" ")[0]); } if (-1 != sj_time) { job.put("job_time", adaptDate(jp.substring(sj_time + 5, sjNext(sj,3)) .trim()) + ""); } if (-1 != sj_experience) { job.put("job_experience", adaptExp(jp.substring(sj_experience + 5, sjNext(sj,4)) .trim())); } if (-1 != sj_education) { job.put("job_education", adaptEdu(jp.substring(sj_education + 5, sjNext(sj,5)).trim())); } if (-1 != sj_manage) { job.put("job_manage", jp.substring(sj_manage + 5, sjNext(sj,6)).trim()); } if (-1 != sj_type) { job.put("job_type", jp.substring(sj_type + 5, sjNext(sj,7)).trim()); } if (-1 != sj_count) { job.put("job_count", jp.substring(sj_count + 5, sjNext(sj,8)).trim()); } if (-1 != sj_salary) { job.put("job_salary", jp.substring(sj_salary + 5).trim()); } // job description parser.reset(); nodeList = parser.parse(jobDesFilter); String jobDesc = ""; NodeList nl; for (int i = 0; i < nodeList.size(); i++) { nl = nodeList.elementAt(i).getChildren(); if (null != nl) { jobDesc += nl.toHtml(); } } job.put("job_description", jobDesc.trim()); // company desc parser.reset(); nodeList = parser.parse(companyDesFilter); job.put("company_description", nodeList.elementAt(0).getChildren() .toHtml().trim()); } catch (Exception e) { System.out.println("abandoned : " + url); e.printStackTrace(); return null; } job.put("from", "智联招聘"); job.put("url", url); System.out.println("parsed : " + url); return job; } private static int sjNext(int[] a, int index) { for(int i=index; i < a.length; i++ ){ if(-1 != a[i]) return a[i]; } return -1; } private static String adaptExp(String exp){ if(exp.contains("1")){ return "一年以上"; } else if(exp.contains("2")){ return "二年以上"; } else if(exp.contains("3")){ return "三年以上"; } else if(exp.contains("4")){ return "四年以上"; } else if(exp.contains("5")){ return "五年以上"; } else if(exp.contains("6")){ return "六年以上"; } else if(exp.contains("7")){ return "七年以上"; } else if(exp.contains("8")){ return "八年以上"; } else if(exp.contains("9")){ return "九年以上"; } else if(exp.contains("10")){ return "十年以上"; } else{ return "不限"; } } private static String adaptEdu(String edu){ if(edu.contains("初中")){ return "初中"; } if(edu.contains("高中")){ return "高中"; } if(edu.contains("中专")){ return "中专"; } if(edu.contains("中技")){ return "中技"; } if(edu.contains("大专")){ return "大专"; } if(edu.contains("本科")){ return "本科"; } if(edu.contains("硕士")){ return "硕士"; } if(edu.contains("博士")){ return "博士"; } else { return "其他"; } } private static SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd"); private static long adaptDate(String date) throws ParseException{ return df.parse(date).getTime(); } }
三、索引模块:
1、技术:lucene3.3 + hbase0.9
2、概述:读取hbase中parsedjobs表中的招聘信息,建立索引。使用了庖丁解牛进行中文分词。
3、详细:
package com.jobsearcher.indexer; import java.io.File; import java.io.IOException; import net.paoding.analysis.analyzer.PaodingAnalyzer; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.util.Bytes; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.NumericField; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.Term; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.Version; public class JobsIndexer { private static final String INDEX_DIR = "/home/qingjie/projects/jobsearcher/index"; public static void main(String... args) throws IOException { System.out.println("indexing"); int count = 0; Analyzer analyzer = new PaodingAnalyzer(); IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_33, analyzer); Directory dir = FSDirectory.open(new File(INDEX_DIR)); IndexWriter writer = new IndexWriter(dir, conf); HTable t = new HTable(HBaseConfiguration.create(), Bytes.toBytes("parsedjobs")); Scan scan = new Scan(); for (Result r : t.getScanner(scan)) { String row = Bytes.toString(r.getRow()); String from = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("from"))); String url = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("url"))); String job_title = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("job_title"))); String company_name = Bytes.toString(r.getValue( Bytes.toBytes("job"), Bytes.toBytes("company_name"))); String company_industry = Bytes.toString(r.getValue( Bytes.toBytes("job"), Bytes.toBytes("company_industry"))); String company_type = Bytes.toString(r.getValue( Bytes.toBytes("job"), Bytes.toBytes("company_type"))); String job_time = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("job_time"))); String job_address = Bytes.toString(r.getValue( Bytes.toBytes("job"), Bytes.toBytes("job_address"))); String job_experience = Bytes.toString(r.getValue( Bytes.toBytes("job"), Bytes.toBytes("job_experience"))); String job_education = Bytes.toString(r.getValue( Bytes.toBytes("job"), Bytes.toBytes("job_education"))); long timestamp = Bytes.toLong(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("timestamp"))); String job_description = Bytes.toString(r.getValue( Bytes.toBytes("job"), Bytes.toBytes("job_description"))); // index Document doc = new Document(); doc.add(new Field("row", row, Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("from", from, Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("job_title", job_title, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS)); doc.add(new Field("company_name", company_name, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS)); doc.add(new Field("url", url, Field.Store.YES, Field.Index.NO)); if (null != company_industry) { doc.add(new Field("company_industry", company_industry, Field.Store.YES, Field.Index.NOT_ANALYZED)); } if (null != company_type) { doc.add(new Field("company_type", company_type, Field.Store.YES, Field.Index.NOT_ANALYZED)); } doc.add(new Field("job_time",job_time,Field.Store.YES,Field.Index.NOT_ANALYZED)); doc.add(new NumericField("timestamp",Field.Store.YES,true).setLongValue(timestamp)); if (null != job_address) { doc.add(new Field("job_address", job_address, Field.Store.YES, Field.Index.NOT_ANALYZED)); } if (null != job_experience) { doc.add(new Field("job_experience", job_experience, Field.Store.YES, Field.Index.NOT_ANALYZED)); } if (null != job_education) { doc.add(new Field("job_education", job_education, Field.Store.YES, Field.Index.NOT_ANALYZED)); } if (null != job_description) { doc.add(new Field("job_description", job_description, Field.Store.YES, Field.Index.ANALYZED)); } writer.updateDocument(new Term("row", row), doc); } writer.close(); System.out.println("index finished!"); } }
四、搜索模块:
1、技术:struts2.2 + lucene3.3 + bobo-browse2.5 + freemarker2.3 + hbase0.9 + jquery
2、概述:提供web界面和各种搜索功能。web界面用struts2、freemarker、jquery等实现;搜索功能用lucene实现,使用了Bobo-Browse实现分组搜索(Facet Search)。
3、详细:
(1)、struts2 action基类BaseAction:
package com.jobsearcher.searcher.action.base; import com.jobsearcher.exception.JobsearcherException; import com.jobsearcher.searcher.service.SearcherService; import com.opensymphony.xwork2.ActionSupport; public class BaseAction extends ActionSupport{ private static final long serialVersionUID = 1L; protected SearcherService searcherService = SearcherService.getInstance(); public BaseAction() throws JobsearcherException{ } }
(2)、搜索action:
package com.jobsearcher.searcher.action; import java.text.NumberFormat; import java.util.ArrayList; import java.util.Calendar; import java.util.Date; import java.util.List; import java.util.Map; import com.browseengine.bobo.api.BrowseResult; import com.browseengine.bobo.api.FacetAccessible; import com.jobsearcher.exception.JobsearcherException; import com.jobsearcher.searcher.action.base.BaseAction; public class SearchAction extends BaseAction { public SearchAction() throws JobsearcherException { super(); } /** * */ private static final long serialVersionUID = 1L; private int startPage = 1; private int pageSize = 10; private String keyword = ""; private String job_time = ""; // 0 1 3 7 14 30 60 private String city = ""; private String from = ""; // 智联招聘 前程无忧 private String company_name = ""; private String company_industry = ""; private String company_type = ""; private String job_experience = ""; private String job_education = ""; private Map<String, FacetAccessible> facetMap; private int totalDocs; private String timeCost = ""; private List<Map<String, String>> jobs = new ArrayList<Map<String, String>>(); private long now; private long today; private long l3day; private long l1week; private long l2week; private long l1month; private long l2month; private String job_time_name = ""; public String execute() throws JobsearcherException { long begin = System.nanoTime(); BrowseResult result = searcherService.search(makeQuery(), startPage, pageSize, jobs); facetMap = result.getFacetMap(); totalDocs = result.getNumHits(); long end = System.nanoTime(); NumberFormat format = NumberFormat.getInstance(); format.setMaximumFractionDigits(3); timeCost = format.format((end * 1.0 - begin) / 1000000000.0); // job_time Calendar n = Calendar.getInstance(); n.setTime(new Date()); now = n.getTimeInMillis(); n.set(Calendar.HOUR_OF_DAY, 0); today = n.getTimeInMillis(); l3day = today - 3l * 24 * 3600 * 1000; l1week = today - 7l * 24 * 3600 * 1000; l2week = today - 14l * 24 * 3600 * 1000; l1month = today - 30l * 24 * 3600 * 1000; l2month = today - 60l * 24 * 3600 * 1000; return SUCCESS; } /* * 0:tokenized * 1:not tokenized */ private String[] makeQuery() { String[] query = {"",""}; //tokenized if (null != keyword && !keyword.equals("")) { query[0] = query[0] + "+job_title:(" + keyword + ")"; } if (null != job_time && !job_time.equals("")) { query[0] = query[0] + "+job_time:(" + job_time + ")"; } //not tokenized if (null != city && !city.equals("")) { query[1] = query[1] + "+job_address:(\"" + city + "\")"; } if (null != from && !from.equals("")) { query[1] = query[1] + "+from:(\"" + from + "\")"; } if (null != company_name && !company_name.equals("")) { query[1] = query[1] + "+company_name:(\"" + company_name + "\")"; } if (null != company_industry && !company_industry.equals("")) { query[1] = query[1] + "+company_industry:(\"" + company_industry + "\")"; } if (null != company_type && !company_type.equals("")) { query[1] = query[1] + "+company_type:(\"" + company_type + "\")"; } if (null != job_experience && !job_experience.equals("")) { query[1] = query[1] + "+job_experience:(\"" + job_experience + "\")"; } if (null != job_education && !job_education.equals("")) { query[1] = query[1] + "+job_education:(\"" + job_education + "\")"; } return query; } public String getKeyword() { return keyword; } public void setKeyword(String keyword) { this.keyword = keyword; } //省略。。。 }
(3)、查看招聘详细信息action:
package com.jobsearcher.searcher.action; import java.util.HashMap; import java.util.Map; import com.jobsearcher.exception.JobsearcherException; import com.jobsearcher.searcher.action.base.BaseAction; public class ViewAction extends BaseAction{ public ViewAction() throws JobsearcherException { super(); } private static final long serialVersionUID = 1L; private String row; Map<String,String> job = new HashMap<String,String>(); public String execute() throws JobsearcherException { job = searcherService.getJobByRow(row); return SUCCESS; } public String getRow() { return row; } public void setRow(String row) { this.row = row; } public Map<String, String> getJob() { return job; } public void setJob(Map<String, String> job) { this.job = job; } }
(4)、service类,采用单例模式:
package com.jobsearcher.searcher.service; import java.io.File; import java.io.IOException; import java.net.URLEncoder; import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.Arrays; import java.util.Calendar; import java.util.Comparator; import java.util.Date; import java.util.HashMap; import java.util.List; import java.util.Map; import net.paoding.analysis.analyzer.PaodingAnalyzer; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.Get; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.util.Bytes; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.KeywordAnalyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.document.Document; import org.apache.lucene.index.CorruptIndexException; import org.apache.lucene.index.IndexReader; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.BooleanClause.Occur; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.NumericRangeQuery; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.Sort; import org.apache.lucene.search.SortField; import org.apache.lucene.search.TopDocs; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.InvalidTokenOffsetsException; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.search.highlight.SimpleSpanFragmenter; import org.apache.lucene.search.highlight.TokenSources; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.Version; import com.browseengine.bobo.api.BoboBrowser; import com.browseengine.bobo.api.BoboIndexReader; import com.browseengine.bobo.api.Browsable; import com.browseengine.bobo.api.BrowseException; import com.browseengine.bobo.api.BrowseFacet; import com.browseengine.bobo.api.BrowseHit; import com.browseengine.bobo.api.BrowseRequest; import com.browseengine.bobo.api.BrowseResult; import com.browseengine.bobo.api.ComparatorFactory; import com.browseengine.bobo.api.FacetSpec; import com.browseengine.bobo.api.FacetSpec.FacetSortSpec; import com.browseengine.bobo.api.FieldValueAccessor; import com.browseengine.bobo.facets.FacetHandler; import com.browseengine.bobo.facets.impl.RangeFacetHandler; import com.browseengine.bobo.facets.impl.SimpleFacetHandler; import com.jobsearcher.exception.JobsearcherException; public class SearcherService { private static final String INDEX_DIR = "/home/qingjie/projects/jobsearcher/index"; private static Directory dir = null; private static String PARSED_JOBS_HBASE = "parsedjobs"; private static SearcherService searcherService = null; private static IndexSearcher searcher = null; private static IndexReader reader = null; private static Analyzer analyzer = new PaodingAnalyzer(); private static final Map<String, Integer> sort = new HashMap<String, Integer>(); static { // job_experience sort.put("在读学生", 100); sort.put("应届毕业生", 101); sort.put("一年以上", 102); sort.put("二年以上", 103); sort.put("三年以上", 104); sort.put("四年以上", 105); sort.put("五年以上", 106); sort.put("六年以上", 107); sort.put("七年以上", 108); sort.put("八年以上", 109); sort.put("九年以上", 110); sort.put("十年以上", 111); sort.put("不限", 112); // job_education sort.put("初中", 30); sort.put("高中", 31); sort.put("中技", 32); sort.put("中专", 33); sort.put("大专", 34); sort.put("本科", 35); sort.put("硕士", 36); sort.put("博士", 37); sort.put("其他", 38); } private SearcherService() throws IOException { dir = FSDirectory.open(new File(INDEX_DIR)); } public static synchronized SearcherService getInstance() throws JobsearcherException { if (null == searcherService) { try { searcherService = new SearcherService(); } catch (IOException e) { e.printStackTrace(); throw new JobsearcherException("创建索引目录过程出错!"); } } return searcherService; } public synchronized IndexSearcher getIndexSearcher() throws JobsearcherException { if (null == searcher) { try { searcher = new IndexSearcher(dir); } catch (CorruptIndexException e) { e.printStackTrace(); throw new JobsearcherException("索引目录已经损坏!"); } catch (IOException e) { e.printStackTrace(); throw new JobsearcherException("打开索引目录过程出错!"); } } return searcher; } public synchronized IndexReader getIndexReader() throws JobsearcherException { if (null == reader) { try { reader = IndexReader.open(dir, true); } catch (CorruptIndexException e) { e.printStackTrace(); throw new JobsearcherException("索引目录已经损坏!"); } catch (IOException e) { e.printStackTrace(); throw new JobsearcherException("打开索引目录过程出错!"); } } return reader; } public List<Map<String, String>> getNewJobs(int num) throws JobsearcherException { List<Map<String, String>> jobs = new ArrayList<Map<String, String>>(); Query q = NumericRangeQuery.newLongRange("timestamp", 0l, new Date().getTime(), true, true); IndexSearcher s = getIndexSearcher(); TopDocs hits; try { hits = s.search(q, num, new Sort(new SortField("timestamp", SortField.LONG))); for (ScoreDoc scoreDoc : hits.scoreDocs) { Document doc = s.doc(scoreDoc.doc); Map<String, String> job = new HashMap<String, String>(); job.put("job_title", doc.get("job_title")); job.put("row", URLEncoder.encode(doc.get("row"), "utf-8")); jobs.add(job); } } catch (Exception e) { e.printStackTrace(); throw new JobsearcherException("搜索最新工作过程出错!"); } return jobs; } public BrowseResult search(String[] query, int startPage, int pageSize, List<Map<String, String>> jobs) throws JobsearcherException { SimpleFacetHandler companyIndustryHandler = new SimpleFacetHandler( "company_industry"); SimpleFacetHandler companyTypeHandler = new SimpleFacetHandler( "company_type"); SimpleFacetHandler jobExperienceHandler = new SimpleFacetHandler( "job_experience"); SimpleFacetHandler jobEducationHandler = new SimpleFacetHandler( "job_education"); Calendar n = Calendar.getInstance(); n.setTime(new Date()); long now = n.getTimeInMillis(); n.set(Calendar.HOUR_OF_DAY, 0); long today = n.getTimeInMillis(); long l3day = today - 3l * 24 * 3600 * 1000; long l1week = today - 7l * 24 * 3600 * 1000; long l2week = today - 14l * 24 * 3600 * 1000; long l1month = today - 30l * 24 * 3600 * 1000; long l2month = today - 60l * 24 * 3600 * 1000; RangeFacetHandler jobTimeHandler = new RangeFacetHandler("job_time", "job_time", Arrays.asList(timeRange(today, now), timeRange(l3day, now), timeRange(l1week, now), timeRange(l2week, now), timeRange(l1month, now), timeRange(l2month, now))); List<FacetHandler<?>> handlerList = Arrays .asList(new FacetHandler<?>[] { companyIndustryHandler, companyTypeHandler, jobTimeHandler, jobExperienceHandler, jobEducationHandler }); try { BoboIndexReader boboReader = BoboIndexReader.getInstance( getIndexReader(), handlerList); BrowseRequest br = new BrowseRequest(); br.setCount(pageSize); br.setOffset((startPage - 1) * pageSize); QueryParser parser0 = new QueryParser(Version.LUCENE_34, "job_title", analyzer); QueryParser parser1 = new QueryParser(Version.LUCENE_34, "job_title", new KeywordAnalyzer()); BooleanQuery q = new BooleanQuery(); if(null != query[0] && !query[0].equals("")){ Query q0 = parser0.parse(query[0]); q.add(q0, Occur.MUST); } if(null != query[1] && !query[1].equals("")){ Query q1 = parser1.parse(query[1]); q.add(q1, Occur.MUST); } br.setQuery(q); FacetSpec generalSpec = new FacetSpec(); generalSpec.setOrderBy(FacetSortSpec.OrderHitsDesc); generalSpec.setMaxCount(10); FacetSpec valueOrderSpec = new FacetSpec(); valueOrderSpec.setMaxCount(10); valueOrderSpec.setOrderBy(FacetSortSpec.OrderByCustom); valueOrderSpec.setCustomComparatorFactory(new ComparatorFactory() { @Override public Comparator<Integer> newComparator( FieldValueAccessor fieldValueAccessor, int[] counts) { return new Comparator<Integer>() { public int compare(Integer o1, Integer o2) { return o2 - o1; } }; } @Override public Comparator<BrowseFacet> newComparator() { return new Comparator<BrowseFacet>() { public int compare(BrowseFacet o1, BrowseFacet o2) { return 0 - o1.getValue().compareTo(o2.getValue()); } }; } }); FacetSpec customSortSpec = new FacetSpec(); customSortSpec.setMaxCount(10); customSortSpec.setOrderBy(FacetSortSpec.OrderByCustom); customSortSpec.setCustomComparatorFactory(new ComparatorFactory() { @Override public Comparator<Integer> newComparator( FieldValueAccessor fieldValueAccessor, int[] counts) { return new Comparator<Integer>() { public int compare(Integer o1, Integer o2) { return o2 - o1; } }; } @Override public Comparator<BrowseFacet> newComparator() { return new Comparator<BrowseFacet>() { public int compare(BrowseFacet o1, BrowseFacet o2) { return sort.get(o1.getValue()).compareTo( sort.get(o2.getValue())); } }; } }); br.setFacetSpec("company_industry", generalSpec); br.setFacetSpec("company_type", generalSpec); br.setFacetSpec("job_time", valueOrderSpec); br.setFacetSpec("job_experience", customSortSpec); br.setFacetSpec("job_education", customSortSpec); SortField timeSort = new SortField("job_time", SortField.LONG); br.setSort(new SortField[] { timeSort }); Browsable browser = new BoboBrowser(boboReader); BrowseResult result = browser.browse(br); // highlight jobs QueryScorer jobTitleScorer = new QueryScorer(q, "job_title"); QueryScorer jobDesScorer = new QueryScorer(q, "job_title"); SimpleHTMLFormatter formatter = new SimpleHTMLFormatter( "<span class=\"highlight\">", "</span>"); Highlighter jobTitleHighlighter = new Highlighter(formatter, jobTitleScorer); Highlighter jobDesHighlighter = new Highlighter(formatter, jobDesScorer); jobTitleHighlighter.setTextFragmenter(new SimpleSpanFragmenter( jobTitleScorer)); jobDesHighlighter.setTextFragmenter(new SimpleSpanFragmenter( jobDesScorer)); SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd"); for (BrowseHit browseHit : result.getHits()) { Map<String, String> job = new HashMap<String, String>(); Document doc = getIndexSearcher().doc(browseHit.getDocid()); job.put("job_address", doc.get("job_address")); job.put("company_name", doc.get("company_name")); job.put("from", doc.get("from")); job.put("row", URLEncoder.encode(doc.get("row"), "utf-8")); job.put("job_time", format.format(new Date(Long.parseLong(doc.get("job_time"))))); String job_title = doc.get("job_title"); TokenStream stream = TokenSources.getAnyTokenStream( getIndexReader(), browseHit.getDocid(), "job_title", analyzer); String jobTitleFragment = jobTitleHighlighter.getBestFragment( stream, job_title); if (null != jobTitleFragment && !jobTitleFragment.equals("")) { job.put("job_title", jobTitleFragment); } else { job.put("job_title", job_title); } String job_description = doc.get("job_description"); TokenStream stream2 = TokenSources.getAnyTokenStream( getIndexReader(), browseHit.getDocid(), "job_description", analyzer); String jobDescriptionFragment = jobDesHighlighter .getBestFragment(stream2, job_description); String desc; if (null != jobDescriptionFragment && !jobDescriptionFragment.equals("")) { desc = jobDescriptionFragment; } else { desc = job_description; } if(desc.length() > 100){ desc = desc.substring(0,100) + "..."; } job.put("job_description", desc); jobs.add(job); } return result; } catch (IOException e) { e.printStackTrace(); throw new JobsearcherException("搜索过程中出错!"); } catch (org.apache.lucene.queryParser.ParseException e) { e.printStackTrace(); throw new JobsearcherException("查询解析过程中出错!"); } catch (BrowseException e) { e.printStackTrace(); throw new JobsearcherException("创建分类搜索结果过程中出错!"); } catch (InvalidTokenOffsetsException e) { e.printStackTrace(); throw new JobsearcherException("获取搜索结果过程中出错!"); } } // utils private String timeRange(long from, long to) { return "[" + from + " TO " + to + "]"; } public Map<String,String> getJobByRow(String row) throws JobsearcherException{ Map<String,String> job = new HashMap<String,String>(); try { HTable t = new HTable(HBaseConfiguration.create(), Bytes.toBytes(PARSED_JOBS_HBASE)); Get get = new Get(Bytes.toBytes(row)); Result r = t.get(get); String url = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("url"))); String from = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("from"))); String job_title = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("job_title"))); String job_category = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("job_category"))); String company_name = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("company_name"))); String company_industry = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("company_industry"))); String company_type = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("company_type"))); String company_scale = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("company_scale"))); String job_time = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("job_time"))); String job_address = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("job_address"))); String job_count = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("job_count"))); String job_experience = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("job_experience"))); String job_education = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("job_education"))); String job_language = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("job_language"))); String job_salary = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("job_salary"))); String job_description = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("job_description"))); String company_description = Bytes.toString(r.getValue(Bytes.toBytes("job"), Bytes.toBytes("company_description"))); job.put("url", url); job.put("from", from); job.put("job_title", job_title); job.put("job_category", job_category); job.put("company_name", company_name); job.put("company_industry", company_industry); job.put("company_type", company_type); job.put("company_scale", company_scale); SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd"); job.put("job_time", format.format(new Date(Long.parseLong(job_time)))); job.put("job_address", job_address); job.put("job_count", job_count); job.put("job_experience", job_experience); job.put("job_education", job_education); job.put("job_language", job_language); job.put("job_salary", job_salary); job.put("job_description", job_description); job.put("company_description", company_description); return job; } catch (IOException e) { e.printStackTrace(); throw new JobsearcherException("连接hbase过错出错!"); } } }
4、界面:
1、首页:
2、搜索结果页面:
3、工作详细信息页面: