概述:该项目分成4个模块:爬取模块、解析模块、索引模块、搜索模块。

功能:爬取智联招聘和前程无忧两个网站上的招聘信息,解析并保存,在本地建立索引,最后提供web界面和各种搜索功能。

技术:heritrix3.0 、 hbase-writer0.9 、 hbase0.9、hadoop0.20.2、HTMLParser2.0 、  lucene3.3 、 bobo-browse2.5 、struts2.2 、 freemarker2.3 、  jquery

开发时间:3周。

开发环境:linux(fedora15)、eclipse

前期学习hadoop、lucene:2个月。

参考书:《Hadoop:The Definitive Guide, 2nd Edition》《Lucene in Action, Second Edition》《开发自己的搜索引擎:Lucene+Heritrix(第2版)》

一、爬取模块:

1、技术:heritrix3.0 + hbase-writer0.9 + hbase0.9
2、概述:用heritrix爬取前程无忧和智联招聘两个网站上职位有关的页面,并保存到hbaserawjobs表。
rawjobs表:
    key:Keying.createKey(url)
    column family:content、curi
3、详细:
(1)、创建自己的DecideRule类:
package com.qjqiao.modules.deciderules;

import org.archive.modules.CrawlURI;
import org.archive.modules.deciderules.DecideResult;
import org.archive.modules.deciderules.DecideRule;

public class JobsRule extends DecideRule {
	
	private static final long serialVersionUID = 1L;


	@Override
	protected DecideResult innerDecide(CrawlURI uri) {
		String u = uri.getURI();
		if (u.startsWith("dns")
				|| u.startsWith("DNS")
				|| u.endsWith("robots.txt")
				// www.zhaopin.com
				|| u.contains("zhaopin.com/jobseeker")
				|| u.contains("company.zhaopin.com")
				|| u.contains("jobs.zhaopin.com")
				|| u.contains("search.zhaopin.com/jobs")
				|| u.contains("search.zhaopin.com/jobseeker")
				// www.51job.com
				|| u.contains("search.51job.com")) {
			if (!u.contains("research.51job.com")) {
				return DecideResult.ACCEPT;
			}
		}
		return DecideResult.REJECT;
	}
}

  (2)、创建assingment policy类:

package org.archive.crawler.frontier;

import org.apache.commons.httpclient.URIException;
import org.archive.modules.CrawlURI;
import org.archive.net.UURI;

public class ELFHashQueueAssignmentPolicy extends
        URIAuthorityBasedQueueAssignmentPolicy {

    /**
     * 
     */
    private static final long serialVersionUID = 1L;

    public int ELFHash(String str, int number) {
        int hash = 0;
        long x = 0l;
        char[] array = str.toCharArray();
        for (int i = 0; i < array.length; i++) {
            hash = (hash << 4) + array[i];
            if ((x = (hash & 0xF0000000L)) != 0) {
                hash ^= (x >> 24);
                hash &= ~x;
            }
        }
        int result = (hash & 0x7FFFFFFF) % number;
        return result;
    }

    @Override
    protected String getCoreKey(UURI basis) {
        try {
            System.out.println(this.ELFHash(basis.getURI().toString(), 50)
                    + " |||| ELFHashQueueAssignmentPolicy : " + basis.getURI());
            return this.ELFHash(basis.getURI().toString(), 50) + "";
        } catch (URIException e) {
            e.printStackTrace();
            return "0";
        }
    }

}

  (3)、修改配置文件crawler-beans.xml:

<bean id="simpleOverrides" class="org.springframework.beans.factory.config.PropertyOverrideConfigurer">
  <property name="properties">
   <value>
metadata.operatorContactUrl=http://www.51job.com
metadata.jobName=51job
metadata.description=jobs from 51job.com
   </value>
  </property>
 </bean>

 <bean id="metadata" class="org.archive.modules.CrawlMetadata" autowire="byName">
       <property name="operatorContactUrl" value="[see override above]"/>
       <property name="jobName" value="[see override above]"/>
       <property name="description" value="[see override above]"/>
  	<property name="userAgentTemplate" 
         value="Mozilla/5.0 (compatible; heritrix/3.1.0 +@OPERATOR_CONTACT_URL@)"/>	
       
 </bean>


<bean id="seeds" class="org.archive.modules.seeds.TextSeedModule">
  <property name="textSource">
   <bean class="org.archive.spring.ConfigFile">
    <property name="path" value="seeds.txt" />
   </bean>
  </property>
  <property name='sourceTagSeeds' value='false'/>
 </bean>


<bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
  <property name="rules">
   <list>
    <bean class="org.archive.modules.deciderules.RejectDecideRule">
    </bean>
    <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
    </bean>
    <bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
    </bean>
    <bean class="org.archive.modules.deciderules.TransclusionDecideRule">
    </bean>
     <bean class="com.qjqiao.modules.deciderules.JobsRule">
    </bean>
    <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
          <property name="decision" value="REJECT"/>
          <property name="seedsAsSurtPrefixes" value="false"/>
          <property name="surtsDumpFile" value="negative-surts.dump" />
    </bean>
    <bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
    </bean>
    <bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
    </bean>
    <bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
    </bean>
    <bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
    </bean>
   </list>
  </property>
 </bean>

 <bean id="hbaseParameterSettings" class="org.archive.io.hbase.HBaseParameters">
	<property name="contentColumnFamily" value="content"></property>
	<property name="contentColumnName" value="raw-data"></property>
	<property name="charsetColumnName" value="charset"></property>
	<property name="curiColumnFamily" value="curi"></property>
	<property name="ipColumnName" value="ip"></property>
	<property name="pathFromSeedColumnName" value="path-from-seed"></property>
	<property name="isSeedColumnName" value="is-seed"></property>
	<property name="viaColumnName" value="via"></property>
	<property name="urlColumnName" value="url"></property>
	<property name="requestColumnName" value="request"></property>
	<!-- Overwrite more options here -->
</bean>

<bean id="hbaseWriterProcessor" class="org.archive.modules.writer.HBaseWriterProcessor">
	<property name="zkQuorum" value="localhost">
	</property>
	<property name="zkClientPort" value="2181">
	</property>
	<property name="hbaseTable" value="rawjobs">
	</property>
	<property name="onlyProcessNewRecords" value="false">
	</property>
	<property name="onlyWriteNewRecords" value="false">
	</property>
	<property name="hbaseParameters">
		<ref bean="hbaseParameterSettings" />
	</property>
</bean>

 <bean id="dispositionProcessors" class="org.archive.modules.DispositionChain">
  <property name="processors">
   <list>
    <ref bean="hbaseWriterProcessor" />
    <ref bean="candidates"/>
    <ref bean="disposition"/>
   </list>
  </property>
 </bean>

 <bean id="queueAssignmentPolicy" 
   		 class="org.archive.crawler.frontier.ELFHashQueueAssignmentPolicy"> 	
 </bean> 

  (4)种子文件:seeds.txt:

http://search.51job.com/jobsearch/advance_search.php?lang=c&stype=2

http://www.zhaopin.com/jobseeker/index_industry.html

(5)运行示例:



二、解析模块:

1、技术:HTMLParser2.0 + hadoop mapreduce + hbase

2、概述:

整个解析模块作为一个mapreduce过程,并行处理;

hbaserawjobs表中读取爬取的页面信息,用HTMLParser解析页面,获取招聘信息,格式化(不同网站,招聘信息的字段取值范围有差别)并保存到hbaseparsedjobs表中。

parsedjobs表:

key:Keying.createKey(url)

column family:job

3、遇到的问题:中文乱码问题。

问题原因:不同网站、同一网站不同页面,有不同的编码方式,所以爬虫获取的InputStream字节流编码不同,hbase-writer直接将该字节序列写入hbase中;解析模块读取hbase中的网页内容时,Bytes.toString(byte[])方法用的是utf-8字符集,所以乱码。

解决方法:在hbase-writerHbaseWriter中,获取爬取的网页的编码方式,并写入到hbaserawjobs表中:

String contentType = curi.getContentType();
		String charset = "utf-8";
		if (-1 != contentType.indexOf("charset=")) {
			charset = contentType
					.substring(contentType.indexOf("charset=") + 8);
		} else {
			if (curi.getURI().contains("51job.com")) {
				charset = "gb2312";
			} else {
				charset = "utf-8";
			}
		}
batchPut.add(Bytes.toBytes(getHbaseOptions().getContentColumnFamily()),
				Bytes.toBytes(getHbaseOptions().getCharsetColumnName()),
				Bytes.toBytes(charset));

  解析模块读取时,用自定义的makeString (byte[],String)方法,使用对应的字符集将字节转换为字符串:

String charset=Bytes.toString(result.getValue(Bytes.toBytes("content"),     Bytes.toBytes("charset")	));
String rawData = JobsParser.makeString(result.getValue(Bytes.toBytes("content"),
	Bytes.toBytes("raw-data")), charset);

public static String makeString(byte[] b, String charset) {
		if (b == null) {
			return null;
		}
		if (b.length == 0) {
			return "";
		}
		try {
			return new String(b, 0, b.length, charset);
		} catch (UnsupportedEncodingException e) {
			System.out.println("charset not supported?");
			e.printStackTrace();
			return null;
		}
	}

  

4、详细:

(1)、mapreduce任务:

package com.jobsearcher.parser.mapreduce;

import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.util.Map;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Base64;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.output.NullOutputFormat;

import com.jobsearcher.parser.util.PageParser;

public class JobsParser {

	/** Name of this 'program'. */
	static final String NAME = "jobsparser";
	static final String FROM_TABLENAME = "rawjobs";
	static final String TO_TABLENAME = "parsedjobs";

	/**
	 * Mapper.
	 */
	static class JobsParserMapper extends
			TableMapper<ImmutableBytesWritable, Result> {
		static int i = 0;

		protected void map(ImmutableBytesWritable key, Result value,
				Context context) throws IOException, InterruptedException {
				context.write(key, value);
		}
	}

	/*
	 * Reducer.
	 */
	static class JobsParserReducer
			extends
			TableReducer<ImmutableBytesWritable, Result, ImmutableBytesWritable> {

		@Override
		protected void reduce(ImmutableBytesWritable row,
				Iterable<Result> results, Context context) throws IOException,
				InterruptedException {
			for (Result result : results) {
				Put put = new Put(result.getRow());

				String url = Bytes.toString(result.getValue(
						Bytes.toBytes("curi"), Bytes.toBytes("url")));

				String charset = Bytes.toString(result.getValue(
						Bytes.toBytes("content"), Bytes.toBytes("charset")));

				String rawData = JobsParser.makeString(
						result.getValue(Bytes.toBytes("content"),
								Bytes.toBytes("raw-data")), charset);

				if (null != url && null != rawData) {
					Map<String, String> job = PageParser.parse(url, rawData);
					if (null != job) {
						byte[] jobFamily = Bytes.toBytes("job");
						for (Map.Entry<String, String> kv : job.entrySet()) {
							put.add(jobFamily, Bytes.toBytes(kv.getKey()),
									Bytes.toBytes(kv.getValue()));
						}
						
						long timestamp = (result
								.getColumn(Bytes.toBytes("curi"),
										Bytes.toBytes("url")).get(0)
								.getTimestamp() + Long.parseLong(job
								.get("job_time"))) / 2;
						put.add(jobFamily, Bytes.toBytes("timestamp"),
								Bytes.toBytes(timestamp));
						context.write(row, put);
					}
				}
			}
		}
	}

	
	public static Job createSubmittableJob(Configuration conf)
			throws IOException {
		Job job = new Job(conf, NAME + "_" + FROM_TABLENAME + "_"
				+ TO_TABLENAME);
		job.setJarByClass(JobsParser.class);
		Scan scan = new Scan();
		TableMapReduceUtil.initTableMapperJob(FROM_TABLENAME, scan,
				JobsParserMapper.class, ImmutableBytesWritable.class,
				Result.class, job);
		
	    TableMapReduceUtil.addDependencyJars(job);

		TableMapReduceUtil.initTableReducerJob(TO_TABLENAME,
				JobsParserReducer.class, job);
		
		return job;
	}

	
	public static void main(String[] args) throws Exception {
		Configuration conf = HBaseConfiguration.create();
		Job job = createSubmittableJob(conf);
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}

	
	public static String makeString(byte[] b, String charset) {
		if (b == null) {
			return null;
		}
		if (b.length == 0) {
			return "";
		}
		try {
			return new String(b, 0, b.length, charset);
		} catch (UnsupportedEncodingException e) {
			System.out.println("charset not supported?");
			e.printStackTrace();
			return null;
		}
	}
}

  (2)、parser类:

package com.jobsearcher.parser.util;

import java.util.Map;

public class PageParser {
	public static Map<String, String> parse(String url, String page) {
		if (url.contains("51job.com")) {
			return PageParser51job.parse(url,page);
		} else if (url.contains("zhaopin.com")) {
			return PageParserZhaopin.parse(url,page);
		} else {
			System.out.println("unexpected url : " + url);
			return null;
		}
	}
}

  

前程无忧页面解析类:PageParser51job

package com.jobsearcher.parser.util;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.HashMap;
import java.util.Map;

import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.CssSelectorNodeFilter;
import org.htmlparser.filters.HasChildFilter;
import org.htmlparser.filters.StringFilter;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.util.NodeList;
import org.htmlparser.visitors.TextExtractingVisitor;

public class PageParser51job {
	public static Map<String, String> parse(String url, String page) {
		Map<String, String> job = new HashMap<String, String>();
		if (!url.matches("http://search.51job.com/job/.*html?")) {
			System.out.println("abandoned : " + url);
			return null;
		}

		Parser parser = Parser.createParser(page, "UTF-8");
		NodeFilter jobTitleFilter = new CssSelectorNodeFilter(
				"div.s_txt_jobs table.jobs_1 td.sr_bt");
		NodeFilter companyNameFilter = new CssSelectorNodeFilter(
				"div.s_txt_jobs table.jobs_1 table td");
		NodeFilter companyPropsFilter = new AndFilter(
				new CssSelectorNodeFilter("div.s_txt_jobs table.jobs_1 td"),
				new HasChildFilter(new StringFilter("公司行业")));
		NodeFilter jobPropsNameFilter = new CssSelectorNodeFilter(
				"div.s_txt_jobs div.jobs_com div.grayline div table.jobs_1 td.txt_1");
		NodeFilter jobPropsValueFilter = new CssSelectorNodeFilter(
				"div.s_txt_jobs div.jobs_com div.grayline div table.jobs_1 td.txt_2");
		NodeFilter jobDescriptionFilter = new CssSelectorNodeFilter(
				"div.s_txt_jobs div.jobs_com div.grayline div table.jobs_1 td div");
		NodeFilter jobCategoryFilter = new AndFilter(new CssSelectorNodeFilter(
				"div.s_txt_jobs div.jobs_com div.grayline div table.jobs_1 td"),new HasChildFilter(new StringFilter("职位职能")));
		NodeFilter companyDescriptionFilter = new AndFilter(
				new CssSelectorNodeFilter(
						"div.s_txt_jobs div.jobs_com div.grayline div.jobs_txt"),
				new HasChildFilter(new TagNameFilter("p")));

		try {
			NodeList nodeList = parser.parse(jobTitleFilter);
			Node job_title_node = nodeList.elementAt(0);
			job.put("job_title", job_title_node.toPlainTextString().trim());

			parser.reset();
			nodeList = parser.parse(companyNameFilter);
			Node company_name_node = nodeList.elementAt(0);
			String rawName = company_name_node.toPlainTextString();
			String cname = rawName.substring(
					0,
					(-1 == rawName.indexOf("&")) ? rawName.length() : rawName
							.indexOf("&"));
			job.put("company_name", cname.trim());

			parser.reset();
			nodeList = parser.parse(companyPropsFilter);
			Node company_props_node = nodeList.elementAt(0);
			Parser company_props_parser = new Parser(
					company_props_node.toHtml());
			TextExtractingVisitor visitor = new TextExtractingVisitor();
			company_props_parser.visitAllNodesWith(visitor);
			String company_props = visitor.getExtractedText();
			int s_industry = company_props.indexOf("公司行业:");
			int s_type = company_props.indexOf("公司性质:");
			int s_scale = company_props.indexOf("公司规模:");
			int len = company_props.length();
			if (-1 != s_industry) {
				job.put("company_industry",
						company_props.substring(s_industry + 5,
								-1 != s_type ? s_type : (-1 != s_scale? s_scale: len)).trim());
			}
			if (-1 != s_type) {
				job.put("company_type",
						adaptCType(company_props.substring(
								s_type + 5,
								-1 != s_scale ? s_scale : len).trim()));
			}
			if (-1 != s_scale) {
				job.put("company_scale",
						company_props.substring(
								s_scale + 5).trim());
			}

			parser.reset();
			NodeList jobPropsNameNodeList = parser.parse(jobPropsNameFilter);
			parser.reset();
			NodeList jobPropsValueNodeList = parser.parse(jobPropsValueFilter);
			Node jobPropsNameNode;
			Node jobPropsValueNode;
			for (int i = 0; i < jobPropsNameNodeList.size(); i++) {
				jobPropsNameNode = jobPropsNameNodeList.elementAt(i);
				jobPropsValueNode = jobPropsValueNodeList.elementAt(i);
				String name = jobPropsNameNode.toPlainTextString().trim();
				if (name.contains("发布日期")) {
					job.put("job_time", adaptDate(jobPropsValueNode.toPlainTextString()
							.trim()) + "");
				}
				if (name.contains("工作地点")) {
					job.put("job_address", jobPropsValueNode
							.toPlainTextString().trim());
				}
				if (name.contains("招聘人数")) {
					job.put("job_count", jobPropsValueNode.toPlainTextString()
							.trim());
				}
				if (name.contains("工作年限")) {
					job.put("job_experience", jobPropsValueNode
							.toPlainTextString().trim());
				}
				if (name.contains("学") && name.contains("历")) {
					job.put("job_education", jobPropsValueNode
							.toPlainTextString().trim());
				}
				if (name.contains("语言要求")) {
					job.put("job_language", jobPropsValueNode
							.toPlainTextString().trim());
				}
				if (name.contains("薪水范围")) {
					job.put("job_salary", jobPropsValueNode.toPlainTextString()
							.trim());
				}
			}

			parser.reset();
			nodeList = parser.parse(jobDescriptionFilter);
			Node job_desc_node = nodeList.elementAt(0);
			job.put("job_description", job_desc_node.getChildren().toHtml()
					.trim());

			parser.reset();
			nodeList = parser.parse(companyDescriptionFilter);
			Node company_desc_node = nodeList.elementAt(0);
			String rawComDesc = company_desc_node.getChildren().toHtml();
			job.put("company_description",
					rawComDesc.replaceAll("<a.*>.*</a>", "").trim());
			
			
			parser.reset();
			nodeList = parser.parse(jobCategoryFilter);
			Node job_category_node = nodeList.elementAt(0);
			if(null != job_category_node){
				String rawJobc = job_category_node.toPlainTextString().replaceAll(" ", " ");
				job.put("job_category",
						rawJobc.substring(rawJobc.indexOf("职位职能:") + 5).trim());
			}

		} catch (Exception e) {
			System.out.println("abandoned : " + url);
			e.printStackTrace();
			return null;
		}

		job.put("from", "前程无忧");
		job.put("url", url);
		System.out.println("parsed : " + url);
		return job;
	}
	
	private static SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd");
	private static long adaptDate(String date) throws ParseException{
		return df.parse(date).getTime();
	}
	private static String adaptCType(String ctype){
		if(ctype.contains("合资")){
			return "合资";
		}
		if(ctype.contains("民营")){
			return "民营";
		}
		if(ctype.contains("国企")){
			return "国企";
		}
		if(ctype.contains("外资")){
			return "外商独资";
		}
		if(ctype.contains("代表处")){
			return "外企代表处";
		}
		if(ctype.contains("机关")){
			return "国家机关";
		}
		if(ctype.contains("事业单位")){
			return "事业单位";
		}
		else{
			return "其他";
		}
	} 
}

  智联招聘页面解析类PageParserZhaopin:

package com.jobsearcher.parser.util;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.HashMap;
import java.util.Map;

import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.CssSelectorNodeFilter;
import org.htmlparser.filters.HasChildFilter;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.util.NodeList;
import org.htmlparser.visitors.TextExtractingVisitor;

public class PageParserZhaopin {
	public static Map<String, String> parse(String url, String page) {
		Map<String, String> job = new HashMap<String, String>();
		if (!url.matches("http://jobs.zhaopin.com/.*html?")) {
			System.out.println("not a valid url,abandoned : " + url);
			return null;
		}

		Parser parser = Parser.createParser(page, "UTF-8");
		NodeFilter jobTitleFilter = new CssSelectorNodeFilter(
				"#positionTitle h1");
		NodeFilter companyPropsFilter = new CssSelectorNodeFilter(
				"#zpcontent > table.companyInfoTab");
		NodeFilter jobPropsFilter = new CssSelectorNodeFilter(
				"#zpcontent table.jobInfoTab table.jobInfoItems");
		NodeFilter jobDesFilter = new CssSelectorNodeFilter(
				"#zpcontent table.jobInfoTab div.jobDes div div");
		NodeFilter companyDesFilter = new AndFilter(new CssSelectorNodeFilter(
				"#zpcontent table.black12 td"), new HasChildFilter(
				new AndFilter(new TagNameFilter("p"), new HasChildFilter(
						new TagNameFilter("br")))));

		try {
			NodeList nodeList = parser.parse(jobTitleFilter);
			Node job_title_node = nodeList.elementAt(0);
			job.put("job_title", job_title_node.toPlainTextString().trim());

			// company props
			parser.reset();
			nodeList = parser.parse(companyPropsFilter);
			Node cp_node = nodeList.elementAt(0);
			Parser cp_parser = new Parser(cp_node.toHtml());
			TextExtractingVisitor visitor = new TextExtractingVisitor();
			cp_parser.visitAllNodesWith(visitor);
			String cp = visitor.getExtractedText();
			int s_industry = cp.indexOf("公司行业:");
			int s_type = cp.indexOf("公司类型:");
			int s_scale = cp.indexOf("公司规模:");
			int len = cp.length();
			job.put("company_name", cp.substring(0, s_industry).trim());
			if (-1 != s_industry) {
				job.put("company_industry",
						cp.substring(
								s_industry + 5,
								-1 != s_type ? s_type
										: (-1 != s_scale ? s_scale : len))
								.trim());
			}
			if (-1 != s_type) {
				job.put("company_type",
						cp.substring(s_type + 5, -1 != s_scale ? s_scale : len)
								.trim());
			}
			if (-1 != s_scale) {
				job.put("company_scale", cp.substring(s_scale + 5).trim());
			}

			// job props
			parser.reset();
			nodeList = parser.parse(jobPropsFilter);
			Node jp_node = nodeList.elementAt(0);
			Parser jp_parser = new Parser(jp_node.toHtml());
			TextExtractingVisitor visitor2 = new TextExtractingVisitor();
			jp_parser.visitAllNodesWith(visitor2);
			String jp = visitor2.getExtractedText();
			int sj[] = new int[10];
			int sj_category = sj[0] = jp.indexOf("职位类别");
			int sj_addr = sj[1] = jp.indexOf("工作地点");
			int sj_time = sj[2] = jp.indexOf("发布日期");
			int sj_experience = sj[3] = jp.indexOf("工作经验");
			int sj_education = sj[4] = jp.indexOf("最低学历");
			int sj_manage = sj[5] = jp.indexOf("管理经验");
			int sj_type = sj[6] = jp.indexOf("工作性质");
			int sj_count = sj[7] = jp.indexOf("招聘人数");
			int sj_salary = sj[8] = jp.indexOf("职位月薪");
			int jlen = sj[9] = jp.length();

			if (-1 != sj_category) {
				job.put("job_category",
						jp.substring(sj_category + 5,
								sjNext(sj,1)).trim());
			}
			if (-1 != sj_addr) {
				job.put("job_address",
						jp.substring(sj_addr + 5,
								sjNext(sj,2)).trim()
								.split(" ")[0]);
			}
			if (-1 != sj_time) {
				job.put("job_time",
						adaptDate(jp.substring(sj_time + 5,
								sjNext(sj,3))
								.trim()) + "");
			}
			if (-1 != sj_experience) {
				job.put("job_experience",
						adaptExp(jp.substring(sj_experience + 5,
								sjNext(sj,4))
								.trim()));
			}
			if (-1 != sj_education) {
				job.put("job_education",
						adaptEdu(jp.substring(sj_education + 5,
								sjNext(sj,5)).trim()));
			}
			if (-1 != sj_manage) {
				job.put("job_manage",
						jp.substring(sj_manage + 5,
								sjNext(sj,6)).trim());
			}
			if (-1 != sj_type) {
				job.put("job_type",
						jp.substring(sj_type + 5,
								sjNext(sj,7)).trim());
			}
			if (-1 != sj_count) {
				job.put("job_count",
						jp.substring(sj_count + 5,
								sjNext(sj,8)).trim());
			}
			if (-1 != sj_salary) {
				job.put("job_salary", jp.substring(sj_salary + 5).trim());
			}

			// job description
			parser.reset();
			nodeList = parser.parse(jobDesFilter);
			String jobDesc = "";
			NodeList nl;
			for (int i = 0; i < nodeList.size(); i++) {
				nl = nodeList.elementAt(i).getChildren();
				if (null != nl) {
					jobDesc += nl.toHtml();
				}
			}
			job.put("job_description", jobDesc.trim());

			// company desc
			parser.reset();
			nodeList = parser.parse(companyDesFilter);
			job.put("company_description", nodeList.elementAt(0).getChildren()
					.toHtml().trim());

		} catch (Exception e) {
			System.out.println("abandoned : " + url);
			e.printStackTrace();
			return null;
		}

		job.put("from", "智联招聘");
		job.put("url", url);
		System.out.println("parsed : " + url);
		return job;
	}

	private static int sjNext(int[] a, int index) {
		for(int i=index; i < a.length; i++ ){
			if(-1 != a[i])
				return a[i];
		}
		return -1;
	}
	
	private static String adaptExp(String exp){
		if(exp.contains("1")){
			return "一年以上";
		}
		else if(exp.contains("2")){
			return "二年以上";
		}
		else if(exp.contains("3")){
			return "三年以上";
		}
		else if(exp.contains("4")){
			return "四年以上";
		}
		else if(exp.contains("5")){
			return "五年以上";
		}
		else if(exp.contains("6")){
			return "六年以上";
		}
		else if(exp.contains("7")){
			return "七年以上";
		}
		else if(exp.contains("8")){
			return "八年以上";
		}
		else if(exp.contains("9")){
			return "九年以上";
		}
		else if(exp.contains("10")){
			return "十年以上";
		}
		else{
			return "不限";
		}
	}
	
	private static String adaptEdu(String edu){
		if(edu.contains("初中")){
			return "初中";
		}
		if(edu.contains("高中")){
			return "高中";
		}
		if(edu.contains("中专")){
			return "中专";
		}
		if(edu.contains("中技")){
			return "中技";
		}
		if(edu.contains("大专")){
			return "大专";
		}
		if(edu.contains("本科")){
			return "本科";
		}
		if(edu.contains("硕士")){
			return "硕士";
		}
		if(edu.contains("博士")){
			return "博士";
		}
		else {
			return "其他";
		}
	}
	
	private static SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd");
	private static long adaptDate(String date) throws ParseException{
		return df.parse(date).getTime();
	}

}

  

三、索引模块:

1、技术:lucene3.3 + hbase0.9

2、概述:读取hbaseparsedjobs表中的招聘信息,建立索引。使用了庖丁解牛进行中文分词。

3、详细:

package com.jobsearcher.indexer;

import java.io.File;
import java.io.IOException;

import net.paoding.analysis.analyzer.PaodingAnalyzer;

import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class JobsIndexer {
	private static final String INDEX_DIR = "/home/qingjie/projects/jobsearcher/index";

	public static void main(String... args) throws IOException {
		System.out.println("indexing");
		int count = 0;

		Analyzer analyzer = new PaodingAnalyzer();
		IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_33,
				analyzer);
		Directory dir = FSDirectory.open(new File(INDEX_DIR));
		IndexWriter writer = new IndexWriter(dir, conf);

		HTable t = new HTable(HBaseConfiguration.create(),
				Bytes.toBytes("parsedjobs"));
		Scan scan = new Scan();
		for (Result r : t.getScanner(scan)) {
			String row = Bytes.toString(r.getRow());
			String from = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("from")));
			String url = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("url")));
			String job_title = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("job_title")));
			String company_name = Bytes.toString(r.getValue(
					Bytes.toBytes("job"), Bytes.toBytes("company_name")));
			String company_industry = Bytes.toString(r.getValue(
					Bytes.toBytes("job"), Bytes.toBytes("company_industry")));
			String company_type = Bytes.toString(r.getValue(
					Bytes.toBytes("job"), Bytes.toBytes("company_type")));
			String job_time = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("job_time")));
			String job_address = Bytes.toString(r.getValue(
					Bytes.toBytes("job"), Bytes.toBytes("job_address")));
			String job_experience = Bytes.toString(r.getValue(
					Bytes.toBytes("job"), Bytes.toBytes("job_experience")));
			String job_education = Bytes.toString(r.getValue(
					Bytes.toBytes("job"), Bytes.toBytes("job_education")));
			long timestamp = Bytes.toLong(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("timestamp")));
			String job_description = Bytes.toString(r.getValue(
					Bytes.toBytes("job"), Bytes.toBytes("job_description")));
			// index
			Document doc = new Document();
			doc.add(new Field("row", row, Field.Store.YES,
					Field.Index.NOT_ANALYZED));
			doc.add(new Field("from", from, Field.Store.YES,
					Field.Index.NOT_ANALYZED));
			doc.add(new Field("job_title", job_title, Field.Store.YES,
					Field.Index.ANALYZED,
					Field.TermVector.WITH_POSITIONS_OFFSETS));
			doc.add(new Field("company_name", company_name, Field.Store.YES,
					Field.Index.ANALYZED,
					Field.TermVector.WITH_POSITIONS_OFFSETS));
			doc.add(new Field("url", url, Field.Store.YES,
					Field.Index.NO));
			if (null != company_industry) {
				doc.add(new Field("company_industry", company_industry,
						Field.Store.YES, Field.Index.NOT_ANALYZED));
			}
			if (null != company_type) {
				doc.add(new Field("company_type", company_type,
						Field.Store.YES, Field.Index.NOT_ANALYZED));
			}
			doc.add(new Field("job_time",job_time,Field.Store.YES,Field.Index.NOT_ANALYZED));
			doc.add(new NumericField("timestamp",Field.Store.YES,true).setLongValue(timestamp));
			if (null != job_address) {
				doc.add(new Field("job_address", job_address, Field.Store.YES,
						Field.Index.NOT_ANALYZED));
			}
			if (null != job_experience) {
				doc.add(new Field("job_experience", job_experience,
						Field.Store.YES, Field.Index.NOT_ANALYZED));
			}
			if (null != job_education) {
				doc.add(new Field("job_education", job_education,
						Field.Store.YES, Field.Index.NOT_ANALYZED));
			}
			if (null != job_description) {
				doc.add(new Field("job_description", job_description,
						Field.Store.YES, Field.Index.ANALYZED));
			}
			writer.updateDocument(new Term("row", row), doc);
		}
		writer.close();
		System.out.println("index finished!");
	}
}

  


四、搜索模块:

1、技术:struts2.2 + lucene3.3 + bobo-browse2.5 + freemarker2.3 + hbase0.9 + jquery

2、概述:提供web界面和各种搜索功能。web界面用struts2、freemarker、jquery等实现;搜索功能用lucene实现,使用了Bobo-Browse实现分组搜索(Facet Search)。

3、详细:

(1)、struts2 action基类BaseAction:

package com.jobsearcher.searcher.action.base;

import com.jobsearcher.exception.JobsearcherException;
import com.jobsearcher.searcher.service.SearcherService;
import com.opensymphony.xwork2.ActionSupport;

public class BaseAction extends ActionSupport{
	private static final long serialVersionUID = 1L;
	protected SearcherService searcherService = SearcherService.getInstance();
	
	public BaseAction() throws JobsearcherException{
	}
}

  


(2)、搜索action:

package com.jobsearcher.searcher.action;

import java.text.NumberFormat;
import java.util.ArrayList;
import java.util.Calendar;
import java.util.Date;
import java.util.List;
import java.util.Map;

import com.browseengine.bobo.api.BrowseResult;
import com.browseengine.bobo.api.FacetAccessible;
import com.jobsearcher.exception.JobsearcherException;
import com.jobsearcher.searcher.action.base.BaseAction;

public class SearchAction extends BaseAction {

	public SearchAction() throws JobsearcherException {
		super();
	}

	/**
	 * 
	 */
	private static final long serialVersionUID = 1L;

	private int startPage = 1;
	private int pageSize = 10;

	private String keyword = "";
	private String job_time = ""; // 0 1 3 7 14 30 60
	
	private String city = "";
	private String from = ""; // 智联招聘 前程无忧
	private String company_name = "";
	private String company_industry = "";
	private String company_type = "";
	private String job_experience = "";
	private String job_education = "";

	private Map<String, FacetAccessible> facetMap;
	private int totalDocs;

	private String timeCost = "";

	private List<Map<String, String>> jobs = new ArrayList<Map<String, String>>();

	private long now;
	private long today;
	private long l3day;
	private long l1week;
	private long l2week;
	private long l1month;
	private long l2month;
	
	private String job_time_name = "";


	public String execute() throws JobsearcherException {
		long begin = System.nanoTime();
		BrowseResult result = searcherService.search(makeQuery(), startPage,
				pageSize, jobs);
		facetMap = result.getFacetMap();
		totalDocs = result.getNumHits();
		long end = System.nanoTime();

		NumberFormat format = NumberFormat.getInstance();
		format.setMaximumFractionDigits(3);
		timeCost = format.format((end * 1.0 - begin) / 1000000000.0);

		// job_time
		Calendar n = Calendar.getInstance();
		n.setTime(new Date());
		now = n.getTimeInMillis();
		n.set(Calendar.HOUR_OF_DAY, 0);
		today = n.getTimeInMillis();
		l3day = today - 3l * 24 * 3600 * 1000;
		l1week = today - 7l * 24 * 3600 * 1000;
		l2week = today - 14l * 24 * 3600 * 1000;
		l1month = today - 30l * 24 * 3600 * 1000;
		l2month = today - 60l * 24 * 3600 * 1000;

		return SUCCESS;
	}

	/*
	 * 0:tokenized
	 * 1:not tokenized 
	 */
	private String[] makeQuery() {
		String[] query = {"",""};
		//tokenized
		if (null != keyword && !keyword.equals("")) {
			query[0] = query[0] + "+job_title:(" + keyword + ")";
		}
		if (null != job_time && !job_time.equals("")) {
			query[0] = query[0] + "+job_time:(" + job_time + ")";
		}
		
		//not tokenized 
		if (null != city && !city.equals("")) {
			query[1] = query[1] + "+job_address:(\"" + city + "\")";
		}
		if (null != from && !from.equals("")) {
			query[1] = query[1] + "+from:(\"" + from + "\")";
		}
		if (null != company_name && !company_name.equals("")) {
			query[1] = query[1] + "+company_name:(\"" + company_name + "\")";
		}
		if (null != company_industry && !company_industry.equals("")) {
			query[1] = query[1] + "+company_industry:(\"" + company_industry + "\")";
		}
		if (null != company_type && !company_type.equals("")) {
			query[1] = query[1] + "+company_type:(\"" + company_type + "\")";
		}
		if (null != job_experience && !job_experience.equals("")) {
			query[1] = query[1] + "+job_experience:(\"" + job_experience + "\")";
		}
		if (null != job_education && !job_education.equals("")) {
			query[1] = query[1] + "+job_education:(\"" + job_education + "\")";
		}
		return query;
	}

	public String getKeyword() {
		return keyword;
	}

	public void setKeyword(String keyword) {
		this.keyword = keyword;
	}

//省略。。。
	
}

  (3)、查看招聘详细信息action:

package com.jobsearcher.searcher.action;

import java.util.HashMap;
import java.util.Map;

import com.jobsearcher.exception.JobsearcherException;
import com.jobsearcher.searcher.action.base.BaseAction;

public class ViewAction extends BaseAction{

	public ViewAction() throws JobsearcherException {
		super();
	}

	private static final long serialVersionUID = 1L;
	
	private String row;
	Map<String,String> job = new HashMap<String,String>();
	
	public String execute() throws JobsearcherException {
		job = searcherService.getJobByRow(row);
		return SUCCESS;
	}

	public String getRow() {
		return row;
	}

	public void setRow(String row) {
		this.row = row;
	}

	public Map<String, String> getJob() {
		return job;
	}

	public void setJob(Map<String, String> job) {
		this.job = job;
	}
}

  

(4)、service类,采用单例模式:

package com.jobsearcher.searcher.service;

import java.io.File;
import java.io.IOException;
import java.net.URLEncoder;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Calendar;
import java.util.Comparator;
import java.util.Date;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import net.paoding.analysis.analyzer.PaodingAnalyzer;

import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.KeywordAnalyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.NumericRangeQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.SortField;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.search.highlight.SimpleSpanFragmenter;
import org.apache.lucene.search.highlight.TokenSources;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import com.browseengine.bobo.api.BoboBrowser;
import com.browseengine.bobo.api.BoboIndexReader;
import com.browseengine.bobo.api.Browsable;
import com.browseengine.bobo.api.BrowseException;
import com.browseengine.bobo.api.BrowseFacet;
import com.browseengine.bobo.api.BrowseHit;
import com.browseengine.bobo.api.BrowseRequest;
import com.browseengine.bobo.api.BrowseResult;
import com.browseengine.bobo.api.ComparatorFactory;
import com.browseengine.bobo.api.FacetSpec;
import com.browseengine.bobo.api.FacetSpec.FacetSortSpec;
import com.browseengine.bobo.api.FieldValueAccessor;
import com.browseengine.bobo.facets.FacetHandler;
import com.browseengine.bobo.facets.impl.RangeFacetHandler;
import com.browseengine.bobo.facets.impl.SimpleFacetHandler;
import com.jobsearcher.exception.JobsearcherException;

public class SearcherService {

	private static final String INDEX_DIR = "/home/qingjie/projects/jobsearcher/index";
	private static Directory dir = null;
	private static String PARSED_JOBS_HBASE = "parsedjobs";

	private static SearcherService searcherService = null;
	private static IndexSearcher searcher = null;
	private static IndexReader reader = null;
	private static Analyzer analyzer = new PaodingAnalyzer();

	private static final Map<String, Integer> sort = new HashMap<String, Integer>();

	static {
		
		// job_experience
		sort.put("在读学生", 100);
		sort.put("应届毕业生", 101);
		sort.put("一年以上", 102);
		sort.put("二年以上", 103);
		sort.put("三年以上", 104);
		sort.put("四年以上", 105);
		sort.put("五年以上", 106);
		sort.put("六年以上", 107);
		sort.put("七年以上", 108);
		sort.put("八年以上", 109);
		sort.put("九年以上", 110);
		sort.put("十年以上", 111);
		sort.put("不限", 112);

		// job_education
		sort.put("初中", 30);
		sort.put("高中", 31);
		sort.put("中技", 32);
		sort.put("中专", 33);
		sort.put("大专", 34);
		sort.put("本科", 35);
		sort.put("硕士", 36);
		sort.put("博士", 37);
		sort.put("其他", 38);
	}

	private SearcherService() throws IOException {
		dir = FSDirectory.open(new File(INDEX_DIR));
	}

	public static synchronized SearcherService getInstance()
			throws JobsearcherException {
		if (null == searcherService) {
			try {
				searcherService = new SearcherService();
			} catch (IOException e) {
				e.printStackTrace();
				throw new JobsearcherException("创建索引目录过程出错!");
			}
		}
		return searcherService;
	}

	public synchronized IndexSearcher getIndexSearcher()
			throws JobsearcherException {
		if (null == searcher) {
			try {
				searcher = new IndexSearcher(dir);
			} catch (CorruptIndexException e) {
				e.printStackTrace();
				throw new JobsearcherException("索引目录已经损坏!");
			} catch (IOException e) {
				e.printStackTrace();
				throw new JobsearcherException("打开索引目录过程出错!");
			}
		}
		return searcher;
	}

	public synchronized IndexReader getIndexReader()
			throws JobsearcherException {
		if (null == reader) {
			try {
				reader = IndexReader.open(dir, true);
			} catch (CorruptIndexException e) {
				e.printStackTrace();
				throw new JobsearcherException("索引目录已经损坏!");
			} catch (IOException e) {
				e.printStackTrace();
				throw new JobsearcherException("打开索引目录过程出错!");
			}
		}
		return reader;
	}

	public List<Map<String, String>> getNewJobs(int num)
			throws JobsearcherException {
		List<Map<String, String>> jobs = new ArrayList<Map<String, String>>();
		Query q = NumericRangeQuery.newLongRange("timestamp", 0l,
				new Date().getTime(), true, true);
		IndexSearcher s = getIndexSearcher();
		TopDocs hits;
		try {
			hits = s.search(q, num, new Sort(new SortField("timestamp",
					SortField.LONG)));
			for (ScoreDoc scoreDoc : hits.scoreDocs) {
				Document doc = s.doc(scoreDoc.doc);
				Map<String, String> job = new HashMap<String, String>();
				job.put("job_title", doc.get("job_title"));
				job.put("row", URLEncoder.encode(doc.get("row"), "utf-8"));
				jobs.add(job);
			}
		} catch (Exception e) {
			e.printStackTrace();
			throw new JobsearcherException("搜索最新工作过程出错!");
		}
		return jobs;
	}

	public BrowseResult search(String[] query, int startPage, int pageSize,
			List<Map<String, String>> jobs) throws JobsearcherException {
		SimpleFacetHandler companyIndustryHandler = new SimpleFacetHandler(
				"company_industry");
		SimpleFacetHandler companyTypeHandler = new SimpleFacetHandler(
				"company_type");
		SimpleFacetHandler jobExperienceHandler = new SimpleFacetHandler(
				"job_experience");
		SimpleFacetHandler jobEducationHandler = new SimpleFacetHandler(
				"job_education");

		Calendar n = Calendar.getInstance();
		n.setTime(new Date());
		long now = n.getTimeInMillis();
		n.set(Calendar.HOUR_OF_DAY, 0);
		long today = n.getTimeInMillis();
		long l3day = today - 3l * 24 * 3600 * 1000;
		long l1week = today - 7l * 24 * 3600 * 1000;
		long l2week = today - 14l * 24 * 3600 * 1000;
		long l1month = today - 30l * 24 * 3600 * 1000;
		long l2month = today - 60l * 24 * 3600 * 1000;

		RangeFacetHandler jobTimeHandler = new RangeFacetHandler("job_time",
				"job_time", Arrays.asList(timeRange(today, now),
						timeRange(l3day, now), timeRange(l1week, now),
						timeRange(l2week, now), timeRange(l1month, now),
						timeRange(l2month, now)));

		List<FacetHandler<?>> handlerList = Arrays
				.asList(new FacetHandler<?>[] { companyIndustryHandler,
						companyTypeHandler, jobTimeHandler,
						jobExperienceHandler, jobEducationHandler });

		try {
			BoboIndexReader boboReader = BoboIndexReader.getInstance(
					getIndexReader(), handlerList);

			BrowseRequest br = new BrowseRequest();
			br.setCount(pageSize);
			br.setOffset((startPage - 1) * pageSize);

			QueryParser parser0 = new QueryParser(Version.LUCENE_34,
					"job_title", analyzer);
			QueryParser parser1 = new QueryParser(Version.LUCENE_34,
					"job_title", new KeywordAnalyzer());
			BooleanQuery q = new BooleanQuery();
			if(null != query[0] && !query[0].equals("")){
				Query q0 = parser0.parse(query[0]);
				q.add(q0, Occur.MUST);
			}
			if(null != query[1] && !query[1].equals("")){
				Query q1 = parser1.parse(query[1]);
				q.add(q1, Occur.MUST);
			}
			
			br.setQuery(q);

			FacetSpec generalSpec = new FacetSpec();
			generalSpec.setOrderBy(FacetSortSpec.OrderHitsDesc);
			generalSpec.setMaxCount(10);

			FacetSpec valueOrderSpec = new FacetSpec();
			valueOrderSpec.setMaxCount(10);
			valueOrderSpec.setOrderBy(FacetSortSpec.OrderByCustom);
			valueOrderSpec.setCustomComparatorFactory(new ComparatorFactory() {

				@Override
				public Comparator<Integer> newComparator(
						FieldValueAccessor fieldValueAccessor, int[] counts) {
					return new Comparator<Integer>() {
						public int compare(Integer o1, Integer o2) {
							return o2 - o1;
						}
					};
				}

				@Override
				public Comparator<BrowseFacet> newComparator() {
					return new Comparator<BrowseFacet>() {
						public int compare(BrowseFacet o1, BrowseFacet o2) {
							return 0 - o1.getValue().compareTo(o2.getValue());
						}
					};
				}
			});

			FacetSpec customSortSpec = new FacetSpec();
			customSortSpec.setMaxCount(10);
			customSortSpec.setOrderBy(FacetSortSpec.OrderByCustom);
			customSortSpec.setCustomComparatorFactory(new ComparatorFactory() {

				@Override
				public Comparator<Integer> newComparator(
						FieldValueAccessor fieldValueAccessor, int[] counts) {
					return new Comparator<Integer>() {
						public int compare(Integer o1, Integer o2) {
							return o2 - o1;
						}
					};
				}

				@Override
				public Comparator<BrowseFacet> newComparator() {
					return new Comparator<BrowseFacet>() {
						public int compare(BrowseFacet o1, BrowseFacet o2) {
							return sort.get(o1.getValue()).compareTo(
									sort.get(o2.getValue()));
						}
					};
				}
			});

			br.setFacetSpec("company_industry", generalSpec);
			br.setFacetSpec("company_type", generalSpec);
			br.setFacetSpec("job_time", valueOrderSpec);
			br.setFacetSpec("job_experience", customSortSpec);
			br.setFacetSpec("job_education", customSortSpec);

			SortField timeSort = new SortField("job_time", SortField.LONG);

			br.setSort(new SortField[] { timeSort });

			Browsable browser = new BoboBrowser(boboReader);
			BrowseResult result = browser.browse(br);

			// highlight jobs
			QueryScorer jobTitleScorer = new QueryScorer(q, "job_title");
			QueryScorer jobDesScorer = new QueryScorer(q, "job_title");
			SimpleHTMLFormatter formatter = new SimpleHTMLFormatter(
					"<span class=\"highlight\">", "</span>");
			Highlighter jobTitleHighlighter = new Highlighter(formatter,
					jobTitleScorer);
			Highlighter jobDesHighlighter = new Highlighter(formatter,
					jobDesScorer);
			jobTitleHighlighter.setTextFragmenter(new SimpleSpanFragmenter(
					jobTitleScorer));
			jobDesHighlighter.setTextFragmenter(new SimpleSpanFragmenter(
					jobDesScorer));
			
			SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd");
			
			for (BrowseHit browseHit : result.getHits()) {
				Map<String, String> job = new HashMap<String, String>();
				Document doc = getIndexSearcher().doc(browseHit.getDocid());
				job.put("job_address", doc.get("job_address"));
				job.put("company_name", doc.get("company_name"));
				job.put("from", doc.get("from"));
				job.put("row", URLEncoder.encode(doc.get("row"), "utf-8"));
				job.put("job_time", format.format(new Date(Long.parseLong(doc.get("job_time")))));

				String job_title = doc.get("job_title");
				TokenStream stream = TokenSources.getAnyTokenStream(
						getIndexReader(), browseHit.getDocid(), "job_title",
						analyzer);
				String jobTitleFragment = jobTitleHighlighter.getBestFragment(
						stream, job_title);
				if (null != jobTitleFragment && !jobTitleFragment.equals("")) {
					job.put("job_title", jobTitleFragment);
				} else {
					job.put("job_title", job_title);
				}

				String job_description = doc.get("job_description");
				TokenStream stream2 = TokenSources.getAnyTokenStream(
						getIndexReader(), browseHit.getDocid(),
						"job_description", analyzer);
				String jobDescriptionFragment = jobDesHighlighter
						.getBestFragment(stream2, job_description);
				String desc;
				if (null != jobDescriptionFragment && !jobDescriptionFragment.equals("")) {
					desc = jobDescriptionFragment;
				} else {
					desc = job_description;
				}
				if(desc.length() > 100){
					desc = desc.substring(0,100) + "...";
				}
				job.put("job_description", desc);
				
				jobs.add(job);
			}

			return result;

		} catch (IOException e) {
			e.printStackTrace();
			throw new JobsearcherException("搜索过程中出错!");
		} catch (org.apache.lucene.queryParser.ParseException e) {
			e.printStackTrace();
			throw new JobsearcherException("查询解析过程中出错!");
		} catch (BrowseException e) {
			e.printStackTrace();
			throw new JobsearcherException("创建分类搜索结果过程中出错!");
		} catch (InvalidTokenOffsetsException e) {
			e.printStackTrace();
			throw new JobsearcherException("获取搜索结果过程中出错!");
		}
	}
	
	// utils
	private String timeRange(long from, long to) {
		return "[" + from + " TO " + to + "]";
	}
	
	public Map<String,String> getJobByRow(String row) throws JobsearcherException{
		Map<String,String> job = new HashMap<String,String>();
		try {
			HTable t = new HTable(HBaseConfiguration.create(),
					Bytes.toBytes(PARSED_JOBS_HBASE));
			Get get = new Get(Bytes.toBytes(row));
			Result r = t.get(get);
			String url = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("url")));
			String from = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("from")));
			String job_title = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("job_title")));
			String job_category = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("job_category")));
			String company_name = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("company_name")));
			String company_industry = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("company_industry")));
			String company_type = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("company_type")));
			String company_scale = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("company_scale")));
			String job_time = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("job_time")));
			String job_address = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("job_address")));
			String job_count = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("job_count")));
			String job_experience = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("job_experience")));
			String job_education = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("job_education")));
			String job_language = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("job_language")));
			String job_salary = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("job_salary")));
			String job_description = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("job_description")));
			String company_description = Bytes.toString(r.getValue(Bytes.toBytes("job"),
					Bytes.toBytes("company_description")));
			
			job.put("url", url);
			job.put("from", from);
			job.put("job_title", job_title);
			job.put("job_category", job_category);
			job.put("company_name", company_name);
			job.put("company_industry", company_industry);
			job.put("company_type", company_type);
			job.put("company_scale", company_scale);
			
			SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd");
			job.put("job_time", format.format(new Date(Long.parseLong(job_time))));
			job.put("job_address", job_address);
			job.put("job_count", job_count);
			job.put("job_experience", job_experience);
			job.put("job_education", job_education);
			job.put("job_language", job_language);
			job.put("job_salary", job_salary);
			job.put("job_description", job_description);
			job.put("company_description", company_description);
			return job;
		} catch (IOException e) {
			e.printStackTrace();
			throw new JobsearcherException("连接hbase过错出错!");
		}
	}
	
	
}

  


4、界面:

1、首页:


2、搜索结果页面:


3、工作详细信息页面:

 posted on 2011-10-03 15:33  歪步  阅读(2795)  评论(7编辑  收藏  举报