WEB数据挖掘（四）——数据采集

以前开发过数据采集的程序，这段时间重新翻出来重构了一下代码，程序还有很多需要改进的地方

web数据采集从http提交方式可分为get方式和post方式（其实还有其他方式，不过目前浏览器不支持），针对这两种方式的数据采集，当时本人通过继承抽象父类的方式来实现这两种采集方式的请求参数封装类，post方式的参数封装类添加了post提交的参数（通过map成员变量保存post参数）

原来针对某指定站点或站点栏目的多页请求时通过一次性的构造这些请求参数类的集合，然后在执行http请求时通过遍历该集合来抓取web数据

后来本人发现，这种预先初始化请求参数类集合的处理方式在页数比较大的时候，比如成千上万的列表页时初始化比较慢，并且性能也不理想

面对这种应用场景，本人想到了要采用Iterator模式来重构，在需要提交当前web请求时，才将它的请求参数对象构造出来

Iterator模式的原型如下

public interface NodeIterator {
    /**
     * Check if more nodes are available.
     * @return <code>true</code> if a call to <code>nextHTMLNode()</code> will succeed.
     */
    public boolean hasMoreNodes() throws ParserException;

    /**
     * Get the next node.
     * @return The next node in the HTML stream, or null if there are no more nodes.
     */
    public Node nextNode() throws ParserException;

}

通过实现该接口梯次构造返回对象，而不是预先初始化List集合，参考实现类如下

public class IteratorImpl implements NodeIterator
{
    Lexer mLexer;
    ParserFeedback mFeedback;
    Cursor mCursor;

    public IteratorImpl (Lexer lexer, ParserFeedback fb)
    {
        mLexer = lexer;
        mFeedback = fb;
        mCursor = new Cursor (mLexer.getPage (), 0);
    }

    /**
     * Check if more nodes are available.
     * @return <code>true</code> if a call to <code>nextNode()</code> will succeed.
     */
    public boolean hasMoreNodes() throws ParserException
    {
        boolean ret;

        mCursor.setPosition (mLexer.getPosition ());
        ret = Page.EOF != mLexer.getPage ().getCharacter (mCursor); // more characters?

        return (ret);
    }

    /**
     * Get the next node.
     * @return The next node in the HTML stream, or null if there are no more nodes.
     * @exception ParserException If an unrecoverable error occurs.
     */
    public Node nextNode () throws ParserException
    {
        Tag tag;
        Scanner scanner;
        NodeList stack;
        Node ret;

        try
        {
            ret = mLexer.nextNode ();
            if (null != ret)
            {
                // kick off recursion for the top level node
                if (ret instanceof Tag)
                {
                    tag = (Tag)ret;
                    if (!tag.isEndTag ())
                    {
                        // now recurse if there is a scanner for this type of tag
                        scanner = tag.getThisScanner ();
                        if (null != scanner)
                        {
                            stack = new NodeList ();
                            ret = scanner.scan (tag, mLexer, stack);
                        }
                    }
                }
            }
        }
        catch (ParserException pe)
        {
            throw pe; // no need to wrap an existing ParserException
        }
        catch (Exception e)
        {
            StringBuffer msgBuffer = new StringBuffer ();
            msgBuffer.append ("Unexpected Exception occurred while reading ");
            msgBuffer.append (mLexer.getPage ().getUrl ());
            msgBuffer.append (", in nextNode");
            // TODO: appendLineDetails (msgBuffer);
            ParserException ex = new ParserException (msgBuffer.toString (), e);
            mFeedback.error (msgBuffer.toString (), ex);
            throw ex;
        }
        
        return (ret);
    }
}

上面的代码来自htmlparser组件的源码，通过移动当前游标的方式来构造Node节点对象

本人参考以上的处理方式首先声明接口

public interface ParamIterator
    {
        public boolean hasMoreParams();
        public Param nextParam();
    }

具体实现类如下（该类为内部类，即内禀迭代子）

private class ConcreteIterator implements ParamIterator
    {
        private int currentIndex=0;
        private int start = 0;
        private int end = 0;
        private int step = 0;
        private StringTokenizer st = new StringTokenizer(WebCate.this.single_links.trim());
        private String urlexp=WebCate.this.expression.trim();
        public ConcreteIterator()
        {        
            //解析分页表达式开始
            if(StringUtils.hasLength(urlexp))
            {
                //解析分页参数开始                
                //initpageparam(this.pageparam,start,end,step);
                String pageparamstr=WebCate.this.pageparam.trim();
                 if(StringUtils.hasLength(pageparamstr))
                    {
                        if(pageparamstr.indexOf(",")>-1)
                        {
                            String[] arr=pageparamstr.split(",");
                            if(arr.length==2)
                            {
                                start=Integer.valueOf(arr[0]);
                                String endstr=arr[1];
                                step=1;
                                if(endstr.contains(":"))
                                {
                                    String[] arr2=endstr.split(":");
                                    end=Integer.valueOf(arr2[0]);
                                    step=Integer.valueOf(arr2[1]);                                    
                                }
                                else
                                {
                                    end=Integer.valueOf(endstr);
                                }                
                            }                
                        }                                    
                    }
            }
            currentIndex=start;
            //解析分页参数结束
        }
        @Override
        public boolean hasMoreParams() {
            // TODO Auto-generated method stub
//            if(step>0)
//             {
//                return currentIndex<=end;                              
//             }
//             if(step<0)
//             {
//                 return currentIndex>=end;                  
//             }            
            return false;
        }

        @Override
        public Param nextParam() {
            // TODO Auto-generated method stub
            Param param=null;
            boolean single=true;
            if(WebCate.this.httpmethod==0)
            {
                //解析单页集合
                if(StringUtils.hasLength(WebCate.this.single_links))
                {                    
                    String str=null;                    
                     if(st.hasMoreElements() )
                     { 
                         str=st.nextToken().trim();
                         
                         if(StringUtils.hasLength(str))
                         {
                             param=new GetParam(str);                             
                         }                         
                     }
                     else
                     {
                         single=false;
                     }                    
                }                
            }
            if(StringUtils.hasLength(urlexp))
            {
                urlexp=transfer(urlexp,currentIndex);                
                if(WebCate.this.httpmethod==0)
                {
                    if(!single)
                    {
                        if(step>0&&currentIndex<=end)
                        {
                            param=new GetParam(urlexp.replace("{*}", String.valueOf(currentIndex)));
                        }
                        if(step<0&&currentIndex>=end)
                        {
                            param=new GetParam(urlexp.replace("{*}", String.valueOf(currentIndex)));
                        }                        
                    }                    
                }
                else 
                {
                    if(step>0&&currentIndex<=end)
                    {
                        param=new PostParam(urlexp,buildmap(WebCate.this.postparam.trim(),currentIndex));
                    }
                    if(step<0&&currentIndex>=end)
                    {
                        param=new PostParam(urlexp,buildmap(WebCate.this.postparam.trim(),currentIndex));
                    }                    
                }                
                currentIndex=currentIndex+step;                
            }            
            return param;
        }        
    }

通过改变当前索引的方式（int currentIndex）获取下一个请求的参数对象（Param）

然后在请求参数类里面返回该对象

public ParamIterator elements()
    {
        return new ConcreteIterator();
    }

然后我们在执行Http请求时就可以通过迭代来获取请求参数Param对象了

最终的采集效果如下

原型页面如下

勘误：

通过改变当前索引的方式（int currentIndex）获取下一个请求的参数对象（Param）

应该是　改变当前页码的方式　currentIndex命名为currentPage更合适

---------------------------------------------------------------------------

本系列WEB数据挖掘系本人原创

作者博客园刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/05/27/3100883.html

本文版权归作者所有，未经作者同意，严禁转载及用作商业传播，否则将追究法律责任。

posted on 2013-05-27 01:58 刺猬的温驯阅读(946) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

君子博学而日参省乎己则知明而行无过矣

公告

君子博学而日参省乎己 则知明而行无过矣

公告

君子博学而日参省乎己则知明而行无过矣