[开源 .NET 跨平台 Crawler 数据采集 爬虫框架: DotnetSpider] [四] JSON数据解析

[DotnetSpider 系列目录]

场景模拟

接上一篇, 假设由于漏存JD SKU对应的店铺信息。这时我们需要重新完全采集所有的SKU数据吗?补爬的话历史数据就用不了了。因此,去京东页面上找看是否有提供相关的接口。

查找API请求接口

  1. 安装 Fiddler, 并打开

  2. 在谷歌浏览器中访问: http://list.jd.com/list.html?cat=1315,1343,9719

  3. 在Fiddler查找一条条的访问记录,找到我们想要的接口

    image

编写爬虫

  1. 分析返回的数据结果,我们可以先写出数据对象的定义(观察Expression的值已经是JsonPath查询表达式了,同时Type必须设置为Type = SelectorType.JsonPath)。另外需要注意的是,这次的爬虫是更新型爬虫,就是说采集到的数据补充回原表,那么就一定要设置主键是什么,即在数据类上添加主键的定义

    复制代码
    [Schema("jd", "sku_v2", TableSuffix.Monday)]
    [EntitySelector(Expression = "$.[*]", Type = SelectorType.JsonPath)]
    [Indexes(Primary = "sku")]
    public class ProductUpdater : ISpiderEntity
    {
         [StoredAs("sku", DataType.String, 25)]
         [PropertySelector(Expression = "$.pid", Type = SelectorType.JsonPath)]
         public string Sku { get; set; }
    
         [StoredAs("shopname", DataType.String, 100)]
         [PropertySelector(Expression = "$.seller", Type = SelectorType.JsonPath)]
         public string ShopName { get; set; }
    
         [StoredAs("shopid", DataType.String, 25)]
         [PropertySelector(Expression = "$.shopId", Type = SelectorType.JsonPath)]
         public string ShopId { get; set; }
     }
    复制代码
  2. 定义Pipeline的类型为Update

    context.AddEntityPipeline(new MySqlEntityPipeline
     {
         ConnectString = "Database='taobao';Data Source= ;User ID=root;Password=1qazZAQ!;Port=4306",
         Mode = PipelineMode.Update
     });
  3. 由于返回的数据中还有一个json()这样的pagging,所以需要先做一个截取操作,框架提供了PageHandler接口,并且我们实现了大量常用的Handler,用于HTML的解析前的一些处理操作,因此完整的代码如下

    复制代码
        public class JdShopDetailSpider : EntitySpiderBuilder
        {
            protected override EntitySpider GetEntitySpider()
            {
                var context = new EntitySpider(new Site())
                {
                    TaskGroup = "JD SKU Weekly",
                    Identity = "JD Shop details " + DateTimeUtils.MondayRunId,
                    CachedSize = 1,
                    ThreadNum = 8,
                    Downloader = new HttpClientDownloader
                    {
                        DownloadCompleteHandlers = new IDownloadCompleteHandler[]
                        {
                            new SubContentHandler
                            {
                                Start = "json(",
                                End = ");",
                                StartOffset = 5,
                                EndOffset = 0
                            }
                        }
                    },
                    PrepareStartUrls = new PrepareStartUrls[]
                    {
                        new BaseDbPrepareStartUrls()
                        {
                            Source = DataSource.MySql,
                            ConnectString = "Database='test';Data Source= localhost;User ID=root;Password=1qazZAQ!;Port=3306",
                            QueryString = $"SELECT * FROM jd.sku_v2_{DateTimeUtils.MondayRunId} WHERE shopname is null or shopid is null order by sku",
                            Columns = new [] {new DataColumn { Name = "sku"} },
                            FormateStrings = new List<string> { "http://chat1.jd.com/api/checkChat?my=list&pidList={0}&callback=json" }
                        }
                    }
                };
                context.AddEntityPipeline(new MySqlEntityPipeline
                {
                    ConnectString = "Database='taobao';Data Source=localhost ;User ID=root;Password=1qazZAQ!;Port=4306",
                    Mode = PipelineMode.Update
                });
                context.AddEntityType(typeof(ProductUpdater), new TargetUrlExtractor
                {
                    Region = new Selector { Type = SelectorType.XPath, Expression = "//*[@id=\"J_bottomPage\"]" },
                    Patterns = new List<string> { @"&page=[0-9]+&" }
                });
                return context;
            }
    
            [Schema("jd", "sku_v2", TableSuffix.Monday)]
            [EntitySelector(Expression = "$.[*]", Type = SelectorType.JsonPath)]
            [Indexes(Primary = "sku")]
            public class ProductUpdater : ISpiderEntity
            {
                [StoredAs("sku", DataType.String, 25)]
                [PropertySelector(Expression = "$.pid", Type = SelectorType.JsonPath)]
                public string Sku { get; set; }
    
                [StoredAs("shopname", DataType.String, 100)]
                [PropertySelector(Expression = "$.seller", Type = SelectorType.JsonPath)]
                public string ShopName { get; set; }
    
                [StoredAs("shopid", DataType.String, 25)]
                [PropertySelector(Expression = "$.shopId", Type = SelectorType.JsonPath)]
                public string ShopId { get; set; }
            }
        }
    复制代码
posted @ 2017-04-14 10:26  网络蚂蚁  阅读(1793)  评论(0编辑  收藏  举报