用 node.js的 Apify 框架去爬取网页

index.ts

   

import Apify from 'apify'

Apify.main(async () => {
    const requestQueue = await Apify.openRequestQueue();
    await requestQueue.addRequest({
        url: 'https://onlinelibrary.wiley.com/doi/10.1002/ccr3.5509',
    });
    await requestQueue.addRequest({
        url: 'https://onlinelibrary.wiley.com/doi/10.1002/ccr3.5521',
    });

    const crawler = new Apify.PlaywrightCrawler({
        requestQueue,
        handlePageFunction: async ({request, page}) => {
            const title = await page.title();
            console.log(`Title of ${request.url}:\n ${title}`);

            const abstract = await page.innerText('section.article-section__abstract > div.article-section__content > p');

            console.log(`Abstract of ${request.url}:\n`);
            console.log(abstract);
        },
    });

    await crawler.run();

    console.log('Crawler finished.');
});

  package.json:

{
  "name": "spider",
  "main": "build/index.js",
  "scripts": {
    "start": "tsc -p tsconfig.json && node ./build/index.js",
    "start:prune": "rm -rf apify_storage && tsc -p tsconfig.json && node ./build/index.js",
    "build": "tsc -p tsconfig.json"
  },
  "dependencies": {
    "apify": "^2.2.2",
    "playwright": "^1.19.2",
    "puppeteer": "^13.5.0"
  },
  "devDependencies": {
    "@types/node": "^17.0.21",
    "@types/puppeteer": "^5.4.5",
    "typescript": "^4.6.2"
  },
  "packageManager": "yarn@3.2.0"
}

  

 

融入我的爬虫系统中:

    由于我的node版本是14,所以安装了:"apify": "^1.3.1"

    运行爬取一次后,再次运行,程序不再成功爬取网页内容,会出现警告,是因为apify为了避免重复爬取,在爬取一次后,会建立缓存存储爬取相关数据,不允许再次爬取:

 

 也可以自己在环境变量中设置缓存存储位置,通过变量:APIFY_LOCAL_STORAGE_DIR

 

  把apify_storage的文件夹删除后,爬虫就可以继续爬取了。

posted @ 2022-03-10 10:22  伊娜陈  阅读(362)  评论(0编辑  收藏  举报