【专项学习】—— 使用TypeScript编写爬虫工具

一、爬虫概述及正版密钥获取

爬取的页面： www.dell-lee.com/typescript/demo.html?secret=secretKey

密钥secretKey值获取（会不定期变更）：https://git.imooc.com/coding-412/source-code

二、TypeScript基础环境搭建

①nodejs下载安装

②VSCode点击左下角齿轮状的图标，在弹出的菜单中选择【Settings】，打开设置窗口

quote引号 —— 选single

tab缩进 —— Tab Size选2

save保存格式化 —— 勾选Format On Save

③VSCode点击左边方块Extensions，搜索插件Prettier, install下载开启

④安装TypeScript

VSCode点击上方Terminal，New Terminal开启TERMINAL面板

npm install typescript@3.6.4 -g

⑤typescript compile使用typescript对demo.ts进行编译，生成demo.js文件

tsc demo.ts

运行demo.js

node demo.js

⑥简化上一步，安装ts-node工具

npm install -g ts-node@8.4.1

直接运行demo.ts

ts-node demo.ts

三、使用SuperAgent和类型定义文件获取页面内容

①生成package.json文件

npm init -y

②生成tsconfig.json文件

tsc --init

③卸载全局安装的ts-node，安装在项目中

npm uninstall ts-node

npm install -D ts-node@8.4.1 (-D = --save-dev)

④项目中安装typescript

npm install typescript@3.6.4 -D

⑤新建src目录，创建crowller.ts文件；pagekage.json中修改命令

"scripts": {
     "dev": "ts-node ./src/crowller.ts"
},

运行npm run dev

⑥安装superagent工具在node中发送ajax请求取得数据

npm install superagent@5.1.1 --save

superagent是js语法，在ts里运行js会飘红，不知道怎么引用

ts -> .d.ts(翻译文件、类型定义文件）-> js

npm install @types/superagent@4.1.4 -D

⑦获取页面内容　　

import superagent from 'superagent';

class Crowller {
     private secret = 'secretKey';
     private url = `http://www.dell-lee.com/typescript/demo.html?secret=${this.secret}`;
     private rawHtml = '';

     async getRawHtml() {
          const result = await superagent.get(this.url);
          this.rawHtml = result.text;
     }

     constructor() {
          this.getRawHtml();
     }
}

const crowller = new Crowller()

四、使用cheerio进行数据提取

①安装cheerio库通过jquery获取页面区块内容

npm install cheerio --save

②安装类型定义文件@types/cheerio

npm install @types/cheerio -D

③数据提取　　

import superagent from 'superagent';
import cheerio from 'cheerio';

interface Course {
  title: string
}

class Crowller {
  private secret = 'secretKey';
  private url = `http://www.dell-lee.com/typescript/demo.html?secret=${this.secret}`;

  getCourseInfo(html: string) {
     const $ = cheerio.load(html);
     const courseItems = $('.course-item');
     const courseInfos: Course[] = [];
     courseItems.map((index, element) => {
       const descs = $(element).find('.course-desc');
       const title = descs.eq(0).text();
       courseInfos.push({
         title
       })
     })
     const result = {
       time: new Date().getTime(),
       data: courseInfos
     }
     console.log(result);
  }

  async getRawHtml() {
    const result = await superagent.get(this.url);
    this.getCourseInfo(result.text);
  }

  constructor() {
     this.getRawHtml();
  }
}

const crowller = new Crowller()

五、爬取数据的结构设计和存储

//ts -> .d.ts(翻译文件）-> js
import fs from 'fs';
import path from 'path';
import superagent from 'superagent';
import cheerio from 'cheerio';

interface Course {
    title: string
}

interface CourseResult {
    time: number;
    data: Course[];
}

interface Content {
    [propName: number]: Course[];
}

class Crowller {
      private secret = 'secretKey';
      private url = `http://www.dell-lee.com/typescript/demo.html?secret=${this.secret}`;

      getCourseInfo(html: string) {
            const $ = cheerio.load(html);
            const courseItems = $('.course-item');
            const courseInfos: Course[] = [];
            courseItems.map((index, element) => {
                  const descs = $(element).find('.course-desc');
                  const title = descs.eq(0).text();
                  courseInfos.push({
                          title
                  })
            })
            return {
                  time: new Date().getTime(),
                  data: courseInfos
            }
      }

      async getRawHtml() {
            const result = await superagent.get(this.url);
            return result.text;
      }

      generateJsonContent(courseInfo: CourseResult) {
           const filePath = path.resolve(__dirname, '../data/course.json')
           let fileContent: Content = {};
           //如果文件存在,读取以前的内容
           if(fs.existsSync(filePath)){
              fileContent = JSON.parse(fs.readFileSync(filePath, 'utf-8'));
           }
           fileContent[courseInfo.time] = courseInfo.data;
           return fileContent;
      }

      async initSpiderProcess() {
            const filePath = path.resolve(__dirname, '../data/course.json')
            const html = await this.getRawHtml();
            const courseInfo = this.getCourseInfo(html);
            const fileContent = this.generateJsonContent(courseInfo);
            //存入现在的内容
            fs.writeFileSync(filePath, JSON.stringify(fileContent));
      }

      constructor() {
            this.initSpiderProcess();
      }
}

const crowller = new Crowller()

六、使用组合设计模式优化代码

①爬虫通用类 - crowller.ts

//ts -> .d.ts(翻译文件）-> js
import fs from 'fs';
import path from 'path';
import superagent from 'superagent';
import DellAnalyzer from './dellAnaiyzer';

export interface Analyzer {
      analyze: (html: string, filePath: string) => string
}

class Crowller {
      private filePath = path.resolve(__dirname, '../data/course.json');

      async getRawHtml() {
            const result = await superagent.get(this.url);
            return result.text;
      }

      writeFile(content: string){
            fs.writeFileSync(this.filePath, content); 
      }

      async initSpiderProcess() {
            const html = await this.getRawHtml();
            const fileContent = this.analyzer.analyze(html, this.filePath);
            //存入现在的内容
            this.writeFile(fileContent)
      }

      constructor(private url: string, private analyzer: Analyzer) {
            this.initSpiderProcess();
      }
}

const secret = 'secretKey';
const url = `http://www.dell-lee.com/typescript/demo.html?secret=${secret}`;

const analyzer = new DellAnalyzer();
new Crowller(url, analyzer)

②爬虫某一网页专向策略 - dellAnaiyzer.ts

import cheerio from 'cheerio';
import fs from 'fs';
import { Analyzer } from './crowller';

interface Course {
  title: string
}

interface CourseResult {
  time: number;
  data: Course[];
}

interface Content {
  [propName: number]: Course[];
}

//分析器
export default class DellAnalyzer implements Analyzer{
    private getCourseInfo(html: string) {
      const $ = cheerio.load(html);
      const courseItems = $('.course-item');
      const courseInfos: Course[] = [];
      courseItems.map((index, element) => {
            const descs = $(element).find('.course-desc');
            const title = descs.eq(0).text();
            courseInfos.push({
                    title
            })
      })
      return {
            time: new Date().getTime(),
            data: courseInfos
      }
    }

    generateJsonContent(courseInfo: CourseResult, filePath: string) {
        let fileContent: Content = {};
        //如果文件存在,读取以前的内容
        if(fs.existsSync(filePath)){
          fileContent = JSON.parse(fs.readFileSync(filePath, 'utf-8'));
        }
        fileContent[courseInfo.time] = courseInfo.data;
        return fileContent;
    }

    public analyze(html: string, filePath: string) {
      const courseInfo = this.getCourseInfo(html);
      const fileContent = this.generateJsonContent(courseInfo, filePath);
      return JSON.stringify(fileContent)
    }
}

七、单例模式实战复习

①DellAnalyzer改为单例模式　

 private static instance: DellAnalyzer;

 static getInstance() {
      if(!DellAnalyzer.instance) {
        DellAnalyzer.instance = new DellAnalyzer();
      }
      return DellAnalyzer.instance;
 }

…… ……

private constructior(){}

②crowller.ts　　

const analyzer = DellAnalyzer.getInstance();
new Crowller(url, analyzer)

八、TypeScript的编译运转过程的进一步理解　　

①package.json增加命令

"build": "tsc -w"   //对ts文件统一编译 -w当ts文件发生改变时自动检测编译

②tsconfig.json文件中控制编译生成的文件放入build目录

"ourDir": "./build"

③安装nodemon工具，通过监控项目文件的变化做一些事情

npm install nodemon -D

④package.json增加命令

"start": "nodemon node ./build/crowller.js"

⑤package.json配置nodemonConfig，忽略数据data发生变化时的编译

"nodemonConfig": {
          "ignore": [
             "data/*"
          ]
}

⑥安装concurrently工具，并行执行build和start命令

npm install concurrently -D

⑦package.json更改命令

“scripts": {
      "dev:build": "tsc -w",
      "dev:start":  "nodemon node ./build/crowller.js",
      "dev": "concurrently npm:dev:*"
}

注：项目源自慕课网　

posted @ 2021-03-11 11:01 柳洁琼Elena 阅读(391) 评论(0) 编辑收藏举报

刷新页面返回顶部

独立天地间，清风洒兰雪

人因为幸福而成功，而不是因为成功而幸福

【专项学习】—— 使用TypeScript编写爬虫工具

公告

独立天地间， 清风洒兰雪

人因为幸福而成功，而不是因为成功而幸福

【专项学习】—— 使用TypeScript编写爬虫工具

公告

独立天地间，清风洒兰雪