爬虫.多机并行的微博爬取.分布式系统设计

分布式系统(Distributed System)

  A program(程序)

    is the code you write  # code,代码

  A process

    is what you get when you run it

  A message(消息)

    is used to communicate between processes  (进程间通信)

  A packet(包)

    is a fragment of a message that might travel on a wirt    # fragment,断片

  A protocol(协议)

    is a formal description of message formats and rules that two processes must follow in order to exchange those messages    #format,格式

    (两个进程必须遵循的格式或规则的正确描述才能交换这些消息)

  A network

    is the infrastructure that links computers,workstations,terminals,servers,etc.It consists of routers which are connected by communication links.  # infrastructure ,基础设施  links(链接)

                      (工作站,       终端,           consist,由...组成   router,路由器      通讯)

  A component(组件)

    can be a process or any piece of hardware required to run a process,support communications between processes,store data,etc.

                      (required ,要求                         store,储存)

  A distributed(分布式)

    is an application that executes a collection of protocols to coordinate the actions of multiple processes on a network,such that all components

                      ( protocol,协议   coordinate,协调   action功能               component,组件)

    cooperate together to perform a single or small set of related tasks.

    (cooperate,合作   perform,执行          task,任务)

    

Advantage

  Fault-Tolerant(容错):It can recover from component failures without performing incorrect anctions.

  Highly Available(高可用性):It can restore operations,,permitting it to resume providing services even when some components have failed.

  Recoverable(可恢复性):Failed components can restart themselves and rejoin the system,after the cause of failure has been repaired.

  Consistent(一致性):The system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies

        the ability of a distributed system to act like a non-distributed system.

  Scalable(可伸缩性):It can operate correctly even as some aspect of the system is scaled to a larger size.

  Predictable Performance(可预测性):The ability to provide desired responsiveness in a timely manner.

  Secure(安全):The system authenticates access to data and services.

Challenge

  Replications and migration cause need for ensuring consistency and distributed decision-making.

  (Replication,复制     migration ,迁移                  decision-making,决策)

  Failure modes(失效模式):Not assuming data received is same as sent

                ( assuming,假设  received,收到了 sent,发送   )

  Concurrency(并发性):Update/Replication/Cache(隐藏)/Failure...

  Heterogeneity(非均匀性):Network,hardware,OS,languages,developers(开发人员)

  Scalability(可扩展,可伸缩性):Architecture must be able to handle increase of users,resources,etc.Considering cost of physical resources,

         performance loss,bottleneck.

  Security(安全)

分布式爬虫系统

    

Master-Slave结构

  有一个主机,对所有的服务器进行管理。绝大多数分布式系统,都是Master-Slave的主从模式。之前的爬虫是完全独立的,一次从url队列里获取url,进行抓取

  当爬虫服务器多的时候,必须能通过一个中心节点对从节点进行管理

  能对整体的爬虫进行控制

  爬虫之间信息共享的桥梁

  负载控制

Remote Procedure Calls

  Specifies the protocol for client-server communication

  Develops the client program

  Develops the server program

      

Protocol-Message Type

Protocol-Actions

Protocol-Key Definition

Socket

  

Create Client Socket

Create Server Socket

Ways to listen

  a new thread to handle clientsocket

  a new process

  use non-blocking socket

Non-blocking mode listening

  connection.setblocking(False),

  send,recv,connect and accept returns immediately

  connection.setblocking(False) is equivalent to settimeout(0.0)

  asyncore

  Provides the basic infrastructure for writing asynchronous

  socket service clients and servers

    

  Way to end communication

    fixed length message : while totalsent < MSGLEN : 

    delimited : some message\ 0

    indicates message length in beginning : LEN : 50 ;

    shutdown connection : server call close ( ), clietn recv( )

    returns 0

串行化处理

  为了防止数据库被过度访问,可以由前端统一管理任务队列,这时,在前端请求与数据之间再建立一层任务对列,用于隔离一部分的数据更新操作

      

消息对列

  消息对列用来解耦一些不需要同步调用的服务或者订阅一些通知以及系统的变化。

  使用消息队列可以实现服务解耦、异步处理、流量缓冲等。

    例如:日志系统:日志系统一般是同步的,本身会有一个缓冲区,缓冲区满的时候再写入到磁盘。高峰期,log系统写磁盘的时候会造成线程的阻塞,可以通过消息

            发送给logger,由logger来处理。

       数据同步:将mysql的数据变更同步到Redis里,可以使消息队列,例如:linkedin发布的databus方案,根据日志系统,一次对数据进行同步操作。

       任务队列:对于爬虫,抓取状态的更新不需要原子性操作,所以可以根据消息队列,把更新的任务放到独立的队列里,依次完成对爬取状态的更新,使得爬虫状态

            更新的线程可以快速返回。

MySQL与Redis的同步

  可以通过databus来实现MySQL数据与Redis的更新操作,把MySQL的数据变化同步到Redis。

    

容灾处理

  在集群环境下,当Master的服务宕机时,我们需要备份的Master能自动接替Master的工作,这里可以使用Zookeeper来配置自动化的集群注册和管理。

      

 

posted @ 2019-03-14 16:27  jacky912  阅读(239)  评论(0编辑  收藏  举报