File Sharing 

主要讨论p2p拓扑结构下的file sharing,其中以bittorrent, HashTable 和diskshare为主要内容。可以进行se中bittorrent的复习.

1. what's file sharing?(定义file sharing,传统FTP和HTTP,P2P下的演变)

  - file sharing is the pratice of making files available for other users to download over the internet and smaller network

  - cenventional network: FTP,HTTP

  - P2P:Napster, Gnutella, Emule.Kademlia, BitTorrent, Mute, Freenet, Gnu-Net

  - file sharing has made P2P popular

2. Why does file sharing make P2P networking so popular?(为什么P2P,传统中心化的不利于成本控制,P2P去中心化,特性既可以是客户端也可以是服务器端,其他用户可以直接从本地硬盘上读取文件,带宽优势,fault tolerance的优势)

  - conventional networks offer rather cost-intensive central mass store solutions

  - The P2P model enables a decentralized data management

  - Peers can work as clients and servers

  - Users can share their files with hundreds of thousands of users and access hundreds of millions of files

  - Files can be accessed directly from the local hard disc of other peers

 

3. First generation - Centralized P2P

  - Central directory server is the most important communication entity(it makes list of files with their associated peers available)

  - decentralized file transfer

  1. Peers register at central server, publish their IP address and a list of files to be shared

  2.peers send queries to central server, server returns a list of peers with the requested objects

  3. After pinging the selected peer the file transfer happens directly between peers

  - strengths:

    1. consistent view of the network(central server always knows who is available and who is not)

    2.Fast and efficient searching in small networks

      2.1 central server makes all files offered by peers available to other peers

      2.2Answer is guaranteed to be correct

  - weaknesses:

    1.single point of failure

      if the central server crashes then the whole P2P network crashes

    2.Performance Bottleneck

      in a P2P network with hundreds of thousands of connected users, the central server needs enough computation power to handle thousands of queries per second.- not scalable

总结,第一代的中心化的P2P网络,建立一个类似于meta server或者说是central directory server 去处理登记和查询,他保存了一个可以被分享的文件清单和该文件所associated peers。所有要求文件下载的请求被他处理,返回一个该文件所在peer的清单,请求者直接ping文件所存在的节点,建立文件传输的链接。 这要求中心服务器需要有足够的运算资源和带宽(bandwidth)去处理成千上万的请求。

优点:查找速度快,consistent view of the network,result is guaranteed. 

缺点:单一节点故障会造成整个网络的crash, 中心节点需要大的带宽和运算资源就导致难以scalable

 

4. Decentralized P2P

  - Discovering new peers

    - sending a broadcast ping

    - connected peers answer with pong

  - Locating specific content 

    - Query messages are sent to all neighbours

    - Query message contian TTL

    - TTL counter is decreased by hop

    if neighbour owns the file, the search process has been done; else neighbour forward requests to its neighbour.

  - Download files 

    directly from source node e.g.via HTTP

  - Strengths:

    - Absolutely decentralized network

      - No single point of failure

      - No performance bottleneck

    - Main part of communication is anonymous

  - weaknesses :

    - high network traffic via ping or pong packet

    - not protected against fire queries (attacker broadcasts artificial queries)

    - establishes request loops  overheads caused by message cycles

    - queryhit rate is reduced in large networks

完全的P2P 结构:涉及到发现neighbour,通过broadcast ping, connected peer 会pong;定位位置,query message会被forward到neighbour,如果neighbour有,就结束返回,如果neighbour没有,neighbour会forward到他自己的邻居节点,请求会有TTL;下载文件,直接source peer建立连接。

优点:完全p2p,没有no single point failure 和没有performance bottleneck. 主要的communication is anonymous。

缺点: 比较高的网络负载因为ping,pong;Not protected against fire queries; 会有请求循环,establishes request loops; 在大型网路中,寻找下载文件速度下降。

 

5. hybrid p2p

goal: combine efficiency properties of centralized model with robustness of decentralized model

- Dividing servants into SuperNodes and LeafNodes

- Client software makes servants to SuperNode by means of specific criteria(i.e. connection speed)

- SuperNodes are temporary servers 

- Every SuperNode can manage approx. 200- 500 leafs

- Reduce message transfer by Flow control and Pong caching 

- Bootstrapping by WebCahes and HostCache

- locating specific content

  - query message is sent to the SuperNode

  - SuperNode decides about forwarding the query to the leafs by RouteTable managed by SuperNode, RoutTableUpdate message is sent to the SuperNode by the LeafNode and contains all keywords which describe the content shared by this LeafNode

  - Dependent on TTL, SuperNode can forward the query to other connected SuperNode

  - Eventually direcrt file transfer between LeafNode

总结:混合p2p上层网络是pure P2P, 子网络是centralized p2p, supernode 维护一个routetable包含所有分享文件的关键词,supernode可能因为连接速度或者带宽被选为supernode,其他都是leafnode。并且它通过flow control 和pong caching来降低message transfer/ bootstrap 通过webcaches和hostcaches。直接文件传输在两个node之间

 

6.Structured P2P networks

  - Problems:

  1. lookup problems in unstructured P2P networks

    - where to place and how to find data items in a distributed system with regard to scalability and efficiency?

      - centralized p2p: fast and efficiency but not scalable;retrieving a data item is O(1)

      - decentralized p2p: broadcast mechanism is not scalable; linear increasing communication overhead; no guarantee for a result when searching with limited TTL

  2.Structured P2P networks

    - based on distributed hash Table: guaranteed correct results;quick search O(logN)

问题在于如果存放文件和查找文件在分布式系统中,让其更加有效和可拓展。建立有结构的P2P 网络,基于分布式hash Table

 

7. Distributed Hash Table 

  data structure in which(key,v) pairs are distributed over the node amount as constantly as possible

  - key = hashed object identifier

  - value = ip address, nodeid, port

  1. NodeId and keys are hashed in common address space

  2. every node is responsible for its own address space

  3. thereby every node is analog to bucket of a hash table

  4. The address space is viewed as a circle(chord), a binary tree(Kademlia) or a quadratic area(CAN)

  5.if a node searches for a value of a key when it has to locate a nodeid in whose address space the key is included

  6. each node n knows its next neighbour nodeid -- successor

  7. each node manages the address space s = predecessor(n)+1 to n consisting of k,v pairs

  - Locating A Value

    1. initial node searches for a value of a key h("mydata")

    2.initial node checks itself if key not found forward request to its next node;

      if its next node manages this key then send the value to its next node until initial node is reached else forward the findvalue msg to its next node.

注意:这里的查询算法有点不一样,initialnode 发送请求到下一个节点,下一个节点如果有值就把值发给自己的下一个节点直到initial node;如果没值,发请求到下一个节点

  - Joining:

    1. after a new node joined the network, the responsible manager of its address space has to partition this address space

    2. new node gets succ(n) and its predecessor updates its succ(n)  

  - leaving :

    1, if a node leaves the network then its next neighbour adds its whole address space 

    2. its predecessor update its successor 

 

8.Kademlia - a specific DHT Approach

  - Basic Idea:

    - nodeids or  their prefixes are mapped onto leafs or a binary tree

    - thereby, a prefix represents a single node managing all identifiers in its subtree

    - on searching every hop leads to a small subtree, of which any node can be accessed 

    - distance of two binary nodeids are calculated by the XOR metric

  - routing information

    - The structure of an internal routing table consists of a binary tree

    - every node knows at least one node in each of its k-buckets

    - every node keeps a list (k-buckets) of triples for nodes of distance between 2^i and 2^(i+1)-1 originated from itself with -<=i<=160

    - every k-bucket keeps a maxium of contact triples

      - default k = 20

      - contacts are kept sorted by time last seen(least-recently seen node at the head, most-recently seen at the tail)

  - update of the k-bucket

    - each received message updates the corresponding k-bucket 

    - case differentiation  

      - sender node is known(move this node to the tail of the k-bucket)

      - sender node is unknown and the k-bucket is not full(insert new node to tail of the k-bucket)

      - sender node is unknown and the k-bucket is full(ping the node at the head of the k-bucket) if this node does not respond, then remove this node and insert the new node; else the new node is dopped

  - Find_Node-locating k closest nodeids to a nodeid:

    1. initiator picks the n closest nodeids of the matched k-bucket, then sends the Find_Node msg to n NodeIds in parallel, defualt : n =3;

    2.Those n NodeIds answer with their k closest nodes

    3. In recursive steps initiator selects the n closest NodeIds from the response set and resends the FIND_NODE msg to them, if one node does not answer, then select another one of the set

    ip,udp-port.nodeid

  - Find_Value - locating the value of a key

    1. the process is similar to find_node

    if a requested node manages this key.

    then answer with key value

    else find_value get k closest nodeids to the key and ask them for the key value

  - STORE - Save k,v pair to a node

    1. locating the k closest nodeids for the key by find_node

    2. sending store msg to these k closest nodeids 

    3. these k nodes republish this k,v pair to other closer new nodes at hourly intervals

    4. after 24 hours the link will expire

  - ping - node online identification

 总结:难点在于k-bucket的建立和维护,每个k-bucket都需要至少存有一个subtree的node!!

recursive

 

9. Third Generation -bitTorrent

  - components:

    1. web server : offering .torrent file

    2. a tracker keeping a list of clients downloading a specific file

    3. a client program acting as peer

  - two classes of peers:

    1. leechers: users who download files; these users provide their downloaded chunks for the upload

    2. seeders: users who have downloaded the complete file and only provide the upload

  - based on swarming:

    1. strengths: very efficient file distributed system; highly scalable due to swarming

    2. weaknesses: trackers is a single point of failure; no explicit file search functionality

    3. To increasec efficiency of downloads, clients implement sepcific strategies

    4. file is splitted into many chunks and can be retrieved by downloading necessary chunks from different peers

    5. swarming: a file sharing client downloads a file from many sources at the same time 

    6. A swarm is a set of clients downloading the same file

    7. while a peer is downloading a file it already downloaded chunks to other peers

    8. goal: quickly replicate chunks to a large number of clients

  - .torrent file

    1. published on some web site

    2. contains:  

      - url of tracker(server)

      - name and description of the file 

      - length of chunks 

      - amount of chunks

      - SHA-1 hashes of each piece in file for reliability 

    3. A user has to search for this file

    4. After the .torrent file has been retrieved by a user, it is imported by a BitTorrent client for getting necessary information for downloading the associated shared file

  - steps:

    1. the .torrent file is fetched and interpreted by a BitTorrent client

    2. the tracker specified for the desired file is contacted and a list with peers offering parts of this file are returned by the tracker

    3. the peers are contacted and a list of chunks offered by the apropriate peer is requested 

    4. After a client has retrieved the list with chunks from every peer, it has to decide which chunk it will request from which peer

  - chunk selection policy

    1. rarest first - download the rarest chunk file

      - each peer creates a statistic of rarest chunks, indicator is the frequency of communication of this chunk to other peers

      - increases likelihood that all pieces are still available even if original send leaves before any one node has download entire file 

    2. Random first

      - the first chunk is download randomly

      - risk: a new node downloads rarest first and logs off without sharing this chunk--block the download

    3. endgame mode

      - to avoid slow download in the end, all peers are requested for the missing chunks

 

10. Third generation - Freenet

  - Decentralized, censorship-resistant distributed data store

  - each participant provides a part of his own hard disk to store files from other participants 

     1. a user has no control or knowledge of what kind of files his node stores

     2. all files on each node are encrypted

  - One node knows only neighbor nodes thus requests can be only sent to neighbours

  - no global semantic search functionality

  - uses gloable unique identifier keys for identifying shared files

  - serveral types of key genration mechanisms may be applied. most important type of key: signed-subspace key 

  - every node has an ID and stores files having a key hash similar to this ID

  - Key Management ( single-subspace keys(ssk)

    1. public-private key pair and symmetric key are generated

    2. file is encrypted using the symmetric key and signed using the private key; nodes do not store the symmetric key, only the public key part of the SSK as an index to the data, thus a node hosting the file can plausibly deny knowledge of the stored data.

    3. SSK is built by the hash of the public key and the symmetric key. the hash of the public key acts as the index to the data for searching purposes. Furthermore, the public key is stored with the data thus nodes can verify the signature when a ssk file enters a node and clients can verify the signature when retrieving a file. the symmetric key is used by clients for file decryption.

  - Requesting files and data transfer

    1. a user gets a ssk

    2. client softwar extracts from this ssk the hash of the public key and the symmetric decryption key 

    3. then sends the requests message containing the hash of public key, a ttl value and a unique request id 

 

    4. initiator node checks its own file system

    5. if there is no match then send request message to the node with the closest id until ttl expires

    6. if the file is found then send reply message to the requesting node(neighbours)

    7. initiator decrypts the encrypted file 

      decryption((encrypted file),decrypted key)

      decryption ---

 

总结:freenet , 用到用户的本地存储空间 censorship-resistant distributed data store. 本地永无无控制和知情权,所有文件被加密,用户只能和邻居通信,没有全局搜索,使用全局唯一标示,key generation mechanism signed-subspace key,每个节点存储与他id相近的key hash。 关键在于key 管理。 生成public key 和private key还有symmtric key。文件被密钥签名被symmetric key 加密; 节点不存symmetric key,public key 作为index to the data,因此节点对该文件一无所知。 hash(public key+symmetric key)= ssk , index = hash(public key),client 会extract form ssk 得到public key 和symmetric key