爬虫抓取山西省介绍-三级菜单生成寻找

公告

  1 #!/usr/bin/env python
  2 # coding=utf-8
  3 # Author:hto
  4 # def data_in():
  5 #     allprovince={}
  6 #
  7 import requests,re,json
  8 from bs4 import BeautifulSoup
  9 import pandas as pd
 10 
 11 url='''https://baike.baidu.com/item/%E5%B1%B1%E8%A5%BF/188460?fr=aladdin&fromid=365266&fromtitle=%E5%B1%B1%E8%A5%BF%E7%9C%81
 12 '''
 13 #______________生成cookie，headers的方法，先用浏览器抓包保存，然后存到txt，再读取，注意其中strip，split('=',1)的作用
 14 
 15 shanxi={}
 16 #——————————————获取头文件，cookie
 17 def gethead():
 18     
 19     f=open('./cookies.txt','r+')
 20     for line in f.read().strip().split(';'):
 21         name, value = line.strip().split('=',1)
 22         cookie[name]=value
 23     f.close()
 24 
 25 
 26     
 27     f=open('./headers.txt','r+')
 28     for line in f:
 29         key,value=line.split(':')
 30         value=value.strip()
 31         headers[key]=value
 32     f.close()
 33 #——————————————这里继承了头文件的部分，后面就不用在用gethead()
 34 
 35 def download(): 
 36     headers={}
 37     cookie={}
 38     f=open('./cookies.txt','r+')
 39     for line in f.read().strip().split(';'):
 40         name, value = line.strip().split('=',1)
 41         cookie[name]=value
 42     f.close()
 43 
 44 
 45     
 46     f=open('./headers.txt','r+')
 47     for line in f:
 48         key,value=line.split(':')
 49         value=value.strip()
 50         headers[key]=value
 51     f.close()
 52     html=requests.get(url,cookies=cookie,headers=headers).content.decode('utf-8')
 53     # html=BeautifulSoup(html,'lxml')
 54 
 55     ft=pd.read_html(html)
 56     global shanxi
 57     for line in ft[0][1]:
 58         name,piece=line.split('市',1)
 59         piece=piece.split('：',1)[1]
 60         piece=piece.split('。',1)[0]
 61         piece=piece.split('、')
 62         shanxi[name]=piece
 63 '''
 64 这里其实不做global也是可以的，因为在函数定义空间中，如果取得是全局函数的切片，那么他默认认为这个是全局函数
 65 但是为了养成良好的习惯，在函数的定义空间内，如果用到全局函数就要声明一下，防止出错
 66 
 67 '''
 68 
 69 
 70 #————————————————————————————这里是保存到本地txt文件
 71 def tosave():
 72     f=open('./shanxi.txt','w+')
 73     json_shanxi=json.dumps(shanxi,ensure_ascii=False)
 74     
 75     f.write(json_shanxi)
 76     f.close()
 77 
 78 '''
 79 这里有个小麻烦，如果用str(dict)保存到txt文件，在读取的时候会因为text的文档特性而失败,这是由于str编码的问题导致，str
 80 生成的都是单引号，而json文件中不允许单引号的出现，所以才会出现编译出错
 81 
 82 '''
 83 '''
 84 这里json.dumps是将字符串转化为json格式，其中ensure_ascii默认为True，意思是默认用ascii编码保存，ascii编码做不到的用utf-8保存
 85 这样在出现中文的时候就会出现2进制码，在这里可以强制False，意思是默认用utf-8保存。
 86 
 87 
 88 '''    
 89 
 90 #—————————————————————————————这里用json.loads读取txt文件直接将其转化成dict
 91 def readit():
 92     f=open('./shanxi.txt','r+')
 93     global shanxi
 94     
 95     shanxi=json.loads(f.read())
 96     f.close()
 97 
 98 # gethead()
 99 # download()
100 
101 
102 # tosave()
103 
104 readit()
105 print(shanxi)
106 
107 quit='not'
108 
109 while quit!='q':
110     quit=city=input('请输入查找的城市：(q退出)')
111     back='not'
112     if city in shanxi:
113         
114         while back!='b' and quit!='q':
115             back=quit=county=input('请输入查找的县区:(q退出/b返回上一级)')
116             if county in shanxi[city]:
117                 print('有这个县区')
118                 
119             elif county!='b' and county!='q':
120                 print('没有这个县区')
121                 
122         
123     elif city!='b' and city!='q':
124         print('没有这个城市...')
125

View Code

在定义的函数空间内，所有的变量重新声明，即时与主函数或者说全局变量相同的名字，他也不会引用，除非使用了切片的方式，比如dict[key],list[num]等的方式。
这里其实不做global也是可以的，因为在函数定义空间中，如果取得是全局函数的切片，那么他默认认为这个是全局函数但是为了养成良好的习惯，在函数的定义空间内，如果用到全局函数就要声明一下，防止出错。
```
   global shanxi
    for line in ft[0][1]:
        name,piece=line.split('市',1)
        piece=piece.split('：',1)[1]
        piece=piece.split('。',1)[0]
        piece=piece.split('、'
```
这里有个小麻烦，如果用str(dict)保存到txt文件，在读取的时候会因为text的文档特性而失败,这是由于str编码的问题导致，str()生成的都是单引号，而json文件中不允许单引号的出现，所以才会出现编译出错。
```
def tosave():
    f=open('./shanxi.txt','w+')
    json_shanxi=json.dumps(shanxi,ensure_ascii=False)
    
    f.write(json_shanxi)
    f.close()
```
如果改成下面的程序，在用json.loads()的时候就会出错！因为他会将所有的字符串标记引号都存成单引号。
```
def tosave():
    f=open('./shanxi.txt','w+')
    json_shanxi=str(shanxi)
    
    f.write(json_shanxi)
    f.close()
```
还有一个细节就是编码的问题
```
ensure_ascii=False如果这里不做设置，默认是True.
```
这里json.dumps是将字符串转化为json格式，其中ensure_ascii默认为True，意思是默认用ascii编码保存，ascii编码做不到的用utf-8保存
这样在出现中文的时候就会出现2进制码，在这里可以强制False，意思是默认用utf-8保存。

posted on 2018-03-06 20:55 撞钟和尚cokeor 阅读(182) 评论(0) 编辑收藏举报

刷新页面返回顶部