Python爬虫从入门到放弃(1)
暑假里闲着没事学习了一波Python,在Python基础里面感觉正则表达式较难,各种字符串的匹配问题。。。巴拉巴拉。。。
于是找了个网站简单的入了个手:
1 import urllib.request 2 import re 3 import xlwt 4 5 url = 'http://www.mp4ba.net/' 6 Web = urllib.request.urlopen(url).read() 7 soup = BeautifulSoup(Web, "html.parser") 8 Doc = soup.find(id="threadlisttableid") 9 soup = BeautifulSoup(str(Doc), "html.parser") 10 nums = soup.find_all('tbody', id=re.compile(r"normalthread")) 11 Workbook = xlwt.Workbook() 12 sheet = Workbook.add_sheet('sheet1') 13 i = 0 14 for num in nums: 15 addr = num.em 16 name = BeautifulSoup(str(num.find('a', "s xst")), "html.parser") 17 sheet.write(i, 0, addr.get_text()) 18 sheet.write(i, 1, name.get_text()) 19 i = i + 1 20 Workbook.save("text.xls")
爬到的结果如下:
再接再厉。。。。666666666666666666