xpath笔记
1.使用lxml.etree.parse()解析html文件,该方法默认使用的是“XML”解析器,所以如果碰到不规范的html文件时就会解析错误,报错代码如下:
lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: meta line 3 and head, line 3, column 87
解决办法:
自己创建html解析器,增加parser参数
from lxml import etree parser = etree.HTMLParser(encoding="utf-8") #自定义解析器 htmlelement = etree.parse("baidu.html", parser=parser) print(etree.tostring(htmlelement, encoding="utf-8").decode("utf-8"))
总结:
1.xpath方法只能对html文档使用,将字符串转为html对象的方法是etree.HTML(text),加载html文档的方法是etree.prase(.html)
2.先使用//*[@id=""]定位到大概位置,再使用/和//精确到要提取的位置,最后使用text()提取文本或@提取属性
3.提取到信息后的对象是列表,先转换为字符串类型,然后再用re.sub()方法清楚多余字符
4.灵活使用eval函数来使字符串变为列表
#!/usr/bin/env.python #._*_ coding:utf-8 _*_ from lxml import etree import requests import re def spider(url): # url = 'https://movie.douban.com/subject/26394152/' # html = etree.HTML(text) 将文本转换为html格式,自动补全标签 res =requests.request('GET', url) return res.text def write_file(text): with open("Bumblebee.html", "wb") as f: f.write(text.encode('utf-8')) def xpath_use(): html = etree.parse("./Bumblebee.html", etree.HTMLParser(encoding="utf-8")) #使用etree打开html文档,如果加载失败,则添加后面句话 # //*[@id="results"]/tbody[2]/tr[1] # select = " //*[@id='celebrities']/ul//li/div/span[1]/a/text() |" \ # " //*[@id='celebrities']/ul//li/div/span[2]/text() " # 选择演员和对应角色 # select = "//*[@id='recommendations']/div//dl/dd/a/text()" # 和此电影相似 select = "//*[@class='comment-item']/div/h3/span[2]/child::*/text()" # child::*选择当前节点的所有子元素 data = html.xpath(select) data = str(data) # 转换为字符串方便清洗 # data = data.split(",") # 以逗号分词 data = re.sub(r"\s*","",data) # 替换多余空字符 data = re.sub(r"\\n","",data) data = eval(data) # eval相当于把括号打开,不加eval的时候[]仅是字符,不是列表 print(data) for i in data: print(i) if __name__ == '__main__': xpath_use()