爬虫常用方法
1. selenium转beautifulsoup:
pageSource = driver.page_source
soup = BeautifulSoup(pageSource,'html.parser')
2. bs4 查找页面内容:
resultPages = soup.find(text=re.compile(u'查询失败,请重新查询!$'))
print('resultPages: ' + str(resultPages))
if resultPages == '查询失败,请重新查询!':
driver.close()
3. bs4查找页面class内容的ul节点中,查找li节点:
resultPages = soup.find("ul",class_="pagination").find_all('li')
resultNum = len(list(resultPages))-2
pageNum = int(resultPages[resultNum].text) #获取第resultNum 个节点的文本
4. bs4 查找节点的内容:
li.find('div',class_='time').text
5. bs4下一个节点:
try:
xmID = xmSoup.find(text=re.compile(u'采购编号:$')).next_element.text
except:
xmID = xmSoup.find(text=re.compile(u'采购编号:$')).next_sibling.text