python Beautiful Soup 采集it books pdf,免费下载
http://www.allitebooks.org/
是我见过最良心的网站,所有书籍免费下载
周末无聊,尝试采集此站所有Pdf书籍。
采用技术
- python3.5
- Beautiful soup
分享代码
最简单的爬虫,没有考虑太多的容错,建议大家尝试的时候,温柔点,别把这个良心网站搞挂掉了
# www.qingmiaokeji.cn 30
from bs4 import BeautifulSoup
import requests
import json
siteUrl = 'http://www.allitebooks.org/'
def category():
response = requests.get(siteUrl)
# print(response.text)
categoryurl = []
soup = BeautifulSoup(response.text,"html.parser")
for a in soup.select('.sub-menu li a'):
categoryurl.append({'name':a.get_text(),'href':a.get("href")})
return categoryurl
def bookUrlList(url):
# urls = []
response = requests.get(url['href'])
soup = BeautifulSoup(response.text,"html.parser")
a = soup.select(".pagination a[title='Last Page →']")
nums = 0
for e in a:
nums = int(e.get_text())
# print(e.get_text())
for i in range(1,nums+1):
# print(url+"page/"+str(i))
# urls.append(url+"page/"+str(i))
bookList(url['href']+"page/"+str(i))
def bookList(url):
response = requests.get(url)
soup = BeautifulSoup(response.text,"html.parser")
article = soup.select(".main-content-inner article .entry-title a")
for i in article:
url = i.get("href")
getBookDetail(url)
def getBookDetail(url):
response = requests.get(url)
soup = BeautifulSoup(response.text,"html.parser")
title = soup.select(".single-title")[0].text
imgurl = soup.select(".entry-body-thumbnail .attachment-post-thumbnail")[0].get("src")
downLoadPdfUrl = soup.select(".download-links a")[0].get("href")
with open('d:/booklist.txt', 'a+',encoding='utf-8') as f:
f.write(title+" |  | "+ downLoadPdfUrl+"\n")
if __name__ == '__main__':
list = category()
for url in list:
bookUrlList(url)
爱技术的程序员,在职上岸非全研究生;技术兴趣广泛:架构、java、python、nodejs、性能优化、工具、大数据、AI等技术分享!
· go语言实现终端里的倒计时
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· 分享一个免费、快速、无限量使用的满血 DeepSeek R1 模型,支持深度思考和联网搜索!
· 使用C#创建一个MCP客户端
· ollama系列1:轻松3步本地部署deepseek,普通电脑可用
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· 按钮权限的设计及实现