爬取网易云课堂、网易公开课课程数据
二话不说,先上代码~
import requests import json def getdata(index): a=input("调用gedata方法") print("正在抓取{index}页数据") payload = {"pageIndex":index, "pageSize":700, "relativeOffset":50, "frontCategoryId":400000001295013, "searchTimeType":-1, "orderType":50, "priceType":-1, "activityId":0, "keyword":"" } payload = json.dumps(payload) headers = {"Accept":"application/json", "Host":"study.163.com", "Origin":"https://study.163.com", "Content-Type":"application/json", "Referer":"https://study.163.com/courses", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36" } req = requests.post("https://study.163.com/p/search/studycourse.json",data=payload,headers=headers) e=input("成功post到数据") print(type(req)) res_json = json.loads(req.text) print(type(res_json)) with open("C:/Users/Administrator/Desktop/wangyiCloud.json","w") as f: json.dump(res_json,f) print("写入文件完成...") a=getdata(1) b=input("运行到了这")
这段数据是爬取网易云课堂的代码~因为我是写php的,所以以上代码如果有什么问题敬请斧正
我先讲一下业务背景吧,leader让我把市面上主流的线上学习的网站的课程数据全部爬取下来~
一开始接到的时候,有点无从开始,没做过啊,
最开始是去搜怎么爬取网页的数据,了解到了一种是通过模拟headers来获取数据,另一种就是获取整个页面的html,再通过选择器来获取你想要的数据
最开始接触的就是scrapy框架,打算建立在windows环境下,果然windows下的安装果然不省心,遇到这方面问题的可以去看看我的另一篇博文:windows下安装scrapy的各种问题
安装好了之后,根据他的教程走,很快的就把csdn,极客,腾讯课堂都爬下来了~
之后爬取网易云课堂的时候,发现爬取下来的html页面里面没有具体的课程数据,去看网站的整个加载过程发现,是通过js加载的数据
可以看到,数据都是通过studycourse.json加载的,那这种就简单了,直接通过模拟headers跟post的数据就能获取了~
数据是通过post获取的,提交的是Payload类型,数据格式是json,
提取一下post关键字,frontCategory,字面意思,前面 类别,大致猜一下应该就是课程的大分类id,keyword应该是我们搜索时才有
pageSize是加载的数据的大小,pageIndex是第几个页面
因为是写php的,所以就直接想通过curl模拟post
代码如下:
//curl模拟post获取网易云数据 public function wangyiDataAction(){ $url = "https://study.163.com/p/search/studycourse.json"; $headers = array( "Accept" =>"application/json", "Host" =>"study.163.com", "Origin" =>"https://study.163.com", "Content-Type"=>"application/json", "Referer" =>"https://study.163.com/courses", "User-Agent"=>"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36", ); $payload = array( "pageIndex" =>1, "pageSize" =>700, "relativeOffset"=>50, "frontCategoryId"=>400000001295013, "searchTimeType"=>-1, "orderType" =>50, "priceType" =>-1, "activityId" =>0, "keyword" =>"", ); $payload = json_encode($payload); $curl = curl_init(); curl_setopt($curl, CURLOPT_URL, $url); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE); curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, FALSE); curl_setopt($curl, CURLOPT_HEADER, $headers); curl_setopt($curl, CURLOPT_POST, 1); curl_setopt($curl, CURLOPT_POSTFIELDS, $payload); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); $output = curl_exec($curl); curl_close($curl); echo"<pre>";print_r($output); return $output; }
运行之后获取的结果却是
搞不懂这是什么?知道的求科普一下~
没办法,用python再写一遍~
代码如下~
import requests import json def getdata(index): a=input("调用gedata方法") print("正在抓取{index}页数据") payload = {"pageIndex":index, "pageSize":700, "relativeOffset":50, "frontCategoryId":400000001295013, "searchTimeType":-1, "orderType":50, "priceType":-1, "activityId":0, "keyword":"" } print(type(payload)) payload = json.dumps(payload) print(type(payload)) headers = {"Accept":"application/json", "Host":"study.163.com", "Origin":"https://study.163.com", "Content-Type":"application/json", "Referer":"https://study.163.com/courses", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36" } print(type(headers)) req = requests.post("https://study.163.com/p/search/studycourse.json",data=payload,headers=headers) e=input("成功post到数据") print(type(req)) res_json = json.loads(req.text) print(type(res_json)) with open("C:/Users/Administrator/Desktop/wangyiPublic.json","w") as f: json.dump(res_json,f) print("写入文件完成...") a=getdata(1) b=input("运行到了这")
因为对python不会,所以有很多打印的
运行结果如下:
比较要注意的点是req的数据类型,打印出来是requests.models.Reaponse
去百度了一下:
它返回来的数据包含很多信息,text就是我们想要的,获取后存入本地文件
代码里比较值得注意的两个点
1、是frontCategory,这个是课程分类,因为网易云课堂不能显示全部课程,只能显示一级分类下的全部课程,这个frontCategoryId就是以及课程分类Id,这个可以自己去看~
这个id要对的才能拿到对应课程的数据
2、是pageSize,这个是每次获取数据的条数,网易默认是50,因为他每页显示50个课程,我们不要这么麻烦,直接往大了些,2000,他每个一级分类下的课程数也就几百上千,肯定小于2K的
这是获取到的数据,本来应该直接代码处理输出csv文件的,但python不怎么会,就用php来处理了
//通过python post获取到https://study.163.com/p/search/studycourse.json的数据,存入文件后,再通过php处理 public function readJsonAction(){ $wangyi = file_get_contents("C:/Users/Administrator/Desktop/wangyi.json"); $wangyi = json_decode($wangyi); $wangyi = $wangyi->result->list; $size = sizeof($wangyi);print_r($size); for ($i=0; $i < $size; $i++) { $courseInfo = $wangyi[$i]; $courseInfo = (array)$courseInfo; $insertData = array( 'title' => $courseInfo['title'], 'productName' => $courseInfo['productName'], 'lectorName' => $courseInfo['lectorName'], 'learnerCount' => $courseInfo['learnerCount'], 'lessonCount' => $courseInfo['lessonCount'], 'description' => $courseInfo['description'], 'score' => $courseInfo['score'], 'type' => $courseInfo['type'], 'imgUrl' => $courseInfo['imgUrl'], 'addtime' => date("Y-m-d H:i:s",time()) ); $this->addCsvFile($insertData); echo"<pre>{$insertData['title']}写入成功"; } }
结果如下:
网易云课堂总共有3600余个课程
之后爬取网易云公开课,通过scrapy shell获取也是获取不到具体的数据,
通过浏览器开发者模式发现:
数据是通过https://vip.open.163.com/open/trade/pc/course/listByClassify.do?classifyId=-1&type=2&page=2&size=20这个链接传过来的,而且是get方式
通过curl模拟,将size改为1000,全部的课程数据就全部都拿到了~~~
具体代码如下:
//网易公开课数据,数据隐藏在下面的url中,通过get方式获取,再处理 public function wangyiPublicAction(){ $url = "https://vip.open.163.com/open/trade/pc/course/listByClassify.do?classifyId=-1&type=2&page=1&size=1032"; $res = $this->https_request($url); $wangyiPublic = json_decode($res); $wangyiPublic = $wangyiPublic->data->items; $size = sizeof($wangyiPublic);print_r($size); for ($i=0; $i < $size; $i++) { $courseInfo = $wangyiPublic[$i]; $courseInfo = (array)$courseInfo; $insertData = array( 'title' => $courseInfo['title'], 'subtitle' => $courseInfo['subtitle'], 'authorName'=> $courseInfo['authorName'], 'authorDesc'=> $courseInfo['authorDescription'], 'price' => $courseInfo['originPrice']/100, 'chapter' => $courseInfo['contentCount'], 'purchase' => $courseInfo['purchaseCount'], 'interest' => $courseInfo['interestCount'], ); $this->addCsvFile($insertData); echo"<pre>{$insertData['title']}写入成功"; } }
部分数据如下:
好了,网易课程的爬取就基本完成了~