正则表达式取值方法归纳
如,某站点接口请求返回值如下:
{"Code":0,"Msg":"获取成功","Data":
{"Total":1,"DataList":
[{"Id":11564XXXX4368,
"TopIsAgent":false,
"IsAnchor":true,
"FirstTwitterId":1156XXX368,
"FirstTwitterName":"贝克汉姆",
"FirstTwitterRemark":"小贝960",
"FirstTwitterTelephone":"1867XXXX640",
"IsTwitter":true,
"HeadImgUrl":"https://wx.qlogo.cn/mmopen/vi_32/DYAIOgqXXXXyXXX2ov4UzAxA/132",
"NickName":"贝克汉姆",
"Remark":"小贝960",
"WxRemark":null,
"Telephone":"1867XXXX640",
"LastBuyTime":"2021-03-06 11:25:38",
"Source":1,
"MemberSourceStr":"小程序授权登录",
"TotalPoints":185200,
"TotalFee":10030866.84,
"TotalFeeNum":292,
"SingleFee":34352.283698630136986301369863,
"CustomerType":"会员",
"Tags":[{"Id":11668,"Name":"规格","TagType":2}],
"SystemTagType":"成交客户",
"Cards":[{"Id":987,"Name":"会员卡测试会员卡测"}],
"CustomerTime":"2019-07-25 13:59:28","IsExistBlackList":false}]},
"TraceFlag":null,"ErrorDetail":null,"Pname":null}
案例1:直取法取"NickName"后的值”贝克汉姆“,但是输出会带“ ”双引号,value = eval(value)可去掉引号
value = re.findall('"NickName":(.*?),', response, re.S)[0]
或者直接在代码中把引号去除
value = re.findall('"NickName":”(.*?)“,', response, re.S)[0]
案列2:前后匹配法取"NickName"后的值”贝克汉姆“,经观察,”贝克汉姆“值前面为FirstTwitterName":"“,后面为"FirstTwitterRemark,则表达式可写成如下
value = re.findall('FirstTwitterName":"(.+?)","FirstTwitterRemark', response, re.S)[0]
同理,也可以取出HeadImgUrl参值网址中的任意一段内容,如取”wx.qlogo.cn“,写法如下
value = re.findall("https://(.+?)/mmopen", response, re.S)[0]
爬虫中经常要取”href“里的内容,也可以通过案例2中的前后匹配法取值,如取下图html内的href内容,但是要注意中间的空格、换行字符
这里面我们看不到任何空格和换行字符,但如果先取出class=”tal“内的数据,就可以看的出来了,写法如下
value = re.findall('<td class="tal" id="">(.+?)</td', response, re.S)
返回内容,可见href前有\r\n,还有空格等数据,所以基于这些数据,把正则再完善一下
value = re.findall('<td class="tal" id=""> \r\n\r\n\t\r\n\r\n\t<h3><a href="(.+?)" target=', response, re.S)
如此,便取出了想要的数据
案例3:如果返回值内即包含中文又包含数字,我们只需要数字或者中文呢?如"FirstTwitterRemark":"小贝960",取出960或者小贝
首先,取出数字
value = re.findall('"FirstTwitterRemark":"(.*?)",', response, re.S)[0] num = re.sub(r'\D', "", value) print(value) print(num)
然后,取出中文
value = re.findall('"FirstTwitterRemark":"(.*?)",', response, re.S)[0] zhong = re.findall('[\u4e00-\u9fa5]+', value) print(value) print(zhong[0])
如上,取中间,是通过写入汉字的unicode范围来取值,那么替换成下表格内诸如数字和字母的范围,同理也可以取出对应的内容
未完待续。。。