有道翻译爬取【json】
1 ''' 2 @Modify Time @Author 3 ------------ ------- 4 2019/9/2 0:19 laoalo 5 ''' 6 import requests 7 import json 8 import urllib.parse 9 from lxml import etree 10 11 def get_translate_data(word): 12 post_data = { 13 'i': word, 14 'from': ' AUTO', 15 'to': 'AUTO', 16 'smartresult': 'dict', 17 'client': 'fanyideskweb', 18 'salt': '15673547889901', 19 'sign': '7ec51a2113e35502456742617b7cf37d', 20 'ts': '1567354788990', 21 'bv': 'a4f4c82afd8bdba188e568d101be3f53', 22 'doctype': 'json', 23 'version': '2.1', 24 'keyfrom': 'fanyi.web', 25 'action': 'FY_BY_REALTlME' 26 } 27 post_data = urllib.parse.urlencode(post_data).encode('utf-8')# 对输入的内容进行url编码 28 header = { 29 'Origin': 'http://fanyi.youdao.com', 30 'Referer': 'http://fanyi.youdao.com/', 31 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36', 32 'X-Requested-With': 'XMLHttpRequest' 33 } 34 youdao = r'http://fanyi.youdao.com/' 35 36 re = requests.post(url=youdao,headers=header,data=post_data).text 37 response = re.replace('"',"'") 38 # print(re) # 直接打印出来没有翻译结果 39 # 将json格式的转成字典 40 # print(response) 41 42 html = json.loads(re) 43 result = html['translateResult'][0][0]['tgt'] 44 print(result) 45 46 47 if __name__ == '__main__': 48 get_translate_data('我来自四川')
import urllib.request import urllib.parse import json url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule' data = {} data['i'] = '我爱你' data['from'] = 'AUTO' data['to'] = 'AUTO' data['smartresult'] = [dict] data['client'] = 'fanyideskweb' data['salt'] = '1536832138651' data['sign'] = 'd01d0881f67f7d556a6c6d2bb441478e' data['doctype'] = 'json' data['version'] = '2.1' data['keyfrom'] = 'fanyi.web' data['action'] = 'FY_BY_CLICKBUTTION' data['typoResult'] = 'false' data = urllib.parse.urlencode(data).encode('utf-8') respones = urllib.request.urlopen(url,data) html = respones.read().decode('utf-8') # print(html) target = json.loads(html) target = target['translateResult'][0][0]['tgt'] print(target)
用 requests.post().text 来读取json代码老是报错 " json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) " 而用 urlopen.read() 就可行 ,为什么???
先整体抓包一下:
urlopen.read()放回的是一个json代码(意外发现getheaders放回的是头信息):
尝试打印出request.post的值发现它的结果是有道网页的html代码
<!DOCTYPE html> <html xmlns='http://www.w3.org/1999/xhtml'> <head> <meta http-equiv='Content-Type' content='text/html; charset=utf-8' /> <meta http-equiv='X-UA-Compatible' content='IE=edge,chrome=1' /> <title>在线翻译_有道</title> <meta name='keywords' content='在线翻译'/> <meta name='description' content='有道翻译提供即时免费的中文、英语、日语、韩语、法语、德语、俄语、西班牙语、葡萄牙语、越南语、印尼语、意大利语全文翻译、网页翻译、文档翻译服务。'/> <meta name='viewport' content='width=device-width, initial-scale=1.4, minimum-scale=1.0, maximum-scale=2.0'/> <link rel='canonical' href='http://fanyi.youdao.com'/> <link href='http://shared.ydstatic.com/plugins/search-provider.fanyi.xml' title='有道翻译' type='application/opensearchdescription+xml' rel='search'/> <link rel='shortcut icon' href='http://shared.ydstatic.com/images/favicon.ico' type='image/x-icon' /> <link href='http://shared.ydstatic.com/fanyi/newweb/v1.0.20/styles/newweb/fanyi-newweb.min.css' rel='stylesheet' type='text/css'/> <!--[if lte IE 8]> <script> window.onload = function(){ document.body.className = document.body.className + ' less-ie8'; }; </script> <![endif]--> </head> <body class='fanyi-page'> <div class='fanyi__nav'> <div class='fanyi__nav__container'> <ul class='fanyi__nav__list'> <li><a class='nav__tongchuan' target='_blank' href='http://tongchuan.youdao.com/?keyfrom=fanyi_web_tab'>同传<span class='tongchuan_new'>new</span></a></li> <li><a target='_blank' href='http://dict.youdao.com/appapi/redirect?redirectUrl=http%3A%2F%2Fyou.163.com%2Fitem%2Fdetail%3Fid%3D3394027%26from%3Dweb_hz_neibu_fyw_16&keyfrom=fyw_youdaofanyi&vendor=top'>翻译机</a></li> <li class='nav__rengong'> <a target='_blank' href='http://f.youdao.com/?vendor=fanyi-new-nav'>人工翻译</a> <div class='rengong__guide'> <a target='_blank' class='rengong__guide--con' href='http://f.youdao.com/?vendor=new-fanyicover'> <span class='tips__pointer tips__pointer--up'></span> <div class='rengong__guide--title'>网易自营人工翻译服务,专业、精准、地道!</div> <ul class='rengong__guide--list'> <li class='rengong__guide--sub'>快速翻译</li> <li>日常用语</li> <li class='right'>工作沟通</li> <li>地址信息</li> <li class='right'>商贸交流</li> <li>邮件往来</li> <li class='right'>文章节选</li> </ul> <div class='rengong__guide--line'></div> <ul class='rengong__guide--list rengong__guide--list--right'> <li class='rengong__guide--sub'>文档翻译</li> <li>专业论文</li> <li class='right'>产品介绍</li> <li>合同标书</li> <li class='right'>简历证件</li> <li>留学移民</li> <li class='right'>创意翻译</li> </ul> </a> <a class='i-know' href='javascript:;'>我知道了</a> </div> </li> <li><a target='_blank' href='http://ai.youdao.com/?keyfrom=fanyi-new-nav'>翻译API</a></li> <li><a target='_blank' href='http://fanyiguan.youdao.com/?vendor=fanyi-new-nav'>翻译APP</a></li> <li class='last'> <a class='login-link' href='javascript:;'>登录</a> </li> </ul> <a href='/?keyfrom=fanyi-new.logo' class='fanyi__nav__logo'></a> </div> </div> <div class='fanyi'> <div class='fanyi__operations'> <div class='fanyi__operations--right'> <span class='fanyi__operations--underline'> <label for='underlineWord'>划词</label> </span> </div> <div class='fanyi__operations--left'> <div id='langSelect' class='lang-select item-select'> <span class='select-text'>自动检测语言</span> <ul id='languageSelect' class='select clear'> <li class='default selected' data-value='AUTO'><a href='javascript:;'>自动检测语言</a></li> <li data-value='zh-CHS2en'><a href='javascript:;'>中文 » 英语</a></li> <li data-value='en2zh-CHS'><a href='javascript:;'>英语 » 中文</a></li> <li data-value='zh-CHS2ja'><a href='javascript:;'>中文 » 日语</a></li> <li data-value='ja2zh-CHS'><a href='javascript:;'>日语 » 中文</a></li> <li data-value='zh-CHS2ko'><a href='javascript:;'>中文 » 韩语</a></li> <li data-value='ko2zh-CHS'><a href='javascript:;'>韩语 » 中文</a></li> <li data-value='zh-CHS2fr'><a href='javascript:;'>中文 » 法语</a></li> <li data-value='fr2zh-CHS'><a href='javascript:;'>法语 » 中文</a></li> <li data-value='zh-CHS2de'><a href='javascript:;'>中文 » 德语</a></li> <li data-value='de2zh-CHS'><a href='javascript:;'>德语 » 中文</a></li> <li data-value='zh-CHS2ru'><a href='javascript:;'>中文 » 俄语</a></li> <li data-value='ru2zh-CHS'><a href='javascript:;'>俄语 » 中文</a></li> <li data-value='zh-CHS2es'><a href='javascript:;'>中文 » 西班牙语</a></li> <li data-value='es2zh-CHS'><a href='javascript:;'>西班牙语 » 中文</a></li> <li data-value='zh-CHS2pt'><a href='javascript:;'>中文 » 葡萄牙语</a></li> <li data-value='pt2zh-CHS'><a href='javascript:;'>葡萄牙语 » 中文</a></li> <li data-value='zh-CHS2it'><a href='javascript:;'>中文 » 意大利语</a></li> <li data-value='it2zh-CHS'><a href='javascript:;'>意大利语 » 中文</a></li> <li data-value='zh-CHS2vi'><a href='javascript:;'>中文 » 越南语</a></li> <li data-value='vi2zh-CHS'><a href='javascript:;'>越南语 » 中文</a></li> <li data-value='zh-CHS2id'><a href='javascript:;'>中文 » 印尼语</a></li> <li data-value='id2zh-CHS'><a href='javascript:;'>印尼语 » 中文</a></li> <li data-value='zh-CHS2ar'><a href='javascript:;'>中文 » 阿拉伯语</a></li> <li data-value='ar2zh-CHS'><a href='javascript:;'>阿拉伯语 » 中文</a></li> </ul> <input class='select-input' id='language' name='language' type='hidden' value='AUTO'> </div> <a class='fanyi__operations--machine' id='transMachine' href='javascript:;'>翻译</a> <a class='fanyi__operations--man clog-js' data-clog='AT_BUTTON_CLICK' data-pos='web.i.top' id='transMan' href='javascript:;'>人工翻译</a> <div class='tips__container fanyi__operations--man--tips'> <span class='tips__pointer tips__pointer--up'></span> <p>专业译员随时待命<br/>最快1分钟返回精准译文</p> <p class='man__tips--new'>【母语润色服务全新上线】</p> </div> </div> <div class='fanyi-error-message'></div> </div> <div class='fanyi__input'> <div class='input__original'> <div class='fanyi__input__bg'> <div id='docUploadBg' class='doc__upload--bg hidden'> <span class='doc-type'></span> <div class='doc-infos'> <p class='doc-name'></p> <span class='doc-error-msg hidden'></span> <span class='doc-size-msg hidden'></span> </div> <a class='doc-delete' href='javascript:;'></a> </div> <div id='docUploadCon' class='doc__upload--con'> <form id='docUploadForm' action1='http://ns013x.corp.youdao.com:13288/doc/upload' action='http://fanyi.youdao.com/trandoc/doc/upload' method='post' enctype='multipart/form-data'> <span>上传文档</span> <input name='your_file' disabled='disabled' type='file' id='docUploadFile' class='doc__upload--file'/> <div class='doc__upload--tip'> <span>全新中英文档互译</span> <a href='javascript:;' class='doc__upload--close'>×</a> </div> <div class='file__type--tips tips__container speaker__tips'> <span class='tips__pointer tips__pointer--down'></span> <span>全新文档翻译,支持docx /pdf 等格式中英互译,快来试用!</span> </div> </form> </div> <a id='inputDelete' class='input__original_delete'></a> <div id='inputOriginalCopy' class='input__original__area'></div> <textarea id='inputOriginal' dir='auto' class='input__original__area' placeholder='请输入你要翻译的文字或网址'></textarea> <div class='input__original__bar'> <div class='input__original__bar--fonts'> <span class='fonts__over'>0</span>/<span class='fonts__limited'>5000</span> </div> <a href='javascript:;' id='originalSpeaker' class='speaker'> <div class='tips__container speaker__tips'> <span class='tips__pointer tips__pointer--down'></span> <span class='tips__text--short'>朗读</span> </div> </a> </div> </div> <div class='fanyi__popularize'> <a href='javascript:;' target='_blank' class='clog-js' data-clog='AD_TEXT_CLICK'></a> </div> </div> <div class='input__target'> <div class='fanyi__input__bg'> <div id='docLangTip' class='doc-lang-tip'>选择翻译语言,然后点击翻译按钮,即可翻译文档</div> <div class='input__target__error' id='inputTargetError'></div> <div id='transTarget' dir='auto' class='input__target__text'></div> <textarea id='transTargetArea' class='input__target__text'></textarea> <div class='input__target__bar'> <a class='target__bar__update' id='updateResult'>修改翻译结果</a> <a href='javascript:;' id='targetSpeaker' class='speaker target__bar__parts'> <div class='tips__container speaker__tips'> <span class='tips__pointer tips__pointer--down'></span> <span class='tips__text--short'>朗读</span> </div> </a> <a href='javascript:;' id='targetCopy' class='copy target__bar__parts'> <div class='tips__container speaker__tips'> <span class='tips__pointer tips__pointer--down'></span> <span class='tips__text--short'>复制</span> </div> </a> <a href='javascript:;' id='targetStar' class='star target__bar__parts'> <div class='tips__container speaker__tips' id='targetStarTip'> <span class='tips__pointer tips__pointer--down'></span> <div class='tips__text--short'>翻译结果打分</div> <div class='star-con'> <span></span> <span></span> <span></span> <span></span> <span></span> </div> </div> </a> </div> <div class='input__update__suggest'>您提供的翻译将用于改善翻译质量,感谢您的建议!</div> <div class='input__target__dict'> <span class='resource'>来自有道词典结果</span> <div class='dict__word'> <span class='dict__word--letters'>美丽</span> <span class='dict__word--phonetic'>[měi lì]</span> </div> <div class='dict__relative'> <a>comeliness</a> <a>fairness</a> <a>goodliness</a> <a>loveliness</a> </div> <a class='dict__more clog-js' data-clog='RESULT_DICT_ALL_CLICK' href='javascript:;' target='_blank' >查看完整结果>></a> </div> </div> <div class='fanyi__update__tip'><span class='pointer'></span>点击可查看其他翻译结果,或修改结果</div> <div class='fanyi__suggest__container' id='fanyiSuggest'> <div class='suggest__title'> <div class='suggest__title--text'>以下为该句多个翻译结果:</div> <a class='suggest__title--close' href='javascript:;'></a> </div> <ul> </ul> <div class='suggest__update__con'> <a href='javascript:;' class='suggest__update__btn'>改进此翻译</a> </div> </div> <div class='input__target__update'> <a class='update__sure update-disable' href='javascript:;'>确认修改</a> <a class='update__cancel' href='javascript:;'>取消</a> </div> <div class='download__area'> <a target='_blank' href='http://f.youdao.com/?vendor=fanyibanner'> <div class='fanyi__banner--title'>试试有道人工翻译?</div> <div class='fanyi__banner--desc'>精选同行业资深译员,专家审校润色,让你尊享快捷又准确的人工翻译!</div> <span class='fanyi__banner--btn'>立即体验</span> </a> </div> </div> </div> <div class='inside__products'> <div class='inside__products__item inside__products__item--left'> <a target='_blank' href='http://f.youdao.com/?vendor=fanyi-new-bottom'> <div class='products__item--cell rengong'> <h4>有道人工翻译/母语润色</h4> <p class='products__item__desc'>全球最优秀的译员时刻待命<br/>专业、精准、地道!</p> <span class='rengong__intro'>了解更多></span> </div> </a> </div> <div class='inside__products__item'> <a target='_blank' href='http://tongchuan.youdao.com?keyfrom=fanyi_web_banner'> <div class=' products__item--half--cell products__tongchuan '> <h4>有道同传</h4> <p class='products__item__desc'>商务会议同传服务提供商<br/>专业、精准、可靠</p> </div> </a> <a target='_blank' href='http://fanyiguan.youdao.com/?vendor=fanyi-new-bottom'> <div class='products__item--half--cell products__fanyiguan'> <h4>有道翻译官 APP</h4> <p class='products__item__desc'>支持语音翻译和拍照翻译<br/>107种语言的随身翻译</p> </div> </a> </div> </div> </div> <form id='mapForm' target='_blank' method='POST' action='http://f.youdao.com/?path=fanyi&vendor=new-fanyiinput'> <input id='mapInput' type='hidden' name='text' value=''/> </form> <div class='fanyi__footer'> <a target='_blank' href='http://www.youdao.com/?keyfrom=fanyi-new.copyright'>有道首页</a><span class='c_fnl'>|</span><a target='_blank' href='http://dsp.youdao.com/?keyfrom=fanyi-new.copyright'>有道智选</a><span class='c_fnl'>|</span><a target='_blank' href='https://ke.youdao.com/?keyfrom=fanyi-new.copyright'>有道精品课</a><span class='c_fnl'>|</span><a target='_blank' href='http://www.youdao.com/about/index.html'>关于有道</a><span class='c_fnl'>|</span><a target='_blank' href='http://i.youdao.com'>官方博客</a> <p class='c_fcopyright'>© 2019 网易公司 京ICP证080268号</p> </div> <div class='side__nav'> <div class='rengong-weixin'> <span class='tips__pointer tips__pointer--right'></span> 扫描二维码<br/> 关注有道人工翻译 </div> <a href='http://f.youdao.com/?vendor=new-fanyientrance' target='_blank' class='side__nav__flow'>人工<br/>翻译</a> <a target='_blank' href='http://survey2.163.com/html/fanyis201103a2/paper.html?id=168824090@10.168.1.8@0' class='side__nav__feedback'>满意度<br/>反馈</a> <a href='javascript:;' class='side__nav__backtop'></a> </div> <div id='YOUDAO_SELECTOR_WRAPPER' bindTo='inputOriginal:transTarget' style='display:none; z-index: 101; margin:0; border:0; padding:0; width:320px; height:240px;'></div> <div class='less-ie8-tip'>请在IE8以上版本,或Chrome、火狐、Safari等浏览器中访问该网页。</div> <div class='dict-download-guide'> <div class='guide-con'> <a href='javascript:;' target='_blank' class='download-guide-link'> <img class='download-guide-img' src='http://shared.ydstatic.com/images/favicon.ico'/> </a> <span class='guide-close'></span> </div> </div> <audio id='playVoice' style='position:absolute;top:-999px;left:-999px;width:1px;height:1px;'></audio> <div class='upload__cover'> <div class='upload__cover--content'> <div class='upload__cover--title'><span class='upload__percent'></span>解析中</div> <div class='upload__filename'></div> <div class='upload__progress--con'> <div class='upload__progress'></div> </div> <a class='upload--cancel' href='javascript:;'>取消</a> </div> </div> <div id='dialogCover' class='dialog-cover'></div> <div id='loginAlert' class='dialog-alert'> <a href='javascript:;' class='dialog-alert--close'></a> <div class='title'>如需使用文档翻译功能,请先登录。</div> <div class='content'></div> <div class='btns-con'> <a href='javascript:;' class='cancel'>取消</a> <a href='javascript:;' class='ok'>确定</a></div> </div> <div id='loginWindow' class='dialog-alert'> <a href='javascript:;' class='dialog-alert--close'></a> <h3 class='login-title'>使用网易邮箱登录</h3> <div class='content urs-login-content' ></div> <div class='other third-login'> <a class='third-login-weixin'> <img src='http://shared.ydstatic.com/fanyi/login/images/weixin@2x.png' alt='微信登录' title='微信登录' height='44'/> </a> <a class='third-login-weibo'> <img src='http://shared.ydstatic.com/fanyi/login/images/weibo@2x.png' alt='新浪微博登录' title='新浪微博登录' height='44'/> </a> <a class='third-login-qq'> <img src='http://shared.ydstatic.com/fanyi/login/images/qq@2x.png' alt='QQ帐号登录' title='QQ帐号登录' height='44' /> </a> </div> </div> <!-- START rlog --> <script> var _rlog = _rlog || []; // 指定 product id _rlog.push(['_setAccount' , 'fanyiweb']); </script> <script defer src='http://shared.ydstatic.com/js/rlog/v1.js'></script> <!-- END rlog --> </body> <script type='text/javascript'> var global = {}; </script><script type='text/javascript' src='http://shared.ydstatic.com/api/fanyi-web/assets/index.min.js' charset='utf-8'></script> <script type='text/javascript' src='http://shared.ydstatic.com/fanyi/newweb/v1.0.20/scripts/newweb/fanyi.min.js'></script> </html>
所以会出现解析错误
''' @Modify Time @Author ------------ ------- 2019/9/2 0:19 laoalo ''' import requests import json import urllib.parse import urllib.request from lxml import etree def get_translate_data(word): post_data = { 'i': word, 'from': ' AUTO', 'to': 'AUTO', 'smartresult': 'dict', 'client': 'fanyideskweb', 'salt': '15673547889901', 'sign': '7ec51a2113e35502456742617b7cf37d', 'ts': '1567354788990', 'bv': 'a4f4c82afd8bdba188e568d101be3f53', 'doctype': 'json', 'version': '2.1', 'keyfrom': 'fanyi.web', 'action': 'FY_BY_REALTlME' } post_data = urllib.parse.urlencode(post_data).encode('utf-8')# 对输入的内容进行url编码 header = { 'Origin': 'http://fanyi.youdao.com', 'Referer': 'http://fanyi.youdao.com/', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36', 'X-Requested-With': 'XMLHttpRequest' } youdao = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule' # re = requests.post(url=youdao,data=post_data).text # # print(re) # 直接打印出来没有翻译结果 # # 将json格式的转成字典 # # html = json.loads(re) # result = html['translateResult'][0][0]['tgt'] # print(result) re = urllib.request.urlopen(url=youdao,data=post_data) html = re.read().decode('utf-8') target = json.loads(html) print(target['translateResult'][0][0]['tgt']) if __name__ == '__main__': get_translate_data(input("请输入:"))
- urlopen(url, data=None, proxies=None) : 创建一个表示远程url的类文件对象,然后像本地文件一样操作这个类文件对象来获取远程数据。
- 参数url表示远程数据的路径,一般是网址;
- 参数data表示以post方式提交到url的数据(玩过web的人应该知道提交数据的两种方式:post与get);
- 参数proxies用于设置代理
- urlopen与requests.get()对比
- urlopen打开URL网址,url参数可以是一个字符串url或者是一个Request对象,返回的是http.client.HTTPResponse对象.http.client.HTTPResponse对象大概包括read()、readinto()、getheader()、getheaders()、fileno()、msg、version、status、reason、debuglevel和closed函数,其实一般而言使用read()函数后还需要decode()函数,返回的网页内容实际上是没有被解码或的,在read()得到内容后通过指定decode()函数参数,可以使用对应的解码方式。
- requests.get()方法请求了站点的网址,然后打印出了返回结果的类型,状态码,编码方式,Cookies等内容