发一个python写的多线程代理服务器抓取,保存,验证程序，希望喜欢python的朋友和我一起完善它

Posted on 2007-06-01 14:17 Go_Rush 阅读(14302) 评论(19) 编辑收藏举报

用php写过一个，不过由于 php 不支持多线程，抓取和验证速度都非常的慢
(尽管libcurl可以实现多线程抓取,但他也只限于抓取网页这个功能，抓回来的数据进行再处理很麻烦).

于是决定用python重新写,python支持多线程啊。
已经有一年多没有用过 python了，很多语法，语言特性都快忘记得差不多了。经过三天业余时间的
摸索，今天我写的这个程序终于可以和大家交流了。

下面放出源代码: 希望有高手能帮我共同完善,
这个程序是我学python语言以来写的第二个程序，应该有很多写得不够简洁的地方，希望行家多多指点

程序现有功能:
   1. 能自动从12个网站抓取代理列表，并保存到数据库里面
   2. 自动验证每个代理是否可用,并保存验证时的响应时间做为判断代理速度的依据
   3. 能分类输出代理信息，已验证的，未验证的，高度匿名代理，普通匿名代理，透明代理到不同文件
   4   支持的输出格式有 xml,htm,csv,txt,tab   每种文件都能自定义字段和格式
   5. 扩展性比较强, 要添加一个新的抓取网站只需要改变一个全局变量，添加两个函数 (有详细接口说明)
   6.   用 sqlite 做数据库，小巧，方便，简单，0配置，0安装，放在屁股口袋里就可以带走
   7. 多线程抓取，多线程验证

我的运行环境：windows xp + python v2.4 ,其他版本未测试

程序下载:   点击这里(242kb)

代码的注释非常详细,python 初学者都可以看懂， 12个网站抓取分析的正则表达式都有详细注释

1 # -*- coding: gb2312 -*- 2 # vi:ts=4:et 3 4

"""

目前程序能从下列网站抓取代理列表

6 7

http://www.cybersyndrome.net/

http://www.pass-e.com/

http://www.cnproxy.com/

http://www.proxylists.net/

http://www.my-proxy.com/

http://www.samair.ru/proxy/

http://proxy4free.com/

http://proxylist.sakura.ne.jp/

http://www.ipfree.cn/

http://www.publicproxyservers.com/

http://www.digitalcybersoft.com/

http://www.checkedproxylists.com/

19 20

问:怎样才能添加自己的新网站，并自动让程序去抓取?

答:

22 23

请注意源代码中以下函数的定义.从函数名的最后一个数字从1开始递增，目前已经到了13

24 25

def build_list_urls_1(page=5):

def parse_page_2(html=''):

27 28

def build_list_urls_2(page=5):

def parse_page_2(html=''):

30 31

.......

32 33

def build_list_urls_13(page=5):

def parse_page_13(html=''):

35 36 37

你要做的就是添加 build_list_urls_14 和 parse_page_14 这两个函数

比如你要从 www.somedomain.com 抓取

    /somepath/showlist.asp?page=1

    ...  到

    /somepath/showlist.asp?page=8  假设共8页

42 43

那么 build_list_urls_14 就应该这样定义

要定义这个page这个参数的默认值为你要抓取的页面数8，这样才能正确到抓到8个页面

def build_list_urls_14(page=8):

    .....

    return [        #返回的是一个一维数组，数组每个元素都是你要抓取的页面的绝对地址

        'http://www.somedomain.com/somepath/showlist.asp?page=1',

        'http://www.somedomain.com/somepath/showlist.asp?page=2',

        'http://www.somedomain.com/somepath/showlist.asp?page=3',

        ....

        'http://www.somedomain.com/somepath/showlist.asp?page=8'

54 55

接下来再写一个函数 parse_page_14(html='')用来分析上面那个函数返回的那些页面html的内容

并从html中提取代理地址

注意： 这个函数会循环处理 parse_page_14 中的所有页面，传入的html就是那些页面的html文本

58 59

ip:   必须为 xxx.xxx.xxx.xxx 数字ip格式，不能为 www.xxx.com 格式

port: 必须为 2-5位的数字

type: 必须为 数字 2,1,0,-1 中的其中一个。这些数字代表代理服务器的类型

      2:高度匿名代理  1: 普通匿名代理  0:透明代理    -1: 无法确定的代理类型

 #area: 代理所在国家或者地区， 必须转化为 utf8编码格式

64 65

def parse_page_14(html=''):

    ....

    return [

        [ip,port,type,area]

        [ip,port,type,area]

        .....

        ....

        [ip,port,type,area]

74 75

最后，最重要的一点:修改全局变量 web_site_count的值，让他加递增1  web_site_count=14

76 77 78 79

问：我已经按照上面的说明成功的添加了一个自定义站点，我要再添加一个，怎么办?

答：既然已经知道怎么添加 build_list_urls_14 和 parse_page_14了

81 82

那么就按照同样的办法添加

def build_list_urls_15(page=5):

def parse_page_15(html=''):

85 86

这两个函数，并 更新全局变量   web_site_count=15

87 88

"""

89 90 91 import

 urllib,time,random,re,threading,string

92 93

web_site_count=13

#要抓取的网站数目 94

day_keep=2

#删除数据库中保存时间大于day_keep天的无效代理 95

indebug=1

96 97

thread_num=100

# 开 thread_num 个线程检查代理 98

check_in_one_call=thread_num*25

# 本次程序运行时最多检查的代理个数 99 100 101

skip_check_in_hour=1

# 在时间 skip_check_in_hour内,不对同一个代理地址再次验证102

skip_get_in_hour=8

# 每次采集新代理的最少时间间隔 (小时)103 104

proxy_array=[]

# 这个数组保存将要添加到数据库的代理列表 105

update_array=[]

# 这个数组保存将要更新的代理的数据 106 107

db=None

#数据库全局对象108

conn=None

109

dbfile='proxier.db'

#数据库文件名110 111

target_url="http://www.baidu.com/"

# 验证代理的时候通过代理访问这个地址112

target_string="030173"

# 如果返回的html中包含这个字符串，113

target_timeout=30

# 并且响应时间小于 target_timeout 秒 114 #那么我们就认为这个代理是有效的 115 116 117 118 #到处代理数据的文件格式，如果不想导出数据，请让这个变量为空 output_type=''119 120

output_type='xml'

#以下格式可选, 默认xml121 # xml122 # htm 123 # tab 制表符分隔, 兼容 excel124 # csv 逗号分隔, 兼容 excel125 # txt xxx.xxx.xxx.xxx:xx 格式126 127 # 输出文件名请保证这个数组含有六个元素128

output_filename=[

129

            'uncheck',

# 对于未检查的代理,保存到这个文件130

            'checkfail',

# 已经检查，但是被标记为无效的代理,保存到这个文件131

            'ok_high_anon',

# 高匿代理(且有效)的代理,按speed排序，最块的放前面132

            'ok_anonymous',

# 普通匿名(且有效)的代理,按speed排序，最块的放前面133

            'ok_transparent',

# 透明代理(且有效)的代理,按speed排序，最块的放前面134

            'ok_other'

# 其他未知类型(且有效)的代理,按speed排序135

136 137 138 #输出数据的格式支持的数据列有 139 # _ip_ , _port_ , _type_ , _status_ , _active_ ,140 #_time_added_, _time_checked_ ,_time_used_ , _speed_, _area_141 142

output_head_string=''

# 输出文件的头部字符串143

output_format=''

# 文件数据的格式 144

output_foot_string=''

# 输出文件的底部字符串145 146 147 148 if

   output_type=='xml':

149

    output_head_string="<?xml version='1.0' encoding='gb2312'?><proxylist>\n"

150

    output_format="""<item>

151

            <ip>_ip_</ip>

152

            <port>_port_</port>

153

            <speed>_speed_</speed>

154

            <last_check>_time_checked_</last_check>

155

            <area>_area_</area>

156

        </item>

157

"""

158

    output_foot_string="</proxylist>"

159 elif

 output_type=='htm':

160

    output_head_string="""<table border=1 width='100%'>

161

        <tr><td>代理</td><td>最后检查</td><td>速度</td><td>地区</td></tr>

162

"""

163

    output_format="""<tr>

164

    <td>_ip_:_port_</td><td>_time_checked_</td><td>_speed_</td><td>_area_</td>

165

    </tr>

166

"""

167

    output_foot_string="</table>"

168 else

169

    output_head_string=''

170

    output_foot_string=''

171 172 if

 output_type=="csv":

173

    output_format="_ip_, _port_, _type_,  _speed_, _time_checked_,  _area_\n"

174 175 if

 output_type=="tab":

176

    output_format="_ip_\t_port_\t_speed_\t_time_checked_\t_area_\n"

177 178 if

 output_type=="txt":

179

    output_format="_ip_:_port_\n"

180 181 182 # 输出文件的函数183 defoutput_file

():

184 global

 output_filename,output_head_string,output_foot_string,output_type

185 if

 output_type=='':

186 return187

    fnum=len(output_filename)

188

    content=[]

189 for

 range(fnum):

190

        content.append([output_head_string])

191 192

    conn.execute("select * from `proxier` order by `active`,`type`,`speed` asc")

193

    rs=conn.fetchall()

194 195 for

 item

rs:

196

        type,active=item[2],item[4]

197 if

   active

 None:

198

            content[0].append(formatline(item))

#未检查199 elif

 active==0:

200

            content[1].append(formatline(item))

#非法的代理201 elif

 active==1

and

 type==2:

202

            content[2].append(formatline(item))

#高匿 203 elif

 active==1

and

 type==1:

204

            content[3].append(formatline(item))

#普通匿名 205 elif

 active==1

and

 type==0:

206

            content[4].append(formatline(item))

#透明代理 207 elif

 active==1

and

 type==-1:

208

            content[5].append(formatline(item))

#未知类型的代理209 else

210 pass211 212 for

 range(fnum):

213

        content[i].append(output_foot_string)

214

        f=open(output_filename[i]+"."+output_type,'w')

215

        f.write(string.join(content[i],''))

216

        f.close()

217 218 #格式化输出每条记录219 defformatline

(item):

220 global

 output_format

221

    arr=['_ip_','_port_','_type_','_status_','_active_',

222

        '_time_added_','_time_checked_','_time_used_',

223

        '_speed_','_area_']

224

    s=output_format

225 for

 range(len(arr)):

226

        s=string.replace(s,arr[i],str(formatitem(item[i],i)))

227 return

228 229 230 #对于数据库中的每个不同字段，要处理一下，中文要编码，日期字段要转化231 defformatitem

(value,colnum):

232 global

 output_type

233 if

 (colnum==9):

234

        value=value.encode('cp936')

235 elif

 value

 None:

236

        value=''

237 238 if

 colnum==5

 colnum==6

 colnum==7:

#time_xxxed239

        value=string.atof(value)

240 if

 value<1:

241

            value=''

242 else

243

            value=formattime(value)

244 245 if

 value==''

and

 output_type=='htm':value=' '

246 return

 value

247 248 249 250 defcheck_one_proxy

(ip,port):

251 global

 update_array

252 global

 check_in_one_call

253 global

 target_url,target_string,target_timeout

254 255

    url=target_url

256

    checkstr=target_string

257

    timeout=target_timeout

258

    ip=string.strip(ip)

259

    proxy=ip+':'+str(port)

260

    proxies = {'http': 'http://'+proxy+'/'}

261

    opener = urllib.FancyURLopener(proxies)

262

    opener.addheaders = [

263

        ('User-agent','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)')

264

265

    t1=time.time()

266 267 if

 (url.find("?")==-1):

268

        url=url+'?rnd='+str(random.random())

269 else

270

        url=url+'&rnd='+str(random.random())

271 272 try

273

        f = opener.open(url)

274

        s= f.read()

275

        pos=s.find(checkstr)

276 except

277

        pos=-1

278 pass279

    t2=time.time()

280

    timeused=t2-t1

281 if

 (timeused<timeout

and

 pos>0):

282

        active=1

283 else

284

        active=0

285

    update_array.append([ip,port,active,timeused])

286 print

 len(update_array),' of ',check_in_one_call,"",ip,':',port,'--',int(timeused)

287 288 289 defget_html

(url=''):

290

    opener = urllib.FancyURLopener({})

#不使用代理291 #www.my-proxy.com 需要下面这个Cookie才能正常抓取292

    opener.addheaders = [

293

            ('User-agent','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)'),

294

            ('Cookie','permission=1')

295

296

    t=time.time()

297 if

 (url.find("?")==-1):

298

        url=url+'?rnd='+str(random.random())

299 else

300

        url=url+'&rnd='+str(random.random())

301 try

302

        f = opener.open(url)

303 return

 f.read()

304 except

305 return

''

306 307 308 309 310 ################################################################################ # ## by Go_Rush(阿舜) from http://ashun.cnblogs.com/ # ################################################################################ 311 312 defbuild_list_urls_1

(page=5):

313

    page=page+1

314

    ret=[]

315 for

 range(1,page):

316

        ret.append('http://proxy4free.com/page%(num)01d.html'%{'num':i})

317 return

ret

318 319 defparse_page_1

(html=''):

320

    matches=re.findall(r'''

321

            <td>([\d\.]+)<\/td>[\s\n\r]*   #ip

322

            <td>([\d]+)<\/td>[\s\n\r]*     #port

323

            <td>([^\<]*)<\/td>[\s\n\r]*    #type

324

            <td>([^\<]*)<\/td>             #area

325

''',html,re.VERBOSE)

326

    ret=[]

327 for

 match

 matches:

328

        ip=match[0]

329

        port=match[1]

330

        type=match[2]

331

        area=match[3]

332 if

 (type=='anonymous'):

333

            type=1

334 elif

 (type=='high anonymity'):

335

            type=2

336 elif

 (type=='transparent'):

337

            type=0

338 else

339

            type=-1

340

        ret.append([ip,port,type,area])

341 if

 indebug:

print

 '1',ip,port,type,area

342 return

ret

343 344 ################################################################################ # ## by Go_Rush(阿舜) from http://ashun.cnblogs.com/ # ################################################################################ 345 346 347 defbuild_list_urls_2

(page=1):

348 return

 ['http://www.digitalcybersoft.com/ProxyList/fresh-proxy-list.shtml']

349 350 defparse_page_2

(html=''):

351

    matches=re.findall(r'''

352

        ((?:[\d]{1,3}\.){3}[\d]{1,3})\:([\d]+)      #ip:port

353

        \s+(Anonymous|Elite Proxy)[+\s]+            #type

354

        (.+)\r?\n                                   #area

355

''',html,re.VERBOSE)

356

    ret=[]

357 for

 match

 matches:

358

        ip=match[0]

359

        port=match[1]

360

        type=match[2]

361

        area=match[3]

362 if

 (type=='Anonymous'):

363

            type=1

364 else

365

            type=2

366

        ret.append([ip,port,type,area])

367 if

 indebug:

print

 '2',ip,port,type,area

368 return

ret

369 370 371 ################################################################################ # ## by Go_Rush(阿舜) from http://ashun.cnblogs.com/ # ################################################################################ 372 373 374 defbuild_list_urls_3

(page=15):

375

    page=page+1

376

    ret=[]

377 for

 range(1,page):

378

        ret.append('http://www.samair.ru/proxy/proxy-%(num)02d.htm'%{'num':i})

379 return

ret

380 381 defparse_page_3

(html=''):

382

    matches=re.findall(r'''

383

        <tr><td><span\sclass\="\w+">(\d{1,3})<\/span>\. #ip(part1)

384

        <span\sclass\="\w+">

385

        (\d{1,3})<\/span>                               #ip(part2)

386

        (\.\d{1,3}\.\d{1,3})                            #ip(part3,part4)

387 388

        \:\r?\n(\d{2,5})<\/td>                          #port

389

        <td>([^<]+)</td>                                #type

390

        <td>[^<]+<\/td>

391

        <td>([^<]+)<\/td>                               #area

392

        <\/tr>''',html,re.VERBOSE)

393

    ret=[]

394 for

 match

 matches:

395

        ip=match[0]+"."+match[1]+match[2]

396

        port=match[3]

397

        type=match[4]

398

        area=match[5]

399 if

 (type=='anonymous proxy server'):

400

            type=1

401 elif

 (type=='high-anonymous proxy server'):

402

            type=2

403 elif

 (type=='transparent proxy'):

404

            type=0

405 else

406

            type=-1

407

        ret.append([ip,port,type,area])

408 if

 indebug:

print

 '3',ip,port,type,area

409 return

ret

410 411 412 413 ################################################################################ # ## by Go_Rush(阿舜) from http://ashun.cnblogs.com/ # ################################################################################ 414 415 defbuild_list_urls_4

(page=3):

416

    page=page+1

417

    ret=[]

418 for

 range(1,page):

419

        ret.append('http://www.pass-e.com/proxy/index.php?page=%(n)01d'%{'n':i})

420 return

ret

421 422 defparse_page_4

(html=''):

423

    matches=re.findall(r"""

424

        list

425

        \('(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'        #ip

426

        \,'(\d{2,5})'                                   #port

427

        \,'(\d)'                                        #type

428

        \,'([^']+)'\)                                   #area

429

        \;\r?\n""",html,re.VERBOSE)

430

    ret=[]

431 for

 match

 matches:

432

        ip=match[0]

433

        port=match[1]

434

        type=match[2]

435

        area=match[3]

436

        area=unicode(area, 'cp936')

437

        area=area.encode('utf8')

438 if

 (type=='1'):

#type的判断可以查看抓回来的网页的javascript部分439

            type=1

440 elif

 (type=='3'):

441

            type=2

442 elif

 (type=='2'):

443

            type=0

444 else

445

            type=-1

446

        ret.append([ip,port,type,area])

447 if

 indebug:

print

 '4',ip,port,type,area

448 return

ret

449 450 451 ################################################################################ # ## by Go_Rush(阿舜) from http://ashun.cnblogs.com/ # ################################################################################ 452 453 454 defbuild_list_urls_5

(page=12):

455

    page=page+1

456

    ret=[]

457

    for i in range(1,page):

458

        ret.append('http://www.ipfree.cn/index2.asp?page=%(num)01d'%{'num':i})

459

    return ret

460 461

def parse_page_5(html=''):

462

    matches=re.findall(r"<font color=black>([^<]*)</font>",html)

463

    ret=[]

464

    for index, match in enumerate(matches):

465

        if (index%3==0):

466

            ip=matches[index+1]

467

            port=matches[index+2]

468

            type=-1      #该网站未提供代理服务器类型

469

            area=unicode(match, 'cp936')

470

            area=area.encode('utf8')

471

            if indebug:print '5',ip,port,type,area

472

            ret.append([ip,port,type,area])

473

        else:

474

            continue

475

    return ret

476 477

################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

478 479 480

def build_list_urls_6(page=3):

481

    page=page+1

482

    ret=[]

483

    for i in range(1,page):

484

        ret.append('http://www.cnproxy.com/proxy%(num)01d.html'%{'num':i})

485

    return ret

486 487

def parse_page_6(html=''):

488

    matches=re.findall(r'''<tr>

489

        <td>([^&]+)

#ip490

#8204‍491

        \:([^<]+)

#port492

        </td>

493

        <td>HTTP</td>

494

        <td>[^<]+</td>

495

        <td>([^<]+)</td>

#area496

        </tr>''',html,re.VERBOSE)

497

    ret=[]

498

    for match in matches:

499

        ip=match[0]

500

        port=match[1]

501

        type=-1          #该网站未提供代理服务器类型

502

        area=match[2]

503

        area=unicode(area, 'cp936')

504

        area=area.encode('utf8')

505

        ret.append([ip,port,type,area])

506

        if indebug:print '6',ip,port,type,area

507

    return ret

508 509 510 511

################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

512 513 514 515

def build_list_urls_7(page=1):

516

    return ['http://www.proxylists.net/http_highanon.txt']

517 518

def parse_page_7(html=''):

519

    matches=re.findall(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\:(\d{2,5})',html)

520

    ret=[]

521

    for match in matches:

522

        ip=match[0]

523

        port=match[1]

524

        type=2

525

        area='--'

526

        ret.append([ip,port,type,area])

527

        if indebug:print '7',ip,port,type,area

528

    return ret

529 530 531 532

################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

533 534 535 536 537 defbuild_list_urls_8

(page=1):

538 return

 ['http://www.proxylists.net/http.txt']

539 540 defparse_page_8

(html=''):

541

    matches=re.findall(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\:(\d{2,5})',html)

542

    ret=[]

543 for

 match

 matches:

544

        ip=match[0]

545

        port=match[1]

546

        type=-1

547

        area='--'

548

        ret.append([ip,port,type,area])

549 if

 indebug:

print

 '8',ip,port,type,area

550 return

ret

551 552 553 554 ################################################################################ # ## by Go_Rush(阿舜) from http://ashun.cnblogs.com/ # ################################################################################ 555 556 557 defbuild_list_urls_9

(page=6):

558

    page=page+1

559

    ret=[]

560 for

 range(0,page):

561

        ret.append('http://proxylist.sakura.ne.jp/index.htm?pages=%(n)01d'%{'n':i})

562 return

ret

563 564 defparse_page_9

(html=''):

565

    matches=re.findall(r'''

566

        (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})        #ip

567

        \:(\d{2,5})                                 #port

568

        <\/TD>[\s\r\n]*

569

        <TD>([^<]+)</TD>                            #area

570

        [\s\r\n]*

571

        <TD>([^<]+)</TD>                            #type

572

''',html,re.VERBOSE)

573

    ret=[]

574 for

 match

 matches:

575

        ip=match[0]

576

        port=match[1]

577

        type=match[3]

578

        area=match[2]

579 if

 (type=='Anonymous'):

580

            type=1

581 else

582

            type=-1

583

        ret.append([ip,port,type,area])

584 if

 indebug:

print

 '9',ip,port,type,area

585 return

ret

586 587 ################################################################################ # ## by Go_Rush(阿舜) from http://ashun.cnblogs.com/ # ################################################################################ 588 589 defbuild_list_urls_10

(page=5):

590

    page=page+1

591

    ret=[]

592 for

 range(1,page):

593

        ret.append('http://www.publicproxyservers.com/page%(n)01d.html'%{'n':i})

594 return

ret

595 596 defparse_page_10

(html=''):

597

    matches=re.findall(r'''

598

        (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})    #ip

599

        <\/td>[\s\r\n]*

600

        <td[^>]+>(\d{2,5})<\/td>                #port

601

        [\s\r\n]*

602

        <td>([^<]+)<\/td>                       #type

603

        [\s\r\n]*

604

        <td>([^<]+)<\/td>                       #area

605

''',html,re.VERBOSE)

606

    ret=[]

607 for

 match

 matches:

608

        ip=match[0]

609

        port=match[1]

610

        type=match[2]

611

        area=match[3]

612 if

 (type=='high anonymity'):

613

            type=2

614 elif

 (type=='anonymous'):

615

            type=1

616 elif

 (type=='transparent'):

617

            type=0

618 else

619

            type=-1

620

        ret.append([ip,port,type,area])

621 if

 indebug:

print

 '10',ip,port,type,area

622 return

ret

623 624 ################################################################################ # ## by Go_Rush(阿舜) from http://ashun.cnblogs.com/ # ################################################################################ 625 626 627 628 defbuild_list_urls_11

(page=10):

629

    page=page+1

630

    ret=[]

631 for

 range(1,page):

632

        ret.append('http://www.my-proxy.com/list/proxy.php?list=%(n)01d'%{'n':i})

633 634

    ret.append('http://www.my-proxy.com/list/proxy.php?list=s1')

635

    ret.append('http://www.my-proxy.com/list/proxy.php?list=s2')

636

    ret.append('http://www.my-proxy.com/list/proxy.php?list=s3')

637 return

ret

638 639 defparse_page_11

(html=''):

640

    matches=re.findall(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\:(\d{2,5})',html)

641

    ret=[]

642 643 if

 (html.find('(Level 1)')>0):

644

        type=2

645 elif

 (html.find('(Level 2)')>0):

646

        type=1

647 elif

 (html.find('(Level 3)')>0):

648

        type=0

649 else

650

        type=-1

651 652 for

 match

 matches:

653

        ip=match[0]

654

        port=match[1]

655

        area='--'

656

        ret.append([ip,port,type,area])

657 if

 indebug:

print

 '11',ip,port,type,area

658 return

ret

659 660 ################################################################################ # ## by Go_Rush(阿舜) from http://ashun.cnblogs.com/ # ################################################################################ 661 662 663 664 defbuild_list_urls_12

(page=4):

665

    ret=[]

666

    ret.append('http://www.cybersyndrome.net/plr4.html')

667

    ret.append('http://www.cybersyndrome.net/pla4.html')

668

    ret.append('http://www.cybersyndrome.net/pld4.html')

669

    ret.append('http://www.cybersyndrome.net/pls4.html')

670 return

ret

671 672 defparse_page_12

(html=''):

673

    matches=re.findall(r'''

674

        onMouseOver\=

675

        "s\(\'(\w\w)\'\)"                           #area

676

        \sonMouseOut\="d\(\)"\s?c?l?a?s?s?\=?"?

677

        (\w?)                                       #type

678

"?>

679

        (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})        #ip

680

        \:(\d{2,5})                                 #port

681

''',html,re.VERBOSE)

682

    ret=[]

683 for

 match

 matches:

684

        ip=match[2]

685

        port=match[3]

686

        area=match[0]

687

        type=match[1]

688 if

 (type=='A'):

689

            type=2

690 elif

 (type=='B'):

691

            type=1

692 else

693

            type=0

694

        ret.append([ip,port,type,area])

695 if

 indebug:

print

 '12',ip,port,type,area

696 return

ret

697 698 ################################################################################ # ## by Go_Rush(阿舜) from http://ashun.cnblogs.com/ # ################################################################################ 699 700 701 defbuild_list_urls_13

(page=3):

702

    url='http://www.checkedproxylists.com/'

703

    html=get_html(url)

704

    matchs=re.findall(r"""

705

        href\='([^']+)'>(?:high_anonymous|anonymous|transparent)

706

        \sproxy\slist<\/a>""",html,re.VERBOSE)

707 return

 map(

lambda

 x: url+x, matchs)

708 709 defparse_page_13

(html=''):

710

    html_matches=re.findall(r"eval\(unescape\('([^']+)'\)",html)

711 if

 (len(html_matches)>0):

712

        conent=urllib.unquote(html_matches[0])

713

    matches=re.findall(r"""<td>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})<\/td>

714

            <td>(\d{2,5})<\/td><\/tr>""",conent,re.VERBOSE)

715

    ret=[]

716 if

   (html.find('<title>Checked Proxy Lists - proxylist_high_anonymous_')>0):

717

        type=2

718 elif

 (html.find('<title>Checked Proxy Lists - proxylist_anonymous_')>0):

719

        type=1

720 elif

 (html.find('<title>Checked Proxy Lists - proxylist_transparent_')>0):

721

        type=0

722 else

723

        type=-1

724 725 for

 match

 matches:

726

        ip=match[0]

727

        port=match[1]

728

        area='--'

729

        ret.append([ip,port,type,area])

730 if

 indebug:

print

 '13',ip,port,type,area

731 return

ret

732 733 ################################################################################ # ## by Go_Rush(阿舜) from http://ashun.cnblogs.com/ # ################################################################################ 734 735 736 737 #线程类738 739 classTEST

(threading.Thread):

740 def__init__

(self,action,index=None,checklist=None):

741

        threading.Thread.__init__(self)

742

        self.index =index

743

        self.action=action

744

        self.checklist=checklist

745 746 defrun

(self):

747 if

 (self.action=='getproxy'):

748

            get_proxy_one_website(self.index)

749 else

750

            check_proxy(self.index,self.checklist)

751 752 753 defcheck_proxy

(index,checklist=[]):

754 for

 item

 checklist:

755

        check_one_proxy(item[0],item[1])

756 757 758 defpatch_check_proxy

(threadCount,action=''):

759 global

 check_in_one_call,skip_check_in_hour,conn

760

    threads=[]

761 if

   (action=='checknew'):

#检查所有新加入，并且从未被检查过的762

        orderby=' `time_added` desc '

763

        strwhere=' `active` is null '

764 elif

 (action=='checkok'):

#再次检查以前已经验证成功的代理765

        orderby=' `time_checked` asc '

766

        strwhere=' `active`=1 '

767 elif

 (action=='checkfail'):

#再次检查以前验证失败的代理768

        orderby=' `time_checked` asc '

769

        strwhere=' `active`=0 '

770 else

#检查所有的 771

        orderby=' `time_checked` asc '

772

        strwhere=' 1=1 '

773

    sql="""

774

           select `ip`,`port` FROM `proxier` where

775

                 `time_checked` < (unix_timestamp()-%(skip_time)01s)

776

                 and %(strwhere)01s

777

                 order by %(order)01s

778

                 limit %(num)01d

779

"""%{     'num':check_in_one_call,

780

             'strwhere':strwhere,

781

                'order':orderby,

782

            'skip_time':skip_check_in_hour*3600}

783

    conn.execute(sql)

784

    rows = conn.fetchall()

785 786

    check_in_one_call=len(rows)

787 788 #计算每个线程将要检查的代理个数789 if

 len(rows)>=threadCount:

790

        num_in_one_thread=len(rows)/threadCount

791 else

792

        num_in_one_thread=1

793 794

    threadCount=threadCount+1

795 print

 "现在开始验证以下代理服务器....."

796 for

 index

 range(1,threadCount):

797 #分配每个线程要检查的checklist,并把那些剩余任务留给最后一个线程 798

        checklist=rows[(index-1)*num_in_one_thread:index*num_in_one_thread]

799 if

 (index+1==threadCount):

800

            checklist=rows[(index-1)*num_in_one_thread:]

801 802

        t=TEST(action,index,checklist)

803

        t.setDaemon(True)

804

        t.start()

805

        threads.append((t))

806 for

 thread

 threads:

807

        thread.join(60)

808

    update_proxies()

#把所有的检查结果更新到数据库809 810 811 defget_proxy_one_website

(index):

812 global

 proxy_array

813

    func='build_list_urls_'+str(index)

814

    parse_func=eval('parse_page_'+str(index))

815

    urls=eval(func+'()')

816 for

url

 urls:

817

        html=get_html(url)

818 print

url

819

        proxylist=parse_func(html)

820 for

 proxy

 proxylist:

821

            ip=string.strip(proxy[0])

822

            port=string.strip(proxy[1])

823 if

 (re.compile("^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$").search(ip)):

824

                type=str(proxy[2])

825

                area=string.strip(proxy[3])

826

                proxy_array.append([ip,port,type,area])

827 828 829 defget_all_proxies

():

830 global

 web_site_count,conn,skip_get_in_hour

831 832 #检查最近添加代理是什么时候，避免短时间内多次抓取833

    rs=conn.execute("select max(`time_added`) from `proxier` limit 1")

834

    last_add=rs.fetchone()[0]

835 if

 (last_add

and

 my_unix_timestamp()-last_add<skip_get_in_hour*3600):

836 print

"""

837

 放弃抓取代理列表!

838

 因为最近一次抓取代理的时间是: %(t)1s

839

 这个时间距离现在的时间小于抓取代理的最小时间间隔: %(n)1d 小时

840

 如果一定要现在抓取代理，请修改全局变量: skip_get_in_hour 的值

841

"""%{'t':formattime(last_add),'n':skip_get_in_hour}

842 return843 844 print

 "现在开始从以下"+str(web_site_count)+"个网站抓取代理列表...."

845

    threads=[]

846

    count=web_site_count+1

847 for

 index

 range(1,count):

848

        t=TEST('getproxy',index)

849

        t.setDaemon(True)

850

        t.start()

851

        threads.append((t))

852 for

 thread

 threads:

853

        thread.join(60)

854

    add_proxies_to_db()

855 856 defadd_proxies_to_db

():

857 global

 proxy_array

858

    count=len(proxy_array)

859 for

 range(count):

860

        item=proxy_array[i]

861

        sql="""insert into `proxier` (`ip`,`port`,`type`,`time_added`,`area`) values('

862

"""+item[0]+"',"+item[1]+","+item[2]+",unix_timestamp(),'"+clean_string(item[3])+"')"

863 try

864

            conn.execute(sql)

865 print

 "%(num)2.1f\%\t"%{'num':100*(i+1)/count},item[0],":",item[1]

866 except

867 pass868 869 870 defupdate_proxies

():

871 global

 update_array

872 for

 item

 update_array:

873

        sql='''

874

             update `proxier` set `time_checked`=unix_timestamp(),

875

                `active`=%(active)01d,

876

                 `speed`=%(speed)02.3f

877

                 where `ip`='%(ip)01s' and `port`=%(port)01d

878

'''%{'active':item[2],'speed':item[3],'ip':item[0],'port':item[1]}

879 try

880

            conn.execute(sql)

881 except

882 pass883 884 #sqlite 不支持 unix_timestamp这个函数,所以我们要自己实现885 defmy_unix_timestamp

():

886 return

 int(time.time())

887 888 defclean_string

(s):

889

    tmp=re.sub(r"['\,\s\\\/]", '', s)

890 return

 re.sub(r"\s+", '', tmp)

891 892 defformattime

(t):

893 return

 time.strftime('%c',time.gmtime(t+8*3600))

894 895 896 defopen_database

():

897 global

 db,conn,day_keep,dbfile

898 899 try

900 from

 pysqlite2

import

 dbapi2

 sqlite

901 except

902 print

"""

903

        本程序使用 sqlite 做数据库来保存数据，运行本程序需要 pysqlite的支持

904

        python 访问 sqlite 需要到下面地址下载这个模块 pysqlite,  272kb

905

        http://initd.org/tracker/pysqlite/wiki/pysqlite#Downloads

906

        下载(Windows binaries for Python 2.x)

907

"""

908 raise

 SystemExit

909 910 try

911

        db = sqlite.connect(dbfile,isolation_level=None)

912

        db.create_function("unix_timestamp", 0, my_unix_timestamp)

913

        conn  = db.cursor()

914 except

915 print

 "操作sqlite数据库失败，请确保脚本所在目录具有写权限"

916 raise

 SystemExit

917 918

    sql="""

919

       /* ip:     只要纯ip地址(xxx.xxx.xxx.xxx)的代理 */

920

       /* type:   代理类型 2:高匿 1:普匿 0:透明 -1: 未知 */

921

       /* status: 这个字段本程序还没有用到，留在这里作以后扩展*/

922

       /* active: 代理是否可用  1:可用  0:不可用  */

923

       /* speed:  请求相应时间，speed越小说明速度越快 */

924 925

        CREATE TABLE IF NOT EXISTS  `proxier` (

926

          `ip` varchar(15) NOT NULL default '',

927

          `port` int(6)  NOT NULL default '0',

928

          `type` int(11) NOT NULL default '-1',

929

          `status` int(11) default '0',

930

          `active` int(11) default NULL,

931

          `time_added` int(11)  NOT NULL default '0',

932

          `time_checked` int(11) default '0',

933

          `time_used` int(11)  default '0',

934

          `speed` float default NULL,

935

          `area` varchar(120) default '--',      /*  代理服务器所在位置 */

936

          PRIMARY KEY (`ip`)

937

);

938

/*

939

        CREATE INDEX IF NOT EXISTS `type`        ON proxier(`type`);

940

        CREATE INDEX IF NOT EXISTS `time_used`   ON proxier(`time_used`);

941

        CREATE INDEX IF NOT EXISTS `speed`       ON proxier(`speed`);

942

        CREATE INDEX IF NOT EXISTS `active`      ON proxier(`active`);

943

*/

944

        PRAGMA encoding = "utf-8";      /* 数据库用 utf-8编码保存 */

945

"""

946

    conn.executescript(sql)

947

    conn.execute("""DELETE FROM `proxier`

948

                        where `time_added`< (unix_timestamp()-?)

949

                        and `active`=0""",(day_keep*86400,))

950 951

    conn.execute("select count(`ip`) from `proxier`")

952

    m1=conn.fetchone()[0]

953 if

m1

 None:

return954 955

    conn.execute("""select count(`time_checked`)

956

                        from `proxier` where `time_checked`>0""")

957

    m2=conn.fetchone()[0]

958 959 if

 m2==0:

960

        m3,m4,m5=0,"尚未检查","尚未检查"

961 else

962

        conn.execute("select count(`active`) from `proxier` where `active`=1")

963

        m3=conn.fetchone()[0]

964

        conn.execute("""select max(`time_checked`), min(`time_checked`)

965

                             from `proxier` where `time_checked`>0 limit 1""")

966

        rs=conn.fetchone()

967

        m4,m5=rs[0],rs[1]

968

        m4=formattime(m4)

969

        m5=formattime(m5)

970 print

"""

971

    共%(m1)1d条代理，其中%(m2)1d个代理被验证过，%(m3)1d个代理验证有效。

972

            最近一次检查时间是：%(m4)1s

973

            最远一次检查时间是: %(m5)1s

974

    提示：对于检查时间超过24小时的代理，应该重新检查其有效性

975

"""%{'m1':m1,'m2':m2,'m3':m3,'m4':m4,'m5':m5}

976 977 978 979 defclose_database

():

980 global

 db,conn

981

    conn.close()

982

    db.close()

983

    conn=None

984

    db=None

985 986 if

 __name__ == '__main__':

987

    open_database()

988

    get_all_proxies()

989

    patch_check_proxy(thread_num)

990

    output_file()

991

    close_database()

992 print

 "所有工作已经完成"

刷新页面返回顶部

Go_Rush(阿舜)的博客,专注于Ajax,JavaScript

公告