把wikipedia中文数据库导入mysql

1.在官网上下最新版的page-article.xml

http://download.wikipedia.com/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

2.在mysql中创建库和表(例如:wikipedia_zh_20121123)

mysql> create database wikipedia_zh_20121123 DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
Query OK, 1 row affected (0.00 sec)

mysql> use wikipedia_zh_20121123
Database changed

mysql>  CREATE TABLE  page (         page_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,             page_namespace int NOT NULL,          page_title varchar(255) binary NOT NULL,          page_restrictions tinyblob NOT NULL,       page_counter bigint unsigned NOT NULL default 0,       page_is_redirect tinyint unsigned NOT NULL default 0,          page_is_new tinyint unsigned NOT NULL default 0,       page_random real unsigned NOT NULL,                   page_touched binary(14) NOT NULL default '',             page_latest int unsigned NOT NULL,       page_len int unsigned NOT NULL )  ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.10 sec)

mysql> CREATE TABLE  revision (      rev_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,       rev_page int unsigned NOT NULL,                rev_text_id int unsigned NOT NULL,             rev_comment tinyblob NOT NULL,          rev_user int unsigned NOT NULL default 0,       rev_user_text varchar(255) binary NOT NULL default '',       rev_timestamp binary(14) NOT NULL default '',          rev_minor_edit tinyint unsigned NOT NULL default 0,       rev_deleted tinyint unsigned NOT NULL default 0,       rev_len int unsigned,          rev_parent_id int unsigned default NULL,       rev_sha1 varbinary(32) NOT NULL default ''  )  ENGINE=InnoDB DEFAULT CHARSET=utf8 MAX_ROWS=10000000 AVG_ROW_LENGTH=1024;
Query OK, 0 rows affected (0.09 sec)

mysql> CREATE TABLE  text (                  old_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,          old_text mediumblob NOT NULL,                               old_flags tinyblob NOT NULL )  ENGINE=InnoDB DEFAULT CHARSET=utf8 MAX_ROWS=10000000 AVG_ROW_LENGTH=10240;
Query OK, 0 rows affected (0.07 sec)


mysql> ALTER TABLE revision MODIFY COLUMN rev_comment BLOB NOT null
    -> ;
Query OK, 0 rows affected (0.20 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> ALTER TABLE text DEFAULT CHARACTER SET BINARY;
Query OK, 0 rows affected (0.17 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE page DEFAULT CHARACTER SET BINARY; 
Query OK, 0 rows affected (0.18 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql>  ALTER TABLE revision DEFAULT CHARACTER SET BINARY;
Query OK, 0 rows affected (0.18 sec)
Records: 0  Duplicates: 0  Warnings: 0

 

mwdumper修复了一个bug,但是第三方提供的mwdumper.jar没有及时更新,直接用jar包导入最新的wikipedia数据运行到一半就挂了,所以需要自己编译

可以先用mwdumper.jar试试看,如果出错了再自行编译。

3.下载mwdumper工程

git clone https://gerrit.wikimedia.org/r/p/mediawiki/tools/mwdumper.git

4.编译mwdumper:在mwdumper目录下
mvn install

5.用eclipse导入mwdumper工程,运行Dumper类

run configuration:

program argument:--output=mysql://localhost/wikipedia_zh_20121123?user=root&password=xxxx&characterEncoding=utf8 --format=sql:1.5 zhwiki-latest-pages-articles.xml.bz2
vm argument:-Xms128m -Xmx1000

  

posted @ 2012-11-23 15:29  iaac  阅读(1451)  评论(0编辑  收藏  举报