把wikipedia中文数据库导入mysql
1.在官网上下最新版的page-article.xml
http://download.wikipedia.com/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
2.在mysql中创建库和表(例如:wikipedia_zh_20121123)
mysql> create database wikipedia_zh_20121123 DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci; Query OK, 1 row affected (0.00 sec) mysql> use wikipedia_zh_20121123 Database changed mysql> CREATE TABLE page ( page_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT, page_namespace int NOT NULL, page_title varchar(255) binary NOT NULL, page_restrictions tinyblob NOT NULL, page_counter bigint unsigned NOT NULL default 0, page_is_redirect tinyint unsigned NOT NULL default 0, page_is_new tinyint unsigned NOT NULL default 0, page_random real unsigned NOT NULL, page_touched binary(14) NOT NULL default '', page_latest int unsigned NOT NULL, page_len int unsigned NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8; Query OK, 0 rows affected (0.10 sec) mysql> CREATE TABLE revision ( rev_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT, rev_page int unsigned NOT NULL, rev_text_id int unsigned NOT NULL, rev_comment tinyblob NOT NULL, rev_user int unsigned NOT NULL default 0, rev_user_text varchar(255) binary NOT NULL default '', rev_timestamp binary(14) NOT NULL default '', rev_minor_edit tinyint unsigned NOT NULL default 0, rev_deleted tinyint unsigned NOT NULL default 0, rev_len int unsigned, rev_parent_id int unsigned default NULL, rev_sha1 varbinary(32) NOT NULL default '' ) ENGINE=InnoDB DEFAULT CHARSET=utf8 MAX_ROWS=10000000 AVG_ROW_LENGTH=1024; Query OK, 0 rows affected (0.09 sec) mysql> CREATE TABLE text ( old_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT, old_text mediumblob NOT NULL, old_flags tinyblob NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8 MAX_ROWS=10000000 AVG_ROW_LENGTH=10240; Query OK, 0 rows affected (0.07 sec) mysql> ALTER TABLE revision MODIFY COLUMN rev_comment BLOB NOT null -> ; Query OK, 0 rows affected (0.20 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> ALTER TABLE text DEFAULT CHARACTER SET BINARY; Query OK, 0 rows affected (0.17 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> ALTER TABLE page DEFAULT CHARACTER SET BINARY; Query OK, 0 rows affected (0.18 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> ALTER TABLE revision DEFAULT CHARACTER SET BINARY; Query OK, 0 rows affected (0.18 sec) Records: 0 Duplicates: 0 Warnings: 0
mwdumper修复了一个bug,但是第三方提供的mwdumper.jar没有及时更新,直接用jar包导入最新的wikipedia数据运行到一半就挂了,所以需要自己编译
可以先用mwdumper.jar试试看,如果出错了再自行编译。
3.下载mwdumper工程
git clone https://gerrit.wikimedia.org/r/p/mediawiki/tools/mwdumper.git
4.编译mwdumper:在mwdumper目录下
mvn install
5.用eclipse导入mwdumper工程,运行Dumper类
run configuration:
program argument:--output=mysql://localhost/wikipedia_zh_20121123?user=root&password=xxxx&characterEncoding=utf8 --format=sql:1.5 zhwiki-latest-pages-articles.xml.bz2 vm argument:-Xms128m -Xmx1000