nutch2.2.1+ mysql 乱码问题
最近搭建了nutch2.2.1+mysql+solr4.5分布式环境。遇到了mysql的保存的问题,导致hadoop异常退出。
1、前面的博客介绍过mysql的配置,需要修改gora.property等配置。具体可以参考前面的博客。运行少量抓取的时候没有发现问题,可以正常保存。但是抓了一些发现出现了异常。即使我们已经改成utf8的编码还是出现这样的错误Incorrect string value: '\xF0\x90\x8D\x83\xF0\x90...' for column 如下图:
参考了一些博客:
MySQL's utf8
permits only the Unicode characters that can be represented with 3 bytes in UTF-8. Here you have a character that needs 4 bytes: \xF0\x90\x8D\x83 (U+10343 GOTHIC LETTER SAUIL).
If you have MySQL 5.5 or later you can change the column encoding from utf8
to utf8mb4
. This encoding allows storage of characters that occupy 4 bytes in UTF-8.
You may also have to set the server property character_encoding_server
to utf8mb4
in the MySQL configuration file. It seems that Connector/J defaults to 3-byte Unicode otherwise:
修改编码为utf8mb4 可以参考这个博客:http://www.cnblogs.com/vincentchan/archive/2012/09/25/2701266.html
因此,我在创建webpage表的时候,将text字段设置如下:`text` longtext CHARACTER SET utf8mb4 DEFAULT NULL。这个问题就解决了。
2、又出现了id 不够长 too long的问题。mysql默认id的长度是767字节,而nutch中webpage表的id是url,url会出现一些Unicode的编码,所以也要设置为utf8mb4 类型的,那样最大设置767/4=191。有些url大于这个长度,导致保存的时候出现too long的异常。我们可以改数据库的设置。
vi m /etc/mysql/my.cnf,在[mysqld] 下面添加黑体部分,则可以增加主键id的长度。
[mysqld]
innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
最终webpage表修改如下:
CREATE TABLE `webpage` ( `id` varchar(767) NOT NULL, `headers` blob, `text` longtext DEFAULT NULL, `status` int(11) DEFAULT NULL, `markers` blob, `parseStatus` blob, `modifiedTime` bigint(20) DEFAULT NULL, `prevModifiedTime` bigint(20) DEFAULT NULL, `score` float DEFAULT NULL, `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `baseUrl` varchar(767) DEFAULT NULL, `content` longblob, `title` varchar(2048) DEFAULT NULL, `reprUrl` varchar(767) DEFAULT NULL, `fetchInterval` int(11) DEFAULT NULL, `prevFetchTime` bigint(20) DEFAULT NULL, `inlinks` mediumblob, `prevSignature` blob, `outlinks` mediumblob, `fetchTime` bigint(20) DEFAULT NULL, `retriesSinceFetch` int(11) DEFAULT NULL, `protocolStatus` blob, `signature` blob, `metadata` blob, PRIMARY KEY (`id`) ) ENGINE=InnoDB ROW_FORMAT=COMPRESSED DEFAULT CHARSET=utf8mb4;
posted on 2013-11-14 20:02 fengjiaoan 阅读(696) 评论(1) 编辑 收藏 举报