简单的，分词算法

2008-08-05 09:10 Virus-BeautyCode 阅读(1384) 评论(0) 编辑收藏举报

一个简单的英文分词程序

转载：http://west263.com/info/html/chengxusheji/Javajishu/20080404/57978.html

在实验室接手的第一个任务，写一个英文分词程序，要将形如：Books in tuneBoxes are for Chinese-Children!断为：Book in tune Box are for Chinese child，也就是说要将复数转为单数，将连写的首字母大写的单词分开等等。复数转单数考虑的比较周全了应该，基本囊活了绝大多数情况。根据大写断词上考虑有些欠妥，比如NEC这样的词显然应该保留，但是这儿会被拆为三个单词。正在试图改进

/**
* 分词
*
* @param source
* 待分的字符串
* @return String[]
*/
public String[] fenci(String source) {
/* 分隔符的集合 */
String delimiters = " \t\n\r\f~!@#$%^&*()_ |`1234567890-=\\{}[]:\";'<>?,./'";

/* 根据分隔符分词 */
StringTokenizer stringTokenizer = new StringTokenizer(source,
delimiters);
Vector vector = new Vector();

/* 根据大写首字母分词 */
while (stringTokenizer.hasMoreTokens()) {
String token = stringTokenizer.nextToken();
int index = 0;
flag1: while (index < token.length()) {
flag2: while (true) {
index ;
if ((index == token.length())
|| !Character.isLowerCase(token.charAt(index))) {
break flag2;
}
}
vector.addElement(token.substring(0, index));
//System.out.println("识别出" token.substring(0, index));
token = token.substring(index);
//System.out.println("剩余" token);
index = 0;
continue flag1;
}
}

/*
* 复数转单数参考以下文档：
* http://ftp.haie.edu.cn/Resource/GZ/GZYY/DCYFWF/NJSYYY/421b0061ZW_0015.htm
*/
for (int i = 0; i < vector.size(); i ) {
String token = (String) vector.elementAt(i);
if (token.equalsIgnoreCase("feet")) {
token = "foot";
} else if (token.equalsIgnoreCase("geese")) {
token = "goose";
} else if (token.equalsIgnoreCase("lice")) {
token = "louse";
} else if (token.equalsIgnoreCase("mice")) {
token = "mouse";
} else if (token.equalsIgnoreCase("teeth")) {
token = "tooth";
} else if (token.equalsIgnoreCase("oxen")) {
token = "ox";
} else if (token.equalsIgnoreCase("children")) {
token = "child";
} else if (token.endsWith("men")) {
token = token.substring(0, token.length() - 3) "man";
} else if (token.endsWith("ies")) {
token = token.substring(0, token.length() - 3) "y";
} else if (token.endsWith("ves")) {
if (token.equalsIgnoreCase("knives")
|| token.equalsIgnoreCase("wives")
|| token.equalsIgnoreCase("lives")) {
token = token.substring(0, token.length() - 3) "fe";
} else {
token = token.substring(0, token.length() - 3) "f";
}
} else if (token.endsWith("oes") || token.endsWith("ches")
|| token.endsWith("shes") || token.endsWith("ses")
|| token.endsWith("xes")) {
token = token.substring(0, token.length() - 2);
} else if (token.endsWith("s")) {
token = token.substring(0, token.length() - 1);
}

/* 处理完毕 */
vector.setElementAt(token, i);
}

/* 转为数组形式 */
String[] array = new String[vector.size()];
Enumeration enumeration = vector.elements();
int index = 0;
while (enumeration.hasMoreElements()) {
array[index] = (String) enumeration.nextElement();
index ;
}

/* 打印显示 */
for (int i = 0; i < array.length; i ) {
System.out.println(array[i]);
}

/* 返回 */
return array;
}

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· TypeScript + Deepseek 打造卜卦网站：技术与玄学的结合
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗？
· 【译】Visual Studio 中新的强大生产力特性
· 10年+ .NET Coder 心语 ── 封装的思维：从隐藏、稳定开始理解其本质意义
· 【设计模式】告别冗长if-else语句：使用策略模式优化代码结构

程序员突破单一数据库类型及单一编程语言来思考的程序员。

简单的，分词算法

一个简单的英文分词程序

About

最新随笔

最新评论

随笔档案

文章分类

Attribute的使用

IOC & DI & AOP

MySql

OCS开发

Oracle

SOA

SQL Server 2005

web.config

windows2003

常用正则表达式

面试题

设计模式

友情链接

日历

我的标签

随笔分类

ASP.NET URL Rewrite

Internet应用的性能优化

java社区

NHibernate

Silverlight

TDD

WCF

XML

数据库技术

自动构建Auto Build Daily

积分与排名

程序员 突破单一数据库类型及单一编程语言来思考的程序员。

简单的，分词算法

一个简单的英文分词程序

About

最新随笔

最新评论

随笔档案

文章分类

Attribute的使用

IOC & DI & AOP

MySql

OCS开发

Oracle

SOA

SQL Server 2005

web.config

windows2003

常用正则表达式

面试题

设计模式

友情链接

日历

我的标签

随笔分类

ASP.NET URL Rewrite

Internet应用的性能优化

java社区

NHibernate

Silverlight

TDD

WCF

XML

数据库技术

自动构建Auto Build Daily

积分与排名

程序员突破单一数据库类型及单一编程语言来思考的程序员。