自动纠错

a summary for https://medium.com/@sarthfrey/https-medium-com-prcobol-the-anatomy-of-autocorrect-9671cecad4b1#.gthtpsfo9

pre knowledge:

1、编辑距离
2、

P(right|error)=P(error|right)P(right)P(error)

3、This is not a bad assumption, as approximately 75% of errors are within 1 edit distance and nearly all of them are within 2 edit distance , and A simple estimate with a 75% accuracy for one suggestion provides a 98.4% accuracy for 3 suggestions (100*(1-0.25³)).

tempt 1:

1、Check if the error word is valid English, if so return it, otherwise proceed.
2、Find the word at 1 edit distance of the error word and that occurs most in the corpus and return it, if none can be found then proceed.
3、Find the valid word within 2 edit distance of the error word and that occurs most in the corpus and return it, if none can be found then proceed.
4、The spelling corrector has failed, return the error word.

tempt 2:

using knowledge 2.

这里写图片描述
tempt 3:

This is where we can add an α parameter, in which we exponentiate our language model by α, such that we are now finding the w that maximizes P(x|w)*P(w)^α.

what’s more:

Next, what if the suitable correction to our error word is at 2 edit distance, and the way we multiply the first edit probability by the second in our error model makes it so that we pretty much never select corrections at more than 1 edit distance? We can raise the second edit probability to β and test that to choose a β like we did for α.

这里写图片描述

future tempt:

使用上下文信息。using markev chain、RNN等模型。

posted @ 2016-12-08 15:58  StevenLuke  阅读(144)  评论(0编辑  收藏  举报