自动生成短文

先来看一段英文

This is the promise that our bravest warriors have protected for all of our citizens in centuries since, a long time ago. Sadly, many around the world celebrate the beginning of Holy Week, ISIS murdered at least 45 people and injured over 100 others at two Christian churches in Egypt. We condemn this barbaric attack. We mourn for those who lost loved ones. And we pray for the strength and wisdom to achieve a better tomorrow¡ªone where good people of all citizens to enjoy safety and peace¡ªand to work and live with the dignity that all Children of God are entitled to know. As long as we have faith in each other, and trust in God, we will succeed. Thank you. Have a Happy Passover. God bless America.

这段读起来还算通顺的英文并非由人撰写,是由计算机学习一篇 演讲文稿 后自动生成的。

那么如何让计算机通过学习人类撰写的文章、说话的语句来生成另外的文章或语句呢?

提示: 为了简单,我们仅使用英文作为输入输出。中文也类似,但是会引入更为复杂的分词问题,这里不讨论。

要让计算机生成短文,首先想到就是输出基本的字符,那我们尝试一下随机输出字母/空格/标点会怎样:

  • 随机输出字母

    qujdge.nzycdps.r.udtm p x., mthrjwef.glclexuctezqt

看起来没有什么意义,连有意义的单词都看不到。

既然这样,我们给一个词典,从词典中随机挑选单词输出:

  • 随机输出单词

    pneumology exponents tribune's uninhibitedly musicophobia chronomancy moost interchapter polypoid nondiffident ullages skyriding cohabiter castratory greenishness donnerd solidarized austenitized subarticulateness ablutions

这次是一个个完整的单词,但是不符合语法,读不通顺,更看不出表达了什么。

再升级一下,那我们收集一个常用的语句集合,从中随机挑选完整的句子输出试试:

  • 随机输出句子

    Not to advance is to go back. What do you want to do about supper? Don't press the wrong button. It doesn't really matter. Do you enjoy making me feeling like the dead? You sit here, Tom, and you sit over there, Mary. A drop of sweat ran down Tom's back. I am thinking of changing my job. Don't bite the hand that feeds you. I couldn't disagree with you more. The distinction is not always precise. Tom tried to catch the rabbit. Tom was ashamed of himself. It was nice of him to help. It's wrong to trick people like that. Don't read too much into this. I forgot. Have you been out at all? I'm glad to have passed at all. Why are we voting for Tom?

这一次句子通顺了,但是句子间显得很独立,谈不上什么关系,粒度太大,组合的灵活度也不够。

那有什么更好的方法吗?还是按照常规思路分析,想想我们平时讲的话,本质上都是从别人那里学习来的 —— 学习到各种新的词汇,然后根据大家讲话的方式(我们小时候学讲话一定不知道什么语法,即使现在学新的话语我们应该也不怎么在意语法),依葫芦画瓢组合各种字/词/短语便成了我们今天讲的话。那可不可以让计算机模拟这种做法呢?下面就来试试:

  • 首先,从我们小时候开始,我们都是先学会一个字/词/短语(如:妈妈、吃饭、我喜欢),所以我们先考虑把学习的粒度放在字/词上。

    • 学习的来源:任意的英文短文

    • 粒度:把短文按词为单位分割开来(英文在这方面很简单,按空格/换行等空白符分割开就行)

      • This is the promise that our bravest warriors have protected for all of our citizens in centuries since, a long time ago.
    • 切割为单词列表(考虑一下用什么数据结构来存储这个单词列表?):

      • This is the promise that our bravest warriors have protected for all of our citizens in centuries since, a long time ago.
  • 下一步,考虑我们学习新的词汇、短语时往往不是学习孤零零的词汇、短语,它们都是在特定的环境(上下文)中被我们学习到的,所以,我们要让计算机学习字/词/短语的同时分析它们的上下文关系。

    • 这看起来似乎有点复杂,人类的语言总是变化多端,规则数不胜数,偶尔还来点创新(什么“么么哒”、“厉害了word哥”之类的)。那怎么办呢?一切从简,我们只关心一个词与邻居的关系,其它的太复杂,我们管不过来。
    • 进一步,因为我们让计算机输出一段短文时,是在不停地挑选接下来应该输出什么,所以再简单一点,我们只关心当前词的下一个词可以是什么。
    • 按照上面的设想,我们还是以这篇 演讲稿 为例,看看是如何做的:
      • 1,把文章划成单词列表

      • 2,学习,建立词与词间的关系(当前词的下一个词可以是什么)

      • 注意: 我们为了能比较合理地处理标点符号,把标点符号也算成单词的一部分(即 "America""America." 是两个不同的单词)

      • 部分结果如下:

      ...
      have -> ['lived', 'a', 'protected', 'felt', 'faith']
      in -> ['houses', 'our', 'centuries', 'Egypt.', 'the', 'each', 'God,']
      beginning, -> ['America']
      Easter -> ['Sunday,']
      Week, -> ['ISIS']
      since, -> ['a']
      Israel -> ['stands']
      strength -> ['and']
      Christian -> ['churches']
      want -> ['to', 'you']
      incredible -> ['people']
      Jewish -> ['families', 'people.', 'People', 'and']
      saw -> ['in']
      45 -> ['people']
      As -> ['families', 'long']
      Nation -> ['with', 'of']
      also -> ['upon', 'want']
      ...
      
  • 最后,我们要根据上面建立的词与词之间的关系随机输出一篇短文

    • 1, 初始选择一个词作为 当前词 ,我们就选 "This" 吧

    • 2, 输出 当前词

    • 3, 如果 当前词 后面可以跟的词列表为空,结束!

    • 4, 看看 当前词 后面可以跟那些单词,我们随机选择一个,当前词 更新为新选择的词

    • 5, 重复第2步

    • 第4步举例:

    当前词为"This"时后面可以跟的单词列表
    This -> ['is', 'week,', 'Easter', 'is']
    假设这一次选中Easter
    接着看它可能的后缀
    Easter -> ['Sunday,']
    ...
    

    注意 ,这里is出现2次,我们并没有做合并或者别的操作,简单保留即可,起到的作用是更大概率选中is

    • 来看看一次随机选择的结果:

    This Easter Sunday, as Christians around the promise the Jewish families across the beginning of terror. On Palm Sunday, Christians around the story of an amazing people with the tremendous blessings of hardship. I also upon us. This is a long as a Nation of reverence and wonderful future. I also upon us. This is the spirit of God bless America.

    看起来有点意思了,个别短语能读过去,但是整体上来看,一句话可能上半部分跟下半部分不那么沾边。

    那我们可以做什么优化吗?接着往下...

  • 优化

    • 上面的结果来看,我们选择一个单词作为前缀,通过分析文章,获得后面可以跟哪些单词,以此作为最后生成新短文的根据。既然一个单词做前缀显得太过于随机,那我们尝试一下两个单词作为前缀呢?

    • 举例:

      • This week, Jewish families across our country, and around the world, celebrate Passover and retell the story of God’s deliverance of the Jewish people.
    • 分析前后缀关系过程如下:

      • This week, Jewish families across our country, and around the world, celebrate Passover and retell the story of God’s deliverance of the Jewish people.
      This week, -> ['Jewish']
      
      • This week, Jewish families across our country, and around the world, celebrate Passover and retell the story of God’s deliverance of the Jewish people.
      This week, -> ['Jewish']
      week, Jewish -> ['families']
      
      • This week, Jewish families across our country, and around the world, celebrate Passover and retell the story of God’s deliverance of the Jewish people.
      This week, -> ['Jewish']
      week, Jewish -> ['families']
      Jewish families -> ['across']
      
      • 直到扫描完整篇文章, 部分结果如下:
      ...
      Happy Passover. -> ['God']
      the world, -> ['celebrate']
      We have -> ['a']
      an amazing -> ['people']
      and endurance. -> ['Another']
      and worship -> ['according']
      the threat -> ['of']
      have a -> ['beautiful']
      the promise -> ['of', 'the', 'that']
      also upon -> ['us.']
      our people. -> ['America']
      that has -> ['cherished']
      ...
      
    • 同样,仿照前面步骤随机输出:

      • 1, 初始选择一个词作为 当前词 ,我们就选 "This is" 吧

      • 2, 输出 当前词

      • 3, 如果 当前词 后面可以跟的词列表为空,结束!

      • 4, 看看 当前词 后面可以跟那些单词,我们随机选择一个 新单词输出 新单词

      • 5,新单词 向左挤掉 当前词 的第一个单词作为新的 当前词

      • 6, 重复第3步

      • 第5步举例:

      当前词为"This is"时后面可以跟的单词列表
      This is -> ['a', 'the']
      假如随机选择 "the" ,输出它,当前词更新为
      "is the"
      接着看它可能的后缀
      is the -> ['story', 'promise', 'promise', 'source']
      ...
      
    • 根据上面的步骤,来看看一次随机输出:

    This is the promise of eternal salvation. It is the promise that our bravest warriors have protected for all of our citizens in centuries since, a long time ago. Sadly, many around the world, celebrate Passover and retell the story of freedom. It is the source of our people. America is a story of the Jewish People have lived through one persecution after another¨Cand yet, they persevered and thrived and uplifted the world celebrate the resurrection of Christ and the promise the first settlers saw in our vast continent¡ªand it is the source of our Nation with the dignity that all Children of God are entitled to know. As long as we have faith in each other, and trust in God, we will succeed. Thank you. Have a Happy Easter, and a Happy Passover. God bless America.

新技能是否get到了呢?想想如果你有更多巧妙的方法让计算机学习足够多的优秀文章,然后再自动写出一堆新的文章,是不是很有趣!

当然这并非唯一的方法,如果你有不同的想法,非常欢迎分享出来。


提示:这个问题简单地应用了 马尔科夫链 ,以上的 当前词 对应马尔科夫链上的当前 状态 ,每一次随机选择都是有一定概率进入到后面的某个状态,并且下一个状态 仅与当前状态有关 ,有兴趣的同学可以深入了解。

posted @ 2017-04-18 02:02  0x1000  阅读(2443)  评论(7编辑  收藏  举报