哥伦比亚大学自然语言处理公开课授课讲稿翻译（三）

现在的位置: 首页 > 综合 > 正文

哥伦比亚大学自然语言处理公开课授课讲稿翻译（三）

2013年02月26日 ⁄ 综合 ⁄ 共 6422字 ⁄ 字号小中大 ⁄ 评论关闭

前言：心血来潮看了一个自然语言处理公开课，大牛柯林斯讲授的。觉得很好，就自己动手把它的讲稿翻译成中文。一方面，希望通过这个翻译过程，让自己更加理解大牛的讲授内容，锻炼自己翻译能力。另一方面，造福人类，hah。括号内容是我自己的辅助理解内容。

翻译的不准的地方，欢迎大家指正。

课程地址：https://www.coursera.org/course/nlangp

哥伦比亚大学自然语言处理公开课授课讲稿翻译（三）：语言模型问题-1

导读：这节课柯老师介绍什么是语言模型并给出语言模型的具体定义。

>> Okay. So, the first topic we're going to cover in this course is the problem of language modeling.Language modeling is one of the oldest problems studied in statistical natural language processing. It's a very basic problem,
and it's a very useful problem. Its language models are used in a very wide range of natural language applications. So, we're going to cover a number of things. I'm firstly going to define the basic problem. We'll then talk about a very important class of
language models. These are called the trigram language models. These are extremely widely used. We'll talk about how to evaluate different language models, how to measure the effectiveness of different language models. And then finally, we'll talk about a
couple of estimation techniques for language modeling. Firstly, something called linear interpolation, and secondly, something called discounting methods. Both of these methods are widely used within language modeling and as we will see later in the class,
they're also useful in many other problems in natural language processing. So, these basic estimation techniques are widely used in other areas.
好的，我们这节课涉及的第一个话题就是语言模型的课题。语言模型是统计自然语言处理方向中最经典的学术研究之一。这是一个基础但十分有用的课题。语言模型广泛的应用于自然语言各种应用。所以呢，我们会聊到很多东西。首先，我会给出基础问题的定义。然后，谈谈一些重要的语言模型，他们称之为三元语言模型，他们的应用十分广泛。接下来，我们会讲到如何去评估不同的语言模型的好坏，如何去衡量不同语言模型的功效。最后，我们会聊到一些语言模型方面的估计方法。一个称之为差值法，另一个称之为折扣法。这两种方法在语言模型和接下来的课程中都有着广泛的应用。额，这些基础的估计方法在其他的领域也有着广泛的应用。

So, to get us started, here are a couple of definitions. We're going to assume that we have some set V. And this is a finite set, and this is going to include all of the words in our language of interest. So, imagine we're constructing a language model for
English, for example. We might have a set of the containing words such as the, a, man, telescope, and so on and so on. And it's not uncommon for this set to be really quite large, it might easily contain thousands or ten of thousands of possible words in,
in the language. So, given this underlying set V, I'm going to use V dagger, this symbol here, to refer to the set of all possible sentences or strings in this language. And a well-formed sentence takes the following form. It has zero or more words, where
each word is drawn from the set V. Followed by a special symbol, the STOP symbol, okay? So, the use of this STOP symbol at the end of each sentence is initially going to look a little peculiar. But, we'll see soon why it's very convenient to include this symbol
when we start to develop a probabilistic model for the language modeling problem. So, just to recap, a sentence could have any sequence of words. It could be a sentence that makes sense. For example, this sentence here, or it might be some sentence that,
that really, really doesn't make sense. We get to have sequences like the, the, the, STOP. So, any sequence of words drawn from this vocabulary followed by STOP. And we'll also include a sentence where we have the STOP symbol alone. This is the case where
the sentence is basically of zero length, there are no words before STOP, just to be completely precise.
好，接下来，我们从一定定义入手，开始我们这节课的内容。假设我们有一个V集合，并且它是一个有限集合（语言中的单词是有限个数的）。它包括我们语言中感兴趣的所有单词。举个例子，可以设想一下我们准备做一个英语语言模型。我们可能有一个包含the, a, man, telescope等单词的语言集合。通常这个集合非常的大，他可能很容易的就包含数以千记或者数以万记的单词。给我们集合V，我们可以使用V+这个记号去表示所有在英语可能（存在）的句子或字符串集合。一个格式正确的的句子有以下形式：他有O个或者多个单词，这里的每个单词都出自集合V。（句子）一个特殊的符号-STOP-结束。好，现在大家可能觉得每个句子的结尾处的STOP符号有些奇葩。但是后面我们会看到在对语言模型进行统计建模的时候包含这个STOP是十分便捷的。总的来说，一个句子有一些字符序列，正式这些字符序列使得句子有着自己的意义。比如，这里的这句话（the
fan saw Beckham play for Real Madrid STOP），但这个句子（the fan saw saw STOP）就没有什么意义。也可能出现类似the the the STOP的（无意义）序列。这里的每个由词汇表中词构成的序列都是以STOP结尾的。当然（集合V+）也（可能）含有仅有STOP的句子，这就是长度为0的句子。这里STOP符号前没有任何的字符。/**just to be completely precise. 不理解这句话的含义，不敢瞎翻译/

So, given these definitions, we can now define the language modeling problem. So, I'm going to assume that we have a training sample of example sentences in the language we're interested in. Let's just assume that's English for now. So, for example, you might
collect all sentences that you've seen in the New York Times over the last 10 years. Or you might collect a very large set of example sentences from the world wide web and you can think of many other examples. And this training sample can again be quite large
so to be concrete, in the mid 90s, for example, it was pretty common to make use of, you know, roughly 20 million words of data in these training samples. And by the end of the 90s, it wasn't uncommon to use maybe a billion words. Often, again, chosen from
newspaper data, for example. And more recently, over the last several years, people have started using web data to construct language models. We might even get into a scenario where we have hundreds of billions of words, potential training data. And the main
point here is just that these training samples can get quite large. So, given a training sample, our task is the following.
We want to learn a distribution p over sentences in a language. Okay. So, P is, is going to be a function and it satisfies 2 conditions. So firstly, for any sentence x, remember, the dagger is the set of possible sentences in the language for any sentence x,
we have p of x is greater and equal to 0. And secondly, if we sum over all sentences in the language, we have something that sums to the value 1, okay? So, p is a well-formed distribution over sentences in the language. So, our task is going to be to take
a training sample. The example sentences as input and outputs and function p as the output of this process.
好的，现在已经有了前面的那些定义。我们现在来定义语言模型问题。额，假设在我们感兴趣的语言上我们已经有了一个有示例句子组成的训练样本。就假设是在英语上吧。额，你可以从过去十年的纽约时报上搜集它所有的句子。或者你可以从互联网上搜集大量的示例句子等等。这个训练样本可以是非常巨大的，比如在九十年代中期，使用大约2000万单词的训练样本是一件很普通的事情。在九十年代末期，这些训练数据通常来自新闻，数量上升到十亿单词。最近几年，人们已经开始使用来自互联网的数据训练语言模型。我们甚至可以想象这么一个场景：潜在的训练数据尅到达千亿单词量级别。这里重要的一点是数据量变得如此的巨大。额，到现在，给我们训练样本。我们的任务如下：我们希望到通过这些句子学习到这种语言（单词的）分布p。好的，P
会成为一个满足2个条件的函数。第一个条件是，对于任何句子x。回忆一下前面内容，V+集合是语言所有可能句子的集合。任何（V+集合）中的句子x，我们规定x的p（概率）大于或等于0。第二个条件是，如果我们讲语言（即V+集合）中所有的句子（概率p）相加，我们认为他们的和为1。（所有句子出现的概率和为1）。好的，（现在）p就是一个在语言句子上定义良好的分布。我们的任务就是处理训练样本，将示例句子作为输入，输出函数p作为这个处理的输出。

So, here are some examples. We might, for example, assign the probability 10 to the minus 12 to the sentence composing just the word, the, followed by STOP. We might assign 2 times 9 to the minus 8 to this particular sentence, and so on and so on. We just assign
a probability to every sentence in the language. Now roughly speaking, we would like a good language model to assign high probability to sentences which are likely in English and low probability to sentences, which are unlikely in English. So, for example,
this sentence here is pretty ill-formed. You're [unknown] unlikely to see this as a sentence and that has relatively low probability.

这里有一些样例，比如我们可以赋予10的-12次方概率值给这个由这几个单词构成的句子（the STOP）。我们可以赋予2的-8次方概率值给这个特殊的句子（the fan saw Beckham STOP）等等。我们给语言中的每个句子都赋予一个的概率值。大致来说，我们觉得一个好的语言模型会赋予较高的概率值给那些英语中更合理的句子，而给赋予较低的概率值那些英语中不合理的句子。比如这里的这句话（the fan saw saw STOP）就是一个病句，你会觉得它不像一个句子，（所以）它有一个相对较低的概率。

<第三节完>

哥伦比亚大学自然语言处理公开课授课讲稿翻译（一）：自然语言处理介绍-1

哥伦比亚大学自然语言处理公开课授课讲稿翻译（二）：自然语言处理介绍-2

哥伦比亚大学自然语言处理公开课授课讲稿翻译（四）：语言模型介绍-2