NPL学习之:分词相关摘选zz from 52npl`

现在的位置: 首页 > 综合 > 正文

NPL学习之:分词相关摘选zz from 52npl`

2012年12月30日 ⁄ 综合 ⁄ 共 1570字 ⁄ 字号小中大 ⁄ 评论关闭

分词相关
a) Tokenization
i. 目标（Goal）：将文本切分成单词序列（divide text into a sequence of words）
ii. 单词指的是一串连续的字母数字并且其两端有空格；可能包含连字符和撇号但是没有其它标点符号

b) 什么是词（What’s a word）?
i. English:
1. “Wash. vs wash”
2. “won’t”, “John’s”
3. “pro-Arab”, “the idea of a child-as-required-yuppie-possession must be motivating them”, “85-year-old grandmother”
ii. 东亚语言

1. 词之间没有空格

c) 分词
i. 基于规则的方法 : 基于词典和语法知识的形态分析
ii. 基于语料库的方法: 从语料中学习
iii. 需要考虑的问题: 覆盖面，歧义，准确性

d) 统计切分方法的动机
i. 未登录词问题:
     ——存在领域术语和专有名词
ii. 语法约束可能不充分
     ——例子（Example）: 名词短语的交替切分
iii. 举例一
   1. Segmentation：sha-choh/ken/gyoh-mu/bu-choh
   2. Translation：“president/and/business/general/manager”
iv. 举例二
    1. Segmentation：sha-choh/ken-gyoh/mu/bu-choh
    2. Translation：“president/subsidiary business/Tsutomi[a name]/general manag

e) 一个切分算法：
i. 核心思想（Key idea）: 对于每一个候选边界，比较这个边界邻接的n元序列的频率和跨过这个边界的n元序列的频率。

f) 实验框架（Experimental Framework）
i. 语料库（Corpus）: 150兆1993年Nikkei新闻语料
ii. 人工切分: 用于开发集的50条序列（调节参数）和用于测试集的50条序列
iii. 基线算法（Baseline algorithms）: Chasen和Juma的形态分析器

g) 评测方法（Evaluation Measures）
i. tp — true positive （真正, TP）被模型预测为正的正样本；
ii. fp — false positive （假正, FP）被模型预测为正的负样本；
iii. tn — true negative （真负 , TN）被模型预测为负的负样本；
iv. fn — false negative （假负 , FN）被模型预测为负的正样本；
v. 准确率（Precision） — the measure of the proportion of selected items that the system got right：
         P = tp / ( tp + fp)
vi. 召回率（Recall） — the measure of the target items that the system selected:
         R = tp / ( tp + fn )
vii. F值（F-measure）:
         F = 2 ∗ PR / (R + P)
viii. Word precision (P) is the percentage of proposed brackets that match word-level brackets in the annotation;
ix. Word recall (R) is the percentage of word-level brackets that are proposed by the algorithm.

完整原文:请参考http://www.52nlp.cn/mit-nlp-second-lesson-word-counting-third-part

【上篇】子窗体与父窗体间的传值–用委托实现
【下篇】Oracle学习笔记：编译PL/SQL对象

作者: gentile

该日志由 gentile 于11年前发表在综合分类下，最后更新于 2012年12月30日.
转载请注明: NPL学习之:分词相关摘选zz from 52npl` | 学步园 +复制链接

抱歉!评论已关闭.

学步园