诗节 Stanza-0027-神经网络流水线-10-命名实体识别

易嘉 · 发表于 2023-3-5 08:53:28

0、背景

研究一下多语种派森自然语言处理包 Stanza 诗节~
（1）本系列文章

格瑞图：诗节 Stanza-0001-概览
格瑞图：诗节 Stanza-0002~0016-使用手册
格瑞图：诗节 Stanza-0017-神经网络流水线-00-目录
格瑞图：诗节 Stanza-0018-神经网络流水线-01-流水线和处理器
格瑞图：诗节 Stanza-0019-神经网络流水线-02-数据对象和标注
格瑞图：诗节 Stanza-0020-神经网络流水线-03-数据转换
格瑞图：诗节 Stanza-0021-神经网络流水线-04-分词和分句
格瑞图：诗节 Stanza-0022-神经网络流水线-05-多词分词展开
格瑞图：诗节 Stanza-0023-神经网络流水线-06-词性标注和形态特征
格瑞图：诗节 Stanza-0024-神经网络流水线-07-词形还原
格瑞图：诗节 Stanza-0025-神经网络流水线-08-依赖解析
格瑞图：诗节 Stanza-0026-神经网络流水线-09-选区解析
1、Named Entity Recognition - 命名实体识别

（0）TABLE OF CONTENTS - 目录

Description - 01. 描述
Options - 02.选项
Example Usage - 03.示例
Accessing Named Entities for Sentence and Document

03.01.访问句子和文档中的命名实体

Accessing Named Entity Recogition (NER) Tags for Token

03.02.访问分词的命名实体识别标签

Using multiple models

03.03.使用多模型

Training-Only Options - 04.训练相关选项

<hr/>（1）Description - 描述

The named entity recognition (NER) module recognizes mention spans of a particular entity type (e.g., Person or Organization) in the input sentence. NER is widely used in many NLP applications such as information extraction or question answering systems. In Stanza, NER is performed by the NERProcessor and can be invoked by the name ner.

命名实体识别模块识别输入句子中指定文本串中的特殊实体类型（例如：人员 P 或者组织 O）。命名实体识别广泛应用到许多自然语言处理应用中，例如信息抽取或问答系统。在诗节中，命名实体识别是通过 NP 处理器执行的，可以通过名称 ner 调用。

Note:The NERProcessor currently supports 23 languages. All supported languages along with their training datasets can be found here.

注意：NP 当前支持 23 中语言。可以从这里找到所有支持的语言以及其训练数据集。

Name	Annotator class name	Requirement	Generated Annotation	Description
ner	NERProcessor	tokenize, mwt	Named entities accessible through Document or Sentence’s properties entities or ents. Token-level NER tags accessible through Token’s properties ner.	Recognize named entities for all token spans in the corpus.

（2）Options - 选项

Option name	Type	Default	Description
ner_batch_size	int	32	When annotating, this argument specifies the maximum number of sentences to process as a minibatch for efficient processing.Caveat: the larger this number is, the more working memory is required (main RAM or GPU RAM, depending on the computating device).
ner_pretrain_path	str	model-specific	Where to get the pretrained word embedding. If you trained your own NER model with a different pretrain from the default, you will need to set this flag to use the model.

（3）Example Usage - 示例

Running the NERProcessor simply requires the TokenizeProcessor. After the pipeline is run, the Document will contain a list of Sentences, and the Sentences will contain lists of Tokens. Named entities can be accessed through Document or Sentence’s properties entities or ents. Alternatively, token-level NER tags can be accessed via the ner fields of Token.

运行命名实体识别处理器 NP 仅需要分词处理器 TP。在流水线运行之后，文档 D 就会包含句子 S 列表，每个句子会包含分词 T 列表。命名实体可以通过文档 D 或者句子 S 的树形 entities 或者 ents 访问。分词级别的命名实体识别标签可以通过分词 T 的 ner 字段访问。
Accessing Named Entities for Sentence and Document - 访问句子 S 和文档 D 的命名实体

Here is an example of performing named entity recognition for a piece of text and accessing the named entities in the entire document:

这是一个在文本上执行命名实体识别并访问整个文档命名实体的栗子：
import stanza

nlp = stanza.Pipeline(lang=&#39;en&#39;, processors=&#39;tokenize,ner&#39;)
doc = nlp(&#34;Chris Manning teaches at Stanford University. He lives in the Bay Area.&#34;)
print(*[f&#39;entity: {ent.text}\ttype: {ent.type}&#39; for ent in doc.ents], sep=&#39;\n&#39;)

Instead of accessing entities in the entire document, you can also access the named entities in each sentence of the document. The following example provides an identical result from the one above, by accessing entities from sentences instead of the entire document:

除了访问整个文档的命名实体，还可以访问文档每个句子的命名实体。下面的栗子通过访问整个文档的每个句子的实体，同样达到了上面示例的输出：
import stanza

nlp = stanza.Pipeline(lang=&#39;en&#39;, processors=&#39;tokenize,ner&#39;)
doc = nlp(&#34;Chris Manning teaches at Stanford University. He lives in the Bay Area.&#34;)
print(*[f&#39;entity: {ent.text}\ttype: {ent.type}&#39; for sent in doc.sentences for ent in sent.ents], sep=&#39;\n&#39;)

As can be seen in the output, Stanza correctly identifies that Chris Manning is a person, Stanford University an organization, and the Bay Area is a location.

从输出中可以看出，诗节正确的识别了克里斯曼宁 CM 是一个人，斯坦福大学 SU 是一个组织，湾区 BA 是一个位置。
entity: Chris Manning type: PERSON
entity: Stanford University type: ORG
entity: the Bay Area type: LOCAccessing Named Entity Recogition (NER) Tags for Token - 访问分词的命名实体识别标签

It might sometimes be useful to access the BIOES NER tags for each token, and here is an example how:

有时候访问每个分词的 BIOES 命名实体识别标签将很有用，下面是一个示例：
import stanza

nlp = stanza.Pipeline(lang=&#39;en&#39;, processors=&#39;tokenize,ner&#39;)
doc = nlp(&#34;Chris Manning teaches at Stanford University. He lives in the Bay Area.&#34;)
print(*[f&#39;token: {token.text}\tner: {token.ner}&#39; for sent in doc.sentences for token in sent.tokens], sep=&#39;\n&#39;)

The result is the BIOES representation of the entities we saw above

上面看到的实体的 BIOES 表示如下：
token: Chris ner: B-PERSON
token: Manning ner: E-PERSON
token: teaches ner: O
token: at ner: O
token: Stanford ner: B-ORG
token: University ner: E-ORG
token: . ner: O
token: He ner: O
token: lives ner: O
token: in ner: O
token: the ner: B-LOC
token: Bay ner: I-LOC
token: Area ner: E-LOC
token: . ner: OUsing multiple models - 使用多模型

NEW IN V1.4.0
When creating the pipeline, it is possible to use multiple NER models at once by specifying a list in the package dict. Here is a brief example:

版本 1.4.0 新增特性
当创建流水线时，通过在包词典里指定一个多命名实体识别模型来使用多模型。这是一个简要示例：
import stanza
pipe = stanza.Pipeline(&#34;en&#34;, processors=&#34;tokenize,ner&#34;, package={&#34;ner&#34;: [&#34;ncbi_disease&#34;, &#34;ontonotes&#34;]})
doc = pipe(&#34;John Bauer works at Stanford and has hip arthritis.  He works for Chris Manning&#34;)
print(doc.ents)Output:
输出：
[{
  &#34;text&#34;: &#34;John Bauer&#34;,
  &#34;type&#34;: &#34;PERSON&#34;,
  &#34;start_char&#34;: 0,
  &#34;end_char&#34;: 10
}, {
  &#34;text&#34;: &#34;Stanford&#34;,
  &#34;type&#34;: &#34;ORG&#34;,
  &#34;start_char&#34;: 20,
  &#34;end_char&#34;: 28
}, {
  &#34;text&#34;: &#34;hip arthritis&#34;,
  &#34;type&#34;: &#34;DISEASE&#34;,
  &#34;start_char&#34;: 37,
  &#34;end_char&#34;: 50
}, {
  &#34;text&#34;: &#34;Chris Manning&#34;,
  &#34;type&#34;: &#34;PERSON&#34;,
  &#34;start_char&#34;: 66,
  &#34;end_char&#34;: 79
}]

Furthermore, the multi_ner field will have the outputs of each NER model in order.

更进一步，多命名实体识别 m_n 字段，会按序包含每个命名实体识别模型的输出。
# results truncated for legibility
# note that the token ids start from 1, not 0
print(doc.sentences[0].tokens[0:2])
print(doc.sentences[0].tokens[8:10])

Output:

输出：
[[
  {
&#34;id&#34;: 1,
&#34;text&#34;: &#34;John&#34;,
&#34;start_char&#34;: 0,
&#34;end_char&#34;: 4,
&#34;ner&#34;: &#34;B-PERSON&#34;,
&#34;multi_ner&#34;: [
   &#34;O&#34;,
   &#34;B-PERSON&#34;
]
  }
], [
  {
&#34;id&#34;: 2,
&#34;text&#34;: &#34;Bauer&#34;,
&#34;start_char&#34;: 5,
&#34;end_char&#34;: 10,
&#34;ner&#34;: &#34;E-PERSON&#34;,
&#34;multi_ner&#34;: [
   &#34;O&#34;,
   &#34;E-PERSON&#34;
]
  }
]]
[[
  {
&#34;id&#34;: 8,
&#34;text&#34;: &#34;hip&#34;,
&#34;start_char&#34;: 37,
&#34;end_char&#34;: 40,
&#34;ner&#34;: &#34;B-DISEASE&#34;,
&#34;multi_ner&#34;: [
   &#34;B-DISEASE&#34;,
   &#34;O&#34;
]
  }
], [
  {
&#34;id&#34;: 9,
&#34;text&#34;: &#34;arthritis&#34;,
&#34;start_char&#34;: 41,
&#34;end_char&#34;: 50,
&#34;ner&#34;: &#34;E-DISEASE&#34;,
&#34;multi_ner&#34;: [
   &#34;E-DISEASE&#34;,
   &#34;O&#34;
]
  }
]]（4）Training-Only Options - 训练相关选项

Most training-only options are documented in the argument parser of the NER tagger.

命名实体识别标注器大多数训练相关的选项可以从参数解析器中查看。
N、后记

Spring is coming Alpine marmot
春天来了阿尔卑斯旱獭
~

		自动登录	找回密码
密码			立即注册