10月02, 2020

主动学习加持标注系统/Active learning enabled Annotation System

为什么需要开发主动学习加持的标注系统

  • 序列标注过程中实体类型多,关系类型多,在一个文档中往往要标注多个地方,人工标注工作量大
  • 多人协作在线标注时,标注员对业务理解不一样,标注质量控制往往需要一定数量的冗余标注
  • 从标注系统连接到序列标注模型往往需要数据格式预处理,需要大量人工处理

Active learning能做什么

根据初始的gloden label,训练种子模型,对所有待标注文档按照预测置信度从低到高排序,由人工完成一定数量的标注后自动重新训练模型,再次排序,随着模型不确定性的减少可以快速提高模型准确度,减少完成全部文档的标注时间

系统架构

12911_2017_466_Fig1_HTML.jpg

查询策略

待标注文本的排序方式需要考虑两个指标:

  1. 预测结果的不确定性
  2. 文档内容的代表性

Clustering And Uncertainty Sampling Engine (CAUSE) 方法是先对文档按内容使用主题模型聚类,然后按置信度采样待标注的文档;这种方法可以保证排名最高的文档来自不内的内容分组,彼此之间相似度低

该算法的伪代码如下:

Input

  1. Clustering results of sentences;
  2. Uncertainty scores of sentences; (3) Batch size (x);

Steps

  1. Cluster ranking: score each cluster based on the uncertainty scores of sentences and select the top x cluster(s) based on the cluster scores, where x is the batch size; (e.g. the score of a cluster could be the average uncertainty score of sentences in this cluster.)

  2. Representative sampling: in each selected cluster, find a sentence with the highest uncertainty score as the cluster representative.

Output

x cluster representative sentences in the order of their cluster ranking.

Initial sampling

When the NER model and uncertainty scores of sentences are not available, we used random sampling to select a cluster and the representative within the selected cluster.

参考资料

本文链接:http://57km.cc/post/Active learning enabled Annotation System.html

-- EOF --

Comments