對話系統中的中文自然語言理解(NLU)(3.1)學術界中的方法(聯合訓練)
在這一節中,我們會詳細的介紹2個任務:意圖分類 和 槽位填充。除此之外,我們還會對以往的一些工作進行回顧。來看一看這些工作是如何完成這兩個任務的。你會發現,其中有一些工作便會用到在上一節中提到的中文特色的特征。
In this section, we will cover the 2 tasks in detail: Intent Classification and Slot Filling. In addition to this, we will also review some of the previous work for these two tasks. You will notice that some of these studies used the Chinese-specific features mentioned in the previous article.
需要注意的是,這一系列文章中不包含端到端的工作(比如,將說的話作為喂給一個語言模型,然后模型吐出來意圖分類或者槽位填充的結果,或者兩者同時吐出來)。
It should be noted that this series of articles does not include end-to-end methods (e.g. feeding utterances to a language model and generating the answers, intent classification results, slot filling, or both).
2個任務 (2 Tasks)
意圖分類任務(Intent Classification Task)和文本分類比較類似,對說出的話進行分類。類別需要提前預先設置好,比如:播放新聞、播放音樂、調高音量等。Similar to Text Classification, the utterances are categorised. The categories need to be pre-defined in advance, e.g. Play News, Play Music, Turn Up Volume, etc.
槽位填充任務(Slot Filling Task)當模型聽懂人類的意圖之后,為了執行任務,模型便需要進一步了解細節。比如當意圖是播放新聞時,模型想進一步知道哪里的以及哪一天發生的新聞。這里的日期和地點便是槽位,而它們對應的信息便是槽位填充的內容。Once the model understands the intent, in order to perform the task, the model needs to obtain further details. For example, when the intention is to Play News, the model has to know Where and on Which Day. In the example in the figure, the Date and Location are the slots, and the information they correspond to is what the slots are filled with.
解決任務的方法(Methods)
目前比較流行的方法有兩類。There are roughly two types of popular approaches.
聯合訓練:為這兩個任務設計一個共享的模型主體,而在輸出預測結果的時候會進行特別的設計。一般的做法是,這兩個任務分別有自己對應的輸出層。
Joint training: A shared model body is designed for the two tasks, while the output prediction layers are specifically designed.
單獨訓練:把這兩個任務當作互相沒有關聯的任務。也就是說針對每一個任務會單獨設計一個方法。
Separate training: The two tasks are considered they are unrelated to each other. This means that a separate method will be designed for each task.
因為我們這篇文章主要介紹針對中文的方法。這里我們簡單提一下處理中文和英文方法之間的關系。如下面的圖說的那樣,處理這兩種語言的方法是有重疊部分的。也就是說,有的方法既適用于中文,也適用于英文。需要注意的是,也有一些針對某一種語言的方法,而這些方法是沒有辦法直接的運用到另一個語種上的。
Because our article focuses on methods for the Chinese language. Here we briefly mention the relationship between the methods for dealing with Chinese and English. As the picture illustrates, there is an overlap between the methods for dealing with the two languages. That is, there are methods that work for Chinese as well as English. It is important to note that there are also methods for one language that cannot be directly applied to another language.
3.1 聯合訓練(Joint Training)
BERT for Joint Intent Classification and Slot Filling, 2019, DAMO Academy, Alibaba
我們從這篇論文說起。在上圖中,我們展示了如何處理英文文本。它的框架在目前是比較流行的。Let's start with this paper. The figure shows how the model works with English text. Its framework is relatively popular at the moment.
輸入包含兩部分:特殊token [CLS] 和 tokens;輸出也包含兩部分:這句話的意圖分類結果(上圖中為Play News) 以及 每個token對應的序列標注的標簽(today對應Date, Beijing對應Location).
Input consists of two parts: special token [CLS] and tokens; Output also consists of two parts: the result of the intention classification (Play News in the picture above) and the label of the sequence corresponding to each token (Today corresponds to Date, Beijing to Location).
從右上角圖中可以看出,在兩個公開的數據集上,可以獲得不錯的效果。那么,如果我們想把它運用到中文上,但同時又不改變模型的結構,應該如何做呢? As you can see from the top right table, good results can be obtained on the two publicly available datasets. What should we do if we plan to apply this model to Chinese, and at the same time not change the structure of the model?
一種簡單粗暴的方法是,將每個字作為一個單獨的token為輸入,并且每個字都會有一個embedding向量。A simple approach would be to consider each Chinese character as an individual token as input, and each character would have an embedding vector.
另一種方法是,先將這段文本進行中文分詞,然后以詞向量的形式輸入。Alternatively, we can apply word segmentation to this text (i.e., split text into Chinese words) and then consider word vectors as the inputs.
有同學要問了,這里為什么一定要使用BERT模型呢?如果說我的業務場景或者任務需要更加輕量級的模型,但是又需要還可以的效果,又該怎么辦呢?You may ask, what if my business scenario or task requires a more lightweight model with acceptable performance? Bert is somehow too heavy for me.
中間是不一定必須使用BERT模型的。中間即使是換成其他的模型,再加上任務沒有那么的難,也同樣會達到還可以的效果。In this case, it is not necessary to use the BERT model. Even if you replace it with another model, it will still achieve decent performance if the task is not difficult.
如果你是NLP新手的話,對于如何使用CRF層還不是很了解,可以參考我之前寫的另一系列文章:CRF Layer on the Top of BiLSTM (https://createmomo.github.io/2017/09/12/CRF_Layer_on_the_Top_of_BiLSTM_1/)。當然,也可以閱讀其他小伙伴的優秀的學習筆記。相信最終你一定會搞懂它的工作原理。If you are new to NLP, and you don't know much about the CRF layer, you can refer to another series of articles I wrote before: CRF Layer on the Top of BiLSTM. Of course, you can also read the excellent articles of other authors. I am sure you will eventually figure out how it works.
A joint model of intent determination and slot filling for spoken language understanding, IJCAI 2016, Peking University
這一篇文章展示的結構為:輸入向量→雙向GRU結構→CRF層用來做序列標注(即槽位填充)& 最后一個token對應的輸出用來做意圖分類。The structure shown in this paper is: input vector → bidirectional GRU → CRF layer used for sequence labelling (i.e. slot filling) & the output corresponding to the last token is used to predict the intent.
需要注意的是,這里的輸入是經過小小的精心設計的。首先,每一個詞會擁有自己的向量。雖然如此,真正直接進入模型的,并不是每個詞的向量。而是每連續3個詞向量,組成新的向量。而這個新的向量才是與GRU部分直接接觸的輸入。It should be noted that the input here is carefully designed. Firstly, each word will have its own vector. Despite this, it is not the vector of each word that goes directly into the model. Rather, it is every three consecutive word vectors that make up a new vector, and this new vector goes into directly the GRU part.
下圖展示了模型可以獲得效果。在圖中我們也寫了一個小提醒:本文使用的詞向量是在中文大數據上經過訓練的詞向量。The figure below shows the results. Note that the word vectors used in this paper are word vectors that have been pre-trained on large Chinese text.
Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling, Interspeech 2016, Carnegie Mellon University
這篇論文提出了2個模型。This paper proposes 2 models.
模型1:基于注意力機制的RNN (Model1: Attention-Based RNN)
共享部分: 輸入→BiLSTM→輸出→計算注意力權重分數(在這個例子中,應該總共為16個分數。例如在“From”這個位置,,我們需要得出這4個權重。在其他位置的情況是類似的。每一步都需要計算4個注意力權重分數)→根據注意力權重分數計算
Shared: Input → BiLSTM → Output → Compute attention weight scores (in this example, there should be a total of 16 scores. For example in the position "From",, we need to derive the 4 weights . The situation is similar in the other positions. (Each step requires the calculation of 4 attentional weight scores) → Calculate based on the attention scores
有的小伙伴可能要問了,這個注意力分數是如何計算的呢?現在注意力分數的計算其實已經有了很多不同的做法。它既可以是簡單的矩陣、向量之間的操作,也可以是由一個小型的神經網絡來決定的。如果感興趣,可以仔細閱讀論文,來研究一下這篇文章是如何計算的。Some of you may be asking, how is this attention score calculated? Nowadays the calculation of the attention score has actually been studied in many different ways. It can either be as simple as manipulating matrices and vectors or it can be determined by a small neural network. If interested, you can read the paper carefully to figure out how this is calculated in this paper.
意圖分類部分: 比如圖中有4個向量,那么將這幾個向量組成一個矩陣,然后使用mean-pooling操作得到一個新的向量。再利用這個新的向量來做分類任務。
Intent Classification Part: For example, if there are 4 vectors in the figure, then form a matrix of these vectors and use the mean-pooling operation to get a new vector. This new vector is used to complete the classification task.
槽位填充部分: 利用向量來預測每一個位置應該是什么標簽(比如:From→O,LA→B-From,to→O,Seattle→B-To)。
Slot-Filling Part: Use the vectors to predict what label should be at each position (e.g. From→O, LA→B-From, to→O, Seattle→B-To).
模型2:基于注意力機制的編碼-解碼器結構 (Model2: Attention-Based Encoder-Decoder)
共享部分: 這一部分和模型1是一樣的。輸入→BiLSTM→輸出→計算注意力權重分數(在這個例子中,應該總共為16個分數。例如在“From”這個位置,,我們需要得出這4個權重。在其他位置的情況是類似的。每一步都需要計算4個注意力權重分數)→根據注意力權重分數計算
Shared: Input → BiLSTM → Output → Compute attention weight scores (in this example, there should be a total of 16 scores. For example in the position "From",, we need to derive the 4 weights . The situation is similar in the other positions. (Each step requires the calculation of 4 attentional weight scores) → Calculate based on the attention scores
意圖分類部分: 將BiLSTM最后一刻的輸出與1個向量(4個的加權求和結果)組合到一起→一個簡單的前饋神經網絡→分類結果
Intent Classification Part: Combining the last output of the BiLSTM with 1 vector (the weighted summation result of 4 s) → a simple feed-forward neural network → classification result
槽位填充部分: 這一部分相當于是一個文字生成過程,只不過這里生成的是序列標注的標簽。每一步的輸入由2部分組成:BiLSTM相關的輸出結果(和)+ 前一時刻LSTM的輸出。每一步根據輸入的內容來預測序列標注的標簽。
Slot-Filling Part: This part is equivalent to a token-by-token sentencegeneration process, except that here the labels are generated for the sequence labelling. The input to each step consists of 2 parts: the output results associated with the BiLSTM ( and ) + the output of the LSTM from the previous step. Each step predicts the label based on the input.
CM-Net A Novel Collaborative Memory Network for Spoken Language Understanding, EMNLP2019, Beijing Jiaotong University, WeChat AI Tencent
這篇論文引入了2個記憶模塊:槽位記憶和意圖記憶。在這個任務中,有多少個槽位,槽位記憶中就有幾個向量。類似的,任務中有多少個意圖,意圖模塊中就會有幾個意圖向量。我們的期待是這2個模塊可以記住一些有用的信息,并且在模型做預測的時候,它們的記憶會有助于提高預測的準確度。This study introduces 2 memory modules: Slot Memory and Intent Memory. There are as many slots of this task as there are vectors in the slot memory (this is similar to intent memory). Our expectation is that these 2 modules will remember useful information and their memory will help to improve the model performance.
那么記憶模塊有了,該怎么讓他們在模型做預測的時候提供幫助呢?這篇論文還設計了一個特別的神經網絡塊:CM-block。它的作用就是聰明地綜合輸入的文本以及記憶模塊中的信息,來輔助最終輸出層(Inference Layer)做決策。如果你想增強整個模型的推理能力,可以考慮疊加更多的這種神經網絡塊。The problem is how they work when the model is making predictions. This paper designed a special neural network block: the CM-block. This block is able to intelligently combine input text and the information from the memory block to improve the Inference Layer in making decisions. One way which may increase the reasoning ability of this model is to stack more of these neural network blocks.
需要注意的是,針對序列標注任務,這篇論文采用的不是BIO2序列標簽框架。而是采用了BIOES的標簽框架。在下面這個圖中可以看出,針對同一句話,這兩種不同的標注框架會有哪些不同。It should be noticed that for the slot-filling task (a type of sequence labelling task), this paper did not use the BIO2 sequence tagging framework. Instead, the BIOES was used. The picture below explains how these two different labelling frameworks would differ for the same sentence.
Injecting Word Information with Multi-Level Word Adapter for Chinese Spoken Language Understanding, ICASSP 2021, Harbin Institute of Technology
從這一篇工作中你會發現,它不僅使用了基本的中文字符向量。同時,它還對句子進行了分詞預處理,然后將分詞后的信息融合在模型中。As you will see from this work, it not only uses basic Chinese character features. It also pre-processes the utterances by word segmentation and then incorporates such word information into the model.
下圖中模型看起來有一些復雜。我們不會展開說模型結構的細節。這里是想為大家展示,這個工作綜合了字符和分詞兩者的信息。作者們認為,雖然偶爾可能會有錯誤的分詞結果,但是這不會影響模型從分詞結果中獲取對理解中文有幫助的信息。The model architecture below looks somewhat complex. We will not describe the details of the model structure. The point here is to show that this work combines information from both characters and words. Although there may be occasional incorrect word segmentation results, this does not affect the model's ability to capture useful information from the words for a better understanding of Chinese.
圖中我們可以看出,這個模型的基本輸入為中文字符(我,洗,手,了)。As we can see in the figure, the basic input is Chinese characters.
同時,模型也考慮了,中文分詞后的結果:我,洗手,了。Moreover, the model takes into account the word segmentation results.
論文還提出了一個小模塊Word Adapter。你可以把它看成是一個通用的小工具。我們期待這個小工具的作用是將兩種不同的信息聰明的結合。這里聰明的含義便是它可以判斷誰輕誰重(weighted sum)。而這種判斷的能力,是可以伴隨著模型的訓練習得的(learnable)。為了將字符和分詞結果兩者的信息更聰明的結合到一起,便使用了這個聰明的小工具。The paper also presents a small module Word Adapter. You can think of it as a smart mini-module. The purpose of it is tocleverly combine two different kinds of information. We expect that it can determine which information is more important and which is not (weighted sum). Such ability can be learned with the training of the model. You can see that this module is used in combining the information from Chinese characters and words.
這張表格展示了模型的效果。可以看出,和很多很厲害的其他工作相比,這個模型帶來的提升是很明顯的。This table shows the model performance. As you can see, the improvement is significant compared to several strong baselines.
下一篇 (Next)
會回顧幾篇工作來解釋它們是如何單獨解決“意圖分類”或者“槽位填充”這兩個任務的~ The next article will present several studies addressing the two tasks of "Intent Classification" or "Slot Filling" individually~
審核編輯 :李倩
-
模型
+關注
關注
1文章
3229瀏覽量
48811 -
自然語言
+關注
關注
1文章
288瀏覽量
13347
原文標題:對話系統中的NLU 3.1 學術界中的方法(聯合訓練)
文章出處:【微信號:zenRRan,微信公眾號:深度學習自然語言處理】歡迎添加關注!文章轉載請注明出處。
發布評論請先 登錄
相關推薦
評論