在本書中,我們經(jīng)常會(huì)使用表示為單詞、字符或單詞序列的文本數(shù)據(jù)。首先,我們需要一些基本工具來將原始文本轉(zhuǎn)換為適當(dāng)形式的序列。典型的預(yù)處理流水線執(zhí)行以下步驟:
-
將文本作為字符串加載到內(nèi)存中。
-
將字符串拆分為標(biāo)記(例如,單詞或字符)。
-
構(gòu)建一個(gè)詞匯詞典,將每個(gè)詞匯元素與一個(gè)數(shù)字索引相關(guān)聯(lián)。
-
將文本轉(zhuǎn)換為數(shù)字索引序列。
import collections
import random
import re
import torch
from d2l import torch as d2l
import collections
import random
import re
import tensorflow as tf
from d2l import tensorflow as d2l
9.2.1. 讀取數(shù)據(jù)集
在這里,我們將使用 HG Wells 的The Time Machine,這是一本 30000 多字的書。雖然實(shí)際應(yīng)用程序通常會(huì)涉及大得多的數(shù)據(jù)集,但這足以演示預(yù)處理管道。以下_download
方法將原始文本讀入字符串。
class TimeMachine(d2l.DataModule): #@save
"""The Time Machine dataset."""
def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
'時(shí)間機(jī)器,HG Wells [1898]nnnnnInnnThe Time Tra'
class TimeMachine(d2l.DataModule): #@save
"""The Time Machine dataset."""
def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...
'The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra'
class TimeMachine(d2l.DataModule): #@save
"""The Time Machine dataset."""
def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
'The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra'
class TimeMachine(d2l.DataModule): #@save
"""The Time Machine dataset."""
def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
'The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra'
為簡單起見,我們?cè)陬A(yù)處理原始文本時(shí)忽略標(biāo)點(diǎn)符號(hào)和大寫字母。
@d2l.add_to_class(TimeMachine) #@save
def _preprocess(self, text):
return re.sub('[^A-Za-z]+', ' ', text).lower()
text = data._preprocess(raw_text)
text[:60]
'the time machine by h g wells i the time traveller for so it'
'the time machine by h g wells i the time traveller for so it'
'the time machine by h g wells i the time traveller for so it'
9.2.2. 代幣化
標(biāo)記是文本的原子(不可分割)單元。每個(gè)時(shí)間步對(duì)應(yīng) 1 個(gè) token,但究竟什么是 token 是一種設(shè)計(jì)選擇。例如,我們可以將句子“Baby needs a new pair of shoes”表示為一個(gè)包含 7 個(gè)單詞的序列,其中所有單詞的集合包含一個(gè)很大的詞匯表(通常是數(shù)萬或數(shù)十萬個(gè)單詞)。或者我們將同一個(gè)句子表示為更長的 30 個(gè)字符序列,使用更小的詞匯表(只有 256 個(gè)不同的 ASCII 字符)。下面,我們將預(yù)處理后的文本標(biāo)記為一系列字符。
't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '
't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '
't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '
9.2.3. 詞匯
這些標(biāo)記仍然是字符串。然而,我們模型的輸入最終必須由數(shù)值輸入組成。接下來,我們介紹一個(gè)用于構(gòu)建詞匯表的類,即,將每個(gè)不同的標(biāo)記值與唯一索引相關(guān)聯(lián)的對(duì)象。首先,我們確定訓(xùn)練語料庫中的唯一標(biāo)記集。然后我們?yōu)槊總€(gè)唯一標(biāo)記分配一個(gè)數(shù)字索引。為方便起見,通常會(huì)刪除不常用的詞匯元素。Whenever we encounter a token at training or test time that had not been previously seen or was dropped from the vocabulary, we represent it by a special “” token, signifying that this is an unknown value.
class Vocab: #@save
"""Vocabulary for text."""
def __init__(self, tokens=[], min_freq=0, reserved_tokens=[]):
# Flatten a 2D list if needed
if tokens and isinstance(tokens[0], list):
tokens = [token for line in tokens for token in line]
# Count token frequencies
counter = collections.Counter(tokens)
self.token_freqs = sorted(counter.items(), key=lambda x: x[1],
reverse=True)
# The list of unique tokens
self.idx_to_token = list(sorted(set([''] + reserved_tokens + [
token for token, freq in self.token_freqs if freq >= min_freq])))
self.token_to_idx = {token: idx
for idx, token in enumerate(self.idx_to_token)}
def __len__(self):
return len(self.idx_to_token)
def __getitem__(self, tokens):
if not isinstance(tokens, (list, tuple)):
return self.token_to_idx.get(tokens,
評(píng)論
查看更多