亚洲欧美日韩高清,一二三四视频高清观看在线播放,五月色综合婷婷综合俺来也

RAKE簡(jiǎn)介

RAKE英文全稱為Rapid Automatic keyword extraction，中文稱為快速自動(dòng)關(guān)鍵字提取，是一種非常高效的關(guān)鍵字提取算法，可對(duì)單個(gè)文檔進(jìn)行操作，以實(shí)現(xiàn)對(duì)動(dòng)態(tài)集合的應(yīng)用，也可非常輕松地應(yīng)用于新域，并且在處理多種類型的文檔時(shí)也非常有效。

算法思想

RAKE算法用來做關(guān)鍵詞(keyword)的提取，實(shí)際上提取的是關(guān)鍵的短語(phrase)，并且傾向于較長(zhǎng)的短語，在英文中，關(guān)鍵詞通常包括多個(gè)單詞，但很少包含標(biāo)點(diǎn)符號(hào)和停用詞，例如and，the，of等，以及其他不包含語義信息的單詞。

RAKE算法首先使用標(biāo)點(diǎn)符號(hào)（如半角的句號(hào)、問號(hào)、感嘆號(hào)、逗號(hào)等）將一篇文檔分成若干分句，然后對(duì)于每一個(gè)分句，使用停用詞作為分隔符將分句分為若干短語，這些短語作為最終提取出的關(guān)鍵詞的候選詞。

最后，每個(gè)短語可以再通過空格分為若干個(gè)單詞，可以通過給每個(gè)單詞賦予一個(gè)得分，通過累加得到每個(gè)短語的得分。一個(gè)關(guān)鍵點(diǎn)在于將這個(gè)短語中每個(gè)單詞的共現(xiàn)關(guān)系考慮進(jìn)去。最終定義的公式是:

算法步驟

(1)算法首先對(duì)句子進(jìn)行分詞，分詞后去除停用詞，根據(jù)停用詞劃分短語;

(2)之后計(jì)算每一個(gè)詞在短語的共現(xiàn)詞數(shù),并構(gòu)建詞共現(xiàn)矩陣;

(3)共現(xiàn)矩陣的每一列的值即為該詞的度deg（是一個(gè)網(wǎng)絡(luò)中的概念，每與一個(gè)單詞共現(xiàn)在一個(gè)短語中，度就加1，考慮該單詞本身）,每個(gè)詞在文本中出現(xiàn)的次數(shù)即為頻率freq;

(4)得分score為度deg與頻率 freq的商,score越大則該詞更重 ;

(5)最后按照得分的大小值降序輸出該詞所在的短語。

下面我們以一個(gè)中文例子具體解釋RAKE算法原理，例如“系統(tǒng)有聲音，但系統(tǒng)托盤的音量小喇叭圖標(biāo)不見了”，經(jīng)過分詞、去除停用詞處理后得到的詞集W = {系統(tǒng)，聲音，托盤，音量，小喇叭，圖標(biāo)，不見}，短語集D={系統(tǒng)，聲音，系統(tǒng)托盤，音量小喇叭圖標(biāo)不見}，詞共現(xiàn)矩陣如表：

每一個(gè)詞的度為deg={"系統(tǒng)”：2，“聲音”：1,“托盤”:1; “音量” ：3; “小喇叭” ：3，“圖標(biāo)” ：3，“不見” ：3}，頻率freq = { “系統(tǒng)” ：2, “聲音” ：1, “托盤” ：1 ；“音量” ：1；“小喇叭” ：1, “圖標(biāo)”丄“不見” ：1}, score ={“系統(tǒng)”：1,“聲音”：1,“托盤” ：1 ；“音量” ：1小喇叭” ：3, “圖標(biāo)” ：3, “不見” ：3 },輸出結(jié)果為{音量小喇叭圖標(biāo)不見 ,系統(tǒng)托盤，系統(tǒng)，聲音}

代碼實(shí)現(xiàn)

importstring

fromtypingimportDict,List,Set,Tuple

PUNCTUATION=string.punctuation.replace(''','')#Donotuseapostropheasadelimiter

ENGLISH_WORDS_STOPLIST:List[str]=[
'(',')','and','of','the','amongst','with','from','after','its','it','at','is',
'this',',','.','be','in','that','an','other','than','also','are','may','suggests',
'all','where','most','against','more','have','been','several','as','before',
'although','yet','likely','rather','over','a','for','can','these','considered',
'used','types','given','precedes',
]


defsplit_to_tokens(text:str)->List[str]:
'''
Splittextstringtotokens.
Behaviorissimilartostr.split(),
butemptylinesareomittedandpunctuationmarksareseparatedfromword.
Example:
split_to_tokens('Johnsaid'Hey!'(andsomeotherwords.)')->
->['John','said',''','Hey','!',''','(','and','some','other','words','.',')']
'''
result=[]
foritemintext.split():
whileitem[0]inPUNCTUATION:
result.append(item[0])
item=item[1:]
foriinrange(len(item)):
ifitem[-i-1]notinPUNCTUATION:
break
ifi==0:
result.append(item)
else:
result.append(item[:-i])
result.extend(item[-i:])
return[itemforiteminresultifitem]


defsplit_tokens_to_phrases(tokens:List[str],stoplist:List[str]=None)->List[str]:
"""
Mergetokensintophrases,delimitedbyitemsfromstoplist.
Phraseisasequenceoftokenthathasthefollowingproperties:
-phrasecontains1ormoretokens
-tokensfromphrasegoinarow
-phrasedoesnotcontaindelimitersfromstoplist
-eithertheprevious(notinaphrase)tokenbelongstostoplistoritisthebeginningoftokenslist
-eitherthenext(notinaphrase)tokenbelongstostoplistoritistheendoftokenslist
Example:
split_tokens_to_phrases(
tokens=['Mary','and','John',',','some','words','(','and','other','words',')'],
stoplist=['and',',','.','(',')'])->
->['Mary','John','somewords','otherwords']
"""
ifstoplistisNone:
stoplist=ENGLISH_WORDS_STOPLIST
stoplist+=list(PUNCTUATION)

current_phrase:List[str]=[]
all_phrases:List[str]=[]
stoplist_set:Set[str]={stopword.lower()forstopwordinstoplist}
fortokenintokens:
iftoken.lower()instoplist_set:
ifcurrent_phrase:
all_phrases.append(''.join(current_phrase))
current_phrase=[]
else:
current_phrase.append(token)
ifcurrent_phrase:
all_phrases.append(''.join(current_phrase))
returnall_phrases


defget_cooccurrence_graph(phrases:List[str])->Dict[str,Dict[str,int]]:
"""
Getgraphthatstorescooccurenceoftokensinphrases.
Matrixisstoredasdict,
wherekeyistoken,valueisdict(keyissecondtoken,valueisnumberofcooccurrence).
Example:
get_occurrence_graph(['Mary','John','somewords','otherwords'])->{
'mary':{'mary':1},
'john':{'john':1},
'some':{'some':1,'words':1},
'words':{'some':1,'words':2,'other':1},
'other':{'other':1,'words':1}
}
"""
graph:Dict[str,Dict[str,int]]={}
forphraseinphrases:
forfirst_tokeninphrase.lower().split():
forsecond_tokeninphrase.lower().split():
iffirst_tokennotingraph:
graph[first_token]={}
graph[first_token][second_token]=graph[first_token].get(second_token,0)+1
returngraph


defget_degrees(cooccurrence_graph:Dict[str,Dict[str,int]])->Dict[str,int]:
"""
Getdegreesforalltokensbycooccurrencegraph.
Resultisstoredasdict,
wherekeyistoken,valueisdegree(sumoflengthsofphrasesthatcontainthetoken).
Example:
get_degrees(
{
'mary':{'mary':1},
'john':{'john':1},
'some':{'some':1,'words':1},
'words':{'some':1,'words':2,'other':1},
'other':{'other':1,'words':1}
}
)->{'mary':1,'john':1,'some':2,'words':4,'other':2}
"""
return{token:sum(cooccurrence_graph[token].values())fortokenincooccurrence_graph}


defget_frequencies(cooccurrence_graph:Dict[str,Dict[str,int]])->Dict[str,int]:
"""
Getfrequenciesforalltokensbycooccurrencegraph.
Resultisstoredasdict,
wherekeyistoken,valueisfrequency(numberoftimesthetokenoccurs).
Example:
get_frequencies(
{
'mary':{'mary':1},
'john':{'john':1},
'some':{'some':1,'words':1},
'words':{'some':1,'words':2,'other':1},
'other':{'other':1,'words':1}
}
)->{'mary':1,'john':1,'some':1,'words':2,'other':1}
"""
return{token:cooccurrence_graph[token][token]fortokenincooccurrence_graph}


defget_ranked_phrases(phrases:List[str],*,
degrees:Dict[str,int],
frequencies:Dict[str,int])->List[Tuple[str,float]]:
"""
GetRAKEmeasureforeveryphrase.
Resultisstoredaslistoftuples,everytuplecontainsofphraseanditsRAKEmeasure.
Itemsaresortednon-ascendingbyRAKEmeasure,thanalphabeticallybyphrase.
"""
processed_phrases:Set[str]=set()
ranked_phrases:List[Tuple[str,float]]=[]
forphraseinphrases:
lowered_phrase=phrase.lower()
iflowered_phraseinprocessed_phrases:
continue
score:float=sum(degrees[token]/frequencies[token]fortokeninlowered_phrase.split())
ranked_phrases.append((lowered_phrase,round(score,2)))
processed_phrases.add(lowered_phrase)
#Sortbyscorethanbyphrasealphabetically.
ranked_phrases.sort(key=lambdaitem:(-item[1],item[0]))
returnranked_phrases


defrake_text(text:str)->List[Tuple[str,float]]:
"""
GetRAKEmeasureforeveryphraseintextstring.
Resultisstoredaslistoftuples,everytuplecontainsofphraseanditsRAKEmeasure.
Itemsaresortednon-ascendingbyRAKEmeasure,thanalphabeticallybyphrase.
"""
tokens:List[str]=split_to_tokens(text)
phrases:List[str]=split_tokens_to_phrases(tokens)
cooccurrence:Dict[str,Dict[str,int]]=get_cooccurrence_graph(phrases)
degrees:Dict[str,int]=get_degrees(cooccurrence)
frequencies:Dict[str,int]=get_frequencies(cooccurrence)
ranked_result:List[Tuple[str,float]]=get_ranked_phrases(phrases,degrees=degrees,frequencies=frequencies)
returnranked_result

執(zhí)行效果：

if__name__=='__main__':
text='Mercy-classincludesUSNSMercyandUSNSComforthospitalships.Credit:USNavyphotoMassCommunicationSpecialist1stClassJasonPastrick.TheUSNavalAirWarfareCenterAircraftDivision(NAWCAD)LakehurstinNewJerseyisusinganadditivemanufacturingprocesstomakefaceshields.........'
ranked_result=rake_text(text)
print(ranked_result)

關(guān)鍵短語抽取效果如下：

[
('additivemanufacturingprocesstomakefaceshields.the3dprintingfaceshields',100.4),
('usnavyphotomasscommunicationspecialist1stclassjasonpastrick',98.33),
('usnavy’smercy-classhospitalshipusnscomfort.currentlystationed',53.33),
...
]

審核編輯：彭靜

阅读全文

聲明：本文內(nèi)容及配圖由入駐作者撰寫或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點(diǎn)僅代表作者本人，不代表電子發(fā)燒友網(wǎng)立場(chǎng)。文章及其配圖僅供工程師學(xué)習(xí)之用，如有內(nèi)容侵權(quán)或者其他違規(guī)問題，請(qǐng)聯(lián)系本站處理。舉報(bào)投訴