悦月直播免费版app下载 - 悦月直播app大全下载最新版本免费安装软件

基于雙層數(shù)據(jù)增強(qiáng)的監(jiān)督對(duì)比學(xué)習(xí)文本分類模型

  • 打印
  • 收藏
收藏成功


打開(kāi)文本圖片集

摘要:針對(duì)DoubleMix算法在數(shù)據(jù)增強(qiáng)時(shí)的非選擇性擴(kuò)充及訓(xùn)練方式的不足,提出一種基于雙層數(shù)據(jù)增強(qiáng)的監(jiān)督對(duì)比學(xué)習(xí)文本分類模型,有效提高了在訓(xùn)練數(shù)據(jù)稀缺時(shí)文本分類的準(zhǔn)確率。首先,對(duì)原始數(shù)據(jù)在輸入層進(jìn)行基于關(guān)鍵詞的數(shù)據(jù)增強(qiáng),不考慮句子結(jié)構(gòu)的同時(shí)對(duì)數(shù)據(jù)進(jìn)行有選擇增強(qiáng);其次,在BERT隱藏層對(duì)原始數(shù)據(jù)與增強(qiáng)后的數(shù)據(jù)進(jìn)行插值,然后送入TextCNN進(jìn)一步提取特征;最后,使用Wasserstein距離和雙重對(duì)比損失對(duì)模型進(jìn)行訓(xùn)練,進(jìn)而提高文本分類的準(zhǔn)確率,對(duì)比實(shí)驗(yàn)結(jié)果表明,該方法在數(shù)據(jù)集SST-2,CR,TREC和PC上分類準(zhǔn)確率分別達(dá)93.41%,93.55%,97.61%和95.27%,優(yōu)于經(jīng)典算法.

關(guān)鍵詞:數(shù)據(jù)增強(qiáng);文本分類;對(duì)比學(xué)習(xí);監(jiān)督學(xué)習(xí)

中圖分類號(hào):TP39文獻(xiàn)標(biāo)志碼:A文章編號(hào):1671-5489(2024)05-1179-09

Supervised Contrastive Learning Text Classification ModelBased on Double-Layer Data Augmentation

WU Liang,ZHANGFangfang,CHENGChao,SONGShinan

(College of Com puter Science and Engineering,Changchun University of Technology,Changchun 130012,China)

Abstract:Aiming at the non-selective expansion and training deficiencies of the DoubleMix algorithm during data augmentation,we proposed a supervised contrastive learning text classification model based on double-layer data augmentation,which effectively improved the accuracy of text classification when training data was scarce.Firstly,keyword-based data augmentation was applied to the original data at the input layer,while selectively enhancing the data without considering sentence structure.Secondly,we interpolated the original and augmented data in the BERT hidden layers,and then send them to the TextCNN for further feature extraction.Finally,the model was trained by using Wasserstein distance and double contrastive loss to enhance text classification accuracy.The comparative experimental results on SST-2,CR,TREC,and PC datasets show that the classification accuracy of the proposed method is 93.41%,93.55%,97.61%,and 95.27%respectively,which is superior to classical algorithms.

Keywords:dataaugmentation;textclassification;comparativelearning;supervised learning

文本分類是自然語(yǔ)言處理(NLP)的基本任務(wù)之一,在新聞過(guò)濾、論文分類、情感分析等方面應(yīng)用廣泛2,深度學(xué)習(xí)模型在文本分類中已取得了巨大成功,其通常建立在大量高質(zhì)量的訓(xùn)練數(shù)據(jù)上,而這些數(shù)據(jù)在實(shí)際應(yīng)用中并不容易獲得,因此,為提高文本分類模型的泛化能力,當(dāng)訓(xùn)練數(shù)據(jù)有限時(shí),數(shù)據(jù)增強(qiáng)技術(shù)得到廣泛關(guān)注[3,文本分類要獲得較好的分類精度,好的特征表示和分類器的訓(xùn)練也至關(guān)重要[4].

在自然語(yǔ)言處理領(lǐng)域中,存在標(biāo)記級(jí)別增強(qiáng)(token-level augment)、句子級(jí)別增強(qiáng)(sentence-level augment)、隱藏層增強(qiáng)(hidden-level augment)等類型[5].EDA6](easy data augmentation)是最常見(jiàn)的標(biāo)記級(jí)別數(shù)據(jù)增強(qiáng),通過(guò)對(duì)句子中的單詞進(jìn)行隨機(jī)替換、刪除、插入等操作實(shí)現(xiàn)數(shù)據(jù)增強(qiáng).句子級(jí)別的增強(qiáng)通過(guò)修改句子的語(yǔ)法或結(jié)構(gòu)實(shí)現(xiàn),最常見(jiàn)的是反向翻譯技術(shù),隱藏層數(shù)據(jù)增強(qiáng)的方法是基于對(duì)數(shù)據(jù)插值(interpolation)實(shí)現(xiàn)的,Mixup是最早出現(xiàn)的一種基于插值的增強(qiáng)方式,TMix(interpolation in textual hidden space)是在其基礎(chǔ)上發(fā)展的線性插值數(shù)據(jù)增強(qiáng)方式.Ssmix(saliency-based span mixup)是一種輸入級(jí)的混合插值方式.上述幾種插值方式都伴隨偽標(biāo)簽(softlabel)生成,會(huì)限制數(shù)據(jù)增強(qiáng)的有效性.DoubleMix5增強(qiáng)方法的提出避免了偽標(biāo)簽生成,首先利用EDA與回譯技術(shù)從原始數(shù)據(jù)中生成幾個(gè)擾動(dòng)樣本,然后在隱藏空間中混合擾動(dòng)樣本與原始樣本,最后采用JSD(Jensen-Shannon divergence)散度為正則項(xiàng)與交叉熵?fù)p失一起訓(xùn)練,但DoubleMix生成擾動(dòng)樣本的方式有的對(duì)句子結(jié)構(gòu)要求較高,有的對(duì)文本進(jìn)行非選擇性的補(bǔ)充。(剩余11705字)

目錄
monitor