摘 要: 针对文本分类数据非均衡问题,在数据层面提出一种新的基于大模型的样本平衡算法———LMSBA算法(Based on Large Model Sample Balancing Algorithm)。LMSBA算法是一种新型的样本平衡方法,旨在解决文本分类中的类别不平衡问题。该算法通过生成少数类样本和筛选多数类样本,有效实现样本均衡化,同时利用特定提示词引导模型结合样本的生成与筛选。实验结果显示,在FastText、TextCNN、TextRNN和TextRCNN4种文本分类模型上,LMSBA算法使宏平均F1分数平均提高约37.37百分点,证明了其在处理非均衡样本问题上的有效性。 |
关键词: 大模型;文本分类;样本不平衡 |
中图分类号: TP391
文献标识码: A
|
基金项目: 国家重点研发计划(2022YFC3302300);预研专项(7090201050307);国家242信息安全计划项目(2023A105) |
|
Optimization Method for Imbalanced Text Classification Based on Large Models |
ZHANG Dawei1,2, MI Rongxin3, ZHOU Peiyao1,2, JIN Dawei2, ZHANG Manman2, SONG Tianhang2
|
(1.School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang 212100, China; 2.Key Laboratory of Intelligent Information Processing Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; 3.National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100190, China)
zhangdawei_just@126.com; mirongxin@cert.org.cn; zpy1690934380@163.com; dwjin0930@163.com; zhangmm6270@163.com; sth@gs.zzu.edu.cn
|
Abstract: This paper proposes a novel Large Model Sample Balancing Algorithm, LMSBA, to address the issue of imbalanced text classification data at the data level. LMSBA is a new sample balancing method designed to tackle the problem of class imbalance in text classification. This algorithm achieves effective sample balancing by generating samples for minority classes and filtering samples for majority classes, while utilizing specific prompt words to guide the model in combining sample generation and filtering. Experimental results demonstrate that LMSBA significantly improves the macro-average F1-score by approximately 37.37 percentage points on four text classification models (FastText, TextCNN, TextRNN, and TextRCNN), validating its effectiveness in handling imbalanced sample distributions. |
Keywords: large models; text classification; sample imbalance |