软件工程

引用本文:

张大伟,秘蓉新,周培姚,靳大为,张漫漫,宋天航.基于大模型的非均衡样本文本分类优化方法[J].软件工程,2025,28(3):47-50.【点击复制】

分享到：微信更多

基于大模型的非均衡样本文本分类优化方法

张大伟^1,2,秘蓉新³,周培姚^1,2,靳大为²,张漫漫²,宋天航²

(1.江苏科技大学计算机学院,江苏镇江 212100;
2.中国科学院计算技术研究所智能信息处理重点实验室,北京100190;
3.国家计算机网络应急技术处理协调中心,北京 100190)
zhangdawei_just@126.com; mirongxin@cert.org.cn; zpy1690934380@163.com; dwjin0930@163.com; zhangmm6270@163.com; sth@gs.zzu.edu.cn

摘要: 针对文本分类数据非均衡问题,在数据层面提出一种新的基于大模型的样本平衡算法———LMSBA算法(Based on Large Model Sample Balancing Algorithm)。LMSBA算法是一种新型的样本平衡方法,旨在解决文本分类中的类别不平衡问题。该算法通过生成少数类样本和筛选多数类样本,有效实现样本均衡化,同时利用特定提示词引导模型结合样本的生成与筛选。实验结果显示,在FastText、TextCNN、TextRNN和TextRCNN4种文本分类模型上,LMSBA算法使宏平均F1分数平均提高约37.37百分点,证明了其在处理非均衡样本问题上的有效性。

关键词: 大模型文本分类样本不平衡

中图分类号: TP391 文献标识码: A

基金项目: 国家重点研发计划(2022YFC3302300);预研专项(7090201050307);国家242信息安全计划项目(2023A105)

Optimization Method for Imbalanced Text Classification Based on Large Models

ZHANG Dawei^1,2, MI Rongxin³, ZHOU Peiyao^1,2, JIN Dawei², ZHANG Manman², SONG Tianhang²

(1.School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang 212100, China;
2.Key Laboratory of Intelligent Information Processing Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
3.National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100190, China)
zhangdawei_just@126.com; mirongxin@cert.org.cn; zpy1690934380@163.com; dwjin0930@163.com; zhangmm6270@163.com; sth@gs.zzu.edu.cn

Abstract: This paper proposes a novel Large Model Sample Balancing Algorithm, LMSBA, to address the issue of imbalanced text classification data at the data level. LMSBA is a new sample balancing method designed to tackle the problem of class imbalance in text classification. This algorithm achieves effective sample balancing by generating samples for minority classes and filtering samples for majority classes, while utilizing specific prompt words to guide the model in combining sample generation and filtering. Experimental results demonstrate that LMSBA significantly improves the macro-average F1-score by approximately 37.37 percentage points on four text classification models (FastText, TextCNN, TextRNN, and TextRCNN), validating its effectiveness in handling imbalanced sample distributions.

Keywords: large models text classification sample imbalance

用微信扫一扫