软件工程

引用本文:

梁俊伟,顾亦然,黄丽亚.基于语义距离的专利相似性检索方法[J].软件工程,2026,29(3):66-72.【点击复制】

分享到：微信更多

基于语义距离的专利相似性检索方法

梁俊伟¹,顾亦然¹,黄丽亚²

(1.南京邮电大学自动化学院、人工智能学院,江苏南京 210023
2.南京邮电大学电子与光学工程学院、柔性电子(未来技术)学院,江苏南京210023)
liangjunwei23@163.com; guyr@njupt.edu.cn; huangly@njupt.edu.cn

摘要: 随着知识产权申请需求不断攀升,传统专利审查方法在海量文本处理与精确语义匹配方面面临着挑战。现有的基于深度学习的专利相似性检索方法存在信息量多、复杂语义匹配不精确的问题。构建了大规模专利数据集,提出了一种基于混合编码架构的专利相似性分析模型。首先,模型通过交叉编码器生成高置信度的伪标签,弥补了专利标注句子对数据的不足;其次,采用双编码器架构对文本进行并行化独立编码,并生成语义向量,以实现高效检索;最后,将伪标签与人工标注数据相结合进行协同训练,有效提升了模型对专利中复杂技术描述的适应性。实验结果表明,该模型在精确率和F1值上均优于PatentSBERTa、PatentBERT等基线模型。

关键词: 专利审查专利相似性检测混合编码架构领域数据增强策略

中图分类号: TP350 文献标识码: A

基金项目: 国家自然科学基金项目资助(61977039)

Patent Similarity Search Method Based on Semantic Distance

LIANG Junwei¹, GU Yiran¹, HUANG Liya²

(College of Automation & College of Artificial Intelligence, Nanjing University of Posts and Telecommunications, Nanjing 210023, China;
College of Electronic and Optical Engineering & College of Flexible Electronics (Future Technology), Nanjing University of Posts and Telecommunications, Nanjing 210023, China)
liangjunwei23@163.com; guyr@njupt.edu.cn; huangly@njupt.edu.cn

Abstract: The growing volume of intellectual property applications poses challenges for traditional patent examination in large-scale text processing and accurate semantic matching. This study proposes a patent similarity search model based on a hybrid encoding framework. A large-scale patent dataset was constructed. High-confidence pseudo-labels were first generated using a cross-encoder to compensate for the lack of labeled sentence pairs. Then, a dua-l encoder architecture encoded patent texts independently and generated semantic vectors for efficient retrieval.Finally, pseudo-labeled data and manually annotated data were used for co-training, improving the model’s ability to handle complex technical content. Experimental results show that the proposed model achieves higher precision and F1 scores than baseline models such as PatentSBERTa and PatentBERT.

Keywords: patent examination patent similarity detection hybrid encoding framework domain-specific data augmentation strategy

用微信扫一扫