• 首页
  • 期刊简介
  • 编委会
  • 投稿指南
  • 收录情况
  • 杂志订阅
  • 联系我们
引用本文:梁俊伟,顾亦然,黄丽亚.基于语义距离的专利相似性检索方法[J].软件工程,2026,29(3):66-72.【点击复制】
【打印本页】   【下载PDF全文】   【查看/发表评论】  【下载PDF阅读器】  
←前一篇|后一篇→ 过刊浏览
分享到: 微信 更多
基于语义距离的专利相似性检索方法
梁俊伟1,顾亦然1,黄丽亚2
(1.南京邮电大学自动化学院、人工智能学院,江苏 南京 210023
2.南京邮电大学电子与光学工程学院、柔性电子(未来技术)学院,江苏 南京210023)
liangjunwei23@163.com; guyr@njupt.edu.cn; huangly@njupt.edu.cn
摘 要: 随着知识产权申请需求不断攀升,传统专利审查方法在海量文本处理与精确语义匹配方面面临着挑战。现有的基于深度学习的专利相似性检索方法存在信息量多、复杂语义匹配不精确的问题。构建了大规模专利数据集,提出了一种基于混合编码架构的专利相似性分析模型。首先,模型通过交叉编码器生成高置信度的伪标签,弥补了专利标注句子对数据的不足;其次,采用双编码器架构对文本进行并行化独立编码,并生成语义向量,以实现高效检索;最后,将伪标签与人工标注数据相结合进行协同训练,有效提升了模型对专利中复杂技术描述的适应性。实验结果表明,该模型在精确率和F1值上均优于PatentSBERTa、PatentBERT等基线模型。
关键词: 专利审查  专利相似性检测  混合编码架构  领域数据增强策略
中图分类号: TP350    文献标识码: A
基金项目: 国家自然科学基金项目资助(61977039)
Patent Similarity Search Method Based on Semantic Distance
LIANG Junwei1, GU Yiran1, HUANG Liya2
(College of Automation & College of Artificial Intelligence, Nanjing University of Posts and Telecommunications, Nanjing 210023, China;
College of Electronic and Optical Engineering & College of Flexible Electronics (Future Technology), Nanjing University of Posts and Telecommunications, Nanjing 210023, China)
liangjunwei23@163.com; guyr@njupt.edu.cn; huangly@njupt.edu.cn
Abstract: The growing volume of intellectual property applications poses challenges for traditional patent examination in large-scale text processing and accurate semantic matching. This study proposes a patent similarity search model based on a hybrid encoding framework. A large-scale patent dataset was constructed. High-confidence pseudo-labels were first generated using a cross-encoder to compensate for the lack of labeled sentence pairs. Then, a dua-l encoder architecture encoded patent texts independently and generated semantic vectors for efficient retrieval.Finally, pseudo-labeled data and manually annotated data were used for co-training, improving the model’s ability to handle complex technical content. Experimental results show that the proposed model achieves higher precision and F1 scores than baseline models such as PatentSBERTa and PatentBERT.
Keywords: patent examination  patent similarity detection  hybrid encoding framework  domain-specific data augmentation strategy


版权所有:软件工程杂志社
地址:辽宁省沈阳市浑南区新秀街2号 邮政编码:110179
电话:0411-84767887 传真:0411-84835089 Email:semagazine@neusoft.edu.cn
备案号:辽ICP备17007376号-1
技术支持:北京勤云科技发展有限公司

用微信扫一扫

用微信扫一扫