| 摘 要: 随着知识产权申请需求不断攀升,传统专利审查方法在海量文本处理与精确语义匹配方面面临着挑战。现有的基于深度学习的专利相似性检索方法存在信息量多、复杂语义匹配不精确的问题。构建了大规模专利数据集,提出了一种基于混合编码架构的专利相似性分析模型。首先,模型通过交叉编码器生成高置信度的伪标签,弥补了专利标注句子对数据的不足;其次,采用双编码器架构对文本进行并行化独立编码,并生成语义向量,以实现高效检索;最后,将伪标签与人工标注数据相结合进行协同训练,有效提升了模型对专利中复杂技术描述的适应性。实验结果表明,该模型在精确率和F1值上均优于PatentSBERTa、PatentBERT等基线模型。 |
| 关键词: 专利审查 专利相似性检测 混合编码架构 领域数据增强策略 |
|
中图分类号: TP350
文献标识码: A
|
| 基金项目: 国家自然科学基金项目资助(61977039) |
|
| Patent Similarity Search Method Based on Semantic Distance |
|
LIANG Junwei1, GU Yiran1, HUANG Liya2
|
(College of Automation & College of Artificial Intelligence, Nanjing University of Posts and Telecommunications, Nanjing 210023, China; College of Electronic and Optical Engineering & College of Flexible Electronics (Future Technology), Nanjing University of Posts and Telecommunications, Nanjing 210023, China)
liangjunwei23@163.com; guyr@njupt.edu.cn; huangly@njupt.edu.cn
|
| Abstract: The growing volume of intellectual property applications poses challenges for traditional patent examination in large-scale text processing and accurate semantic matching. This study proposes a patent similarity search model based on a hybrid encoding framework. A large-scale patent dataset was constructed. High-confidence pseudo-labels were first generated using a cross-encoder to compensate for the lack of labeled sentence pairs. Then, a dua-l encoder architecture encoded patent texts independently and generated semantic vectors for efficient retrieval.Finally, pseudo-labeled data and manually annotated data were used for co-training, improving the model’s ability to handle complex technical content. Experimental results show that the proposed model achieves higher precision and F1 scores than baseline models such as PatentSBERTa and PatentBERT. |
| Keywords: patent examination patent similarity detection hybrid encoding framework domain-specific data
augmentation strategy |