| 摘 要: 针对织造车间数据采集过程中存在的数据质量低、数据冗余高的问题,提出了一种基于聚类分析法的综合数据清洗方法。首先,对纺织企业车间能耗进行层级分析,针对异常数据提出了基于二分K-means算法的异常数据识别方法。其次,针对缺失数据,采用多样化数据插补办法,实现对不同特征数据的插补;针对数据冗余高的问题,引入可决系数对数据集进行去重,降低数据集冗余。最后,以某纺织企业车间运行数据为对象进行仿真实验,结果表明,经降重后,数据集的数据量降低了83%,数据集预测实验的平均绝对百分比误差波动范围小于2%,该方法在降低数据冗余的同时保证了预测的可靠性。 | 
			
	         
				| 关键词: 数据清洗  聚类  异常检测  去重 | 
		
			 
                     
			
                | 中图分类号: TP111.8
			 
		
                  文献标识码: A | 
		
	   
            
                | 基金项目: 浙江省科技计划项目(2022C01202) | 
	     
          |  | 
           
                | Cleaning of Energy Consumption Data in Weaving Workshop Based on Clustering Analysis Method | 
           
			
                | HUANG Qihang1, RU Xin1, DAI Ning1, YU Bo1, CHEN Wei2, XU Yushan3 | 
           
		   
                | (1.School of Mechanical Engineering, Zhejiang Sci-Tech University, Hangzhou 310018, China; 2.Zhejiang Tianheng In f ormation Technology Co., Ltd., Shaoxing 312500, China;
 3.Zhejiang Kangli Automatic Control Technology Co., Ltd., Shaoxing 312500, China)
 2801554196@qq.com; zhitingna@126.com; 990713260@qq.com; angle_xb@163.com; 287270195@qq.com; 1193570378@qq.com
 | 
             
                | Abstract: In view of the problems of low data quality and high data redundancy in the data collection process of the weaving workshop, this paper proposes a comprehensive data cleaning method based on clustering analysis method. Firstly, hierarchical analysis is conducted on the energy consumption of textile enterprises, and a method for identifying abnormal data based on the binary K-means algorithm is proposed for abnormal data. Secondly, for missing data, diversified data interpolation methods are used to impute different feature data; for the problem of high data redundancy, the determination coefficient is introduced to deduplicate the dataset and reduce dataset redundancy. Finally, simulation experiments are conducted on the operating data of a textile enterprise workshop. The results show that after the reduction, the data volume of the dataset is reduced by 83% , and the average absolute percentage error range of the dataset prediction experiment is less than 2% . This method ensures the reliability of prediction while reducing data redundancy. | 
	       
                | Keywords: data cleaning  clustering  abnormal detection  deduplication |