基于半监督学习的多源异构数据治理
作者:
作者单位:

1.同济大学 软件学院,上海201804;2.复杂系统仿真总体重点实验室,北京100101

作者简介:

饶卫雄(1974—),男,教授,博士生导师,工学博士,主要研究方向为深度强化学习应用、时空数据科学、移动计算等。E-mail:wxrao@tongji.edu.cn

通讯作者:

赵钦佩(1982—),女,副教授,硕士生导师,工学博士,主要研究方向为机器学习、数据挖掘、模式识别。 E-mail:qinpeizhao@tongji.edu.cn

中图分类号:

TP391

基金项目:

上海市科技重大专项(2021SHZDZX0100);中央高校基本科研业务费专项资金


Multi-source Heterogeneous Data Governance Based on Semi-supervised Learning
Author:
Affiliation:

1.School of Software Engineering, Tongji University, Shanghai 201804, China;2.National Key Laboratory for Complex Systems Simulation, Beijing 100101, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    为实现不同数据管理系统之间的互通,提出一种基于半监督学习算法的多源异构数据治理框架,并由此设计、实现和测试了一套非结构化数据与结构化数据的自动化对齐方法。利用命名实体识别(NER)技术,将非结构化数据转化为结构化数据,再分别利用基于字符串相似度的方法和基于监督学习的方法,对结构化数据进行模式匹配;通过半监督学习方法,在结构化数据与数据库记录实体之间进行实体匹配与融合;利用自然语言处理(NLP)技术及深度学习方法,对融合后的数据集进行缺失值填补。结果表明:在论文数据集和视频元数据集上进行对齐处理后,两者的F1值分别达到89.70%及96.50%;在不同属性上进行缺失值填补后,整体填补准确率达到78%以上,大大优于基线方法的准确率。

    Abstract:

    In order to realize the intercommunication between different data management systems, we proposed a framework of multi-source heterogeneous data governance based on semi-supervised learning. Then, we designed, implemented and tested an automatic alignment method of unstructured data and structured data. The named entity recognition (NER) technology was firstly employed in the framework to convert the unstructured data into the structured one, and the string-similarity-based method and supervised-learning-based method were respectively used for the schema matching of structured data. With the semi-supervised learning method, the structured data and its corresponding entity in database were matched and integrated. Finally, natural language processing (NLP) technology and deep learning methods were used to impute missing values in the integrated dataset. It is shown that the F1-scores for the alignment on the paper dataset and video metadata set are 89.70% and 96.50%, respectively; and that the accuracy of missing value imputation on different attributes is all above 78%, which is a great improvement compared with the baseline methods.

    参考文献
    相似文献
    引证文献
引用本文

饶卫雄,高宏业,林程,赵钦佩,叶丰.基于半监督学习的多源异构数据治理[J].同济大学学报(自然科学版),2022,50(10):1392~1404

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2022-05-10
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2022-11-03
  • 出版日期: