Multi-source Heterogeneous Data Governance Based on Semi-supervised Learning
CSTR:
Author:
Affiliation:

1.School of Software Engineering, Tongji University, Shanghai 201804, China;2.National Key Laboratory for Complex Systems Simulation, Beijing 100101, China

Clc Number:

TP391

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    In order to realize the intercommunication between different data management systems, we proposed a framework of multi-source heterogeneous data governance based on semi-supervised learning. Then, we designed, implemented and tested an automatic alignment method of unstructured data and structured data. The named entity recognition (NER) technology was firstly employed in the framework to convert the unstructured data into the structured one, and the string-similarity-based method and supervised-learning-based method were respectively used for the schema matching of structured data. With the semi-supervised learning method, the structured data and its corresponding entity in database were matched and integrated. Finally, natural language processing (NLP) technology and deep learning methods were used to impute missing values in the integrated dataset. It is shown that the F1-scores for the alignment on the paper dataset and video metadata set are 89.70% and 96.50%, respectively; and that the accuracy of missing value imputation on different attributes is all above 78%, which is a great improvement compared with the baseline methods.

    Reference
    Related
    Cited by
Get Citation

RAO Weixiong, GAO Hongye, LIN Cheng, ZHAO Qinpei, YE Feng. Multi-source Heterogeneous Data Governance Based on Semi-supervised Learning[J].同济大学学报(自然科学版),2022,50(10):1392~1404

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:May 10,2022
  • Revised:
  • Adopted:
  • Online: November 03,2022
  • Published:
Article QR Code