首页 > 资讯 > 结构化数据清洗技术综述

结构化数据清洗技术综述

泰然健康网
2024-12-07 15:31

结构化数据清洗技术综述1. 北京交通大学计算机与信息技术学院, 北京 100044;
2. 清华大学计算机科学与技术系, 数据库组, 北京 100084Survey of structured data cleaning methodsHAO Shuang1,2, LI Guoliang2, FENG Jianhua2, WANG Ning11. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China;
2. Database Group, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 摘要图/表参考文献相关文章Metrics 摘要数据清洗是对脏数据进行检测和纠正的过程，是进行数据分析和管理的基础。该文对经典和新兴的数据清洗技术进行分类和总结，为进一步的研究工作提供方向。形式化定义了数据清洗问题，对数据缺失、数据冗余、数据冲突和数据错误这4种数据噪声的检测技术进行详细阐述。按照数据清洗方式对数据噪声的消除技术进行分类概述，包括基于完整性约束的数据清洗算法、基于规则的数据清洗算法、基于统计的数据清洗算法和人机结合的数据清洗算法。介绍了常用的测评数据集和噪声注入工具，并对未来重点的研究方向进行了探讨和展望。服务把本文推荐给朋友加入引用管理器 E-mail AlertRSS收藏文章作者相关文章郝爽李国良冯建华王宁关键词：数据清洗, 数据噪声, 噪声检测, 噪声消除 Abstract：Data cleaning is the process of detecting and repairing dirty data which is often needed in data analysis and management. This paper classifies and summarizes the traditional and advanced data cleaning techniques and identifies potential directions for further work. This study first formally defines the cleaning problem for structured data and then describes error detection methods for missing data, redundant data, conflicting data and erroneous data. The data cleaning methods are then summarized based on their error elimination method, including constraint-based data cleaning, rule-based data cleaning, statistical data cleaning and human-in-the-loop data cleaning. Some important datasets and noise injection tools are introduced as well. Open research problems and future research directions are also discussed.Key words：data cleaning dirty data error detection error elimination收稿日期: 2018-07-31 出版日期: 2018-12-13基金资助:国家重点研发计划项目（2018YFC0809800）；国家自然科学基金项目（61373024，61632016，61422205，61521002）通讯作者:李国良,副教授,E-mail:liguoliang@tsinghua.edu.cn E-mail: liguoliang@tsinghua.edu.cn引用本文: 郝爽, 李国良, 冯建华, 王宁. 结构化数据清洗技术综述[J]. 清华大学学报（自然科学版）, 2018, 58(12): 1037-1050.
HAO Shuang, LI Guoliang, FENG Jianhua, WANG Ning. Survey of structured data cleaning methods. Journal of Tsinghua University(Science and Technology), 2018, 58(12): 1037-1050.链接本文: http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.22.053 或 http://jst.tsinghuajournals.com/CN/Y2018/V58/I12/1037

[1] SHALLCROSS S. 2 Reasons why your data is lying to you[R/OL]. (2016-09-15)[2018-06-25]. http://hollistibbetts.sys-con.com/node/1975126.
[2] BUCKLEY J. Causes of dirty data and how to combat them[R/OL]. (2018-07-13)[2018-06-25]. https://www.qubole.com/blog/causes-of-dirty-data-and-how-to-combat-them.
[3] GALHARDAS H, FLORESCU D, SHASHA D, et al. An extensible framework for data cleaning[C]//IEEE 16th International Conference on Data Engineering. San Diego, USA, 2000.
[4] REKATSINAS T, CHU X, ILYAS I F, et al. HoloClean:Holistic data repairs with probabilistic inference[J]. Proceedings of the VLDB Endowment, 2017, 10(11):1190-1201.
[5] MAYFIELD C, NEVILLE J, PRABHAKAR S. ERACER:A database approach for statistical inference and data cleaning[C]//2010 ACM SIGMOD International Conference on Management of Data. Indianapolis, USA, 2010:75-86.
[6] YAKOUT M, BERTI-ÉQUILLE L, ELMAGARMID A K. Don't be SCAREd:Use scalable automatic repairing with maximal likelihood and bounded changes[C]//2013 ACM SIGMOD International Conference on Management of Data. New York, USA, 2013:553-564.
[7] WANG J N, KRASKA T, FRANKLIN M J, et al. CrowdER:Crowdsourcing entity resolution[J]. Proceedings of the VLDB Endowment, 2012, 5(11):1483-1494.
[8] CHAI C, LI G L, LI J, et al. Cost-effective crowdsourced entity resolution:A partial-order approach[C]//2016 International Conference on Management of Data. San Francisco, USA, 2016:969-984.
[9] 郭志懋, 周傲英. 数据质量和数据清洗研究综述[J]. 软件学报, 2002, 13(11):2076-2082. GUO Z M, ZHOU A Y. Research on data quality and data cleaning:A survey[J]. Journal of Software, 2002, 13(11):2076-2082. (in Chinese)
[10] 杨辅祥, 刘云超, 段智华. 数据清理综述[J]. 计算机应用研究, 2002, 19(3):3-5. YANG F X, LIU Y C, DUAN Z H. An overview of data cleaning[J]. Application Research of Computers, 2002, 19(3):3-5. (in Chinese)
[11] 王曰芬, 章成志, 张蓓蓓, 等. 数据清洗研究综述[J]. 现代图书情报技术, 2007, 2(12):50-56. WANG Y F, ZHANG C Z, ZHANG B B, et al. A survey of data cleaning[J]. New Technology of Library and Information Service, 2007, 2(12):50-56. (in Chinese)
[12] 赵一凡, 卞良, 丛昕. 数据清洗方法研究综述[J]. 软件导刊, 2017(12):222-224. ZHAO Y F, BIAN L, CONG X. Survey of data cleaning method in data preprocessing[J]. Software Guide, 2017(12):222-224. (in Chinese)
[13] ELMAGARMID A K, IPEIROTIS P G, VERYKIOS V S. Duplicate record detection:A survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1):1-16.
[14] CHU X, ILYAS I F, KRISHNAN S, et al. Data cleaning:Overview and emerging challenges[C]//2016 International Conference on Management of Data. San Francisco, USA, 2016:2201-2206.
[15] ILYAS I F, CHU X. Trends in cleaning relational data:Consistency and deduplication[J]. Foundations and Trends in Databases, 2015, 5(4):281-393.
[16] FAN W F, GEERTS F, JIA X B, et al. Conditional functional dependencies for capturing data inconsistencies[J]. ACM Transactions on Database Systems (TODS), 2008, 33(2):6.
[17] BOHANNON P, FAN W F, GEERTS F, et al. Conditional functional dependencies for data cleaning[C]//IEEE 23rd International Conference on Data Engineering. Istanbul, Turkey, 2007:746-755.
[18] ABITEBOUL S, HULL R, VIANU V. Foundations of databases:The logical level[M]. Boston, USA:Addison-Wesley Longman Publishing, 1995.
[19] KOUDAS N, SAHA A, SRIVASTAVA D, et al. Metric functional dependencies[C]//IEEE 25th International Conference on Data Engineering. Shanghai, China, 2009:1275-1278.
[20] SONG S X, CHEN L. Differential dependencies:Reasoning and discovery[J]. ACM Transactions on Database Systems (TODS), 2011, 36(3):16.
[21] WANG J N, TANG N. Towards dependable data repairing with fixing rules[C]//2014 ACM SIGMOD International Conference on Management of Data. Snowbird, USA, 2014:457-468.
[22] WANG J N, TANG N. Dependable data repairing with fixing rules[J]. Journal of Data and Information Quality (JDIQ), 2017, 8(3-4):16.
[23] FAN W F, LI J Z, MA S, et al. Towards certain fixes with editing rules and master data[J]. Proceedings of the VLDB Endowment, 2010, 3(1-2):173-184.
[24] FAN W F, LI J Z, MA S, et al. Towards certain fixes with editing rules and master data[J]. The VLDB Journal, 2012, 21(2):213-238.
[25] INTERLANDI M, TANG N. Proof positive and negative in data cleaning[C]//IEEE 31st International Conference on Data Engineering. Seoul, Republic of Korea, 2015:18-29.
[26] HAO S, TANG N, LI G L, et al. Cleaning relations using knowledge bases[C]//IEEE 33rd International Conference on Data Engineering. San Diego, USA, 2017:933-944.
[27] HAO S, TANG N, LI G L, et al. Distilling relations using knowledge bases[J]. The VLDB Journal, 2018, 27(4):497-519.
[28] HUA M, PEI J. DiMaC:A disguised missing data cleaning tool[C]//14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, USA, 2008:1077-1080.
[29] QAHTAN A A, ELMAGARMID A, CASTRO FERNANDEZ R, et al. FAHES:A robust disguised missing values detector[C]//24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. London, UK, 2018:2100-2109.
[30] KÖPCKE H, RAHM E. Frameworks for entity matching:A comparison[J]. Data & Knowledge Engineering, 2010, 69(2):197-210.
[31] HE Q L, LI Z H, ZHANG X. Data deduplication techniques[C]//2010 IEEE International Conference on Future Information Technology and Management Engineering (FITME). Changzhou, China, 2010:430-433.
[32] BAXTER R, CHRISTEN P, CHURCHES T. A comparison of fast blocking methods for record linkage[C]//2003 KDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation. Washington, DC, USA, 2003, 3:25-27.
[33] TEJADA S, KNOBLOCK C A, MINTON S. Learning object identification rules for information integration[J]. Information Systems, 2001, 26(8):607-633.
[34] ELFEKY M G, VERYKIOS V S, ELMAGARMID A K. TAILOR:A record linkage toolbox[C]//18th International Conference on Data Engineering. San Jose, USA, 2002.
[35] TEJADA S, KNOBLOCK C A, MINTON S. Learning domain-independent string transformation weights for high accuracy object identification[C]//International Conference on Knowledge Discovery and Data Mining. Edmonton, Canada, 2002:350-359.
[36] CHRISTEN P. Febrl:A freely available record linkage system with a graphical user interface[C]//2nd Australasian Workshop on Health Data and Knowledge Management. Wollongong, Australia, 2008:17-25.
[37] MCCALLUM A, NIGAM K, UNGAR L H. Efficient clustering of high-dimensional data sets with application to reference matching[C]//6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, USA, 2000:169-178.
[38] BAYARDO R J, MA Y M, SRIKANT R. Scaling up all pairs similarity search[C]//16th International Conference on World Wide Web. Banff, Canada, 2007:131-140.
[39] CHAUDHURI S, GANTI V, KAUSHIK R. A primitive operator for similarity joins in data cleaning[C]//22nd International Conference on Data Engineering. Atlanta, USA, 2006.
[40] XIAO C, WANG W, LIN X M, et al. Efficient similarity joins for near-duplicate detection[J]. ACM Transactions on Database Systems (TODS), 2011, 36(3):15.
[41] ARASU A, RÉ C, SUCIU D. Large-scale deduplication with constraints using Dedupalog[C]//2009 IEEE 25th International Conference on Data Engineering. Shanghai, China, 2009:952-963.
[42] HERNÁNDEZ M A, STOLFO S J. The merge/purge problem for large databases[C]//1995 ACM SIGMOD International Conference on Management of Data. San Jose, USA, 1995:127-138.
[43] LEE M L, LING T W, LOW W L. IntelliClean:A knowledge-based intelligent data cleaner[C]//Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, USA, 2000:290-294.
[44] WANG J N, LI G L, YU J X, et al. Entity matching:How similar is similar[J]. Proceedings of the VLDB Endowment, 2011, 4(10):622-633.
[45] BILENKO M, MOONEY R J. Adaptive duplicate detection using learnable string similarity measures[C]//Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, DC, USA, 2003:39-48.
[46] PAPENBROCK T, EHRLICH J, MARTEN J, et al. Functional dependency discovery:An experimental evaluation of seven algorithms[J]. Proceedings of the VLDB Endowment, 2015, 8(10):1082-1093.
[47] LIU J X, LI J Y, LIU C F, et al. Discover dependencies from data:A review[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(2):251-264.
[48] HUHTALA Y, KÄRKKÄINEN J, PORKKA P, et al. TANE:An efficient algorithm for discovering functional and approximate dependencies[J]. The Computer Journal, 1999, 42(2):100-111.
[49] YAO H, HAMILTON H J. Mining functional dependencies from data[J]. Data Mining and Knowledge Discovery, 2008, 16(2):197-219.
[50] NOVELLI N, CICCHETTI R. FUN:An efficient algorithm for mining functional and embedded dependencies[C]//International Conference on Database Theory. London, UK, 2001:189-203.
[51] WYSS C, GIANNELLA C, ROBERTSON E. FastFDs:A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract[C]//International Conference on Data Warehousing and Knowledge Discovery. Munich, Germany, 2001:101-110.
[52] LOPES S, PETIT J M, LAKHAL L. Efficient discovery of functional dependencies and Armstrong relations[C]//International Conference on Extending Database Technology. Konstanz, Germany, 2000:350-364.
[53] FAN W F, GEERTS F, LI J Z, et al. Discovering conditional functional dependencies[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(5):683-698.
[54] DE MARCHI F, LOPES S, PETIT J M. Efficient algorithms for mining inclusion dependencies[C]//8th International Conference on Extending Database Technology. Prague, Czech Republic, 2002:464-476.
[55] BOHANNON P, FAN W F, FLASTER M, et al. A cost-based model and effective heuristic for repairing constraints by value modification[C]//2005 ACM SIGMOD International Conference on Management of Data. Baltimore, USA, 2005:143-154.
[56] CHIANG F, MILLER R J. A unified model for data and constraint repair[C]//2011 IEEE 27th International Conference on Data Engineering. Hannover, Germany, 2011:446-457.
[57] CHU X, ILYAS I F, PAPOTTI P. Holistic data cleaning:Putting violations into context[C]//2013 IEEE 29th International Conference on Data Engineering (ICDE). Brisbane, Australia, 2013:458-469.
[58] KOLAHI S, LAKSHMANAN L V S. On approximating optimum repairs for functional dependency violations[C]//12th International Conference on Database Theory. St. Petersburg, Russia, 2009:53-62.
[59] HAO S, TANG N, LI G L, et al. A novel cost-based model for data repairing[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(4):727-742.
[60] ARENAS M, BERTOSSI L, CHOMICKI J. Consistent query answers in inconsistent databases[C]//18th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Philadelphia, USA, 1999:68-79.
[61] BRAVO L, BERTOSSI L. Logic programs for consistently querying data integration systems[C]//18th International Joint Conference on Artificial Intelligence. Acapulco, Mexico, 2003:10-15.
[62] CALÍ A, LEMBO D, ROSATI R. On the decidability and complexity of query answering over inconsistent and incomplete databases[C]//22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. San Diego, USA, 2003:260-271.
[63] CALI A, LEMBO D, ROSATI R. Query rewriting and answering under constraints in data integration systems[C]//18th International Joint Conference on Artificial Intelligence. Acapulco, Mexico, 2003:16-21.
[64] CHOMICKI J, MARCINKOWSKI J. Minimal-change integrity maintenance using tuple deletions[J]. Information and Computation, 2005, 197(1-2):90-121.
[65] GERTZ M, LIPECK U W. A diagnostic approach for repairing constraint violations in databases[M]//JAJODIA S, LIST W, MCGREGOR G, et al, eds. Integrity and internal control in information systems. ⅡCIS 1997. Boston, USA:Springer, 1997, 1:89-111
[66] GRECO G, GRECO S, ZUMPANO E. A logical framework for querying and repairing inconsistent databases[J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(6):1389-1408.
[67] YAKOUT M, ELMAGARMID A K, NEVILLE J, et al. Guided data repair[J]. Proceedings of the VLDB Endowment, 2011, 4(5):279-289.
[68] CHU X, MORCOS J, ILYAS I F, et al. KATARA:A data cleaning system powered by knowledge bases and crowdsourcing[C]//2015 ACM SIGMOD International Conference on Management of Data. Melbourne, Australia, 2015:1247-1261.
[69] COHEN W, RAVIKUMAR P, FIENBERG S. A comparison of string metrics for matching names and records[C]//KDD Workshop on Data Cleaning and Object Consolidation. Washington, DC, USA, 2003:73-78.
[70] JIANG Y, LI G L, FENG J H, et al. String similarity joins:An experimental evaluation[J]. Proceedings of the VLDB Endowment, 2014, 7(8):625-636.
[71] EBAID A, ELMAGARMID A, ILYAS I F, et al. NADEEF:A generalized data cleaning system[J]. Proceedings of the VLDB Endowment, 2013, 6(12):1218-1221.
[72] CONG G, FAN W F, GEERTS F, et al. Improving data quality:Consistency and accuracy[C]//33rd International Conference on Very Large Data Bases. Vienna, Austria, 2007:315-326.
[73] GEERTS F, MECCA G, PAPOTTI P, et al. The LLUNATIC data-cleaning framework[J]. Proceedings of the VLDB Endowment, 2013, 6(9):625-636.
[74] BESKALES G, ILYAS I F, GOLAB L. Sampling the repairs of functional dependency violations under hard constraints[J]. Proceedings of the VLDB Endowment, 2010, 3(1-2):197-207.
[75] KHAYYAT Z, ILYAS I F, JINDAL A, et al. BIGDANSING:A system for big data cleansing[C]//2015 ACM SIGMOD International Conference on Management of Data. Melbourne, Australia, 2015:1215-1230.
[76] BESKALES G, ILYAS I F, GOLAB L, et al. On the relative trust between inconsistent data and inaccurate constraints[C]//2013 IEEE 29th International Conference on Data Engineering. Brisbane, Australia, 2013:541-552.
[77] VOLKOVS M, CHIANG F, SZLICHTA J, et al. Continuous data cleaning[C]//2014 IEEE 30th International Conference on Data Engineering. Chicago, USA, 2014:244-255.
[78] RISSANEN J. Modeling by shortest data description[J]. Automatica, 1978, 14(5):465-471.
[79] COVER T M, THOMAS J A. Elements of information theory[M]. New York, USA:John Wiley & Sons, 1991.
[80] HE J, VELTRI E, SANTORO D, et al. Interactive and deterministic data cleaning[C]//2016 International Conference on Management of Data. San Francisco, USA, 2016:893-907.
[81] AROCENA P C, GLAVIC B, MECCA G, et al. Messing up with BART:Error generation for evaluating data-cleaning algorithms[J]. Proceedings of the VLDB Endowment, 2015, 9(2):36-47.
[82] ANANTHAKRISHNA R, CHAUDHURI S, GANTI V. Eliminating fuzzy duplicates in data warehouses[C]//28th International Conference on Very Large Databases. Hong Kong, China, 2002:586-597.
[83] WANG J N, KRISHNAN S, FRANKLIN M J, et al. A sample-and-clean framework for fast and accurate query processing on dirty data[C]//2014 ACM SIGMOD International Conference on Management of Data. Snowbird, USA, 2014:469-480.
[84] KRISHNAN S, WANG J N, FRANKLIN M J, et al. SampleClean:Fast and reliable analytics on dirty data[C/OL]//Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2015:59-75[2018-06-19]. https://www.ocf.berkeley.edu/~sanjayk/old/pubs/sc.pdf.
[85] LI G L, WANG J N, ZHENG Y D, et al. Crowdsourced data management:A survey[C]//2017 IEEE 33rd International Conference on Data Engineering. San Diego, USA, 2017:39-40.
[86] PARK H, WIDOM J. CrowdFill:Collecting structured data from the crowd[C]//2014 ACM SIGMOD International Conference on Management of Data. Snowbird, USA, 2014:577-588.
[87] WHANG S E, LOFGREN P, GARCIA-MOLINA H. Question selection for crowd entity resolution[J]. Proceedings of the VLDB Endowment, 2013, 6(6):349-360.
[88] TONG Y X, CAO C C, ZHANG C J, et al. CrowdCleaner:Data cleaning for multi-version data on the web via crowdsourcing[C]//2014 IEEE 30th International Conference on Data Engineering. Chicago, USA, 2014:1182-1185.
[89] MICHALSKI R S. A theory and methodology of inductive learning[M]//MICHALSKI R S, CARBONELL J G, MITCHELL T M. Machine learning. Berlin, Germany:Springer, 1983:83-134.