尤物视频

Faculty

Data science researcher builds tools to speed up data preparation

April 08, 2021

Data preparation is widely regarded as the most time-consuming part of data science and  of a data scientist鈥檚 time. SFU computing science professor Jiannan Wang鈥檚 mission is to speed up data science by greatly reducing the time spent on data preparation. To do this, he develops innovative technologies and open-source tools for data scientists to use.

Data preparation refers to the process of collecting, exploring, cleaning, transforming and integrating data into a form for downstream analysis and modeling. By 2025, it is estimated that the market for data preparation will be over .

鈥淒ata preparation is not a single problem,鈥 says Wang.

鈥淚t consists of many challenging problems such as discovering, understanding, cleaning and integrating the data.鈥

These problems are more easily solved by crowdsourcing and using human intelligence than by being fully automated. For example, entity resolution is the task of disambiguating records that refer to real-world entities. It is central to data cleaning and integration, but algorithmic solutions are far from perfect. Wang built , the first crowdsourced entity resolution system able to outperform the best human-only and machine-only systems. To reduce the human cost, he also developed the first quality-aware task assignment system for various data preparation tasks.

Wang鈥檚 project  was proposed to scale the expensive data cleaning process. The main idea of this project is to have a human clean a small sample of the data, and then use these results for the machine to learn the cleaning process and lessen the impact of unclean data on query results. This system has been incorporated in the , one of the world鈥檚 most popular big data stacks at that time.

His mission to speed up data science, however, can perhaps best be seen in his string similarity join work. String similarity join is defined as finding all pairs of similar strings whose similarity values are above a user-specified threshold. Wang鈥檚 proposed algorithms made several major breakthroughs and ran 10 to 100 times faster than all other algorithms at the String Similarity Join/Search Competition hosted by  in 2013, reducing the algorithm run time from hours to minutes.

In recognition of these research breakthroughs, Wang recently received a  from CS-Can|Info-Can. This  is, 鈥淭o foster excellence in Computer Science research and higher education in Canada, drive innovation and benefit society.鈥 This award comes after Wang also received the  in 2018.

鈥淭his award recognizes my past achievements, but in my opinion, research is a marathon,鈥 says Wang, who also serves as the program director for the Master of Science in Professional Computer Science Program at SFU.

His long-term research goal can be found in his new project , an all-in-one data preparation system that provides the easiest way for data scientists to prepare data in Python. After beginning on this project in May 2019, DataPrep has already been downloaded over  and has received positive feedback on forums such as 

Through his research, Wang hopes to build a community that is 鈥渆qual, diverse and inclusive鈥 while saving data scientists time during the crucial stage of data preparation.

鈥淭his award gives me more motivation to do excellent research and to focus on the impact of this research,鈥 says Wang.

鈥淭here are millions of data scientists in the world that spend a lot of time on data preparation, so solving this problem could have a huge impact on society.鈥

Facebook
Twitter
LinkedIn
Reddit
SMS
Email
Copy