Loved to talk about “Data Quality” at the Second Data Science MeetUp Twente on Responsible Data Analytics. Thanks to Susan Kamies of MoneyBird and Sven van Munster of Green Orange for organizing the event.
[Slides on Slideshare] [2nd Data Science MeetUp Twente]
Our faculty (EEMCS) will form a new interdisciplinary research group around computational statistics, data science, and statistical signal processing. Researchers within the faculty with a shared interest for data science, including my group, will join forces to address problems in the areas of image analysis, spatial statistics, statistics of networks, biometric pattern recognition, machine learning, big data, search, natural language processing, data integration, data quality, and trustworthy analytics.
To strengthen the activities, we have three vacancies (first two in my group, latter in the department of applied mathematics):
- Assistant/Associate Professor Data Science & Technology
- Assistant Professor Big data and Data Science in Creative Technology
- Full Professor in Computational Statistics
Starting January 1st, 2017, I’m the head of the group “Databases”
Today, a PhD student of mine, Mohammad S. Khelghati, defended his thesis.
Deep Web Content Monitoring [Download]
In this thesis, we investigate the path towards a focused web harvesting approach which can automatically and efficiently query websites, navigate through results, download data, store it and track data changes over time. Such an approach can also facilitate users to access a complete collection of relevant data to their topics of interest and monitor it over time. To realize such a harvester, we focus on the following obstacles: finding methods that can achieve the best coverage in harvesting data for a topic; reducing the cost of harvesting a website regarding the number of submitted requests by estimating its actual size; monitoring data changes over time in web data repositories; and we combine our experiences in harvesting with the studies in the literature to suggest a general designing and developing framework for a web harvester. It is important to know how to configure harvesters so that they can be applied to different websites, domains and settings. These steps bring further improvements to data coverage and monitoring functionalities of web harvesters and can help users such as journalists, business analysts, organizations and governments to reach the data they need without requiring extreme software and hardware facilities. With this thesis, we hope to have contributed to the goal of focused web harvesting and monitoring topics over time.
After a guest lecture in the second year module “Data:From the Source to the Senses“, I was asked to organize a data wrangling workshop. After a short introduction and a demonstration of TriFacta, the data wrangling tool that I proposed them to use, I let them work on data from the university’s timetabling system for compliance checking, exploration, or trend analysis purposes, or their own data that they found for their self-chosen data visualization project in the module. After the workshop, I asked who would use Trifacta in their project and a majority of the hands went up. Most of them were using Excel for their data wrangling tasks before that. Quite a success for Trifacta that the majority of the students were convinced of its value and strengths after just a 1.5 hour workshop.
I also like to share two example cases of the students that came up and that I found most convincing.
Example case 1:
One group of students found some data that had six columns of the form “Male A”, “Female A”, “Male B”, “Female B”, “Male C”, “Female C”. They wanted to reshape this into a form that had two rows per original row, one for “Male” and one for “Female”, with both a column for “Gender”, three columns “A”, “B”, and “C”, and all the values in the other columns duplicated. We of course recognized it as a case for unpivot
, but not a trivial one, because we needed to unpivot two sets of three columns at the same time! We achieved it by first nest
ing the three columns for both “Male” and “Female”, then doing the unpivot
, and then unnest
ing the result. Some rename
-ing of columns, some replace
-ing of values, and drop
ping of irrelevant columns inbetween and done … for the whole data set! This particular group of students was quite impressed by this feat.
Example case 2:
Another group of students found data on countries that they wanted to use for a kind of network analysis. The data included 6 columns with up to 6 languages that were spoken in those countries. For the network analysis, they wanted to reshape this data into a form with one row per combination of countries where the same language was spoken. Obviously, they were at a loss how to achieve that. As a database-person myself, I recognized this as a self-join on language. What we did was first unpivot
on the six language columns to obtain 6 rows per country, one for each language. For countries with less than 6 languages, some of the rows had an empty cell in the new “Language” column. They were easily drop
ped with a few clicks. Then we generated the resulting data set to have the file twice. The we did a join with the other file on their respective “Language” columns. As you can imagine, this particular group was also quite impressed that this could be achieved in a matter of a few minutes for the whole data set.
One must know that these students typically do not have much programming or database experience. But with the suggestions made by Trifacta, possibly modifying the commands behind these suggestions a bit, they were quickly able to do their own wrangling without any instruction beforehand. In my opinion, this suggests that the tool is suitable for a wide audience of not-that-technical users that need to do data-driven analyses.
Another observation I want to make is about data quality. Trifacta throws it in your face that your data is dirty (which is a good thing!). There are histograms above all columns as well as a bar indicating the amount of trustful (green), suspicious (red), and missing (black) values. I have seen no case of a data set, that didn’t have red and black parts for one or more columns. Of course, there are data quality problems! Always! So, I think it is very good that this is so in-your-face making users aware that their data is dirty and they need to do something about it to make their visualizations and analytic results reliable. I often use the term responsible analytics: a data scientist should know about data quality problems in their data and how the affect the results (and be open and tell you about it). Furthermore, the functionality and suggestions for data cleaning are quite good. I use the ETL tool Pentaho data integration, aka Kettle, in another course, but I definitely think that Trifacta is better for detecting and solving data quality problems (as well as for doing such transformations as above).
In conclusion, good work Trifacta! A valuable addition to a data scientist’s toolbox.
The project proposal “Time To Care: Using sensor technology to dynamically model social interactions of healthcare professionals at work in relation to healthcare quality” has been accepted in our university’s Tech4People program. The project is a cooperation with Educational Sciences (chair OWK) and Psychology of Conflict, Risk and Safety (chair PCRS) with whom the funded PhD student will be shared.
What I am particulary enthusiastic about in this project is that it is not only interdisciplinary cooperation towards a shared goal, but that also disciplinary research questions from each of the participating disciplines can be answered. For me, it is a unique opportunity to test whether probabilistic modeling of the data quality problems / noise in the social interaction data obtained from the sensors indeed provide significantly different results when predicting team performance.
My PhD student Mohammad Khelgathi released his web harvesting software, called HarvestED.
I’ve been interview by NRCQ about natural language processing, in particular, about computers learning to understand the more subtle aspects of language use such as sarcasm
Súperhandig hoor, computers die sarcasme kunnen herkennen
Today I gave a presentation at the Data Science Northeast Netherlands Meetup about
Managing uncertainty in data: the key to effective management of data quality problems [slides (PDF)]
Business analytics and data science are significantly impaired by a wide variety of ‘data handling’ issues, especially when data from different sources are combined and when unstructured data is involved. The root cause of many such problems centers around data semantics and data quality. We have developed a generic method which is based on modeling such problems as uncertainty *in* the data. A recently conceived new kind of DBMS can store, manage, and query large volumes of uncertain data: the UDBMS or “Uncertain Database”. Together, they allow one to, e.g., postpone the resolution of data problems, assess what their influence is on analytical results, etc. We furthermore develop technology for data cleansing, web harvesting, and natural language processing which uses this method to deal with ambiguity of natural language and many other problems encountered when using unstructured data.
Dolf Trieschnigg and I got some subsidy to valorize some of the research results of the COMMIT/ TimeTrails, PayDIBI, and FedSS projects. Company involved is Mydatafactory.
SmartCOPI: Smart Consolidation of Product Information
[download public version of project proposal]
Maintaining the quality of detailed product data, ranging from data about required raw materials to detailed specifications of tools and spare parts, is of vital importance in many industries. Ordering or using wrong spare parts (based on wrong or incomplete information) may result in significant production loss or even impact health and safety. The web provides a wealth of information on products provided in various formats, detail levels, targeted at at a variety of audiences. Semi- automatically locating, extracting and consolidating this information would be a “killer app” for enriching and improving product data quality with a significant impact on production cost and quality. The new to COMMIT/ industry partner Mydatafactory is interested in both the web harvesting and data cleansing technologies developed in COMMIT/-projects P1/Infiniti and P19/TimeTrails for this potential and for improving Mydatafactory’s data cleansing services. The ICT science questions behind data cleansing and web harvesting are how noise can be detected and reduced in discrete structured data, and how human cognitive skills in information navigation and extraction can be mimicked. Research results on these questions may benefit a wide range of applications from various domains such as fraud detection and forensics, creating a common operational picture, and safety in food and pharmaceuticals.