CIS Seminar Series
Does Data Format Matter? A Study of a Domain-Agnostic Data Format for Domain-Specific - Michael Wyatt, PhD Student, CIS Department
In this age of information, increasing amounts of data have become available in every domain. The growing scale and heterogeneity of data promise insights that can revolutionize our world. These traits of size and diversity also introduce new challenges associated with the effectiveness and meaningfulness of knowledge extraction. In domains such as High Performance Computing (HPC) centers, user-submitted scripts change from user to user and performance log files from system to systems. In other domains such as preventative healthcare, for example, Electronic Medical Records (EMRs) have gone through multiple format specifications. Extracting knowledge from data in both domains requires error-prone, manual, and time-consuming work. Without a way to learn from data in a consistent manner independent of their format, the promises of Big Data may never be realized. In light of these facts, we seek to answer two key questions: (1) Is the format of data relevant to the task of extracting knowledge or can we leverage a general, domain-agnostic data representation within and across different domains? And (2) How effective is the combination of a domain-agnostic representation and machine learning techniques to extract domain-specific knowledge automatically?
In this thesis, we answer the two questions with the following contributions. First, we propose a domain-agnostic format of data based on an image-like representation suitable for deep learning methods. We describe the data transformation process by which hybrid data types (i.e., an amalgam of text and numerical data) are transformed into image-like representations. Second, we show how our domain-agnostic data representation is general as it is fed into machine learning tools (i.e., deep learning) for domain-specific knowledge extraction with diverse data types in two distinct domains: job scripts in HPC centers and EMRs in healthcare. Specifically, for the job script data we show how our image-like representations, when fed into deep learning, allow us to predict usage of new resources (e.g., IO bandwidth), which are leveraged for new advanced scheduling policies. For EMR data we show how our image-like representations, when fed into deep learning, allow us to extrapolate clinical, dietary, and behavioral patterns of patients that can be leveraged by practitioners to determine the likelihood of disease occurrence. Our approach towards data representation and analytics opens new opportunities for automatic knowledge discovery in larger and larger datasets across domains.
Monday, May 14, 2018 at 4:00pm
Smith Hall, Room 102A
Smith Hall, University of Delaware, Newark, DE 19716, USA
- Event Type
-
Academics, College of Engineering, Students, Lectures & Programs, Community, Lectures and Programs
- Calendar
- Departments
- Group
- ENGR - Computer & Information Sciences
- Contact Email
- Contact Name
-
Michela Taufer
- Subscribe
Recent Activity
No recent activity