What is data science?

The concept of data science is not a recent development; it has existed for decades. However, to engage in discussions about a field, it is essential to give it a name. One of the pioneers in assigning a name to this discipline was Peter Naur, who introduced the term "datalogy" in the 1960s, defining it as "the science of the nature and use of data." Over time, this term evolved into "science of data" and eventually became known as "data science." Notably, in 1997, Jeff Wu delivered an inaugural lecture provocatively titled "Statistics = Data Science?"—a signal of the significant shift in data analysis approaches. Furthermore, one of the earliest institutions to specialize in dedicated research in this domain was the Research Center for Dataology and Data Science at the Fudan University, Shanghai (China), established in 2007.

Data science combines the skill set and expert knowledge of many different fields, including machine learning, artificial intelligence, statistics and network science. This makes the field inherently interdisciplinary because each of these fields has its own history. 

More detailed background information about this can be found in the following publications:

In general, the availability of data provides opportunities in all fields of science to gain new information and to tackle difficult problems. However, data alone do not provide information; first, they need to be analyzed. This is what we call learning from data which is at the heart of data science. There is a large number of quite different data types and in the following, I just want to mention three data types in more detail that are also studied in my Lab. Specifically, I will describe:

   1. Gene expression data

   2. Text data

   3. Network data

The first data type is called gene expression data. Such data provide information from biological cells of animals or plants about the activity of genes by measuring the concentration of mRNAs. Humans have over ~20,000 genes which means that the resulting profiles obtained from such genomics experiments correspond to high-dimensional arrays of numbers. Using gene expression data does not only allow to learn about molecular biological laws but also to gain a better understand of diseases and to help developing treatments. The latter forms the basis of what is called personalized medicine or precision medicine. 

The following image visualizes the connection between molecular biology and the resulting measurements where rows correspond to patients and columns to the measurments of gene activity. 

More details about gene expression data can be found in the following publication:

Science of Data, the oil of the twenty-first century

Data is everywhere, and every field of science or industry generates data in a seemingly effortless manner. Examples are genomic measurements of patients, transactions on the stock market or communications on social media. For this reason, data have been called the “oil of the twenty-first century”. To deal with this flood of data, a new field has been established called data science.

A more detailed description of our work can be found on the research page.