Intro

What is data science?

The concept of data science is not a recent development; it has existed for decades. However, to engage in discussions about a field, it is essential to give it a name. One of the pioneers in assigning a name to this discipline was Peter Naur, who introduced the term "datalogy" in the 1960s, defining it as "the science of the nature and use of data." Over time, this term evolved into "science of data" and eventually became known as "data science." Notably, in 1997, Jeff Wu delivered an inaugural lecture provocatively titled "Statistics = Data Science?"—a signal of the significant shift in data analysis approaches. Furthermore, one of the earliest institutions to specialize in dedicated research in this domain was the Research Center for Dataology and Data Science at the Fudan University, Shanghai (China), established in 2007.

Data science combines the skill set and expert knowledge of many different fields, including machine learning, artificial intelligence, statistics and network science. This makes the field inherently interdisciplinary because each of these fields has its own history.

More detailed background information about this can be found in the following publications:

Emmert-Streib, F., Yli-Harja, O., & Dehmer, M. (2020). Artificial intelligence: A clarification of misconceptions, myths and desired status. Frontiers in artificial intelligence, 3, 524339.
Emmert-Streib, F. (2021). From the digital data revolution toward a digital society: Pervasiveness of artificial intelligence. Machine Learning and Knowledge Extraction, 3(1), 284-298.
Emmert-Streib, F., & Dehmer, M. (2018). Defining data science by a data-driven quantification of the community. Machine Learning and Knowledge Extraction, 1(1), 235-251.

In general, the availability of data provides opportunities in all fields of science to gain new information and to tackle difficult problems. However, data alone do not provide information; first, they need to be analyzed. This is what we call learning from data which is at the heart of data science. There is a large number of quite different data types and in the following, I just want to mention three data types in more detail that are also studied in my Lab. Specifically, I will describe:

1. Gene expression data

2. Text data

3. Network data

The first data type is called gene expression data. Such data provide information from biological cells of animals or plants about the activity of genes by measuring the concentration of mRNAs. Humans have over ~20,000 genes which means that the resulting profiles obtained from such genomics experiments correspond to high-dimensional arrays of numbers. Using gene expression data does not only allow to learn about molecular biological laws but also to gain a better understand of diseases and to help developing treatments. The latter forms the basis of what is called personalized medicine or precision medicine.

The following image visualizes the connection between molecular biology and the resulting measurements where rows correspond to patients and columns to the measurments of gene activity.

More details about gene expression data can be found in the following publication:

Emmert-Streib, F., Tripathi, S., & Matos Simoes, R. D. (2012). Harnessing the complexity of gene expression data from cancer: from single gene to structural pathway methods. Biology Direct, 7(1), 1-25.

The second data type I would like to mention is text data. Text data differ significantly from gene expression data because they are not presented as numerical values, but as symbol sequences composed of letters from an alphabet. That means before any analysis of text data can be conducted, a transformation, or mapping, of the symbol sequences to numbers needs to be performed. However, already this mapping is very challenging and no optimal solution currently exists. Still, examples like chatGPT demonstrate impressively capabilities of modern deep learning methods.

More details about text data and their analysis can be found in the following publications:

Perera, N., Dehmer, M., & Emmert-Streib, F. (2020). Named entity recognition and relation detection for biomedical information extraction. Frontiers in cell and developmental biology, 673.
Yang, Z., Dehmer, M., Yli-Harja, O., & Emmert-Streib, F. (2020). Combining deep learning with token selection for patient phenotyping from electronic health records. Scientific reports, 10 (1), 1432.
Farea, A., Yang, Z., Duong, K., Perera, N., & Emmert-Streib, F. (2022). Evaluation of Question Answering Systems: Complexity of judging a natural language. arXiv preprint arXiv:2209.12617.

The third data type is called network data. In general, networks, also called graphs, provide an elegant way to represent the connectivity among discrete units. Such discrete units are called nodes and the connections between these are called edges or links (see the image below). While the most simple network consists of only two nodes and one edge, the combination of many nodes and many edges can result in very complex networks. An example for such a network is the gene regulatory network of breast cancer showing in the image below. This network, inferred from data, is like a blueprint of the disease that shows a global dependency structure among the genes. In general, the interrogation of such networks allows to gain insights into the causal mechanisms that are present in disorders.

More details about gene regulatory networks, their inference and the analysis of networks can be found in the following publications:

Emmert-Streib, F., de Matos Simoes, R., Mullan, P., Haibe-Kains, B., & Dehmer, M. (2014). The gene regulatory network for breast cancer: integrated regulatory landscape of cancer hallmarks. Frontiers in genetics, 5, 15.
Moore, D., de Matos Simoes, R., Dehmer, M., & Emmert-Streib, F. (2019). Prostate cancer gene regulatory network inferred from RNA-seq data. Current Genomics, 20(1), 38-48.
Emmert-Streib, F., Dehmer, M., & Haibe-Kains, B. (2014). Gene regulatory networks and their applications: understanding biological and medical problems in terms of networks. Frontiers in cell and developmental biology, 2, 38.
Emmert-Streib, F., Dehmer, M., & Shi, Y. (2016). Fifty years of graph matching, network alignment and network comparison. Information sciences, 346, 180-197.

From these examples you see that data science needs to be very versatile by providing different analysis methods for the efficient analysis of many different data types.

Despite the diversity of data and methods for their analysis, there is a common underlying thread describing the purpose of data science and that is “data science allows to make predictions”. While this sounds simple, the challenge is to make systematic predictions by minimizing prediction errors. Here statistical thinking is playing a key role to complement approaches from artificial intelligence and machine learning.

Science of Data, the oil of the twenty-first century

Data is everywhere, and every field of science or industry generates data in a seemingly effortless manner. Examples are genomic measurements of patients, transactions on the stock market or communications on social media. For this reason, data have been called the “oil of the twenty-first century”. To deal with this flood of data, a new field has been established called data science.

A more detailed description of our work can be found on the research page.