Research

Data science and artificial intelligence are inherently interdisciplinary fields that transcend traditional academic boundaries by placing data at the core of analysis. This paradigm shift moves us beyond the constraints of parametric models, which rely on predefined theoretical assumptions, and toward data-driven approaches that implicitly shape analytical models. By integrating methodologies from data science and network science, we effectively tackle two fundamental challenges: uncertainty and explainability. This integration serves as a unifying principle across our research, ensuring a more robust and interpretable understanding of complex systems. In essence, our approach can be summarized as the seamless combination of:

statistical thinking
computational thinking

Data science and AI are diverse and multifaceted fields rather than monolithic disciplines. This diversity arises from two key aspects. First, the methods originate from various communities, including machine learning, and statistics, each bringing different perspectives and strengths. Second, the complexity stems from the wide variety of data types, the intricacy of data sources, and the diverse domains that generate the data. A major advantage of this plurality is the ability to develop flexible, domain-specific solutions by leveraging techniques from multiple disciplines. However, a potential downside is the challenge of integration. Additionally, the increasing complexity of data sources demands careful preprocessing and interpretation to avoid biases and misrepresentations. For instance, the data we use in our studies come from various fields, including biology, medicine, healthcare, economics, finance, industry, and social media. As a consequence, data science can be devided into a number of sub-fields, including:

Biomedical data science
Text data science
Industrial data science
Foundational data science

In the following, we give a brief overview of our research.

Biomedical Data Science

Biomarkers and pharmacogenomics

Genomic data provide important information for understanding complex phenotypes, including disorders such as cancer or diabetes. Such data have great potential for diagnostic and prognostic purposes. Two areas we are particularly interested in is the study of biomarkers and the repurposing of drugs. Specifically, based on gene expression data, we studied the selection and stability of biomarkers in breast cancer and other cancer types because for making prognostic predictions such biomarkers need to be reliable and interpretable [1,2]. In a number of different cancers we found that especially the latter is not guaranteed leading to limits of explainability [3]. For drug-perturbation profiles, we developed a systems pharmacogenomic model allowing the exploitation of the large-scale repository LINCS [4]. This allowed the summarization of millions of gene expression perturbation profiles into a coherent landscape of drug similarities including also FDA approved drugs [5].

Manjang, K., Yli-Harja, O., Dehmer, M., & Emmert-Streib, F. (2021). Limitations of explainability for established prognostic biomarkers of prostate cancer. Frontiers in Genetics, 12, 649429.
Manjang, K., Tripathi, S., Yli-Harja, O., Dehmer, M., Glazko, G., & Emmert-Streib, F. (2021). Prognostic gene expression signatures of breast cancer are lacking a sensible biological meaning. Scientific Reports, 11(1), 156.
Emmert-Streib, F. (2022). Severe testing with high-dimensional omics data for enhancing biomedical scientific discovery. NPJ Systems Biology and Applications 8(1), 40.
Musa, A., Ghoraie, L. S., Zhang, S. D., Glazko, G., Yli-Harja, O., Dehmer, M., ... & Emmert-Streib, F. (2018). A review of connectivity map and computational approaches in pharmacogenomics. Briefings in bioinformatics, 19(3), 506-523.
Musa, A., Tripathi, S., Dehmer, M., Yli-Harja, O., Kauffman, S. A., & Emmert-Streib, F. (2019). Systems pharmacogenomic landscape of drug similarities from LINCS data: Drug Association Networks. Scientific reports, 9(1), 7849.

Inference and analysis of regulatory networks

The inference and analysis of gene regulatory networks (GRNs) is of fundamental importance to gain a basic understanding for the functioning of biological cells. The reason therefor is that genes do not work in isolation but cooperate with each other in a concerted manner. For the inference of GRNs from expression data, we developed two methods: C3Net and BC3Net. BC3Net [2] is an ensemble method that is based on bagging the C3Net algorithm [1], which means it corresponds to a Bayesian approach with noninformative priors. Furthermore, we developed methods that allow us to detect aberrant pathways where pathways correspond to (small) functional modules within a GRN. For example, in [3] a statistical hypothesis test based on the \emph{graph edit distance} has been introduced whereas in [4] a multivariate differential coexpression test (Gene Sets Net Correlations Analysis (GSNCA)) was developed that accounts for the complete correlation structure between genes.

Altay, G., & Emmert-Streib, F. (2010). Inferring the conservative causal core of gene regulatory networks. BMC systems biology, 4(1), 1-13.
de Matos Simoes, R., & Emmert-Streib, F. (2012). Bagging statistical network inference from large-scale gene expression data. PloS one, 7(3), e33624.
Emmert-Streib, F. (2007). The chronic fatigue syndrome: a comparative pathway analysis. Journal of computational biology, 14(7), 961-972.
Rahmatallah, Y., Emmert-Streib, F., & Glazko, G. (2014). Gene Sets Net Correlations Analysis (GSNCA): a multivariate differential coexpression test for gene sets. Bioinformatics, 30(3), 360-368.

Text Data Science

Text data are unique in the sense that they are presented in the form of symbol sequences rather than numerical arrays. In recent years, several groundbreaking methods have been introduced most notably chatGPT. My group studies named entity recognition to capture food, nutrition and phytochemical entities [1] and multi-label classification [2] because such methods have important applications for the diagnosis of patients based on electronic health records (eHR). We are also interested in relation detection because such methods allow to construct different types of networks from text data. Furthermore, we are interested in the utilization of transfer learning for general text-related tasks [3].

Perera, N., Nguyen, T. T. L., Dehmer, M., & Emmert-Streib, F. (2022). Comparison of text mining models for food and dietary constituent named-entity recognition. Machine Learning and Knowledge Extraction, 4(1), 254-275.
Yang, Z., & Emmert-Streib, F. (2023). Threshold-learned CNN for multi-label text classification of electronic health records. IEEE Access.
Bashath, S., Perera, N., Tripathi, S., Manjang, K., Dehmer, M., & Streib, F. E. (2022). A data-centric review of deep transfer learning with applications to text data. Information Sciences, 585, 498-528.
Perera, N., Dehmer, M., & Emmert-Streib, F. (2020). Named entity recognition and relation detection for biomedical information extraction. Frontiers in cell and developmental biology, 673.

Industrial, financial and economic data science

Industrial data from production, manufacturing or services provide many opportunities for optimizing processes, improving efficiency, and ultimately driving overall business growth and competitiveness [1]. At the same time the high-dimensionality of the data together with a complex correlation structure constitute severe challenges. In cooperation with Cargotec (Finland), a manufacturer of cargo handling machinery for ships and ports, we modeled the prognostics of machine components from loading cranes [2]. This allows the estimation of survival curves and hazard rates for characterizing time dependent failure probabilities of machine components. Insights from such an analysis have the potential for optimizing supply-chain management.

Tripathi, S., Muhr, D., Brunner, M., Jodlbauer, H., Dehmer, M., & Emmert-Streib, F. (2021). Ensuring the robustness and reliability of data-driven knowledge discovery models in production and manufacturing. Frontiers in artificial intelligence, 4, 576892.
Yang, Z., Kanniainen, J., Krogerus, T., & Emmert-Streib, F. (2022). Prognostic modeling of predictive maintenance with survival analysis for mobile work equipment. Scientific Reports, 12(1), 8529.
Baltakys, K., Kanniainen, J., & Emmert-Streib, F. (2018). Multilayer aggregation with statistical validation: Application to investor networks. Scientific reports, 8(1), 8198.
Emmert-Streib, F., Tripathi, S., Yli-Harja, O., & Dehmer, M. (2018). Understanding the world economy in terms of networks: a survey of data-based network science approaches on economic networks. Frontiers in Applied Mathematics and Statistics, 4, 37.

Integrating data science with complex systems

The concept of a digital twin (DT) has gained significant attention in academia and industry because of its efficiency to utilize simulations for making predictions. Originally introduced in manufacturing, in recent years this approach has been expanded to other fields including climate research and medicine. Currently, we are involved in a collaborative project with the Institute for Systems Biology (USA) and the Institute for Molecular Medicine Finland at the University of Helsinki (Finland) jointly funded by the NIH (National Institute of Health, US) and the AoF (Academy of Finland) to develop a digital twin system for a personalised and preventive medicine approach for Acute Myeloid Leukaemia (AML). In this project, we explore virtual tests corresponding to 'what-if' scenarios for identifying drugs that have the potential for use in treating patients.

Emmert-Streib, F. (2023). Defining a Digital Twin: A Data Science-Based Unification. Machine Learning and Knowledge Extraction, 5(3), 1036-1054.
Emmert-Streib, F. (2023). What Is the Role of AI for Digital Twins?. AI, 4(3), 721-728.
Emmert-Streib, F., & Yli-Harja, O. (2022). What Is a Digital Twin? Experimental Design for a Data-Centric Machine Learning Perspective in Health. International Journal of Molecular Sciences, 23(21), 13149.

Software

Our group developed a number of software solutions for the statistical analysis of data and the visualization of networks. The following software packages have been developed for the programming language R:
BC3Net Inference of causal networks (from CRAN)
NetBioV Network visualization (from Bioconductor)
samExploreR Preprocessing for RNA-seq data (from Bioconductor)
mvgraphnorm Generating constrained covariance matrices for Gaussian graphical models (from CRAN)
sgnesR Simulating gene expression data from an underlying gene network structure from (GitHub)
GSAR Gene Set Analysis in R (from Bioconductor)

Webportals

In addition to software packages, we develop also web portals. A web portal allows the interactive exploration of complex and high-dimensional results:
Drug association network Network visualization of significant similarities of drugs
L1000 Viewer Access to the LINCS L1000 data repository

Research

Selected publications

Biomedical Data Science

Text Data Science

Industrial Data Science

Foundational Data Science

Biomedical Data Science

Biomarkers and pharmacogenomics

Inference and analysis of regulatory networks

Text Data Science

Industrial, financial and economic data science

Integrating data science with complex systems

Software

Webportals