Data science and artificial intelligence are inherently interdisciplinary fields that transcend traditional academic boundaries by placing data at the core of analysis. This paradigm shift moves us beyond the constraints of parametric models, which rely on predefined theoretical assumptions, and toward data-driven approaches that implicitly shape analytical models.
By integrating methodologies from data science and network science, we effectively tackle two fundamental challenges: uncertainty and explainability. This integration serves as a unifying principle across our research, ensuring a more robust and interpretable understanding of complex systems. In essence, our approach can be summarized as the seamless combination of:
Data science and AI are diverse and multifaceted fields rather than monolithic disciplines. This diversity arises from two key aspects. First, the methods originate from various communities, including machine learning, and statistics, each bringing different perspectives and strengths. Second, the complexity stems from the wide variety of data types, the intricacy of data sources, and the diverse domains that generate the data. A major advantage of this plurality is the ability to develop flexible, domain-specific solutions by leveraging techniques from multiple disciplines. However, a potential downside is the challenge of integration. Additionally, the increasing complexity of data sources demands careful preprocessing and interpretation to avoid biases and misrepresentations. For instance, the data we use in our studies come from various fields, including biology, medicine, healthcare, economics, finance, industry, and social media. As a consequence, data science can be devided into a number of sub-fields, including:
In the following, we give a brief overview of our research.
Genomic data provide important information for understanding complex phenotypes, including disorders such as cancer or diabetes. Such data have great potential for diagnostic and prognostic purposes. Two areas we are particularly interested in is the study of biomarkers and the repurposing of drugs. Specifically, based on gene expression data, we studied the selection and stability of biomarkers in breast cancer and other cancer types because for making prognostic predictions such biomarkers need to be reliable and interpretable [1,2]. In a number of different cancers we found that especially the latter is not guaranteed leading to limits of explainability [3]. For drug-perturbation profiles, we developed a systems pharmacogenomic model allowing the exploitation of the large-scale repository LINCS [4]. This allowed the summarization of millions of gene expression perturbation profiles into a coherent landscape of drug similarities including also FDA approved drugs [5].
The inference and analysis of gene regulatory networks (GRNs) is of fundamental importance to gain a basic understanding for the functioning of biological cells. The reason therefor is that genes do not work in isolation but cooperate with each other in a concerted manner. For the inference of GRNs from expression data, we developed two methods: C3Net and BC3Net. BC3Net [2] is an ensemble method that is based on bagging the C3Net algorithm [1], which means it corresponds to a Bayesian approach with noninformative priors. Furthermore, we developed methods that allow us to detect aberrant pathways where pathways correspond to (small) functional modules within a GRN. For example, in [3] a statistical hypothesis test based on the \emph{graph edit distance} has been introduced whereas in [4] a multivariate differential coexpression test (Gene Sets Net Correlations Analysis (GSNCA)) was developed that accounts for the complete correlation structure between genes.
Text data are unique in the sense that they are presented in the form of symbol sequences rather than numerical arrays. In recent years, several groundbreaking methods have been introduced most notably chatGPT. My group studies named entity recognition to capture food, nutrition and phytochemical entities [1] and multi-label classification [2] because such methods have important applications for the diagnosis of patients based on electronic health records (eHR). We are also interested in relation detection because such methods allow to construct different types of networks from text data. Furthermore, we are interested in the utilization of transfer learning for general text-related tasks [3].
Industrial data from production, manufacturing or services provide many opportunities for optimizing processes, improving efficiency, and ultimately driving overall business growth and competitiveness [1]. At the same time the high-dimensionality of the data together with a complex correlation structure constitute severe challenges. In cooperation with Cargotec (Finland), a manufacturer of cargo handling machinery for ships and ports, we modeled the prognostics of machine components from loading cranes [2]. This allows the estimation of survival curves and hazard rates for characterizing time dependent failure probabilities of machine components. Insights from such an analysis have the potential for optimizing supply-chain management.
The concept of a digital twin (DT) has gained significant attention in academia and industry because of its efficiency to utilize simulations for making predictions. Originally introduced in manufacturing, in recent years this approach has been expanded to other fields including climate research and medicine. Currently, we are involved in a collaborative project with the Institute for Systems Biology (USA) and the Institute for Molecular Medicine Finland at the University of Helsinki (Finland) jointly funded by the NIH (National Institute of Health, US) and the AoF (Academy of Finland) to develop a digital twin system for a personalised and preventive medicine approach for Acute Myeloid Leukaemia (AML). In this project, we explore virtual tests corresponding to 'what-if' scenarios for identifying drugs that have the potential for use in treating patients.
Our group developed a number of software solutions for the
statistical analysis of data and the visualization of networks. The
following software packages have been developed for the programming
language R:
BC3Net Inference of causal networks (from CRAN)
NetBioV Network visualization (from Bioconductor)
samExploreR Preprocessing for RNA-seq data (from Bioconductor)
mvgraphnorm Generating constrained covariance matrices for Gaussian graphical models (from CRAN)
sgnesR Simulating gene expression data from an underlying gene network structure from (GitHub)
GSAR Gene Set Analysis in R (from Bioconductor)
In addition to software packages, we develop also web portals. A web
portal allows the interactive exploration of complex and
high-dimensional results:
Drug association network Network visualization of significant similarities of drugs
L1000 Viewer Access to the LINCS L1000 data repository
created with
Offline Website Builder Software .