Michel Dumontier FAIR data principles ONTOFORCE

BLOG

Avoiding the pharma data graveyard: an interview with Michel Dumontier, co-founder of the FAIR data principles

In this interview, Michel Dumontier, distinguished professor at Maastricht University and co-founder of the FAIR data principles, discusses the intersection of FAIR, knowledge graphs, GenAI, and pharma data management.  

ONTOFORCE team
25 November 6 minutes

Michel Dumontier is a distinguished professor at Maastricht University and co-founder of the FAIR data principles. We recently invited Michel to be a keynote speaker at our 2024 Customer Advisory Board. We also sat down with him to discuss the intersection of the FAIR data principles, knowledge graphs, GenAI, and pharma data management.  

What are the FAIR data principles?  

The FAIR data principles were developed to support individuals and organizations wishing to enhance the reusability of their data. Since its formalization in 2016, FAIR has proliferated across the life sciences industry, becoming a major tenant in many organizations’ data management strategies.  

ONTOFORCE DISQOVER FAIR DATA PRINCIPLES Findable Accessible Reusable Interoperable


What prompted the creation of the FAIR data principles? 

The FAIR principles emerged out of a workshop at the Lorenz Center, spearheaded by Professor Barend Mons and the Netherlands eScience Center. This workshop brought together people from all walks of life - librarians, researchers, people from industry, publishers - to talk about the problem of data sharing.  

There was a lot of discussion around licensing and making data available, especially in the context of medical research. In this workshop, we hashed out the FAIR acronym which was emergent out of the data interests of the workshop attendees. Together we hammered out the minimal criteria or principles for making data available for reuse. We didn't want 100 points on it, we wanted as few as possible while still capturing the essential characteristics of what make data easier to reuse.  

How has FAIR evolved since its inception? 

I don’t think any of us could have predicted or anticipated the FAIR principles global uptake. The influential network at that meeting played a significant role, spreading the concept to scientists, policymakers, and so on. The timing was also right, with existing discussions around data accessibility and return on investment of publicly funded research. As the G20, G7, EU, NIH, and others adopted it, the principles quickly became a dominant ideology and codified. They became embedded in policies, guides, conferences, and communities. From there, it just skyrocketed.  

Why is FAIR important for life sciences organizations, particularly pharmaceutical companies? 

In the early days of my consulting work with pharma organizations, it was clear that many companies were struggling with data governance- it was hard to find data and once you did find it, it was unclear how to get access rights and so on. Organizations, by design or by accident, have segregated their various data spaces. This makes conducting data-driven research very difficult. FAIR addresses this by promoting accessible data, not necessarily open, but available to those who need it within proper governance structures.  

The value of FAIR data may be clear for large pharma companies, but achieving it requires substantial initial work. Organizations need to consider producing FAIR data from the start, integrating the right vocabularies, terminologies, and schemas into existing processes. In doing so, data users can benefit immediately rather than trying to wrangle data for days on end.  

How important is FAIR/quality data for implementing AI in an organization? Where do knowledge graphs come into that? 

Data quality remains a key issue, as quality and utility are often subjective. A dataset that appears high-quality may be inadequate for certain uses. FAIR principles encourage communities to establish metrics that codify quality in machine-readable terms, enabling automatic checks. Provenance is also crucial for FAIR, as it allows users to trace how data was generated, ensuring it meets community standards. 

For AI, data quality directly impacts model performance. A model can show high accuracy on poor data yet fail with testing. This emphasizes the need for not only high-quality data but also justification abilities so a model can explain its prediction. Knowledge graphs can support this by offering structured explanations, such as evidence for a drug treating a disease. Language models, while capable of generating explanations, often lack factual accuracy or logical consistency in their reasoning.  

How essential are knowledge graphs for AI applications? 

Knowledge graphs are especially relevant in the context of large language models. Databases have traditionally been reliable sources for retrieving precise answers through structured queries. You ask a specific question and receive accurate information from stored data. Knowledge graphs are a step beyond, allowing users to ask questions based on concepts rather than locating data in tables. This shift enables domain experts to query data semantically, asking, for instance, about proteins interacting with a specific molecule without needing technical SQL skills. 

Knowledge graphs embed meaning, making data more accessible and allowing for the aggregation of data from various sources under common labels, like "protein data" or "drug data." This provides a powerful intermediate layer between raw data and human questions, ideal for precise information retrieval. 

Large language models, on the other hand, generate responses based on probabilistic predictions rather than factual databases, which can yield plausible but not always accurate answers. Combining language models' interpretative abilities with structured data sources offer the best of both worlds: human-like interpretation with access to verified information.  

In your opinion, what is on the horizon for data and AI? 

I think we need AI systems that are more responsive and human-centered for scientific inquiry, particularly in pharma. These applications require AI to provide detailed, justified arguments with data—qualities lacking in current systems that tend to fabricate information. 

At the same time, creative tasks benefit from AI that can "hallucinate," generating imaginative content like images, music, and videos. This flexibility can inspire new ideas and directions for any industry. So, we need both: rigorous, critical-thinking AIs for scientific work and creative, unbounded AIs for inspiration.  

What advice would you give for an organization struggling to implement FAIR? 

My motivation for FAIR is rooted in the goal of making data usable, so, my advice is: “Don’t create data graveyards. Instead, have a purpose-driven vision for data, building systems that enable analysts, scientists, or whomever to access and use data effectively.” 

This requires more than individual enthusiasm—it demands organizational governance. FAIR must be part of the company’s vision and strategy, with clear structures that ensure innovations, such as ontologies or data repositories, are integrated organization-wide. Without this vision and governance, FAIR becomes ineffective, and data initiatives won’t lead to meaningful outcomes. 

ONTOFORCE Fireside chat speakers (2) 

Hear more from Michel during our upcoming fireside chat centered on the business case for FAIR on 11 December 2024. Michel along with fellow industry experts will discuss how the FAIR principles enhance data utility and accessibility, impacting decision-making and organizational success, alongside discussions on balancing costs and value for robust ROI measurements. Register now!