BLOG

Knowledge graphs to enhance and achieve your AI and machine learning endeavors: best practices

Life sciences research generates a staggering amount of data. Learn how to harness its power through effective modeling, ensure its quality, and how to use it effectively to reveal new insights for your research.

Introduction

Knowledge graphs are structured representations of information that form a graph of interconnected concepts. They consist of data points representing entities, known as nodes, and edges which show the relationships between these nodes. In the life sciences, knowledge graphs integrate diverse datasets, including publications and genomic, proteomic, and clinical data. This diversity of information enables comprehensive understanding and analyses of biological systems to inform drug discovery, personalized medicine, and disease research by revealing hidden relationships, predicting outcomes, and facilitating the identification of novel therapeutic targets. By providing a holistic view of complex biological data, knowledge graphs enhance decision-making and accelerate scientific discoveries in the life sciences.

Best practices start with FAIR data principles

Any best practice surrounding knowledge graphs must be guided throughout by the FAIR data principles of findability, accessibility, interoperability, and reusability. These principles aim to enhance the utility of data by ensuring it is properly managed and shared. When applied to knowledge graphs, data becomes easily discoverable through unique identifiers and metadata (findable), retrievable through standardized protocols (accessible), uses common formats and vocabularies to facilitate integration across systems (interoperable), and annotated with rich metadata and usage licenses to support reuse in various contexts (reusable). Normalizing and annotating data within knowledge graphs according to FAIR principles improves data quality, consistency, and utility, fostering more effective data integration, analysis, and sharing across the life sciences community.

Starting with a solid foundation of FAIR data allows us to focus on further best practices surrounding modeling, data quality, and data utilization. The remainder of this blog will discuss some concrete practicalities to take into account when modeling a knowledge graph, specifically one for artificial intelligence (AI) and machine learning (ML) purposes.

Modeling best practices

Firstly, in terms of modeling best practices, it is important to consider which types of data, relationships, ontologies, and descriptions are most important for your specific needs, Sometimes, using too much superfluous data, especially too much of the same, can lead to biases in the ML. At other times building a specialized knowledge graph may be best for your needs. Together, these considerations will help you build a robust model to retrieve the most accurate and complete answers possible to your queries.

1. Consider typing your relationships

A knowledge graph contains different entities, for example these could be drugs and targets. These entities can be linked to each other in multiple different ways. A drug could be an inhibitor for a target, or a modulator for a target, etc. In most situations where machine learning (ML) is used on top of your knowledge graph, it’s best to specify and use the most important type of relationship for your use case, i.e. whether to consider inhibitors or modulators or both. This distinction is important, otherwise the ML model will be fed with noise and potentially lead to wrong conclusions.

There are, however, some situations where it may be sensible to ignore the types and just feed the fact that there are relationships there to your machine learning.

Michael Vanhoutte VP, Engineering in Product at ONTOFORCE

Michael Vanhoutte, Vice President Engineering at ONTOFORCE, elaborates: “Suppose you want to find key opinion leaders on rare diseases using information from clinical trials and publications. It may not really matter whether the key opinion leader is a principal investigator or what role they played in a publication, so you would just use any type of relationship.”

It is always important to consider whether typing your relationships is useful in each case, but generally it is best to do this.

2. How to define your canonical types?

Generally, your entities will be well defined. However, in some cases it may make sense to merge certain concepts into one, for example, when considering phenotype and disease. For Alzheimer's research, where the phenotype could involve memory impairment or neurological decline, these would be clearly distinct from the disease itself. However, for cancer research the line between phenotype and disease is less clear. In this case it may make more sense to merge both types into one entity and use that to feed the ML algorithm.

Just like with defining relationship types, the decision of whether to merge or not will be different depending on your use case and the questions you want to answer. It is important to consider this on a case-by-case basis to get the best out of ML.

3. Consider simplifying your links

Pic blog

Typically for drug screening, entities such as an active substance and a protein will be linked by activity data, like bioassays, and this information will be used as evidence to feed into your ML algorithm for screening drugs. This is great if all that evidence is really useful for your research, but sometimes too much superfluous data could confuse the ML. In this case, a direct relation may be more useful for the ML to establish a relation directly from the active substance to the protein. Here, the less unnecessary information that is fed into the ML algorithm, the less noise is created, improving the quality of your output.

4. Explain your model with descriptions

This fourth practice is most relevant for large language models using natural language processing (NLP). For example, to find all clinical trials for Huntington’s disease that involve blood tests, you might perform a natural language query against the LLM using that statement. To carry this out effectively, the LLM needs to understand how that information is modeled in the knowledge graph. This can be achieved by making sure that every entity and property in the knowledge graph has a clear description or definition, then feeding those descriptions to the LLM as part of your context, to teach it the types of knowledge that are stored in the knowledge graph.

ONTOFORCE’s DISQOVER platform has descriptions of all the entities and properties in the system. This information can then be used by the LLM to make the correct assessment and establish the right links when translating your natural language queries.

“A collaborator is an organization other than the Sponsor that provides support for a clinical study. This support may include funding, design, implementation, data analysis, or reporting.” (Example definition from DISQOVER’s public data)

5. Select and use the right ontologies

Using the right ontology may be obvious, but it is especially important when thinking about it in an AI ML context. There are a number of industry standard ontologies and others that are approved or used by EMA and FDA, and these are regularly updated and maintained. Even within these ontologies, there are certain situations where it is beneficial to swap from one to another. It is important to consider whether the ontology you are using is the right one for your case.

As an example that demonstrates the importance of this, ONTOFORCE recently switched from using UMLS (Unified Medical Language System) to Mondo, because some synonyms in UMLS were overlapping with drugs which lead to incorrect conclusions in the ML analysis. You must really consider what kind of granularity your chosen ontology provides for your chosen use case, and whether that is appropriate for the kind of research you are doing. Sometimes this may be Mondo for example, other times this may be something like UMLS.

6. Consider building specialized graphs

Most of ONTOFORCE’s customers are trying to build an enterprise knowledge graph specifically for AI/ML, where all the information is linked together, enabling as many use cases as possible. Naturally, if it is possible to start directly from your enterprise knowledge graph, that will be the easiest and cheapest way forward. However, in some cases it may be beneficial to build a specialized knowledge graph for that particular use case. This will allow you to optimize for that use case and apply all the other best practices already described.

Data quality best practices

Data quality is obviously a very important factor to consider for any analysis, particularly in life sciences where human health could be at risk from decisions based on inaccurate results. When AI and ML are added to the mix, data quality becomes especially important because they could lead to credible sounding, but ultimately wrong conclusions that may be difficult to identify. It is crucial to catch these errors early to avoid carrying incorrect data through to your analyses.

Investing in the correct tools with embedded QA and data visualization systems can assist with identifying a range of potential errors in your data, including:

NLP errors: glioma was recognized as a malignant neoplasm of frontal lobe by a well-known NLP technology, whereas it actually applies to more than just the frontal lobe.
Double entries: “knee pain” and “pain in knee” are considered two different symptoms from an ML perspective if they are fed as two separate entities from your ontology.

1. Invest in built-in QA tooling

Our first recommendation for addressing these issues and more is no small task, but it is an important one: to build QA tooling that is specific to your use case.

A demonstration of the work involved in developing such QA tooling is the new DISQOVER Assistant that will be released in ONTOFORCE’s DISQOVER 7 later this year. This capability was developed with three main considerations in mind, the first of which was, obviously, to make sure it works from both a coding and an LLM perspective. Secondly, because the system will be dealing with public data, the descriptions mentioned above were crucial to ensure the LLM understands how the information is modeled. Finally, a significant amount of time was spent building a system to measure the quality.

Continuously running thousands of queries against any new improvements to the system reveals not only the quality of the output, but also the success rate, while identifying where any issues arise. Such issues might be in recognizing diseases, or less easily recognized data types like numbers or dates, or how certain phrases or sentences are formulated. These queries provide metrics that provide confidence in the process and a full understanding of the impact of any changes to the system.

2. Use tools to visualize data issues

DISQOVER’s data ingestion engine already has quality control capabilities that allow you to quickly identify and respond to changes or errors in your data. However, this type of analysis tends to discover trends, not necessarily an outlier in one piece of data. This is where the visualization tools provided by DISQOVER really help.

For example, if you’re analyzing clinical trial data, through a simple screenshot you can easily identify start and end dates, the number of people enrolled, as well as missing data. If you suddenly see that 500 million people are enrolled in a study, that would suggest that something has gone wrong. Or you may notice that your data has been doubled because rabbit and rabbits have been identified as two different organisms. These kinds of errors become obvious using DISQOVER’s visualization capabilities, so you can find outliers in your data and respond appropriately.

Knowledge utilization best practices

Finally, there are also best practices and considerations to be taken into account surrounding the way in which you use your data for AI, including how much data to use in your analysis, which data formats, and choosing the most appropriate tool for your needs.

1. Use only what you need

It may seem like if you use all the available knowledge for your analysis, the more complete your result will be and the less time you will need to invest in deciding what to use. However, doing so will substantially increase your runtime and development costs, will introduce noise into your ML algorithm and result in an overall decrease in performance, either in terms of runtime or quality. Being very selective in the type of data that is fed into your machine learning, and making sure that it has been modeled correctly, will be highly beneficial to the quality and efficiency of your result.

2. Use the right tool for the job

If you do choose to use DISQOVER, we have two sets of technologies that offer ML analysis on top of knowledge graph data, and it’s important that you select the right one. For more ad hoc analyses, for example if you’re using Jupiter interactive lab notebooks, we highly recommend using REST API, to do your analysis in a web-based environment. For more bulk machine learning analyses, for example using SageMaker, Azure Machine Learning, or TensorFlow, it would be more appropriate to carry out a bulk export of all information into your knowledge graph to feed it into whatever system you want to use to carry out the analysis.

3. Use the right format for the job

The final consideration is the correct format for your particular use. After considering how to use the data from DISQOVER in an ML context, you need to know which system to feed that into. It is not enough to just assume you will be pushing information into a data repository like Neptune, or Snowflake, or another similar tool. There are a number of implications to bear in mind here when deciding on a format. We will discuss the two main ones here.

The first consideration is the richness of the data. For example, exporting information from any DISQOVER knowledge graph into comma-separated values (CSV) or Parquet file type may be a very easy format to process. The tabular structure of these file types means that the tabular data will be transferred, but not any descriptions assigned to entities or the meanings of relationships. These would have to be re-established manually. Using a rich data format (RDF) is slightly more complicated to deal with, but it is a richer format that will contain much more information and give you more possibilities for downstream processing.

The second consideration is from a cost and capability perspective which will be affected by the platform to which you’re pushing your information. The further you move towards a data lake, which is a single store of data from various sources, or SQL and graph databases, the more capabilities you will gain out of the box because you can run SQL statements or semantic SPARQL queries against these formats, albeit at a greater cost. Providing it is suitable for your needs, feeding RDF or Parquet files directly to Azure Machine Learning or AWS SageMaker could be much more cost effective.

To learn more about best practices for AI/ML, watch our webinar “Elevate your AI and Machine Learning analytics with knowledge graphs”.

Check also the blog "The value of knowledge graphs for AI & ML applications"

To learn more about DISQOVER or to request a demo please see here.