Data Management

Approaches towards insights on pathogen evolution

Key Points

Microbial genomics leverages high-throughput sequencing techniques to study genomic sequences of microorganisms.
Ability to provide comprehensive insights into pathogen identity, diversity, and evolution is critical in accelerating vaccine development for human and animal health.
Bioinformatics workflows to implement approaches such as phylogenetic analysis and taxonomic classification are key to generating valuable insights.

Background: Importance of microbial genomics

Microbial genomics is a high-throughput OMICS-based technique that entails the study of the genomic sequences of microorganisms. Research in microbial genomics has provided us with many insights on microbiome functioning, and refinements to perturbation methods to ultimately improving both human and animal health.¹

The ability to provide comprehensive insights on pathogen identity, diversity, and evolution is invaluable in modern infectious disease research, and has profound implications for the development of vaccines.

Challenges in analyzing microbial genomics data for vaccine development

Due to the several benefits of OMICS-based technologies, analysis and interpretation of omics data for understanding the spread and evolution of pathogens is increasingly important for vaccine development. As more and more variety of omics data is being generated, a number of tools have been developed to process the data from quality checking to downstream analysis. This brings both experimental as well as computational challenges in acquisition and analysis of data². Thus, we urgently need bioinformatics workflows that leverage the existing tools and are tailored to business requirements.

Approaches to generate insights on pathogen evolution:

Here, we summarize a few approaches to analyze the pathogen evolution from microbial genomics data and highlight our experience and expertise in implementing them in workflows.

Phylogenetic analysis:
- This technique is used for studying evolutionary relatedness among various groups of organisms. A traditional workflow would involve multiple sequence alignment tool such as MAFFT, MUSCLE, ClustalW followed by phylogenetic tree building tool such as PHYLIP, FastTree.
- Recently we leveraged NextStrain³, an open-source tool for pathogen evolution, to develop a custom solution for a large pharma customer. The solution helped our customer to explore and visualize publicly available data alongside company-internal data. It was deployed in company’s environment so that the business users can freely and securely utilize it. Figure 1 shows a visualization of publicly available SARS-CoV-2 data on Nextstrain’s dashboard.

Figure 1: NextStrain’s dashboard based on publicly available SARS-COV data (Adapted from https://nextstrain.org/ncov/global)

Taxonomic classification:
- One approach to analyze novel pathogens is to perform assignments of sequences to taxonomic groups based on sequence similarity. Tools such as Kraken2 work by leveraging exact k-mer matches between input sequence and a database containing reference sequences with taxonomic information.
- We implemented a previously published workflow⁴ to automate the analysis of multiple runs containing genomic sequences of animal samples from raw reads to interactive visualization for classification of novel pathogens. The workflow filters high quality reads and efficiently reports confidence scores for the classification results. With this workflow, our customer could generate insights on novel pathogens at a much higher pace than before, and could accelerate the vaccine development by focusing on pathogens-of-interest in the downstream validation processes. An exemplary report is provided in figure 2, which depicts the classification of sequences found in a human gut sample.

Figure 2: Exemplary interactive visualization performed on microbial genomics data (Source: Interactive visualization of taxonomic classification in Krona)

Our expertise and learnings to boost vaccine development initiatives

Figure 3: Our expertise and learnings to boost vaccine development initiatives

How can OSTHUS support you in your vaccine development initiatives?

We have Bioinformatics domain knowledge to interpret existing datasets and provide support on its relevancy and better impact based on literature.
We provide technology agnostic scientific advisory and data management solutions from vision and strategy, to market analysis, to implementation.
We develop and/or automate customized workflows that not only fit into existing tool ecosystem but also scale to meet the dynamic data processing demands in terms of compute, storage, performance etc.
We can spearhead the development of standardized workflows and best practices for reuse in order to avoid siloed solutions within the company.

References:

Disclaimer

The contents of this blog are solely the opinion of the author and do not represent the opinions of PharmaLex GmbH or its parent Cencora Inc. PharmaLex and Cencora strongly encourage readers to review the references provided with this blog and all available information related to the topics mentioned herein and to rely on their own experience and expertise in making decisions related thereto.

OMICs technologies uncovered several unknown aspects of central dogma

Several scientific and technological advances uncovered new knowledge about each step in central dogma – unidirectional flow of information from DNA to RNA to Proteins and in many cases from RNA to Proteins. Simultaneously, these advances discovered epigenetic regulation of central dogma and the importance of probing other biomolecules such as lipids and metabolites.

Due to rapidly evolving technologies and the reducing cost of generating OMICs data – quantitative high throughput data on biomolecules, we require specific tools to analyze it rapidly. Data analysis is a key factor in R&D processes due to the increasing resolution of these measurements in spatio-temporal dimensions from organism to tissues and even to individual cells. This is also evident from the large investments OMICs data attracts across the R&D industry and academia.¹ In Jan 2023, EU launched a joint program worth 16.5 million euros for large scale analysis of OMICs data for drug-target finding in neurodegenerative diseases alone.

Fig 1: OMICs technologies centered around the central dogma in Biology

Reproducibility and Ease of Operations are major challenges for OMICs Data Analysis

Scientific workflows or pipelines (Figure 2) – series of software tools working in a stepwise manner one after the other – are important to rapidly analyze and interpret vast amounts of data generated using various OMICs techniques. For example, RNA-seq data analysis (Transcriptomics) involves trimming, aligning, quantification, normalization and differential gene expression analysis where output of one tool serves as input for the next tool in the workflow. Permutations of these tools can lead to over 150 sequential workflows or pipelines, so reproducing and comparing their results can be challenging².

Below are some examples of pipelines and software tools to perform different analysis steps in different OMICs experiments.

Fig 2: Gold standard tools for OMICS analysis

Integrative frameworks can help facilitate the execution of these pipelines. Typical no/low code frameworks for non-programmers are Galaxy, Unipro UGENE NGS and MIGNON. For developers, suitable analysis frameworks include Snakemake, Nextflow and Bpipe. Community driven platforms like nf-core provide peer-reviewed best practice analysis pipelines written in Nextflow.

One way to ensure efficient/streamlined R&D is to establish and follow standard practices for OMICs data analysis. Figure 2 shows gold standard software tools for performing various intermediate steps in different OMICs data analysis pipelines. Standardized OMICs practices – tools and frameworks – will facilitate accessibility (A in FAIR) and reproducibility of high quality results. It will also enhance their business value (please also refer to the following blog post: Multi-Omics Data Integration in Drug Discovery). We would like to point to the analysis pipelines for major OMICs assay types developed by the ENCODE Data Coordinating Center (DCC).

Although uniformity is valuable in OMICs data analyses, customization is also equally valuable for specific scientific contexts. Different OMICs experiments require different handling of the data and analyses. For example, high variation in signal-to-noise ratio in ChIP-seq experiments to identify transcription factor (TF) binding sites necessitates a wide range of quality thresholds. RNA-seq data analyses are driven by factors such as read size, polyadenylation status, strandedness and require different parameters or settings. Hence, there are multiple “generic” factors that can be standardized, while individual parameters and settings can be customized to suit specific scientific questions.

Factors to consider while standardizing OMICs data analysis workflows:

Knowledge and understanding of central dogma and design of various high throughput experiments
Standard data files and formats (e.g. using common reference genome across labs/departments where ever possible)
Domain specific language such as Workflow Description Language (WDL) and/or Common Workflow Language (CWL) to enhance Interoperability (I of the FAIR principles)
Flexible frameworks that can run locally (e.g. on HPC) and in cloud (e.g. AWS, Azure, Google)
Common framework for testing the workflows
Wrappers for automated file handling (import and export of data, parameters)
User friendly interfaces for interactive usage
Platform-agnostic installation of frameworks, packages, libraries
Common package managers like Conda
Intuitive and interactive visualization of workflows, their progress and their results
Metadata tracking
Common and portable sharing mechanism for pipelines (e.g. Docker images), data and results

Uniform OMICs Data Analysis Workflows to Empower your Future:

Relatively well-established OMICs data generation techniques demand standardized ways of data analysis while allowing customization necessary for a specific scientific context. To accelerate discovery and actionable data-driven decisions within your organization, take a step closer to FAIR OMICs data by establishing gold standard workflows and frameworks for OMICs data analyses.

How Can OSTHUS Help?

Scientific advise (in experimental design)
Tool selection for your specific scientific question
Developing an automated workflow using different workflow management frameworks
Customizing an existing workflow by developing scripts for individual steps
Visualizing analysis results using available business intelligence tools and/or developing bespoke interfaces suitable to answer specific questions

to derive scientific insights from their OMICs data.

References:

https://biotechfinance.org/q2/
Corchete LA et al, Scientific Reports, 2020

Disclaimer

Unlocking Precision Medicine: Omics Data Challenges & Solutions

Table of Content:

Key points
Keywords
Background
Business value of integrated omics data
We at OSTHUS have done it before: Offerings

Key points:

Omics technologies is a fast-paced field that produces large amount of data and an effective data management brings opportunities as well as challenges¹.
Having deeper omics data analysis capabilities and Bioinformatics expertise are becoming central to drug development.
Multi-omics data strategies and a long-term vision of enterprise-wide needs is the key to realizing the full potential of its business value.

Background:

The first complete gapless human reference genome was published in 2022 (draft genome that was published in 2003 was incomplete) by Telomere-2-Telomere consortium – discovering 200 Million more base pairs and 1956 new gene predictions in the process². It unlocked further potential for functional studies to find new therapeutic targets.

Multi-omics (also called Panomics or integrative omics) is the integration of omics data sets arising from the subfields such as genomics, transcriptomics, proteomics, metabolomics; aimed at increasing our understanding of biological systems³. As the pharmaceutical industry is increasingly embracing the era of precision medicine, the fast-paced omics technologies are becoming the significant driver in this transformation journey. However, as per our experience in the field, gaps remain with respect to data integration, data harmonization, design considerations, and data management strategies for realizing the full potential of omics data.

Genomic databases like GenBank and the Sequence Read Archive (SRA) collectively hold 100+ petabytes of data and are predicted to exceed 2.5 exabytes by 2025. Collecting, integrating, and systematically analyzing heterogeneous big data with distinct characteristics are a challenging task that may lead to data mismanagement. For instance, DNA sequencing data often comes from various platforms like Illumina, Pacific Biosciences, and Oxford Nanopore, each producing data with unique quality thresholds and file types. One specific issue involves the use of multiple identifiers. A protein can have several identifiers depending on the database used, such as UniProt, PDB, or internal source systems. Discrepancies in mapping these identifiers may lead to confusion or misinterpretation of results arising from multiple systems, hindering the downstream data analysis.

Our understanding of the business values of multi-omics data (holistic view):

Streamline R&D processes: The use of omics data can streamline R&D processes. For example, it can help in identifying biomarkers with high confidence that support the predictive models of disease progression.
Accelerate drug discovery and development: Omics data, when integrated, can provide in-depth molecular insights that can help businesses save time and resources in drug discovery research and predicting drug efficacy and safety at a quicker pace.
Gain deeper insights: Integrated omics data allows for a more detailed understanding of individual genetic and molecular profiles, driving personalized healthcare solutions.
Cost Reduction: Efficient data strategies enabled by streamlined storage, processing, and analysis of omics data, reduce the costs associated with these processes.

Figure 1: Illustration of our understanding and approaches for leveraging multi-omics data

Our exemplary approaches and considerations to certain challenges in omics data management:

Figure 2: OSTHUS’ exemplary approaches and considerations to certain challenges in omics data management

Leveraging Bioinformatics Expertise for Optimizing Omics Data Management

How can OSTHUS help?

Figure 3: OSTHUS consulting approach from vision to implementation

In our recent project, a pharmaceutical company was struggling with efficiently managing their genomic and protein sequence data. We are implementing a bespoke cloud-based centralized data lake solution that not only consolidates different data and metadata but also offers an intuitive user-interface that provides ability to quickly extract insights from similar sequences in in-house as well as publicly available resources.

OSTHUS offers end-to-end services from vision and strategy, to market analysis, to implementation. Recognizing that one size doesn't fit all, we offer technology agnostic consulting. Our Bioinformatics experts understand the available technologies and their strengths as well as weaknesses, from open-source solutions to commercial offerings, which allows us to recommend and implement the best-fit solution that caters to specific objectives.

Conclusion:

To realize the full potential of these approaches and transform raw omics data into meaningful insights, a strategic and robust data strategy is critical.

With strategic planning and expert guidance, these challenges can be effectively managed, unlocking the immense potential of integrated omics data for accelerated drug development.

Contact us today to revolutionize your bioinformatics journey and empower data-driven decision-making in your drug development efforts.

References:

Disclaimer:

OSTHUS GmbH is a subsidiary of AmerisourceBergen Corporation. OSTHUS GmbH and AmerisourceBergen strongly encourages readers to review all available information about the topics contained in this blog and to rely on their own experience and expertise in making decisions related thereto.

Not all data is created equal. Understand the different levels of data quality and how they can influence your organization.

In the race to leverage data as a valuable asset to the enterprise, we are constantly balancing the need for immediate access with the need to refine the information as much as possible. Ideally, we would like to extract the real nuggets of gold from our data, the ones that give us important strategic insights. However, it is also natural (and necessary!) to handle data at various stages of refinement on our way there.

So what does data look like at different levels of quality?

Data quality levels

A useful way of thinking about data quality is by splitting it into three distinct levels:

Bronze: data at this level is raw and unmodified. It is often created or loaded from different heterogeneous source systems, and thus it is mostly useful in the short term with limited scope.
Silver: this type of data has been filtered, cleansed, and structured before use, allowing it to be used in the medium to long term. As this data is somewhat standardized, it can be used by multiple team members and fit into a wider enterprise context.
Gold: data at this level is processed until it is highly reusable and trustworthy. This data has been refined taking the needs of multiple team members from multiple parts of the organization into account. Generally, this data can be used for sophisticated analyses that increase the team’s understanding of the business.

Source: *https://www.bluegranite.com/genomics-data-lake-ebook*

How to improve overall data quality?

While many organizations have data at both ends of the spectrum, moving as much data as possible to the gold level (i.e. improving its quality) is a goal that will pay dividends for years to come. To do so, we rely on a number of strategies:

Structure and integrate: the low-hanging fruit of data quality is standardization. Having cohesively structured datasets means that more people across the organization can consume the information from their own perspectives. Additionally, this structuring makes it easier to spot any potential issue in data points.
Understand the user base: another effective measure is to understand how team members are consuming data, then tailor datasets to suit their needs. For example, if a subset of users habitually processes data in a specific way to produce a visualization, then giving them the ability to retrieve pre-processed information can significantly boost their productivity.
Curate manually: while doing this for an entire dataset is not feasible, manually curating portions of the data can give the team an in-depth understanding of the overall data condition. Based on what we observe during this exercise, we can apply changes to data validation and structuring processes accordingly.

The pros of gold-level data

Once data transitions to a gold-level status, it offers some additional advantages to the organization. For instance, gold data has a high level of accuracy, which means that it can be used as a trusted input for training machine learning models and business intelligence tools. These techniques allow businesses to better understand their current state and to create segments for current behavior or forecasts that help inform decisions about the future.

Storing this heterogeneous data

As we can see, the data across the organization is handled and accessed by multiple user groups with varied needs. The question then becomes: how can we effectively store these heterogeneous data categories into a single, unified repository?

A potential answer is to employ a Data Lakehouse, which is a flexible repository that combines the scalability of a Data Lake with the solid integration power of a Data Warehouse. A Data Lakehouse has the advantage of allowing data to be organized in layers, which can be directly mapped to the bronze, silver, and gold quality categories above.

Approaches towards insights on pathogen evolution

Key Points

Background: Importance of microbial genomics

Challenges in analyzing microbial genomics data for vaccine development

Approaches to generate insights on pathogen evolution:

Our expertise and learnings to boost vaccine development initiatives

References:

OMICs technologies uncovered several unknown aspects of central dogma

Reproducibility and Ease of Operations are major challenges for OMICs Data Analysis

Factors to consider while standardizing OMICs data analysis workflows:

Uniform OMICs Data Analysis Workflows to Empower your Future:

How Can OSTHUS Help?

References:

Unlocking Precision Medicine: Omics Data Challenges & Solutions

Table of Content:

Key points:

Background:

Our understanding of the business values of multi-omics data (holistic view):

Our exemplary approaches and considerations to certain challenges in omics data management:

Leveraging Bioinformatics Expertise for Optimizing Omics Data Management

Conclusion:

References:

Disclaimer:

Data quality levels

How to improve overall data quality?

The pros of gold-level data

Storing this heterogeneous data

Demo Title