Methodology

Discoverant Internal Documentation

Patent Collision Map

The Patent Collision Map provides a patent landscape analysis of a query compound against 24.5 million SureChEMBL patent compounds. It classifies hits into 5 zones based on structural similarity to assess novelty risk.

Algorithm

5-Zone Classification

Zone Name Criteria Colour
  1   Literal Match Exact InChI match with a patent compound Red
  2   Chemical Isostere F↔Cl, CH₃↔CF₃, OH↔SH, NH₂↔OH substitutions Orange
  3   Biological Isostere COOH↔tetrazole, phenyl↔thiophene, ester↔amide Amber
  4   Close Analog Tanimoto ≥ 0.85 and MCS ≥ 80% Yellow
  5   Novel Below all thresholds — structurally distinct Green
Important Disclaimer: The Patent Collision Map is a patent landscape analysis tool. It is NOT legal advice, NOT a Freedom-to-Operate (FTO) opinion, and NOT a patentability assessment. Results should be reviewed by qualified patent counsel before making any legal or business decisions.

Structure Search

Chemical structure search powered by the Bingo SQL cartridge (EPAM Indigo engine), enabling substructure and similarity queries directly within PostgreSQL.

Search Modes

Substructure Search

The query molecule is treated as a fragment. Returns all compounds in the database that contain the query substructure.

Engine
Bingo bingo.searchSub()
Input
SMILES or Molfile
Stereochemistry
Exact match by default
Similarity Search

Computes Tanimoto similarity on ECFP4 fingerprints between the query and all database compounds.

Engine
Bingo bingo.searchSim()
Fingerprint
ECFP4 (2048-bit)
Default threshold
0.7
Stereochemistry
Exact match by default

Databases Searched

Database Compounds Schema
SureChEMBL 24.5M patent_chem.surechembl_compounds
ChEMBL 2.4M compounds.chembl
PubChem 444K compounds.pubchem

Total searchable compounds: 26.9M+. Typical query performance: 2–5 seconds across all databases.

Target Enrichment

Maps a compound to its known biological targets using bioactivity data from ChEMBL, then enriches with protein interaction networks and pathway data.

Pipeline

  1. Compound → ChEMBL: Lookup via InChI key to find bioactivity records
  2. Bioactivity filtering: IC50, EC50, Ki, Kd values filtered by threshold
  3. Target identification: Active protein targets extracted from assay data
  4. STRING interactions: Protein–protein interaction network expansion
  5. Reactome pathways: Pathway mapping for identified targets

Parameters

Default threshold
IC50 ≤ 10,000 nM (10 μM)
Activity types
IC50, EC50, Ki, Kd
Target types
Single protein targets from ChEMBL
Interaction source
STRING v12 (13.7M human interactions)
Pathway source
Reactome curated human pathways

Pathway Analysis

Identifies enriched biological pathways for a set of target proteins, combining Reactome curated pathways with STRING protein interaction networks.

Data Sources

Statistical Method

Enrichment test
Fisher's exact test
Output
p-value per pathway
Hierarchy
Child → parent pathway traversal (Reactome tree)

Workflow

  1. Input set of target proteins (from enrichment or manual selection)
  2. Map proteins to Reactome pathways
  3. Compute enrichment p-values (Fisher's exact test)
  4. Traverse pathway hierarchy from specific to general
  5. Report enriched pathways with statistical significance

Unit Conversion Engine

Three-type scientific conversion system integrated into the data grid, supporting 62 canonical field-unit mappings and 30 ADMET/PK fields.

Conversion Types

TypeMethodExample
LinearMultiply by scale factornM → µM (divide by 1000)
Cross-system (molar↔mass)Uses molecular weight: mass = molarity × MWnM → ng/mL (requires MW column)
LogarithmicpIC50 = −log10(IC50 × 10&sup9;)IC50 (nM) ↔ pIC50

Algorithm

  1. Identify source column and current unit from field metadata
  2. Determine conversion type from unit pair lookup table
  3. For cross-system: locate MW column, validate values are present
  4. Apply conversion function row-by-row, preserving null/empty values
  5. Write result to display (in-place) or new column (additive)

Structure Deduplication

Indigo-based chemical structure matching for identifying duplicate compounds across datasets.

Precision Levels

LevelMethodMatches
ExactCanonical SMILES string equalityIdentical structures only
CanonicalIndigo canonical form comparisonSame structure, different input representations
InChIKeyFirst 14 characters (connectivity layer)Same connectivity, different stereochemistry
ScaffoldMurcko scaffold comparisonSame core ring system

Workflow

  1. Select structure column and comparison precision
  2. Generate canonical representations for all rows
  3. Group rows by canonical key at chosen precision
  4. Present duplicate groups for user review and resolution