Methodology - Discoverant Docs

Patent Collision Map

The Patent Collision Map provides a patent landscape analysis of a query compound against 24.5 million SureChEMBL patent compounds. It classifies hits into 5 zones based on structural similarity to assess novelty risk.

Algorithm

Fingerprint: ECFP4 (Extended Connectivity, radius 2, 2048-bit)
Similarity metric: Tanimoto coefficient
Default similarity threshold: 0.7
MCS (Maximum Common Substructure) computed for zone classification
InChI comparison for exact match detection

5-Zone Classification

Zone	Name	Criteria	Colour
1	Literal Match	Exact InChI match with a patent compound	Red
2	Chemical Isostere	F↔Cl, CH₃↔CF₃, OH↔SH, NH₂↔OH substitutions	Orange
3	Biological Isostere	COOH↔tetrazole, phenyl↔thiophene, ester↔amide	Amber
4	Close Analog	Tanimoto ≥ 0.85 and MCS ≥ 80%	Yellow
5	Novel	Below all thresholds — structurally distinct	Green

Important Disclaimer: The Patent Collision Map is a patent landscape analysis tool. It is NOT legal advice, NOT a Freedom-to-Operate (FTO) opinion, and NOT a patentability assessment. Results should be reviewed by qualified patent counsel before making any legal or business decisions.

Structure Search

Chemical structure search powered by the Bingo SQL cartridge (EPAM Indigo engine), enabling substructure and similarity queries directly within PostgreSQL.

Search Modes

Substructure Search

The query molecule is treated as a fragment. Returns all compounds in the database that contain the query substructure.

Engine: Bingo bingo.searchSub()
Input: SMILES or Molfile
Stereochemistry: Exact match by default

Similarity Search

Computes Tanimoto similarity on ECFP4 fingerprints between the query and all database compounds.

Engine: Bingo bingo.searchSim()
Fingerprint: ECFP4 (2048-bit)
Default threshold: 0.7
Stereochemistry: Exact match by default

Databases Searched

Database	Compounds	Schema
SureChEMBL	24.5M	`patent_chem.surechembl_compounds`
ChEMBL	2.4M	`compounds.chembl`
PubChem	444K	`compounds.pubchem`

Total searchable compounds: 26.9M+. Typical query performance: 2–5 seconds across all databases.

Target Enrichment

Maps a compound to its known biological targets using bioactivity data from ChEMBL, then enriches with protein interaction networks and pathway data.

Pipeline

Compound → ChEMBL: Lookup via InChI key to find bioactivity records
Bioactivity filtering: IC50, EC50, Ki, Kd values filtered by threshold
Target identification: Active protein targets extracted from assay data
STRING interactions: Protein–protein interaction network expansion
Reactome pathways: Pathway mapping for identified targets

Parameters

Default threshold: IC50 ≤ 10,000 nM (10 μM)
Activity types: IC50, EC50, Ki, Kd
Target types: Single protein targets from ChEMBL
Interaction source: STRING v12 (13.7M human interactions)
Pathway source: Reactome curated human pathways

Pathway Analysis

Identifies enriched biological pathways for a set of target proteins, combining Reactome curated pathways with STRING protein interaction networks.

Data Sources

Reactome: Curated human biological pathways with hierarchical structure
STRING: Protein–protein interaction networks (confidence-scored)

Statistical Method

Enrichment test: Fisher's exact test
Output: p-value per pathway
Hierarchy: Child → parent pathway traversal (Reactome tree)

Workflow

Input set of target proteins (from enrichment or manual selection)
Map proteins to Reactome pathways
Compute enrichment p-values (Fisher's exact test)
Traverse pathway hierarchy from specific to general
Report enriched pathways with statistical significance

Unit Conversion Engine

Three-type scientific conversion system integrated into the data grid, supporting 62 canonical field-unit mappings and 30 ADMET/PK fields.

Conversion Types

Type	Method	Example
Linear	Multiply by scale factor	nM → µM (divide by 1000)
Cross-system (molar↔mass)	Uses molecular weight: `mass = molarity × MW`	nM → ng/mL (requires MW column)
Logarithmic	`pIC50 = −log10(IC50 × 10&sup9;)`	IC50 (nM) ↔ pIC50

Algorithm

Identify source column and current unit from field metadata
Determine conversion type from unit pair lookup table
For cross-system: locate MW column, validate values are present
Apply conversion function row-by-row, preserving null/empty values
Write result to display (in-place) or new column (additive)

Structure Deduplication

Indigo-based chemical structure matching for identifying duplicate compounds across datasets.

Precision Levels

Level	Method	Matches
Exact	Canonical SMILES string equality	Identical structures only
Canonical	Indigo canonical form comparison	Same structure, different input representations
InChIKey	First 14 characters (connectivity layer)	Same connectivity, different stereochemistry
Scaffold	Murcko scaffold comparison	Same core ring system

Workflow

Select structure column and comparison precision
Generate canonical representations for all rows
Group rows by canonical key at chosen precision
Present duplicate groups for user review and resolution