BioPredict, Inc.             
Home Company Profile Technologies Services Internal Discovery Success Stories People Contact

            


QSAR Analysis and HTS Interpretation


High throughput screening (HTS) has become a core component in drug discovery. HTS produces a great amount of data that must be analyzed to guide follow-on screens and lead optimization chemistry.   QSAR (quantitative structure-activity relationships) represent attempts to describe and to correlate structural or property descriptors of compounds with compound activities. In both cases compounds are represented by physicochemical descriptors. The goal  of interpretation package is to predict the activity of new compounds based on observed actives.

Many learning and data mining methods now emerging from the artificial intelligence community are ideally suited to the task. These include Bayesian Classifiers and Probabilistic models, application of Kernel Based methods, Decision Trees methods, and Scaling methods that map results into lower dimensional spaces through selection of appropriate molecule descriptors.  We have collected these methods into an extensive internal software package for application to HTS and for other chemoinformatics tasks  associated with maintenance of a screening library. .Here we briefly describe some problems and opportunities associated with HTS interpretation and with associated QSAR.

Typical predictions based on classifications of HTS results ignore the prior distribution of compounds that were screened.  When one analyzes a typical screening library  - even one selected for diversity – compounds fall into ‘islands’ that are far apart and have, in an adequate approximation, a continuous distribution of compounds within the islands.  This allows us to create a Bayesian Classification Schemes that measures the size of these cluster islands and then uses them as “prior” distributions in the classification of hits.  This classification is based on sampling the immediate environment of each positive compound and, by taking the ratio of positives to negatives and formulating statistical priors, infers a probability that the compound is a true positive.  This approach allows the identification of inferred true positives, inferred false positives, and inferred false negatives. For example, Most HTS efforts discard isolated hits (clusters of one compound).  The use of priors can differentiate between isolated hits that are isolated because their chemotypes was underrepresented in the original library, and those that are isolated because they are candidates for false positives.  Isolated hits that are under-represented are worth pursuing.

In the same venue the distances between the compound vectors can be defined by their local environments. Namely, the straightforward cartesian distances between compound descriptor vectors do not reflect well the similarity between the compounds as perceived by chemists. We define the non Euclidean distances between objects by confining the analysis to the local neighborhoods and then by defining the distance through conditional probabilities of two objects to belong to the same chemotype.

A key problem in HTS data mining and in QSAR is the selection of chemical descriptors used to express each compound. For any given dataset a large number of descriptors are noisy or simply irrelevant to the problem at hand.  These descriptors can impede subsequent clustering and classification procedures.  Before applying any of the classification or clustering schemes we examine how relevant a descriptor is to the data set.  A well-chosen set of descriptors makes subsequent classification more sensitive- and therefore more precise.  This is done by application of such scaling methods as PCA (principal component analysis) and  ICA (independant component analysis), by computing the descriptor’s characteristics such as entropy over that neighbor data set, and by applying of non Euclidean metrics mentioned above.

Classification and regression can be directly derived from the training set by  Kernel based classification methods.  Unfortunately, training sets are often erroneous and noisy. Kernel based methods comprise a novel and sensitive class of "instance" methods that create a separation hyper surface between positive and negative points.  In these methods a hypersurface separating positives and negatives is constructed based  on data points close to the hyper surface called support vectors.  By separating these support vectors the methods apply a  “worst case” analysis, and as such they  generalizes well (i.e. they can be trained on small data sets).   The methods are typically robust, in the sense that the choice and normalization of compound descriptors is usually not critical.

There is a need to rapidly analyze and classify large compound data sets. Decision trees  work by analyzing molecular descriptors, which may be binary or numerical, and categorizing compounds using a top-down recursive procedure applied to descriptor values. Decision tree methods are used to classify and to rapidly cluster compounds coming from a high throughput screen to determine recurring compound classes that hold the most promise for lead optimization. The method can also be used to create descriptor-based profiles of ligands for target classes. These profiles can then be used to rapidly search existing compound databases for other compounds that fit the profile, for “druglike”, “kinase-like”, active, etc.