Dennis Bahler, C. Wellington, B. Stone, and D. Bristol, Symbolic, Neural, and Bayesian Machine Learning Models for Predicting Carcinogenicity of Chemical Compounds, J. Chemical Information and Computer Sciences, 40(4), July/August 2000, 906-914.

Experimental programs have been underway for several years to determine the environmental effects of chemical compounds, mixtures, and the like. Among these programs is the National Toxicology Program (NTP) on rodent carcinogenicity. Because these experiments are costly and time consuming, the rate at which test articles (i.e., chemicals) can be tested is limited. The ability to predict the outcome of the analysis at various points in the process would facilitate informed decisions about the allocation of testing resources. To assist human experts in organizing an empirical testing regime, and to try to shed light on mechanisms of toxicity, we constructed toxicity models using various machine learning and data mining methods, both existing and of our own devising. These models took the form of decision trees, rule sets, neural networks, rules extracted from trained neural networks, and Bayesian classifiers. As a training set we used recent results from rodent carcinogenicity bioassays conducted by the National Toxicology Program (NTP) on 226 test articles. We performed 10-way cross-validation on each of our models to approximate their expected error rates on unseen data. The data set consists of physical chemical parameters of test articles, alerting chemical substructures, salmonella mutagenicity assay results, subchronic histopathology data, information on route, strain, and sex/species for 744 individual experiments. These results contribute to the ongoing process of evaluating and interpreting the data collected from chemical toxicity studies.
(242kb PDF)