Dennis Bahler, C. Wellington, B. Stone, and D. Bristol,
Symbolic, Neural, and Bayesian Machine Learning
Models for Predicting Carcinogenicity of Chemical Compounds,
J. Chemical Information and Computer Sciences, 40(4),
July/August 2000, 906-914.
Experimental programs have been underway for several years
to determine the environmental effects of chemical compounds, mixtures,
and the like.
Among these programs is
the National Toxicology Program (NTP) on rodent carcinogenicity.
Because these experiments are costly and time
consuming,
the rate at which test articles (i.e., chemicals) can be tested is limited.
The ability to predict
the outcome of the analysis
at various points in the process would facilitate informed decisions about the
allocation of testing
resources.
To assist human experts in organizing an empirical testing regime,
and to try to shed light on mechanisms of toxicity,
we constructed toxicity models
using various machine learning and data mining methods, both
existing and of our own devising.
These models took the
form of decision trees, rule sets,
neural networks, rules extracted from trained neural networks, and Bayesian
classifiers.
As a training set we used recent results from rodent
carcinogenicity bioassays conducted by the National Toxicology Program
(NTP) on 226 test articles.
We performed 10-way cross-validation on each of our models to approximate
their expected error rates on unseen data.
The data set consists of physical chemical parameters of test articles,
alerting chemical substructures,
salmonella mutagenicity assay results, subchronic histopathology data,
information on route, strain, and sex/species
for 744 individual experiments.
These results
contribute to the ongoing
process of evaluating and interpreting the data collected from
chemical toxicity studies.
(242kb PDF)