Developing Optimal Prediction Models
for Cancer Classification
Using Gene Expression Data


This page contains supplemental material and code for the manuscript to appear in the Journal of Bioinformatics and Computational Biology (JBCB). A copy of the manuscript in .pdf format is available here.

Links
  • R software
  • Leukemia Data Set
  • Colon Cancer Data Set


  • R-code
    The following code is used for the analysis of the leukemia data set. The data can be obtained using the above link and scrolling down to Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. All the script files are saved as text documents that can be implemented using the free statistical software package R (link above). The first script file, DataPrep, helps with preparing the data to be analyzed in the later script files. The second script file, TwoSets contains the code if there is a clear distinction between the training and test data set. The script file, OneSet, contains the code to run the analysis with randomly chosen training and independent data sets (refer to section 2.4 in the manuscript for more information about these two methods).
  • Preparing the Data (DataPrep) (.txt)
  • Both a Training and Test Set (TwoSets) (.txt)
  • A Single Data Set (OneSet) (.txt)


  • Below are some output tables from the analysis of the leukemia and colon cancer data sets.
    Tables
  • One-Gene Model (Zyxin) for the Leukemia Data Set (.pdf)
  • Two-Gene Model (Zyxin + LAMP7-E1) for the Leukemia Data Set (.pdf)
  • Two-Gene Model (hCRP + hmgI) for the Colon Cancer Data Set (.pdf)



  • If you have any questions about the above code, email: Mat Soukup.