Data Analytics Question
- (4 points) Which of the following statements are true? Briefly explain your answer.
- Training a k-nearest-neighbors classifier takes less computational time than testing it.
- The more training examples, the more accurate the prediction of a k-nearest-neighbors.
- k-nearest-neighbors cannot be used for regression.
- A k-nearest-neighbors is sensitive to the number of features.
- (6 points) Figure 2 presents the performance of several algorithms applied to the problem of classifying molecules in two classes: those that inhibit Human Respiratory Syncytial Virus (HRSV), and those that do not. HRSV is the most frequent cause of respiratory tract infections in small children, with a worldwide estimated prevalence of about 34 million cases per year among children under 5 years of age.
- Which method gives the best performance? Explain your answer.
- The goal of this study is to develop an algorithm that can be used to suggest, among a large collection of several millions of molecules, those that should be experimentally tested for activity against HRSV. Compounds that are active against HSRV are good leads from which to develop new medical treatments against infections caused by this virus. In this context, is it preferable to have a high sensitivity or a high specificity? Which part of the ROC curve is the most interesting?
- In this study, the authors have represented the molecules based on 777 descriptors. Those descriptors include the number of oxygen atoms, the molecular weights, the number of rotatable bonds, or the estimated solubility of the molecule. They have fewer samples (216) than descriptors. What is the danger here? How would you solve this issue?
Problem 2 (20 points):
Download the letter recognition data from: http://archive.ics.uci.edu/ml/datasets/Letter+Reco…
The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15. Below is the attribute information, but more information on the data and how it was used for data mining research can be found in the paper:
Create a classification model for letter recognition using decision trees as a classification method with a holdout partitioning technique for splitting the data into training versus testing.
Problem 3 (20points):
On the same data from Problem 2, apply a K-nearest neighbor classifier to classify the data.