Data Analytics Question
- (4 points) Which of the following statements are true? Briefly explain your answer.
- Training a k-nearest-neighbors classifier takes less computational time than testing it.
- The more training examples, the more accurate the prediction of a k-nearest-neighbors.
- k-nearest-neighbors cannot be used for regression.
- A k-nearest-neighbors is sensitive to the number of features.
- (6 points) Figure 2 presents the performance of several algorithms applied to the problem of classifying molecules in two classes: those that inhibit Human Respiratory Syncytial Virus (HRSV), and those that do not. HRSV is the most frequent cause of respiratory tract infections in small children, with a worldwide estimated prevalence of about 34 million cases per year among children under 5 years of age.
-
- Which method gives the best performance? Explain your answer.
- The goal of this study is to develop an algorithm that can be used to suggest, among a large collection of several millions of molecules, those that should be experimentally tested for activity against HRSV. Compounds that are active against HSRV are good leads from which to develop new medical treatments against infections caused by this virus. In this context, is it preferable to have a high sensitivity or a high specificity? Which part of the ROC curve is the most interesting?
- In this study, the authors have represented the molecules based on 777 descriptors. Those descriptors include the number of oxygen atoms, the molecular weights, the number of rotatable bonds, or the estimated solubility of the molecule. They have fewer samples (216) than descriptors. What is the danger here? How would you solve this issue?
Problem 2 (20 points):
Download the letter recognition data from: http://archive.ics.uci.edu/ml/datasets/Letter+Reco…
The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15. Below is the attribute information, but more information on the data and how it was used for data mining research can be found in the paper:
Create a classification model for letter recognition using decision trees as a classification method with a holdout partitioning technique for splitting the data into training versus testing.
Problem 3 (20points):
On the same data from Problem 2, apply a K-nearest neighbor classifier to classify the data.

We've got everything to become your favourite writing service
Money back guarantee
Your money is safe. Even if we fail to satisfy your expectations, you can always request a refund and get your money back.
Confidentiality
We don’t share your private information with anyone. What happens on our website stays on our website.
Our service is legit
We provide you with a sample paper on the topic you need, and this kind of academic assistance is perfectly legitimate.
Get a plagiarism-free paper
We check every paper with our plagiarism-detection software, so you get a unique paper written for your particular purposes.
We can help with urgent tasks
Need a paper tomorrow? We can write it even while you’re sleeping. Place an order now and get your paper in 8 hours.
Pay a fair price
Our prices depend on urgency. If you want a cheap essay, place your order in advance. Our prices start from $11 per page.