### Machine Learning with Big Data complete course is currently being offered by UC San Diego through Coursera platform.

This course is part of the Big Data Specialization.

This course provides an overview of machine learning techniques to explore, analyze, and leverage data.  You will be introduced to tools and algorithms you can use to create machine learning models that learn from data, and to scale those models up to big data problems.

At the end of the course, you will be able to:

Design an approach to leverage data using the steps in the machine learning process.
Apply machine learning techniques to explore and prepare data for modeling.
Identify the type of machine learning problem in order to apply the appropriate set of techniques.
Construct models that learn from data using widely available open source tools.
Analyze big data problems using scalable machine learning algorithms on Spark.

SKILLS YOU WILL GAIN

- Machine Learning Concepts
- Knime
- Machine Learning
- Apache Spark

### Course Link: https://www.coursera.org/learn/big-data-machine-learning?specialization=big-data

Quiz 6 Answers - Classification

1. Which of the following is a TRUE statement about classification?

• Classification is a supervised task.
• Classification is an unsupervised task.
• In a classification problem, the target variable has only two possible outcomes.

2. In which phase are model parameters adjusted?

• Testing phase
• Training phase
• Data preparation phase
• Model parameters are constant throughout the modeling process.

3. Which classification algorithm uses a probabilistic approach?

• naive bayes
• none of the above
• decision tree
• k-nearest-neighbors

4. What does the 'k' stand for in k-nearest-neighbors?

• the number of samples in the dataset
• the number of nearest neighbors to consider in classifying a sample
• the distance between neighbors: All neighboring samples that are 'k' distance apart from the sample are considered in classifying that sample.
• the number of training datasets

5. During construction of a decision tree, there are several criteria that can be used to determine when a node should no longer be split into subsets. Which one of the following is NOT applicable?

• The tree depth reaches a maximum threshold.
• The number of samples in the node reaches a minimum threshold.
• All (or X% of) samples have the same class label.
• The value of the Gini index reaches a maximum threshold.

6. Which statement is true of tree induction?

• You want to split the data in a node into subsets that are as homogeneous as possible
• All of these statements are true of tree induction.
• An impurity measure is used to determine the best split for a node.
• For each node, splits on all variables are tested to determine the best split for the node.

7. What does 'naive' mean in Naive Bayes?

• The full Bayes' Theorem is not used. The 'naive' in naive bayes specifies that a simplified version of Bayes' Theorem is used.
• The Bayes’ Theorem makes estimating the probabilities easier. The 'naÃ¯ve' in the name of classifier comes from this ease of probability calculation.
• The model assumes that the input features are statistically independent of one another. The 'naÃ¯ve' in the name of classifier comes from this naÃ¯ve assumption.

8. The feature independence assumption in Naive Bayes simplifies the classification problem by

• assuming that the prior probabilities of all classes are independent of one another.
• assuming that classes are independent of the input features.
• ignoring the prior probabilities altogether.
• allowing the probability of each feature given the class to be estimated individually.

Quiz 7 Answers Classification in KNIME and Spark

1. KNIME: In configuring the Numeric Binner node, what would happen if the definition for the humidity_low bin is changed from

] -infinity ... 25.0 [

to

] -infinity ... 25.0 ]

(i.e., the last bracket is changed from [ to ] ?

• The definition for the humidity_low bin would change from excluding 25.0 to including 25.0
• The definition for the humidity_low bin would change from having 25.0 as the endpoint to having 25.1 as the endpoint
• Nothing would change
1. KNIME: Considering the Numeric Binner node again, what would happen if the “Append new column” box is not checked?
• The relative_humidity_3pm variable will become a categorical variable
• The relaltive_humidity_3pm variable will remain unchanged, and a new unnamed categorical variable will be created
• The relative_humidity_3pm variable will become undefined, and an error will occur
1. KNIME: How many samples had a missing value for air_temp_9am before missing values were addressed?
• 5
• 3
• 0
1. KNIME: How many samples were placed in the test set after the dataset was partitioned into training and test sets?
• 213
• 851
• 20
1. KNIME: What are the target and predicted class labels for the first sample in the test set?
• Both are humidity_not_low
• Target class label is humidity_not_low, and predicted class label is humidity_low
• Target class label is humidity_low, and predicted class label is humidity_not_low
1. Spark: What values are in the number column?
• Integer values starting at 0
• Time and date values
• Random integer values
1. Spark: With the original dataset split into 80% for training and 20% for test, how many of the first 20 samples from the test set were correctly classified?
• 19
• 10
• 1
1. Spark: If we split the data using 70% for training data and 30% for test data, how many samples would the training set have (using seed 13234)?
• 730
• 334
• 70