### Machine Learning with Big Data complete course is currently being offered by UC San Diego through Coursera platform.

This course is part of the Big Data Specialization.

This course provides an overview of machine learning techniques to explore, analyze, and leverage data.  You will be introduced to tools and algorithms you can use to create machine learning models that learn from data, and to scale those models up to big data problems.

At the end of the course, you will be able to:

Design an approach to leverage data using the steps in the machine learning process.
Apply machine learning techniques to explore and prepare data for modeling.
Identify the type of machine learning problem in order to apply the appropriate set of techniques.
Construct models that learn from data using widely available open source tools.
Analyze big data problems using scalable machine learning algorithms on Spark.

SKILLS YOU WILL GAIN

- Machine Learning Concepts
- Knime
- Machine Learning
- Apache Spark

### Course Link: https://www.coursera.org/learn/big-data-machine-learning?specialization=big-data

Quiz 10 Answers - Regression, Cluster Analysis, & Association Analysis

1. What is the main difference between classification and regression?

• In classification, you're predicting a number, and in regression, you're predicting a category.
• There is no difference since you're predicting a numeric value from the input variables in both tasks.
• In classification, you're predicting a category, and in regression, you're predicting a number.
• In classification, you're predicting a categorical variable, and in regression, you're predicting a nominal variable.

2. Which of the following is NOT an example of regression?

• Predicting the price of a stock
• Estimating the amount of rain
• Determining whether power usage will rise or fall
• Predicting the demand for a product

3. In linear regression, the least squares method is used to

• Determine the distance between two pairs of samples.
• Determine whether the target is categorical or numerical.
• Determine the regression line that best fits the samples.
• Determine how to partition the data into training and test sets.

4. How does simple linear regression differ from multiple linear regression?

• In simple linear regression, the input has only categorical variables. In multiple linear regression, the input can be a mix of categorical and numerical variables.
• In simple linear regression, the input has only one variable. In multiple linear regression, the input has more than one variables.
• In simple linear regression, the input has only categorical variables. In multiple linear regression, the input has only numerical variables.
• They are the just different terms for linear regression with one input variable.

5. The goal of cluster analysis is

• To segment data so that differences between samples in the same cluster are maximized and differences between samples of different clusters are minimized.
• To segment data so that all samples are evenly divided among the clusters.
• To segment data so that all categorical variables are in one cluster, and all numerical variables are in another cluster.
• To segment data so that differences between samples in the same cluster are minimized and differences between samples of different clusters are maximized.

6. Cluster results can be used to

• Determine anomalous samples
• Segment the data into groups so that each group can be analyzed further
• Classify new samples
• Create labeled samples for a classification task
• All of these choices are valid uses of the resulting clusters.

7. A cluster centroid is

• The mean of all the samples in the two closest clusters.
• The mean of all the samples in the cluster
• The mean of all the samples in the two farthest clusters.
• The mean of all the samples in all clusters

8. The main steps in the k-means clustering algorithm are

• Assign each sample to the closest centroid, then calculate the new centroid.
• Calculate the centroids, then determine the appropriate stopping criterion depending on the number of centroids.
• Calculate the distances between the cluster centroids, then find the two closest centroids.
• Count the number of samples, then determine the initial centroids.

9. The goal of association analysis is

• To find the most complex rules to explain associations between as many items as possible in the data.
• To find the number of outliers in the data
• To find rules to capture associations between items or events
• To find the number of clusters for cluster analysis

10. In association analysis, an item set is

• A transaction or set of items that occur together
• A set of transactions that occur a certain number of times in the data
• A set of items that two rules have in common
• A set of items that infrequently occur together

11. The support of an item set

• Captures the frequency of that item set
• Captures how many times that item set is used in a rule
• Captures the number of items in that item set
• Captures the correlation between the items in that item set

12. Rule confidence is used to

• Identify frequent item sets
• Determine the rule with the most items
• Measure the intuitiveness of a rule
• Prune rules by eliminating rules with low confidence

Quiz 11 Answers - Cluster Analysis in Spark

1. What percentage of samples have 0 for rain_accumulation?

• 157812 / 158726 = 99.4%
• 157237 / 158726 = 99.1%
• There is not enough information to determine this

2. Why is it necessary to scale the data (Step 4)?

• Since the values of the features are on different scales, all features need to be scaled so that all values will be positive.
• Since the values of the features are on different scales, all features need to be scaled so that no one feature dominates the clustering results.
• Since the values of the features are on different scales, all features need to be scaled so that the cluster centers can be displayed on the same plot for easier analysis.

3. If we wanted to create a data subset by taking every 5th sample instead of every 10th sample, how many samples would be in that subset?

• 317,452
• 1,587,257
• 158,726

4. This line of code creates a k-means model with 12 clusters:

kmeans = KMeans (k=12, seed=1)

What is the significance of “seed=1”?

• This sets the seed to a specific value, which is necessary to reproduce the k-means results
• This means that this is the first iteration of k-means. The seed value is incremented by 1 every time k-means is executed
• This specifies that the first cluster centroid is set to sample #1

5. Just by looking at the values for the cluster centers, which cluster contains samples with the lowest relative humidity?

• Cluster 4
• Cluster 3
• Cluster 9

6. What do clusters 7, 8, and 11 have in common?

• They capture weather patterns associated with warm and dry days
• They capture weather patterns associated with high air pressure
• They capture weather patterns associated with very strong winds

7. If we perform clustering with 20 clusters (and seed = 1), which cluster appears to identify Santa Ana conditions (lowest humidity and highest wind speeds)?

• Cluster 12
• Cluster 1
• Cluster 16

8. We did not include the minimum wind measurements in the analysis since they are highly correlated with the average wind measurements. What is the correlation between min_wind_speed and avg_wind_speed (to two decimals)? (Compute this using one-tenth of the original dataset, and dropping all rows with missing values.)

• 0.97
• -0.12
• 0.62