Machine Learning with Big Data complete course is currently being offered by UC San Diego through Coursera platform.

This course is part of the Big Data Specialization.

About this Course

This course provides an overview of machine learning techniques to explore, analyze, and leverage data.  You will be introduced to tools and algorithms you can use to create machine learning models that learn from data, and to scale those models up to big data problems.

At the end of the course, you will be able to:

Design an approach to leverage data using the steps in the machine learning process.
Apply machine learning techniques to explore and prepare data for modeling.
Identify the type of machine learning problem in order to apply the appropriate set of techniques.
Construct models that learn from data using widely available open source tools.
Analyze big data problems using scalable machine learning algorithms on Spark.


- Machine Learning Concepts
- Knime
- Machine Learning
- Apache Spark

Machine Learning with Big Data Week 2 Quiz Answers!

Quiz 2 Answers - Data Exploration

1. Which of these statements is true about samples and variables?

  • A sample is an instance or example of an entity in your data.
  • All of these statements are true.
  • A sample can have many variables to describe it.
  • A variable describes a specific characteristic of an entity in your data.

2. Other names for 'variable' are

  • categorical, nominal
  • feature, column, attribute
  • sample, row, observation
  • numerical, quantitative

3. What is the purpose of exploring data?

  • To gain a better understanding of your data.
  • To gather your data into one repository.
  • To digitize your data.
  • To generate labels for your data.

4. What are the two main categories of techniques for exploring data? Choose two.

  • Histogram
  • Outliers
  • Visualization
  • Trends
  • Correlations
  • Summary statistics

5. Which of the following are NOT examples of summary statistics?

  • mean, median, mode
  • data sources, data locations
  • standard deviation, range, variation
  • skewness, kurtosis

6. What are the two measures for measuring shape as mentioned in the lecture? Choose two.

  • Kurtosis
  • Skewness
  • Contingency Table
  • Range
  • Mode

7. Which of the following would NOT be a good reason to use a box plot?

  • To show and compare distribution values
  • To show data distribution shapes such as asymmetry and skewness.
  • To show correlations between two variables.

8. All of the following are true about data visualization EXCEPT

  • Is more important than summary statistics for data exploration
  • Should be used with summary statistics for data exploration.
  • Is useful for communicating results.
  • Provides an intuitive way to look at data.

Quiz 3 Answers - Data Exploration in KNIME and Spark

1. What is the maximum of the average wind speed measurements at 9am (to 2 decimal places)?

  • 23.55
  • 29.84
  • 5.50
  • 4.55

2. How many rows containing rain accumulation at 9am measurements have missing values?

  • 6
  • 4
  • 3
  • 2

3. What is the correlation between the relative humidity at 9am and at 3pm (to 2 decimal places, and without removing or imputing missing values)?

  • 0.88
  • 1.00
  • -0.45
  • 0.19

4. If the histogram for air temperature at 9am has 50 bins, what is the number of elements in the bin with the most elements (without removing or imputing missing values)?

  • 57
  • 224
  • 49
  • 166

5. What is the approximate maximum max_wind_direction_9am when the maximum max_wind_speed_9am occurs?

  • 70
  • 30
  • 312


Quiz 4 Answers - Data Preparation

1. Which of the following is NOT a data quality issue?

  • Inconsistent data
  • Scaled data
  • Missing values
  • Duplicate data

2. Imputing missing data means to

  • replace missing values with something reasonable.
  • drop samples with missing values.
  • replace missing values with outliers.
  • merge samples with missing values.

3. A data sample with values that are considerably different than the rest of the other data samples in the dataset is called an/a _____________.

  • Outlier
  • Invalid data
  • Noise
  • Inconsistent data

4. Which one of the following examples illustrates the use of domain knowledge to address a data quality issue?

  • Simply discard the samples that lie significantly outside the distribution of your data
  • Drop samples with missing values
  • Merge duplicate records while retaining relevant data
  • None of these

5. Which of the following is NOT an example of feature selection?

  • Adding an in-state feature based on an applicant's home state.
  • Re-formatting an address field into separate street address, city, state, and zip code fields.
  • Removing a feature with a lot of missing values.
  • Replacing a missing value with the variable mean.

6. Which one of the following is the best feature set for your analysis?

  • Feature set with the smallest set of features that best capture the characteristics of the data for the intended application
  • Feature set with the smallest number of features
  • Feature set with the largest number of features
  • Feature set that contains exclusively re-coded features

7. The mean value and the standard deviation of a zero-normalized feature are

  • mean = 0 and standard deviation = 0
  • mean = 1 and standard deviation = 0
  • mean = 0 and standard deviation = 1
  • mean = 1 and standard deviation = 1

8. Which of the following is NOT true about PCA?

  • PCA stands for principal component analysis
  • PC1 and PC2, the first and second principal components, respectively, are always orthogonal to each other.
  • PC1, the first principal component, captures the largest amount of variance in the data along a single dimension.
  • PCA is a dimensionality reduction technique that removes a feature that is very correlated with another feature.

 Quiz 5 Answers - Handling Missing Valuers in KNIME and Spark

1. If we remove all missing values from the data, how many air pressure at 9am measurements have values between 911.736 and 914.67?

  • 77
  • 287
  • 80

2. If we impute the missing values with the minimum value, how many air temperature at 9am measurements are less than 42.292?

  • 28
  • 23
  • 1
  • 5

3. How many samples have missing values for air_pressure_9am?

  • 3
  • 5
  • 1092
  • 0

4. Which column in the weather dataset has the most number of missing values?

  • rain_accumulation_9am
  • number
  • They are all the same
  • air_temp_9am

5. When we remove all the missing values from the dataset, the number of rows is 1064, yet the variable with most missing values has 1089 rows. Why did the number of rows decrease so much?

  • Because the missing values in each column are not necessarily in the same row
  • Because rows with missing values as well as rows with 0s are removed
  • Because rows with missing values as well as rows with duplicate values are removed

Post a Comment

Previous Post Next Post