Course 4: Process Data from Dirty to Clean, all weekly challenge quiz answers of this course are provided in this article from week 1 to week 5 to help students solving this exam.

Process Data from Dirty to Clean Weekly Challenge 1 Answers

Q1. Which of the following conditions are necessary to ensure data integrity? Select all that apply.

• Statistical power
• Completeness
• Accuracy
• Privacy

Q2. What is one potential problem associated with data manipulation that analysts must be aware of?

• Data manipulation can help organize a dataset.
• Data manipulation can separate a dataset among different locations.
• Data manipulation can make a dataset easier to read.
• Data manipulation can introduce errors.

Q3. A data analyst is given a dataset for analysis. It includes data about the total population of every country in the previous 20 years. Based on the available data, an analyst will be able to determine which country was the most populous from 2016 to 2017.

• True
• False

Q4. A data analyst is given a dataset for analysis.

Which of the following has duplicate data?

• Data for Valando on 2/18/2014
• Data for Valando on 1/1/2014
• Data for Symteco on 5/20/2014
• Data for Symteco on 2/21/2014

Q5. A data analyst is working on a project about the global supply chain. They have a dataset with lots of relevant data from Europe and Asia. However, they decide to generate new data that represents all continents. What type of insufficient data does this scenario describe?

• Data that keeps updating
• Data that’s outdated
• Data that’s geographically limited
• Data from only one source

Q6. A car manufacturer wants to learn more about the brand preferences of electric car owners. There are millions of electric car owners in the world. Who should the company survey?

• A sample of car owners who most recently bought an electric car
• A sample of all electric car owners
• A sample of car owners who have owned more than one electric car
• The entire population of electric car owners

Q7. Fill in the blank: Sampling bias in data collection happens when a sample isn’t representative of _____.

• a dataset about the population
• the population most affected by the data
• a subset of the population
• the population as a whole

Q8. Which of the following processes helps ensure a close alignment of data and business objectives?

• Completing data replication
• Transferring data multiple times
• Having data update automatically during analysis
• Maintaining data integrity

Process Data from Dirty to Clean Weekly Challenge 2 Answers

Q1. Which of the following terms describe dirty data? Select all that apply.

• Irrelevant
• Incomplete
• Infallible
• Incorrect

Q2. Field length is a spreadsheet tool for determining if a field has been duplicated.

• True
• False

Q3. A data analyst notices that the customer in row 2 shares the same Customer ID as the customer in row 6. What does this scenario describe?

 A B C D D 1 Last name First name Middle initial Customer ID 2 Smith Leonardo R. 64078 3 Lee Natasha E. 92862 4 Wallace Luciana M. 55107 5 Xiao Hua A. 88492 6 Smith Leo R. 64078 7 Chaudhuri Toby T. 34694 8 Lee Tasha P. 18295 9 Walton Mason Q. 58239 10 Richards Felix S. 12765 11 Guillermo Beth I. 27593 12 Walton Nadine J. 67292 12 Walton Nadine J. 67292
• Duplicate data
• Mislabeled data
• Inconsistent data
• Obsolete data

Q4. Fill in the blank: Conditional formatting is a spreadsheet tool that changes how _____ appear when values meet a specific condition.

• filters
• cells
• queries
• charts

Q5. A data analyst uses the SPLIT function to divide a text string around a specified character and put each fragment into a new, separate cell. What is the specified character separating each item called?

• Delimiter
• Unit
• Partition
• Substring

Q6. For a function to work properly, data analysts must follow each function’s predetermined structure. What is this structure called?

• Syntax
• Validation
• Summary
• Algorithm

Q7. You are working with the following selection of a spreadsheet:

 A B 1 Customer Address 2 Sally Stewart 9912 School St. North Wales, PA 19454 3 Lorenzo Price 8621 Glendale Dr. Burlington, MA 01803 4 Stella Moss 372 W. Addison Street Brandon, FL 33510 5 Paul Casey 9069 E. Brickyard Road Chattanooga, TN 37421

In order to extract the five-digit postal code from Burlington, MA, what is the correct function?

• =LEFT(5,B3)
• =RIGHT(B3,5)
• =RIGHT(5,B3)
• =LEFT(B3,5)

Q8. A data analyst in a human resources department is working with the following selection of a spreadsheet:

 A B C D 1 Year   Hired Last   4 of   SS# Department Employee    ID 2 2019 1192 Marketing 3 2014 2683 Operations 4 2020 1939 Strategy 5 2009 3208 Graphics

They want to create employee identification numbers (IDs) in column D. The IDs should include the year hired plus the last four digits of the employee’s Social Security Number (SS#). What function will create the ID 20093208 for the employee in row 5?

• =CONCATENATE(A5,B5)
• =CONCATENATE(A5+B5)
• =CONCATENATE(A5:B5)
• =CONCATENATE(A5*B5)

Q9. An analyst is cleaning a new dataset containing 500 rows. They want to make sure the data contained from cell B2 through cell B300 does not contain a number greater than 50. Which of the following COUNTIF function syntaxes could be used to answer this question? Select all that apply.

• =COUNTIF(B2:B300,>50)
• =COUNTIF(B2:B300,”<=50”)
• =COUNTIF(B2:B300,<=50)
• =COUNTIF(B2:B300,”>50″)

Q10. The V in VLOOKUP stands for what?

• Virtual
• Vertical
• Visual
• Variable

Q11. Fill in the blank: Data mapping is the process of _____ fields from one data source to another.

• matching
• merging
• extracting

Q12. Describe the relationship between a primary key and a foreign key.

• A primary key references a row in which each value is unique. A foreign key is a column within a table that is a primary key in another table.
• A primary key is a field within a table that is a foreign key in another table. A foreign key references a column in which each value is unique
• A primary key references a column in a table in which each value is unique. A foreign key is a field within a table that is a primary key in another table.
• A primary key references a field within a table that is a foreign key in another table. A foreign key references a row in which each value is unique. Correct

Process Data from Dirty to Clean Weekly Challenge 3 Answers

Q1. Data analysts choose SQL for which of the following reasons? Select all that apply.

• SQL is a programming language that can also create web apps
• SQL is a powerful software program
• SQL is a well-known standard in the professional community
• SQL can handle huge amounts of data

Q2. In which of the following situations would a data analyst use spreadsheets instead of SQL? Select all that apply.

• When visually inspecting data
• When working with a dataset with more than 1,000,000 rows
• When working with a small dataset
• When using a language to interact with multiple database programs

Q3. A data analyst creates many new tables in their company’s database. When the project is complete, the analyst wants to remove the tables so they don’t clutter the database. What SQL commands can they use to delete the tables?

• INSERT INTO
• CREATE TABLE IF NOT EXISTS
• UPDATE
• DROP TABLE IF EXISTS

Q4. A data analyst is cleaning customer data for an online retail company. They are working with the following section of a database:

The analyst wants to find out if the state data is consistent and if any text strings contain more than two characters. What is the correct SQL clause to use to find any text strings containing more than two characters?

• WHERE(state) > 2
• DISTINCT(state) > 2
• SUBSTR(state) > 2
• LENGTH(state) > 2

Q5. Fill in the blank: The _____ function counts the number of characters a string contains.

• SUBSTR
• CAST
• LENGTH
• TRIM

Q6. In SQL databases, what data type refers to a number that contains a decimal?

• Integer
• String
• Boolean
• Float

Q7. Fill in the blank: In SQL databases, the _____ function can be used to convert data from one datatype to another.

• TRIM
• LENGTH
• SUBSTR
• CAST

Q8. Fill in the blank: The _____ function can be used to return non-null values in a list.

• CONCAT
• TRIM
• COALESCE
• CAST

Process Data from Dirty to Clean Weekly Challenge 4 Answers

Q1. The data collected for an analysis project has just been cleaned. What are the next steps for a data analyst? Select all that apply.

• Verification
• Reporting
• Certification
• Validation

Q2. A data analyst is in the verification step. They consider the business problem, the goal, and the data involved in their analytics project. What scenario does this describe?

• Reporting on the data
• Seeing the big picture
• Considering the stakeholders
• Visualizing the data

Q3. Which function removes leading, trailing, and repeated spaces in data?

• CUT
• CROP
• TRIM
• TIDY

Q4. A data analyst uses the COUNTA function to count which of the following?

• The total number of headers in a specific range.
• The total number of values within a specified range.
• The total number of entries in a changelog.
• The specific numbers in a dataset.

Q5. A WHEN statement considers one or more conditions and returns a value as soon as that condition is met.

• True
• False

Q6. What is the process of tracking changes, additions, deletions, and errors during data cleaning?

• Recording
• Documentation
• Observation

Q7. Fill in the blank: A changelog contains a _____ list of modifications made to a project.

• approximate
• random
• chronological
• synchronized

Q8. Reviewing version history is an effective way to view a changelog in SQL.

• True
• False

Process Data from Dirty to Clean Course Week 05 Challenge Answers

Scenario 1, questions 1-5

Q1. You are a data analyst at a small analytics company. Your company is hosting a project kick-off meeting with a new client, Meer-Kitty Interior Design. The agenda includes reviewing their goals for the year, answering any questions, and discussing their available data.

Meer-Kitty Interior Design About Us Page.pdf

Meer-Kitty Interior Design has two goals. They want to expand their online audience, which means getting their company and brand known by as many people as possible. They also want to launch a line of high-quality indoor paint to be sold in-store and online. You decide to consider the data about indoor paint first.

Kitty Survey Feedback – Meer-Kitty survey feedback.csv

You are pleased to find that the available data is aligned to the business objective. However, you do some research about confidence level for this type of survey and learn that you need at least 120 unique responses for the survey results to be useful. Therefore, the dataset has two limitations: First, there are only 40 responses; second, a Meer-Kitty superfan, User 588, completed the survey 11 times.

As the survey has too few responses and numerous duplicates that are skewing results, what are your options? Select all that apply.

• Repeat the survey in order to create a new, improved dataset.
• Locate another dataset about indoor paint.
• Remove the duplicates from the data and proceed with analysis.
• Talk with stakeholders and ask for more time.

Q2. During the meeting, you also learn that Meer-Kitty videos are hosted on their website. For each product offered, there is an accompanying video for customers to learn more. So, more views for a video suggests greater consumer interest.

Your goal is to identify which videos are most popular, so Meer-Kitty knows what topics to explore in the future. Unfortunately, Meer-Kitty has just three months of data available because they only recently launched the videos on their site.

Without enough data to identify long-term trends about the video subjects that people prefer, what should you do?

• Find an alternate data source that will still enable you to meet your objective.
• Watch the videos and use your gut instinct to identify which are most successful.
• Tell the client you’re sorry, but there is no way to meet their objective.
• Move ahead with the data you have to determine the top video subjects.

Q3. Now that you’ve identified some limitations with Meer-Kitty’s data, you want to communicate your concerns to stakeholders. In addition to insufficient video trend data, your main concern with the indoor paint survey is that the data isn’t representative of the population as a whole.

Clearly, one particular respondent, the superfan, is overrepresented. This means the data doesn’t represent the population as a whole.

When surveying people for Meer-Kitty in the future, what are some best practices you can use to address some of the issues associated with sampling bias? Select all that apply.

• Increase sample size
• Use data that keeps updating
• Use data from only one source
• Use random sampling

Q4. The stakeholders understand your concerns and agree to repeat the indoor paint survey. In a few weeks, you have a much better dataset with more than 150 responses and no duplicates.

Kitty Survey Feedback – New Meer-Kitty survey feedback.csv

You notice that questions 4 and 5 are dependent on the respondent’s answer to question 3. So, you need to determine how many people answered Yes to question 3, then compare that to responses to questions 4 and 5. That way, you will know if questions 4 and 5 have any nulls.

You decide to use a spreadsheet tool that changes how cells appear when they contain the word Yes. Which tool do you use?

• Data validation
• Conditional formatting
• Filtering
• CONCATENATE

Q5. You continue cleaning the data. You use tools such as remove duplicates and COUNTIF to ensure the dataset is complete, correct, and relevant to the problem you’re trying to solve. Then, you complete the verification and reporting processes to share the details of your data-cleaning effort with your team.

While reviewing, your team notes one aspect of data cleaning that would improve the dataset even more. They point out that the new survey also has a new question in Column G: “What are your favorite indoor paint colors?” This was a free-response question, so respondents typed in their answers. Some people included multiple different colors of paint. In order to determine which colors are most popular, it will be necessary to put each color in its own cell.

What spreadsheet function enables you to put each of the colors in Column G into a new, separate cell?

• Delimit
• MID
• Divide
• SPLIT

Scenario 2, questions 6-10

Q6. You’ve completed this program and are interviewing for a junior data scientist position. The job is at B.Spoke Market Research, a company that analyzes market conditions using customer surveys and other research methods. The detailed job description can be found below:

C4 B.Spoke Market Research Job Description.pdf

So far, you’ve had a phone interview with a recruiter and you’ve secured a second interview with the B.Spoke team. The recruiter’s email can be found below:

C4 S2 Email from Recruiter.pdf

You arrive 15 minutes early for your interview. Soon, you are escorted into a conference room, where you meet Jodie Choi, the data science lead. After welcoming you, the behavioral interview begins.

For your first question, your interviewer wants to learn about your experience with spreadsheets. She says: Sometimes the team needs data that is stored in different spreadsheets. So, we use a spreadsheet function to find the information we need.

There is a spreadsheet function that searches for a value in the first column of a given range and returns the value of a specified cell in the row in which it is found. It is called SEARCH.

• True
• False

Q7. Next, your interviewer wants to know more about your understanding of tools that work in both spreadsheets and SQL. She explains that the data her team receives from customer surveys sometimes has many duplicate entries.

She says: Spreadsheets have a great tool for that called remove duplicates. In SQL, you can include DISTINCT to do the same thing. In which part of the SQL statement do you include DISTINCT?

• The FROM statement
• The WHERE statement
• The UPDATE statement
• The SELECT statement

Q8. Now, your interviewer explains that the data team usually works with very large amounts of customer survey data. After receiving the data, they import it into a SQL table. But sometimes, the new dataset imports incorrectly and they need to change the format.

She asks: What function would you use to convert data in a SQL table from one datatype to another?

• CONVERT
• CHANGE
• CAST
• COALESCE

Q9. Next, your interviewer explains that one of their clients is an online retailer that needs to create product numbers for a vast inventory. Her team does this by combining the text strings for product number, manufacturing date, and color.

She asks: Which SQL function would you use to add strings together to create new text strings?

• COMBINE
• CREATE
• COALESCE
• CONCAT

Q10. For your final question, your interviewer explains that her team often comes across data with extra spaces.

She asks: Which function would enable you to eliminate those extra spaces? You respond: To eliminate extra spaces for consistency, use the TRIM function.

• True
• False