Big Data Integration and Processing complete course is currently being offered by UC San Diego through Coursera platform and is Course 3 of 6 in the Big Data Specialization.
About this Course: This course is for those new to data science. Completion of Intro. to Big Data is recommended. No prior programming experience is needed, although the ability to install applications and utilize a virtual machine is necessary to complete the hands-on assignments. Refer to the specialization technical requirements for complete hardware and software specifications.
Also Check: How to Apply for Coursera Financial Aid

Big Data Integration & Processing Coursera Quiz 1 Answers – Retrieving Big Data Quiz!
Q1.What does it mean for a query language to be declarative?
- The
language specifies the process of how to obtain the data.
- The
language specifies both the process of how to obtain the data and
specifies what data to obtain.
- The
language specifies what data to obtain.
- A
language specific declaration of data types in order to define the method
of data retrieval.
Q2. Use the following table named “user_table” to answer the
next 2 problems.
userId username email
1 admin
admin@corporate.moe
2 h4xor
1337@rawr.cte
How would you go about querying the entire username column
(however many)?
- SELECT
user_table FROM username
- SELECT
username FROM user_table
- SELECT
username FROM user_table WHERE userId=1
- SELECT
username FROM userId WHERE *
Q3. How would you go about querying the entire database
table (please refer to question 2’s table)?
- SELECT
user_table FROM *
- SELECT
* FROM * WHERE user_table
- SELECT
username, email FROM userId
- SELECT
* FROM user_table
Q4. What is the global indexing table?
- A
global table that uses a specific technique called indexing and the table
uses an index as the primary key.
- An
index table in order to keep track of a given data type that might exist
within multiple machines.
- An
index table in order to keep track of data records within one machine.
- An
index table in order to keep track of a given data type that might exist
within one machine.
Q5. What are the three computing steps of a semi-join?
- Project,
Ship, Reduce
- Project,
Decompose, Send
- Index,
Join, Display
- Query,
Join, Display
- None
Applicable
Q6. What is the purpose of a semi-join?Quiz 2 – Postgres,
MongoDB and Pandas
- Another
name for join: an operation to combine two tables by column.
- Increase
the efficiency of sending data across multiple machines.
- Increase
the speed of the join for trade-off of increased data transmission cost.
Q7. What is a subquery?
- A
query statement within another query.
- A
short query than normal.
- An
alternative query that acts as a substitute for another query.
Q8. What is a correlated subquery?
- A
type of query that contains a subquery that requires information from a
query one level up.
- A
type of query that contains a relationship between a variable attribute x
and a variable attribute y. The two variables have a dependent
relationship causing a correlation.
- A
type of query that requires two tables in order to calculate values.
Q9. What is the purpose of GROUP BY queries?
- Enables
calculations based on specific columns of the table.
- Enables
queries within queries.
- Required
before you can use functions like AVG, SUM, MIN, MAX, COUNT.
Q10. Consider the following generic statement for questions
10-12:
db.<collection>.find(<query filter>,
<projection>).<cursor modifier>
Which part of the statement would reflect that of the FROM
statement in SQL as illustrated in the lecture?
- <query
filter>
- <collection>
- <cursor
modifier>
- <projection>
Q11. Which part of the statement would reflect that of the
SELECT statement in SQL as illustrated in the lecture?
- <query
filter>
- <projection>
- <cursor
modifier>
- <collection>
Q12. Which part of the statement would reflect that of the
WHERE statement in SQL as illustrated in the lecture?
- <projection>
- <cursor
modifier>
- <query
filter>
- <collection>
Q13. A sample part of the data structure is as follows:
{ _id:1, userIndex: 10, email:
“arealeamil@notreallu.asd", retainRate:2}
What would be the most likely statement that we would need
to grab email info for user indexes greater than 24?
- db.userIndex.find({email:{$gt:24}},
{_id:0})
- db.email.find({userIndex:{$gt:24}},
{email:1, _id:0})
- db.userIndex.find({email:{$lte:24}},
{_id:0})
- db.email.find({userIndex:{$lte:24}},
{email:1, _id:0})
Q14. What does it mean to have a _id:0 within our query
statement?
- Grab
the first object in the results.
- Grab
as many objects as possible.
- Does
not have an effect, simple convention left for compatibility issues.
- Tell
MongoDB not to return a document id.
Big Data Integration & Processing Coursera Quiz 2 Answers: Postgres, MongoDB and Pandas
Q1. What is the highest level that the team has reached
in-game clicks? (Hint: use the MAX operation in Postgres).
- 6
- 8
- 9
- 10
- 7
Q2. How many users id’s (repeats allowed) have reached the
highest level as found in the previous question? (Hint: For Postgres: you may
either use two queries or use a sub-query).
- 106436
- 67271
- 122757
- 51294
- 98823
Q3. How many users id’s (repeats allowed) reached the
highest level in game clicks and also clicked the highest costing price in buy
clicks? Hint: Refer to question 4 for ideas.
- 66887
- 32747
- 23301
- 73226
Q4. What does the following line of code do in postgres?
SELECT count(userid) FROM (SELECT buyclicks.userId,
teamLevel, price FROM buyclicks JOIN gameclicks on buyclicks.userId =
gameclicks.userId) temp WHERE price=3 and teamLevel=5;
- Displays
the users who have bought items worth $3 and have had a team with level 5.
- This
is an invalid line of code, the subquery is not formatted properly.
- Counts
the users who exists between both gameclicks and buyclicks files.
- Finds
the total number of user ids (repeats allowed) in buy-clicks that have
bought items with prices worth $3 and was in a team with level 5 at some
point in time.
Q5. In the MongoDB data set, what is the username of the
twitter account who has a tweet_followers_count of exactly 8973882?
- CreateImga
- Autocenterit
- FIFAcom
- SasSpear
Big Data Integration & Processing Coursera Quiz 3 Answers – Information Integration
Q1. What is the main problem with big data information
integration?
- Pay-as-you-go
model
- Probabilistic
Schema Mapping
- Many
sources
- Mediated
Schema
Q2. What would be the two possible solutions associated with
“big data” information integration as mentioned in lecture? (Choose 2)
- Probabilistic
Schema Mapping
- Customer
Transactions
- Pay-as-you-go
Model
- Mediated
Schema
- Attribute
Grouping
Q3. What are mediated schemas?
- Schemas
created from customer info.
- Schemas
created entirely from attribute grouping.
- A
type of probabilistic schema mapping.
- Schema
created from integrating two or more schemas.
Q4. In attribute grouping, how would one evaluate if two
attributes should go together? (Choose 2)
- Probability
of Two Attributes Co-occurring
- Integrated
Views
- Similarity
of Attributes
- Customer
Interaction
- Candidate
Designs
Q5. What is a data item?
- Data
found in a customer transaction.
- Data
that represents an aspect of a real-world entity.
- The
real worth of a data value.
- Data
found in a mediated schema.
Q6. What is data fusion?
- Extracting
a global value from a data source.
- Extracting
true sources from a data source.
- Extracting
the true value of a data item.
- Another
term for customer analytics.
Q7. What is a potential problem of having too many data
sources as mentioned in lecture?
- Too
much data processing required for compression.
- Too
many data values.
- Schema
mapping becomes impossible.
- None,
the problem is not a problem when using big data methodologies.
Q8. What do we mean when we say “the true value of a data
item”?
- Extrapolated
data from a data item that represents the worth of that item.
- Data
created from statistical estimations.
- Another
term for data fusion.
Q9. What is a potential method to deal with too many data
sources as mentioned in lecture?
- Compare
and weigh each source by their trustworthiness.
- Randomly
select a sample of sources to represent the various data sources.
- None,
the more the better.
- Take
less samples per tick.
Big Data Integration & Processing Coursera Quiz 4 Answers – Hands-On with Splunk
Q1. Which of the queries below will return the average
population of the counties in Georgia (be careful not to include the population
of the state of Georgia itself)?
- None
of the above
- source=”census.csv”
CTYNAME != “Georgia” STNAME=”Georgia” | stats sum(CENSUS2010POP)
- source=”census.csv”
CTYNAME != “Georgia” STNAME=”Georgia” | stats mean(CENSUS2010POP)
- source=”census.csv”
STNAME=”Georgia” | stats mean(CENSUS2010POP)
Q2. What is the average population of the counties in the
state of Georgia (be careful not to include the population of the state of
Georgia itself)?
- 394383.53786
- 45373.454788
- 243767.4564
- 60928.635220
Q3. Of the options below, which query allows you to find the
state with the most counties?
- source=”census.csv”
| stats count by CENSUS2010POP | sort count
- stats
count by STNAME | sort -count
- source=”census.csv”
| stats count by CTYNAME | sort num(count)
- source=”census.csv”
| stats count by STNAME | sort count desc
Q4. What state contains the most counties?
- Texas
- California
- Georgia
- Alaska
Q5. Of the options below, which query allows you to find the
most populated counties in the state of Texas?
- STNAME=”Texas”
CENSUS2010POP > 100000 | sort -CENSUS2010POP | table
CENSUS2010POP,CTYNAME
- STNAME=”Texas”
CENSUS2010POP > 100000 | sort CENSUS2010POP desc | table
CENSUS2010POP,CTYNAME
- Both
- Neither
Q6. What is the most populated county in the state of Texas?
- Harris
- Dallas
- Travis
- Bexar
Big Data Integration & Processing Coursera Quiz 5 Answers – Pipeline and Tools
Q1. What is data-parallelism as defined in the lecture?
- Having
multiple multiple data pipelines at the same time.
- Simultaneously
processing input data from multiple cores.
- Running
the same function simultaneously for the partitions of a data set on
multiple cores.
- At
each step of the data pipeline, process values simultaneously by using
multiple cores.
Q2. Of the following, which procedure best generalizes big
data procedures such as (but not limited to) the map-reduce process?
- split->sort->merge
- split->do->merge
- split->map->shuffle
and sort->reduce
- split
->shuffle and sort->map->reduce
Q3. What are the three layers for the Hadoop Ecosystem?
(Choose 3)
- Data
Manipulation and Integration
- Data
Management and Storage
- Data
Integration and Processing
- Coordination
and Workflow Management
- Data
Creation and Storage
Q4. What are the 5 key points in order to categorize big
data systems?
- Execution
model, Latency, Scalability, Programming Language, Fault Tolerance
- Coordination,
Latency, Productivity, Speed, Fault Tolerance
- Execution
model, Speed, Scalability, Flexibility, Fault Tolerance
- Coordination,
Latency, Productivity, Flexibility, Fault Tolerance
Q5. What is the lambda architecture as shown in lecture?
- A
type of hybrid data processing architecture.
- A
type of architecture that only contains part of the data processing
method.
- A
type of swappable data processing layer.
- An
architecture that natively supports lambda calculus.
Q6. Which of the following scenarios is NOT an aggregation
operation?
- Counting
the total number of data per type.
- Averaging
the total number of data per type.
- Removing
undefined values.
- Counting
the total number of data.
Q7. What usually happens to data when aggregated as
mentioned in lecture?
- Data
become organized.
- Data
becomes smaller.
- Data
becomes personalized.
- Data
becomes faster to process.
Q8. What is K-means clustering?
- Divide
samples using k lines.
- Classify
data by k decisions.
- Group
samples into k clusters.
- Classify
data by k actions.
Q9. Why is Hadoop not a good platform for machine learning
as mentioned in lecture? (Choose 4)
- Too
massive.
- Requires
nodes and multiple machines.
- Bottleneck
using HDFS.
- Map
and Reduce Based Computation.
- Unable
to support machine learning.
- No
interactive shell and streaming.
- Java
support only.
10. What are the layers (parts) of Spark? (Choose 5)
- SparkSQL
- Graphx
- MLlib
- Spark
Graph
- Spark
Core
- Spark
RDD
- Spark
Streaming
- Worker
Node
Q11. What is in-memory processing?
- Having
the pipeline completely in disk.
- Writing
data to disk between pipeline steps.
- Writing
data to memory between pipeline steps.
- Having
the pipeline completely in memory.
- Having
the input completely in disk.
- Having
the input completely in memory.
Big Data Integration & Processing Coursera Quiz 6 Answers – Word Count in Spark
Also Check: Agile Project Management Quiz Answers - Coursera!
Q1. What does the following line of code do?
words = lines.flatMap(lambda line: line.split(“ “))
- Each
line in the document is split up into words.
- Each
line in the document is split into various Spark partitions.
- Each
word in each line is counted.
- Each
word is merged into lines to be counted later.
Q2. What does the following line of code imply about the
state of partitions before the action is performed?
words = lines.flatMap(lambda line: line.split(“ “))
- Each
Spark partition corresponds to a line in the document.
- Each
Spark partition corresponds to a word in the document.
- There
is only one single partition containing the full document.
Q3. When the following command is executed, where is the
file written and how can it be accessed?
counts.coalesce(1).saveAsTextFile(‘hdfs:/user/cloudera/wordcount/outputDir’)
- HDFS
and through the system directory with the “cd” terminal command.
- HDFS
and through the “hadoop fs” command.
- The
local file system and through the “hadoop fs” command.
- The
local file system and through the directory with the “cd” terminal
command.
Q4. What does the number one (1) allow us to do in the
following line of code?
tuples = words.map(lambda word: (word,1))
- The
number represents the number of partitions in charge of counting each
line.
- The
number represents the number of partitions in charge of keeping track of
each word.
- None,
completely arbitrary in order to apply an algorithm that requires a tuple.
- Treat
each word with a weight of one during the counting process.
Big Data Integration & Processing Coursera Quiz 7 Answers – More on Spark
Q1. Which part of SPARK is in charge of creating RDDs?
- Driver
Program
- Local
CPU
- Storage
- Spark
Executor
- Worker
Node
Q2. How does lazy evaluation work in Spark?
- Transformations
are queued and executed at a certain threshold.
- Transformations
are not executed until the action stage.
- Actions
are queued and executed at a certain threshold.
- Actions
are not executed until the transformation stage.
Q3. What are the consequences of lazy evaluation as
mentioned in lecture?
- Errors
sometimes do not show up until the action stage.
- Hiccups
within the system during queue execution.
- There
are no consequences.
Q4. What is a wide transformation?
- A
transformation that requires data shuffling across node partitions.
- Transformations
that take a lot of nodes to complete.
- A
longer time-taking transformation compared to narrow transformations.
- The
name for the most used transformations.
Q5. Where does the data for each worker node get sent to
after a collect function is called?
- Other
Worker Nodes
- Spark
Streaming
- Spark
Context
- None;
Stays in the Same Node
- Spark
SQL
Q6. What are DataFrames?
- A
special type of data node that contains framework to manipulate SQL.
- A
column like data format that can be read by Spark SQL.
- A
type of narrow transformation.
Q7. Can RDD’s be converted into DataFrames directly without
manipulation?
- Yes
- No:
lines have to be converted into row.
- No:
RDD’s needed to be made relational first.
- No:
RDD’s cannot be converted into DataFrames.
Q8. What is the function of Spark SQL as mentioned in
lecture? (Choose 3)
- Efficient
data manipulation using SQL like structure.
- Enables
relational queries on Spark.
- Deploy
business intelligence tools over Spark.
- Connect
to variety of databases.
- Better
ability to manipulate big data.
- Better
worker node interpolation.
Q9. What is a triplet in GraphX?
- A
type of data to contain vertex info.
- A
type of data to contain the information on connections between vertices
and edges.
- A
type of data to contain both edge and vertex info.
- A
type of data to contain edge info.
Big Data Integration & Processing Coursera Quiz 8 Answers – SparkSQL and Spark Streaming
Q1. What does the following filter line of code do?
df.filter(df[“teamlevel”] > 1)
- Filter
each row to show only team levels larger than 1.
- Filter
each column to show only team levels larger than 1.
- Select
the first two columns of the data and filter each column to show only team
levels larger than 1.
- Select
the first two columns of the data and displays only team levels greater
than 1.
Q2. What does the following do?
df.select(“userid”, “teamlevel”).show(5)
- Select
the rows named “userid” and “teamlevel” and display first 5 rows.
- Display
all rows except “userid” and “teamlevel”.
- Select
the columns named “userid” and “teamlevel” and display first 5 rows.
- Display
all columns except “userid” and “teamlevel”.
Q3. What does the 1 represent in the following line of code?
ssc = StreamingContext(sc,1)
- To
create only one partition to manage the stream.
- To
specific debug output.
- To
create one single context.
- A
batch interval of 1 second.
Q4. What does the following code do?
window = vals.window(10, 5)
- Creates
a window that combines 10 seconds worth of data and moves by 5 seconds.
- Creates
10 windows with 5 seconds worth of data in them.
- Creates
10 windows with 5 batch intervals inbetween.
- Creates
a batch interval between 10 seconds and 5 seconds.
Big Data Integration & Processing Coursera Quiz 9 Answers – Check Your Query Results
Q1. How many tweets have location not null?
- 6937
- 6945
- No
option applicable.
- 5957
- 6973
Q2. How many people have more followers than friends? (Hint
: use this.user instead of user).
- 6238
- 5809
- 5590
- 6673
- 5206
Q3. Perform a query that returns the text of tweets which
have the string “http://”. Which of the following substrings do NOT occur in
the results? (Choose all that apply)
- @Infosmessi_
- @DundalkFC
- @Ass0Star
- @espn
- @TerraceImages
Q4. Query: Return all the tweets which contain text
“England” but not “UEFA”. In these results the string “Euro 2016” appears in…
- 2
tweets
- 3
tweets
- 0
tweets
- More
than 6 tweets.
- 5
tweets
Q5. Query: Get all the tweets from the location “Ireland”
which also contain the string “UEFA”. In this result the user with the highest
friends count is…
- Pauldonaghue
- ProfitwatchInfo
- irishexaminer
- DerekRantsGames
- Insight4News4
Big Data Integration & Processing Coursera Quiz 10 Answers – Check your Analysis Results
Q1. How many different countries are mentioned in at least
one tweet?
- 44
- 112
- 211
- 64
Q2. How many times is any country mentioned in a tweet?
- 52
- 211
- 397
- 26634
Q3. What are the three countries with the highest mentioned
count
- Nigeria,
Slovakia, Germany
- Thailand,
Iceland, Mexico
- Norway,
Nigeria, France
- Thailand,
Mexico, Denmark
Q4. How many times was France mentioned in a tweet?
- 25
- 8
- 42
- 30
Q5. Which country was mentioned most: Kenya, Wales, or the
Netherlands?
- Netherlands
- Wales
- Kenya
Q6. What is the average number of times a country is
mentioned? (Round to the nearest integer)
- 44
- 15
- 9
- 3
Post a Comment