DATA SCIENCE FOR SOLUTION ARCHITECTS

Course Overview

This training course helps Solution Architects and other IT practitioners understand the value proposition, methodology, and techniques of the emerging discipline of Data Science. The class also introduces the students to a number of existing production-ready technologies and capabilities that enable enterprises to build cost-efficient Big Data processing solutions.

Course Objectives

This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics. The course covers the fundamental and advanced concepts and methods of deriving business insights from raw data using cost-effective data processing solutions. The course is supplemented by hands-on labs that help attendees reinforce their theoretical knowledge of the learned material.

Course Prequisites

Participants should have the general knowledge of statistics and programming

Target Audience

Enterprise Architects, Solution Architects, Information Technology Architects, Business Analysts, Senior Developers, and Team Leads

Course Outline

1. APPLIED DATA SCIENCE

What is Data Science?
Data Science Ecosystem
Data Mining vs. Data Science
Business Analytics vs. Data Science
Who is a Data Scientist?
Data Science Skill Sets Venn Diagram
Data Scientists at Work
Examples of Data Science Projects
An Example of a Data Product
Applied Data Science at Google
Data Science Gotchas
Summary

2. DATA ANALYTICS LIFE-CYCLE PHASES

Big Data Analytics Pipeline
Data Discovery Phase
Data Harvesting Phase
Data Priming Phase
Exploratory Data Analysis
Model Planning Phase
Model Building Phase
Communicating the Results
Production Roll-out
Summary

3. GETTING STARTED WITH R

Introduction
Positioning of R in the Data Science Arena
R Integrated Development Environments
Running R
Running RStudio
Ending the Current R Session
Getting Help
Getting System Information
General Notes on R Commands and Statements
R Data Structures
R Objects and Workspace
Assignment Operators
Assignment Example
Arithmetic Operators
Logical Operators
System Date and Time
Operations
User-defined Functions
User-defined Function Example
R Code Example
Type Conversion (Coercion)
Control Statements
Conditional Execution
Repetitive Execution
Repetitive execution
Built-in Functions
Reading Data from Files into Vectors
Example of Reading Data from a File
Writing Data to a File
Example of Writing Data to a File
Logical Vectors
Character Vectors
Matrix Data Structure
Creating Matrices
Working with Data Frames
Matrices vs Data Frames
A Data Frame Sample
Accessing Data Cells
Getting Info About a Data Frame
Selecting Columns in Data Frames
Selecting Rows in Data Frames
Getting a Subset of a Data Frame
Sorting (ordering) Data in Data Frames by Attribute(s)
Applying Functions to Matrices and Data Frames
Using the apply() Function
Example of Using apply()
Executing External R commands
Loading External Scripts in RStudio
Listing Objects in Workspace
Removing Objects in Workspace
Saving Your Workspace in R
Saving Your Workspace in RStudio
Saving Your Workspace in R GUI
Loading Your Workspace
Hands-on Exercises
Getting and Setting the Working Directory
Getting the List of Files in a Directory
Diverting Output to a File
Batch (Unattended) Processing
Importing Data into R
Exporting Data from R
Hands-on Exercise
Standard R Packages
Extending R
Extending R in R GUI
Extending R in RStudio
CRAN Page
Summary

4. R STATISTICAL COMPUTING FEATURES

Statistical Computing Features
Descriptive Statistics
Basic Statistical Functions
Examples of Using Basic Statistical Functions
Using the summary() Function
Math Functions Used in Data Analysis
Examples of Using Math Functions
Correlations
Correlation Example
Summary

5. DATA SCIENCE ALGORITHMS AND ANALYTICAL METHODS

Supervised vs Unsupervised Machine Learning
Supervised Machine Learning Algorithms
Unsupervised Machine Learning Algorithms
Choose the Right Algorithm
Life-cycles of Machine Learning Development
Classifying with k-Nearest Neighbors (SL)
k-Nearest Neighbors Algorithm
k-Nearest Neighbors Algorithm
The Error Rate
Hands-on Exercise
Decision Trees (SL)
Using Decision Trees
Random Forests
Naive Bayes Classifier (SL)
Classification of Documents with Naive Bayes
Unsupervised Learning Type: Clustering
K-Means Clustering (UL)
K-Means Clustering in a Nutshell
Regression Analysis
Types of Regression
Simple Linear Regression Model
Linear Regression Illustration
Least-Squares Method (LSM)
LSM Assumptions
Fitting Linear Regression Models in R
Example of Using R’s lm() Function
Example of Using lm() with a Data Frame
Regression Models in Excel
Hands-on Exercise
Logistic Regression
Regression vs Classification
Time-Series Analysis
Decomposing Time-Series
Summary

6. TEXT MINING

What is Text Mining?
The Common Text Mining Tasks
What is Natural Language Processing (NLP)?
Some of the NLP Use Cases
Machine Learning in Text Mining and NLP
Machine Learning in NLP
TF-IDF
The Feature Hashing Trick
Stemming
Example of Stemming
Stop Words
Popular Text Mining and NLP Libraries and Packages
Summary

7. WHAT IS NOSQL?

Limitations of Relational Databases
Limitations of Relational Databases (Cont’d)
Defining NoSQL
What are NoSQL (Not Only SQL) Databases?
The Past and Present of the NoSQL World
NoSQL Database Properties
NoSQL Benefits
NoSQL Database Storage Types
The CAP Theorem
NoSQL Systems CAP Triangle
Mechanisms to Guarantee a Single CAP Property
Limitations of NoSQL Databases
Big Data Sharding
Sharding Example
Quiz
Quiz Answers
Summary

8. MAPREDUCE OVERVIEW

The Client – Server Processing Pattern
Distributed Computing Challenges
MapReduce Defined
Google’s MapReduce
The Map Phase of MapReduce
The Reduce Phase of MapReduce
MapReduce Explained
MapReduce Word Count Job
MapReduce Shared-Nothing Architecture
Similarity with SQL Aggregation Operations
Example of Map & Reduce Operations using JavaScript
Problems Suitable for Solving with MapReduce
Typical MapReduce Jobs
Fault-tolerance of MapReduce
Distributed Computing Economics
MapReduce Systems
Summary

9. HADOOP OVERVIEW

Apache Hadoop
Apache Hadoop Logo
Typical Hadoop Applications
Hadoop Clusters
Hadoop Design Principles
Hadoop Versions
Hadoop’s Main Components
Hadoop Simple Definition
Side-by-Side Comparison: Hadoop 1 and Hadoop 2
Hadoop-based Systems for Data Analysis
Other Hadoop Ecosystem Projects
Hadoop Caveats
Hadoop Distributions
Cloudera Distribution of Hadoop (CDH)
Cloudera Distributions
Hortonworks Data Platform (HDP)
MapR
Summary

10. HADOOP DISTRIBUTED FILE SYSTEM OVERVIEW

Hadoop Distributed File System (HDFS)
HDFS High Availability
HDFS “Fine Print”
Storing Raw Data in HDFS
Hadoop Security
HDFS Rack-awareness
Data Blocks
Data Block Replication Example
HDFS NameNode Directory Diagram
Accessing HDFS
Examples of HDFS Commands
Other Supported File Systems
WebHDFS
Examples of WebHDFS Calls
Client Interactions with HDFS for the Read Operation
Read Operation Sequence Diagram
Client Interactions with HDFS for the Write Operation
Communication inside HDFS
Summary

11. MAPREDUCE WITH HADOOP

Hadoop’s MapReduce
MapReduce 1 and MapReduce 2
Why do I need Discussion of the Old MapReduce?
MapReduce v1 (“Classic MapReduce”)
JobTracker and TaskTracker (the “Classic MapReduce”)
YARN (MapReduce v2)
YARN vs MR1
YARN As Data Operating System
MapReduce Programming Options
Hadoop’s Streaming MapReduce
Python Word Count Mapper Program Example
Python Word Count Reducer Program Example
Setting up Java Classpath for Streaming Support
Streaming Use Cases
The Streaming API vs Java MapReduce API
Amazon Elastic MapReduce
Apache Tez
Summary

12. APACHE PIG SCRIPTING PLATFORM

What is Pig?
Pig Latin
Apache Pig Logo
Pig Execution Modes
Local Execution Mode
MapReduce Execution Mode
Running Pig
Running Pig in Batch Mode
What is Grunt?
Pig Latin Statements
Pig Programs
Pig Latin Script Example
SQL Equivalent
Differences between Pig and SQL
Statement Processing in Pig
Comments in Pig
Supported Simple Data Types
Supported Complex Data Types
Arrays
Defining Relation’s Schema
Not Matching the Defined Schema
The bytearray Generic Type
Using Field Delimiters
Loading Data with TextLoader()
Referencing Fields in Relations
Summary

13. APACHE PIG RELATIONAL AND EVAL OPERATORS

Pig Relational Operators
Example of Using the JOIN Operator
Example of Using the Order By Operator
Caveats of Using Relational Operators
Pig Eval Functions
Caveats of Using Eval Functions (Operators)
Example of Using Single-column Eval Operations
Example of Using Eval Operators For Global Operations
Summary

14. HIVE

What is Hive?
Apache Hive Logo
Hive’s Value Proposition
Who uses Hive?
Hive’s Main Sub-Systems
Hive Features
The “Classic” Hive Architecture
The New Hive Architecture
HiveQL
Where are the Hive Tables Located?
Hive Command-line Interface (CLI)
The Beeline Command Shell
Summary

15. HIVE COMMAND-LINE INTERFACE

Hive Command-line Interface (CLI)
The Hive Interactive Shell
Running Host OS Commands from the Hive Shell
Interfacing with HDFS from the Hive Shell
The Hive in Unattended Mode
The Hive CLI Integration with the OS Shell
Executing HiveQL Scripts
Comments in Hive Scripts
Variables and Properties in Hive CLI
Setting Properties in CLI
Example of Setting Properties in CLI
Hive Namespaces
Using the SET Command
Setting Properties in the Shell
Setting Properties for the New Shell Session
Setting Alternative Hive Execution Engines
The Beeline Shell
Connecting to the Hive Server in Beeline
Beeline Command Switches
Beeline Internal Commands
Summary

16. HIVE DATA DEFINITION LANGUAGE

Hive Data Definition Language
Creating Databases in Hive
Using Databases
Creating Tables in Hive
Supported Data Type Categories
Common Numeric Types
String and Date / Time Types
Miscellaneous Types
Example of the CREATE TABLE Statement
Working with Complex Types
Table Partitioning
Table Partitioning
Table Partitioning on Multiple Columns
Viewing Table Partitions
Row Format
Data Serializers / Deserializers
File Format Storage
File Compression
More on File Formats
The ORC Data Format
Converting Text to ORC Data Format
The EXTERNAL DDL Parameter
Example of Using EXTERNAL
Creating an Empty Table
Dropping a Table
Table / Partition(s) Truncation
Alter Table/Partition/Column
Views
Create View Statement
Why Use Views?
Restricting Amount of Viewable Data
Examples of Restricting Amount of Viewable Data
Creating and Dropping Indexes
Describing Data
Summary

17. APACHE SQOOP

What is Sqoop?
Apache Sqoop Logo
Sqoop Import / Export
Sqoop Help
Examples of Using Sqoop Commands
Data Import Example
Fine-tuning Data Import
Controlling the Number of Import Processes
Data Splitting
Helping Sqoop Out
Example of Executing Sqoop Load in Parallel
A Word of Caution: Avoid Complex Free-Form Queries
Using Direct Export from Databases
Example of Using Direct Export from MySQL
More on Direct Mode Import
Data Export from HDFS
Export Tool Common Arguments
Data Export Control Arguments
Data Export Example
INSERT and UPDATE Statements
INSERT Operations
UPDATE Operations
Example of the Update Operation
Failed Exports
Sqoop2
Summary

18. INTRODUCTION TO FUNCTIONAL PROGRAMMING

What is Functional Programming (FP)?
Terminology: Higher-Order Functions
Terminology: Lambda vs Closure
A Short List of Languages that Support FP
FP with Java
FP With JavaScript
Imperative Programming in JavaScript
The JavaScript map (FP) Example
The JavaScript reduce (FP) Example
Using reduce to Flatten an Array of Arrays (FP) Example
The JavaScript filter (FP) Example
Common High-Order Functions in Python
Common High-Order Functions in Scala
Elements of FP in R
Summary

19. INTRODUCTION TO APACHE SPARK

What is Apache Spark
A Short History of Spark
Where to Get Spark?
The Spark Platform
Spark Logo
Common Spark Use Cases
Languages Supported by Spark
Running Spark on a Cluster
The Driver Process
Spark Applications
Spark Shell
The spark-submit Tool
The spark-submit Tool Configuration
The Executor and Worker Processes
The Spark Application Architecture
Interfaces with Data Storage Systems
Limitations of Hadoop’s MapReduce
Spark vs MapReduce
Spark as an Alternative to Apache Tez
The Resilient Distributed Dataset (RDD)
Spark Streaming (Micro-batching)
Spark SQL
Example of Spark SQL
Spark Machine Learning Library
GraphX
Spark vs R
Summary

20. THE SPARK SHELL

The Spark Shell
The Spark Shell UI
Spark Shell Options
Getting Help
The Spark Context (sc) and SQL Context (sqlContext)
The Shell Spark Context
Loading Files
Saving Files
Basic Spark ETL Operations
Summary

21. SPARK RDDS

The Resilient Distributed Dataset (RDD)
Ways to Create an RDD
Custom RDDs
Supported Data Types
RDD Operations
RDDs are Immutable
Spark Actions
RDD Transformations
Other RDD Operations
Chaining RDD Operations
RDD Lineage
The Big Picture
What May Go Wrong
Checkpointing RDDs
Local Checkpointing
Parallelized Collections
More on parallelize() Method
The Pair RDD
Where do I use Pair RDDs?
Example of Creating a Pair RDD with Map
Example of Creating a Pair RDD with keyBy
Miscellaneous Pair RDD Operations
RDD Caching
RDD Persistence
The Tachyon Storage
Summary

22. PARALLEL DATA PROCESSING WITH SPARK

Running Spark on a Cluster
Spark Stand-alone Option
The High-Level Execution Flow in Stand-alone Spark Cluster
Data Partitioning
Data Partitioning Diagram
Single Local File System RDD Partitioning
Multiple File RDD Partitioning
Special Cases for Small-sized Files
Parallel Data Processing of Partitions
Spark Application, Jobs, and Tasks
Stages and Shuffles
The “Big Picture”
Summary

23. THE SPARK MACHINE LEARNING LIBRARY

What is MLlib?
Supported Languages
MLlib Packages
Dense and Sparse Vectors
Labeled Point
Python Example of Using the LabeledPoint Class
LIBSVM format
An Example of a LIBSVM File
Loading LIBSVM Files
Local Matrices
Example of Creating Matrices in MLlib
Distributed Matrices
Example of Using a Distributed Matrix
Classification and Regression Algorithm
Clustering
Summary

LAB EXERCISES

Lab 1. Getting Started with R
Lab 2. Working with R
Lab 3. Data Import and Export in R
Lab 4. k-Nearest Neighbors Algorithm
Lab 5. Simple Linear Regression
Lab 6. Common Text Mining Tasks with the tm Library
Lab 7. Learning the Lab Environment
Lab 8. The Hadoop Distributed File System
Lab 9. Getting Started with Apache Pig
Lab 10. Working with Data Sets in Apache Pig
Lab 11. The Hive and Beeline Shells
Lab 12. Hive Data Definition Language
Lab 13. The Spark Shell
Lab 14. Spark ETL and HDFS Interface
Lab 15. Using k-means Algorithm from MLlib

Frequently Asked Questions

None

"As a client of Makintouch, I would recommend the company as a leading PC-based training school. When we needed half day courses to fit the schedules of our employees they were able to customize them to meet our company’s needs" – Java Struts

Taiwo Alaka

MTN

I would recommend this course as it is up to date with the current release which is quite rare as things move on so fast. The classroom set up worked well and the labs were good and relevant to the course.’ – Oracle

Kayode Akinpelu

Stanbic IBTC

DATA SCIENCE FOR SOLUTIONS ARCHITECTS

Course Overview

Course Objectives

Course Prequisites

Target Audience

Course Outline

Frequently Asked Questions

What Our Students Have to Say

QUICK LINKS

TRAINING SERVICES

OTHER SERVICES