Key Features:

Primary Audience: Services organizations building a Spark consulting practice and Product development start-ups and companies

Total Duration: 20 Full Days (10 Days Instruction + 10 Days Mentored Project)

Pre-requisites: Basic familiarity with HDFS, Some basic exposure to Java programming.

Minimum Machine Requirements: Laptops with 8GB RAM, Mac OS X or Linux, Hadoop installed and configured, Apache Spark downloaded, WiFi connection.

Note: All training exercises and hands-on sessions will be done on individual laptops.

Training Equipment Required: Whiteboard and marker pens. Ability to project from Mac (HDMI connector or Chromecast).

Teaching Style:  >80% hands-on content (we sit and we code a lot).

Course Contents: The content and duration of various modules can be adjusted to the specific requirements of the organization.

Note: This is a data engineering course focused on tools, programming and architecting a Spark-based application / product using Scala (and not a data science course).

Course Content: Spark & Scala Learning Modules

Module I: High-Speed Hands-on Introduction to Spark (using Scala) – 2 Day

  • Scala Basics: Arithmetic, Strings, Variables, Functions, Functional programming, Arrays, Lists
  • RDDs: Parallelize, Filter, Collect, Map, FlatMap, Reduce, Read from files, Subset, Count, Union, Persist / Cache, CSV files, countByValue, toList
  • Programs: Writing Spark-Scala programs, Using SBT
  • Spark SQL: Understanding DataFrames / Datasets, Displaying DataFrames / Datasets, Specifying schema, Reading from CSV files, Selecting specific columns for display, filtering, distinct, Defining case classes, toDF, Registering as table and executing SQLs, describe,  computing basic stats
  • Spark Streaming: Using Kafka with Spark
  • Spark MLlib: Transformer, Estimators, Building a sample Spark ML pipeline using a Tokenizer, HashingTF and Logistic Regression
  • Spark GraphX: Creating a Graph, Querying, Running Graph Algorithms
  • Using User-Defined Functions (UDFs), User-Defined Aggregate Functions (UDAFs), and WindowFunctions

Module II: Using Spark for Processing Structured and Semi-Structured Data (Basics only) – 1 Day

  • Using JDBC to work with relational databases
  • Using Spark with MongoDB (NoSQL Database)
  • Working with JSON data
  • Using Spark with Avro and Parquet datasets

Module III: Using Spark for Data Exploration & Data Munging  – 2 Day

  • Using Spark SQL and SparkR for data analysis and visualization
  • Creating pivot tables using Spark SQL
  • Exploring graphs using GraphFrames including executing common graph algorithms, and visualizing your graphs using SparkR
  • Exploring time-series data
  • Very basic data munging with Spark SQL and SparkR
  • Very basic data munging on time-series data
  • Very basic data munging on textual data
  • Working with Spark Notebooks – Apache Zeppelin

Module IV: Focus Area: Spark ML – 5 Days

  • APIs for ML pipelines, Estimators, Transformers and Pipelines, Model Selection
  • Extracting, Transforming and Selecting Features (Feature Extractors, Feature Transformers, Feature Selectors)
  • Selected topics from Classification, Regression, Decision Trees, Tree Ensembles, Clustering

Mini-Project Modules (Duration: 10 Days)

Module V: Spark Product Architecture, Development and Deployment Guidelines

  • Product Architectures: Lambda, SaaS Multi-Tenant, & Services Oriented Architectures
  • AWS Cloud Considerations for Product Scalability, Availability and Security, and Minimizing Costs

Module VI: High-level overview of Monitoring, Performance, and Troubleshooting

  • SparkUI, Catalyst Optimizer, Project Tungsten, Garbage Collection