Primary Audience: Experienced Java Developers

Total Duration: 4 Full Days

Pre-requisites: Prior experience in Java programming.

Minimum Machine Requirements: Laptops with 8GB RAM, Mac OS X or Linux, Hadoop installed and configured, Apache Spark downloaded, WiFi connection.

Note: All training exercises and hands-on sessions will be done on individual laptops.

Training Equipment Required: Whiteboard and marker pens. Ability to project from Mac (HDMI connector or Chromecast).

Teaching Style:  >80% hands-on content (we sit and we code).

Course Content:

Part A: Scala Programming

  1. Introduction to Scala: Scala Interpreter; Scala Basics: Values and Variables; Arithmetic and Operator Overloading; Calling Methods; Control structures and Functions: Conditional Expressions; Statement Termination; Input and Output; Loops; Functions; Default and Named Arguments; Variable arguments; Lazy Values; Exceptions; Maps and Tuples; Exercises
  2. Classes and Objects: Simple Classes, Constructors; Nested Classes; Objects: Singletons; Companion Objects; Packages and Imports: Packages; Scope Rules; Package Visibility; Imports; Renaming and Hiding Members; Implicit Imports; Inheritance: Extending a Class; Overriding Methods; Type Checks and Casts; Protected Fields and Methods; Superclass Construction; Overriding Fields; Anonymous Subclasses; Abstract Classes; Abstract Fields; Object Equality; Value Classes; Traits: Traits as Interfaces; Traits with Concrete Implementations; Objects with Traits; Layered Traits; Overriding Abstract Methods in Traits; Traits for Rich Interfaces; Self Types; Exercises
  3. Collections: Mutable and Immutable Collections; Sequences; Lists; Sets; Mapping a Function; Reducing, Folding, and Scanning; Zipping; Iterators; Interoperability with Java Collections; Parallel Collections; Exercises
  4. Pattern Matching and Case Classes: Switch; Guards; Variables in Patterns; Type Patterns; Extractors; Patterns in Variable Declarations; Patterns in for Expressions; Case Classes; Matching Nested Structures; Sealed Classes; The Option Type; Partial Functions; Exercises
  5. Implicits: Implicit Conversions; Using Implicits for Enriching Existing Classes; Importing Implicits; Rules for Implicit Conversions; Implicit Parameters; Implicit Conversions with Implicit Parameters; Exercises
  6. Higher-Order Functions: Functions as Values; Anonymous Functions; Functions with Function Parameters; Parameter Inference; Useful Higher-Order Functions; Closures; Currying; Exercises

Part B: Spark Streaming Programming (Apache Spark; Using DataFrame / Dataset APIs; Structured Streaming; SBT; Public Datasets)

  1. Quick tour of Spark: SparkSession; Spark Shell; Understanding RDDs, DataFrames & Datasets; Overview of the Catalyst Optimizer; Continuous Applications & Structured Streaming
  2. Hands-on Exercises:
    1. Defining a schema; Creating a DataFrame; Displaying the contents of a DataFrame, Creating a Temporary View & execute a simple SQL statement; Creating a Dataset; Defining a case class; Creating UDFs; Using SparkSession.
    2. Basic DataFrame / Dataset operations. For this purpose, we use the restaurant’s dataset that is typically used for evaluating duplicate detection and record linkage systems.
    3. Measure the difference in execution times of Spark 1.6 and Spark 2.0.
    4. Define our input records schema and create a streaming input DataFrame. Next, we define our query with a time interval of 20 seconds and the output mode as Complete.
  3. What is a streaming application? Typical streaming use-cases; Using Spark SQL DataFrame / Dataset APIs to build streaming applications; Using Kafka in Structured Streaming applications; Creating a receiver for a custom data source.
  4. Hands-on Exercises:
    1. Building Continuous Applications using Structured Streaming; Using window operations; Joining a streaming dataset with a static dataset; Using Dataset API in Structured Streaming; Using the Foreach, Memory, and File Sinks.
    2. Using Kafka with Spark Structured Streaming; Simple Kafka-Spark example.
    3. Writing a receiver for a custom data source: we define a custom data source for public APIs.
  5. Large-Scale Application Architectures: Understand Spark-based batch and stream processing architectures; Understand Lambda and Kappa architectures; Implementing scalable stream processing with structured streaming; Building robust ETL (Extract-Transform-Load) pipelines; Implementing a scalable monitoring solution.
  6. Hands-on Exercises:
    1. Building robust ETL pipelines: Choosing appropriate data formats; transforming data in ETL pipelines; we present examples of Spark SQL based transformations on Twitter data.
    2. Addressing errors in ETL pipelines: ETL tasks are usually considered to be complex, expensive, slow, and error-prone. Here, we will examine typical challenges in ETL processes, and how Spark SQL features assist in addressing them.

Implementing a scalable monitoring solution: Voluminous logs collected from applications, servers, network devices, and so on are processed to provide real-time monitoring that helps detect errors, warnings, failures and other issues.