Primary Audience: Experienced Java Developers
Total Duration: 4 Full Days
Pre-requisites: Prior experience in Java programming.
Minimum Machine Requirements: Laptops with 8GB RAM, Mac OS X or Linux, Hadoop installed and configured, Apache Spark downloaded, WiFi connection.
Note: All training exercises and hands-on sessions will be done on individual laptops.
Training Equipment Required: Whiteboard and marker pens. Ability to project from Mac (HDMI connector or Chromecast).
Teaching Style: >80% hands-on content (we sit and we code).
Part A: Scala Programming
- Introduction to Scala: Scala Interpreter; Scala Basics: Values and Variables; Arithmetic and Operator Overloading; Calling Methods; Control structures and Functions: Conditional Expressions; Statement Termination; Input and Output; Loops; Functions; Default and Named Arguments; Variable arguments; Lazy Values; Exceptions; Maps and Tuples; Exercises
- Classes and Objects: Simple Classes, Constructors; Nested Classes; Objects: Singletons; Companion Objects; Packages and Imports: Packages; Scope Rules; Package Visibility; Imports; Renaming and Hiding Members; Implicit Imports; Inheritance: Extending a Class; Overriding Methods; Type Checks and Casts; Protected Fields and Methods; Superclass Construction; Overriding Fields; Anonymous Subclasses; Abstract Classes; Abstract Fields; Object Equality; Value Classes; Traits: Traits as Interfaces; Traits with Concrete Implementations; Objects with Traits; Layered Traits; Overriding Abstract Methods in Traits; Traits for Rich Interfaces; Self Types; Exercises
- Collections: Mutable and Immutable Collections; Sequences; Lists; Sets; Mapping a Function; Reducing, Folding, and Scanning; Zipping; Iterators; Interoperability with Java Collections; Parallel Collections; Exercises
- Pattern Matching and Case Classes: Switch; Guards; Variables in Patterns; Type Patterns; Extractors; Patterns in Variable Declarations; Patterns in for Expressions; Case Classes; Matching Nested Structures; Sealed Classes; The Option Type; Partial Functions; Exercises
- Implicits: Implicit Conversions; Using Implicits for Enriching Existing Classes; Importing Implicits; Rules for Implicit Conversions; Implicit Parameters; Implicit Conversions with Implicit Parameters; Exercises
- Higher-Order Functions: Functions as Values; Anonymous Functions; Functions with Function Parameters; Parameter Inference; Useful Higher-Order Functions; Closures; Currying; Exercises
Part B: Spark Streaming Programming (Apache Spark; Using DataFrame / Dataset APIs; Structured Streaming; SBT; Public Datasets)
- Quick tour of Spark: SparkSession; Spark Shell; Understanding RDDs, DataFrames & Datasets; Overview of the Catalyst Optimizer; Continuous Applications & Structured Streaming
- Hands-on Exercises:
- Defining a schema; Creating a DataFrame; Displaying the contents of a DataFrame, Creating a Temporary View & execute a simple SQL statement; Creating a Dataset; Defining a case class; Creating UDFs; Using SparkSession.
- Basic DataFrame / Dataset operations. For this purpose, we use the restaurant’s dataset that is typically used for evaluating duplicate detection and record linkage systems.
- Measure the difference in execution times of Spark 1.6 and Spark 2.0.
- Define our input records schema and create a streaming input DataFrame. Next, we define our query with a time interval of 20 seconds and the output mode as Complete.
- What is a streaming application? Typical streaming use-cases; Using Spark SQL DataFrame / Dataset APIs to build streaming applications; Using Kafka in Structured Streaming applications; Creating a receiver for a custom data source.
- Hands-on Exercises:
- Building Continuous Applications using Structured Streaming; Using window operations; Joining a streaming dataset with a static dataset; Using Dataset API in Structured Streaming; Using the Foreach, Memory, and File Sinks.
- Using Kafka with Spark Structured Streaming; Simple Kafka-Spark example.
- Writing a receiver for a custom data source: we define a custom data source for public APIs.
- Large-Scale Application Architectures: Understand Spark-based batch and stream processing architectures; Understand Lambda and Kappa architectures; Implementing scalable stream processing with structured streaming; Building robust ETL (Extract-Transform-Load) pipelines; Implementing a scalable monitoring solution.
- Hands-on Exercises:
- Building robust ETL pipelines: Choosing appropriate data formats; transforming data in ETL pipelines; we present examples of Spark SQL based transformations on Twitter data.
- Addressing errors in ETL pipelines: ETL tasks are usually considered to be complex, expensive, slow, and error-prone. Here, we will examine typical challenges in ETL processes, and how Spark SQL features assist in addressing them.
Implementing a scalable monitoring solution: Voluminous logs collected from applications, servers, network devices, and so on are processed to provide real-time monitoring that helps detect errors, warnings, failures and other issues.