Primary Audience: Services organizations building a Spark consulting practice and Product development start-ups and companies
Total Duration: 20 Full Days (10 Days Instruction + 10 Days Mentored Project)
Pre-requisites: Basic familiarity with HDFS, Some basic exposure to Java programming.
Minimum Machine Requirements: Laptops with 8GB RAM, Mac OS X or Linux, Hadoop installed and configured, Apache Spark downloaded, WiFi connection.
Note: All training exercises and hands-on sessions will be done on individual laptops.
Training Equipment Required: Whiteboard and marker pens. Ability to project from Mac (HDMI connector or Chromecast).
Teaching Style: >80% hands-on content (we sit and we code a lot).
Course Contents: The content and duration of various modules can be adjusted to the specific requirements of the organization.
Note: This is a data engineering course focused on tools, programming and architecting a Spark-based application / product using Scala (and not a data science course).
Course Content: Spark & Scala Learning Modules
Module I: High-Speed Hands-on Introduction to Spark (using Scala) – 2 Day
- Scala Basics: Arithmetic, Strings, Variables, Functions, Functional programming, Arrays, Lists
- RDDs: Parallelize, Filter, Collect, Map, FlatMap, Reduce, Read from files, Subset, Count, Union, Persist / Cache, CSV files, countByValue, toList
- Programs: Writing Spark-Scala programs, Using SBT
- Spark SQL: Understanding DataFrames / Datasets, Displaying DataFrames / Datasets, Specifying schema, Reading from CSV files, Selecting specific columns for display, filtering, distinct, Defining case classes, toDF, Registering as table and executing SQLs, describe, computing basic stats
- Spark Streaming: Using Kafka with Spark
- Spark MLlib: Transformer, Estimators, Building a sample Spark ML pipeline using a Tokenizer, HashingTF and Logistic Regression
- Spark GraphX: Creating a Graph, Querying, Running Graph Algorithms
- Using User-Defined Functions (UDFs), User-Defined Aggregate Functions (UDAFs), and WindowFunctions
Module II: Using Spark for Processing Structured and Semi-Structured Data (Basics only) – 1 Day
- Using JDBC to work with relational databases
- Using Spark with MongoDB (NoSQL Database)
- Working with JSON data
- Using Spark with Avro and Parquet datasets
Module III: Using Spark for Data Exploration & Data Munging – 2 Day
- Using Spark SQL and SparkR for data analysis and visualization
- Creating pivot tables using Spark SQL
- Exploring graphs using GraphFrames including executing common graph algorithms, and visualizing your graphs using SparkR
- Exploring time-series data
- Very basic data munging with Spark SQL and SparkR
- Very basic data munging on time-series data
- Very basic data munging on textual data
- Working with Spark Notebooks – Apache Zeppelin
Module IV: Focus Area: Spark ML – 5 Days
- APIs for ML pipelines, Estimators, Transformers and Pipelines, Model Selection
- Extracting, Transforming and Selecting Features (Feature Extractors, Feature Transformers, Feature Selectors)
- Selected topics from Classification, Regression, Decision Trees, Tree Ensembles, Clustering
Mini-Project Modules (Duration: 10 Days)
Module V: Spark Product Architecture, Development and Deployment Guidelines
- Product Architectures: Lambda, SaaS Multi-Tenant, & Services Oriented Architectures
- AWS Cloud Considerations for Product Scalability, Availability and Security, and Minimizing Costs
Module VI: High-level overview of Monitoring, Performance, and Troubleshooting
- SparkUI, Catalyst Optimizer, Project Tungsten, Garbage Collection