DataScience with Python Course Syllabus
Apache Spark is the next-generation successor to MapReduce. Spark is a powerful, open source processing engine for data in the Hadoop cluster, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs.
Apache Spark course enables participants to build complete, unified Big Data applications combining batch, streaming, and interactive analytics on all their data. With Spark, developers can write sophisticated parallel applications for faster business decisions and better user outcomes, applied to a wide variety of use cases, architectures, and industries.
This course is best suited to developers and software engineers. Course examples and exercises are presented in Python and Scala, so knowledge of one of these programming languages is required. Basic knowledge of Linux is assumed. Prior knowledge of Hadoop is not required.
1. Why Spark?
2. Spark Basics
3. Working with RDDs
4. The Hadoop Distributed File System
5. Running Spark on a Cluster
6. Parallel Programming with Spark
7. Caching and Persistence
8. Writing Spark Applications
9. Spark, Hadoop, and the Enterprise Data Center
10. Spark Streaming
11. Common Spark Algorithms
12. Improving Spark Performance
Multivariate data interpolation (griddata)
Using radial basis functions for smoothing/interpolation
Fast Fourier transforms
Discrete Cosine Transforms
Discrete Sine Transforms
Sparse Eigenvalue Problems with ARPACK
Compressed Sparse Graph Routines
Spatial data structures and algorithms
Statistics Random Variables