Apache Spark & Scala
Majority of the applications’ success today depends on their data story. Not just large applications, even the small applications collect huge amounts of data, process, analyze and learn from the data to provide essential features needed for the application. No application is a small application anymore and following are some of the implications
1. Complexity – Large applications imply complexity. Most mainstream programming languages, use techniques which makes it extremely hard if not impossible to reason about correctness, let alone prove correctness. Most often proving even the smallest of components(objects, functions) correct, is hard. Reasoning about software, by composing such components makes it nearly impossible.
2. Scalability – Majority of these applications may have to grow quickly, from a single virtual machine on a cloud, to hundreds of machines with multiple CPUs. You already might have used many such applications from audio/video streaming applications like netflix to chat applications like whatsapp. Twitter initially written in Ruby and Facebook initially written in php started relatively small.
3. Resiliency – At this scale, failures happen, far more often than we think. Our applications need to recover quickly and continue to work without any issues.
4. Performance – Utilizing all available resources efficiently is critical for keeping the costs down and sometimes keep the solutions feasible.
This course explains four important technologies needed to create modern data oriented applications solving the above problems.
1. Scala: Scala has a state of the art type system, far more capable than Java’s, and is one of the best functional programming language(not the same as procedural programming). Scala’s type system not only help you write safe, efficient code, but more importantly allows you write code using functional programming, which can be proved to be correct using mathematical techniques or reason about correctness and be more confident that your code always works. If your software needs to be resilient, the first goal would be make your code provably or reasonably correct.
2. Akka: One of the most difficult task for programmers is concurrency. It’s extremely difficult to that point that, even the experts find it nearly impossible to do correctly. I would highly encourage you to read ‘Java concurrency in practice’ to get an idea. Perhaps, writing distributed systems is as difficult. Writing concurrent, distributed system which scales well is extremely hard. Akka is written in Scala and implements Actor model(from Erlang) for the JVM, allowing you to write highly concurrent distributed systems with ease. Akka/Erlang’s actor model scales extremely well, to the extent that most of your application’s scalability will be determined by your hardware platform. Best part about Actor model, in my opinion isn’t scalability, but resiliency. Akka supports self healing, and continues to work without problems even in the face of unexpected errors.
3. Spark: When you need to process terabytes of data, you need a fast computing system. Spark is the fastest general purpose cluster computing system. It can be 100 times faster than map reduce of Hadoop for in memory operations and majority of times is at least 10 times faster. Apache Spark is written in Scala, and allows you to describe data processing using a typesafe, high level, declarative Scala API(DSL) provided by Spark which makes Spark far easier to program than something like Hadoop using Java. Spark also provides bindings for many languages like Python, R and Java 8 making it accessible to data scientists, data analysts, statisticians and mathematicians.
4. Kafka: Data oriented applications are written as a collection of disparate systems. These disparate systems need to talk to each other, sending and receiving huge amounts of data efficiently and reliably. Apache Kafka, again written in Scala, solves exactly this problem. Many alternatives to Kafka exist, some very efficient and some very resilient but not both. Kafka is also very simple to learn, partly thanks to Scala, and can also be accessed from many languages including Java and Clojure.
Data oriented applications, are written with the requirements of high scalability and high resiliency. Distributed software systems could be written using Scala and Akka. All data processing needs, whether that’s simple analysis or real time machine could be done using Spark. Apache Kafka will allow all the components to communicate data. All highly scalable and resiliency. All of it with simplicity and safety not possible before, with any existing mainstream technology.
Please do not assume any of these technologies to be easy. Yes, they are far simpler than any of the alternatives, so be prepared to work really hard to understand these technologies well.
Important Note: This course will not cover configuration, administration or devops. This course is specifically designed for programmers.
- Lectures 0
- Quizzes 0
- Students 27
- Assessments Self