This sponsored post is produced by Databricks. 

With the influx in data, companies of all sizes are embarking on a shared goal of extracting insights and creating value. Yet data science projects and their promised returns still remain elusive beyond the grasp of many — be it due to a lack of resources, size of data sets, disparate sources, or other factors. The goal of the Apache Spark project is to dramatically simplify the process of extracting value from data.

What is Spark?

Spark is a rapidly growing open source data processing framework in the big data space. It was first created as part of a research project at UC Berkeley and was open-sourced just over five years ago. The framework, built for sophisticated analytics, speed, and ease of use, caught on quickly. Spark has already become the most active open source project in big data ecosystem, with over 350 contributors added in the last year alone. Spark also has hundreds of production use cases within large enterprises like Yahoo, Baidu, and Tencent for everything from batch analytics to stream processing.
In a nutshell, Spark enables enterprises to simultaneously achieve high performance computing at scale, while simplifying their data infrastructure by avoiding the difficult integration of a set of disparate and complex tools. That’s because Spark is a parallel execution engine for big data that provides three highly desirable properties:

  1. Spark goes beyond batch computation and provides a unified platform that supports streaming, interactive analytics, and sophisticated data processing such as machine learning and graph algorithms.
  2. It is fast, as it has been built from ground up to process data in memory. Spark’s optimizations however extend beyond memory. Currently Spark holds the record for Terasort benchmark having beat the previous record held by a Hadoop cluster by running 3x faster and using 10x fewer nodes.
  3. It makes it much easier to write big data applications by exposing a rich and expressive API in a variety of languages, including Python, Java, and Scala. In particular it exposes over 100 APIs with map and reduce being just two of them.

Spark is compatible with the existing Hadoop stack. Broadly speaking, the Hadoop stack consists of three layers: storage layer (HDFS), resource management layer (Yarn), and execution layer (Hadoop MR). Spark is situated at the execution layer, runs on top of Yarn and can consume data from HDFS.

When it comes to Hadoop MR, Spark is up to 100x faster, requires between 2-5x less lines of code to write big data applications, and functionality wise can replace not only MR, but also other systems in the Hadoop ecosystem, such Storm, Mahout, and Giraph.

How do workloads interoperate within Spark?

Spark provides support for a variety of workloads, including, batch, streaming, interactive, and iterative processing through a powerful set of libraries: Spark Streaming, Spark SQL, MLlib, GraphX, and now SparkR.

All these libraries use the same execution engine and the same storage abstraction. This makes it possible to trivially stitch together multiple functionalities provided by these libraries. For instance one can easily invoke Machine Learning algorithms from Spark Streaming or from Spark, or use Spark SQL to query live streaming data.

This tight integration will enable new applications, which were not possible before, such as online fraud detection and real-time large-scale optimization.

Data Science from ingest to production with Spark

While the Spark execution engine is a great start, it alone is not sufficient to solve the big data challenges facing enterprises today. Companies of all sizes are finding out that there are many challenges in their journey to operationalize their data pipeline. These challenges include, cluster management, deploying, upgrading and configuring Spark, interactively exploring data to get insights, and ultimately building data products.
A paradigm shift is required to address these challenges. Enterprises need a data platform that enables them to unlock the value of their data, to seamlessly transition from data ingest to exploration and production in one platform. Databricks is one such platform, helping to free enterprises from today’s constraints, so they focus on finding answers from their data, build data products, and ultimately capture the value promised by big data. All in all, the ideal data platform will utilize Spark but will also go one step beyond to include, the critical components required to enable organizations to connect to a wide variety of data sources, gain better productivity with user-friendly tools, collaborate more effectively, and serve data products to a broad audience.

Ion Stoica is the CEO and Co-Founder of Databricks.


Sponsored posts are content that has been produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. The content of news stories produced by our editorial team is never influenced by advertisers or sponsors in any way. For more information, contact sales@venturebeat.com.