Tag Archives: apache spark

Apache Spark’s missing link for Realtime Interactive Data Analysis

Spark and Interactive Data Analysis

Interactive data analysis is a scenario when we have a human asking a data question and he expects an answer in human time. Another characteristic of interactive data analysis is that usually a series of questions are asked – an operations analyst investigating site traffic might first want to group by geographic location and then drills down to other demographics like device type, user agent and finally filtering by a suspicious IP. A main requirement here is the ability to cache the data as multiple queries are fired on the same data set – this is where Apache Spark fits naturally. Spark’s RDDs can be cached in memory with graceful fallback which is many times faster than reading from disk and selecting the informative data set every time.

Adding a “Realtime” scenario

The word “realtime” has become a little confusing lately. There are two kinds of realtime here: First, the data needs realtime ingestion and being available for action or querying immediately. Second, a user asks a query and looks for an immediate answer in real time. The second case is same as interactive analysis, the first case is what we’ll focus now.

So, the use case I wish to solve with Spark is realtime and interactive analysis. At first look, Spark looks great with Spark SQL for simplifying access, Spark Streaming for realtime data and the core Spark for data on a Hadoop compatible source. The catch here is how to view and query both streaming data and historical data as a single RDD. In many cases like log files, of click stream events we have a realtime data stream and historical data which functionally is a single table. However, Spark Streaming and Spark design is similar to lambda architecture where you have a separate speed layer and separate batch layer and querying on merged view is a challenge.

The workaround I find here is to keep ingesting the data in realtime into Hadoop and keep recomputing the RDDs for each query or at a particular frequency, but this takes away the advantages of caching RDDs for future queries. I do understand this is an intentional design limitation of RDDs. Well, a problem or a limitation is an opportunity to improvise and I am looking to prototype a solution for this use case. Will be glad to hear any ideas in this space.

Existing solutions: In-memory DBs

Druid Architecture

Druid Architecture

The existing solution for the use case we have been talking about is to use in memory DBs like MemSQL (not open source) or Druid. These DBs are columnar and designed ground up for analytics. However, point to note is that these in memory DBs expect structured data. So we cannot ingest a plain text log file directly into these systems and extract fields for querying like how we do with Spark. However, if dealing with structured data, these in-memory DBs should be a great fit.

Abishek, LogBase