Analyzing a large JSON in memory – Part I

I have been doing a comparative study on different ways to analyse a large JSON in memory, and wanted to share the results I see.

I intend to do a wider set of comparisons and this post is going to be the first part. In this post will discuss about the use case and the test results while analysing in a simple row vs columnar fashion.

Use Case

Let’s say you got a big chunk of JSON data from a source like twitter API and you want to do some analysis on it by  firing multiple and random queries on this large JSON. To speed up the queries it’s preferred to load the JSON in memory and answer the queries instead of reaching out to the disk every time (assuming it fits in memory). Please note, I did not want to bother throwing distributed computing or a real time requirement  at this point.

The test data I took was 100000 records from twitter firehose that made up a 264MB JSON array on disk. The test query was to pull up a tweet that came from a particular timezone and from a particular user name – select text where user.screen_name = <> and user.timezone = <>.

Approaches

1. Row by row: Parse the twitter stream data in a traditional row by row fashion and filter by both user name and time zone parameters, select the tweet text.

2. NoSQL DB: Post the test data into a NoSQL DB like mongodb and run the query. This is not an always in-memory analysis but with mongodb nearing cache speeds, I would like to get some stats here. I will do this test by my next post.

3. Columnar: Convert the JSON to columnar format so that it’s better for analytics. (Later I intend to add compression as well). There are known techniques to convert nested JSON to columnar like Parquet format. I found that Parquet intended to solve more problems than I needed and could potentially perform better, I will revisit Parquet probably later. Another new player – MemSQL converts JSON to columnar and allows SQL querying on the same. MemSQL appears a good fit for our use case but it’s not a free software. Also, it’s a much bigger product that includes real-time and scalability solutions. I also found treasure data has a technology called Plazma,  again it did not look like an open sourced one.

I decided to code my own algorithm to flatten the JSON and query in an SQL fashion. My approach is a little different to Parquet in that I did not go by repetition level and definition level. Instead, I traversed the JSON assigning rowids to columns values based on the level and iteration.

With this basic set up, I ran my test query and obtained below stats:

Columnar

Time stats:

MilliSecs for file load: 30432

MilliSecs for query exec: 33

Space stats:

Increase in heap after file load: 303 MB

Total memory: 1517

Free memory after program: 431

Used (Total – Free): 1086 MB

Row by row

Time stats:

MilliSecs for file load: 34494

MilliSecs for query exec: 192

Space stats:

Increase in heap after file load: 241 MB

Total memory: 1652

Free memory after program: 303

Used (Total – Free): 1349 MB

Inference

We can see that right away with this basic columnar implementation, the query execution time is more than 5x faster.

I am unable to conclude well on the memory footprint due to other variables, but looks like the columnar format is lighter on memory by a small margin – Not bad considering I have not done any compression yet.

Columnar will be faster and lighter than Row is not a new finding. This test is to affirm the same and apply to JSON which is a nested data format.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s