Distribute with Hazelcast, Persist into HBase

Hbase_logo

In this article I will implement a solution for a Big Data scenario. 
I will use HBase as persistence layer, and Hazelcast as distributed cache.
So the resulting project will be a "Getting Started Sample" for ones who wants to use HBase as persistent storage for their Hazelcast application.

The Scenario
Suppose you have (or hope to have:) ) "User" data with billions of records. -> Big Data
People will reach the data from your web application; query them, search them... -> Real-time Access
Some records will be reached more frequently -> Cache them in memory, serve faster.
Can add/remove columns, no strict schema -> Sparse data
Given the main requirements, the solution "NoSQL + Distributed Cache" fits to our scenario.
I will persist user data to the HBase:
A no-sql key-value datastore based on Hadoop technology and specialized for Big Data requirements.
It is modeled after Google's Big table and used by Yahoo and Facebook.
Facebook prefered HBase over Cassandra for its messaging system.
To learn more 
I will cache and distribute the data with Hazelcast.

HBase Setup
HBase is intended to be used in cluster but it has a standalone mode that you can try and use for development purposes.
For HBase setup follow:
If you use Ubuntu, you will encounter problems.
Although windows is not recommended for production, still you can try HBase on Windows.

Hazelcast Setup
Hazelcast is deadly simple to use. Just download and add hazelcast.jar to your classpath.
If you are new to hazelcast have a look at:

Project Setup
Create a maven Java project with dependencies:
Create a User pojo:
Create the user table in HBase:
Run your hbase by,
HBASE_DIR> ./bin/start-hbase.sh
Here it will be good to check the logs, to be sure it is installed and started properly.
Then open the HBASE shell by,
HBASE_DIR> ./bin/hbase shell
Create the user table
hbase(main):008:0> create 'user', 'cf_basic', 'cf_text'
Here I should tell more about 'cf_basic' and 'cf_text'. These are column families. 
Column families are stored together in the disk with the same storage specifications.
For example if you want some type of data (e.g. images) to be compressed then make them the same column family so you can define the same storage rule for them.
Here we have two column families: 'cf_basic' is for simple types, numbers, strings and 'cf_text' is for long text columns.
Notice that we have done nothing about schema, column types etc.
In the HBase intro video, you will recall Todd uses the term "datastore" instead "database" defining HBase.
HBase (and other key-value stores) is more like a persisted HashMap than a database.
You gain scalability but lose complex queries.

Create HBaseMapStore
This is the class where hazelcast will call at each map operation.
And a singleton service for getting HBase table.
On map.get hazelcast will look at HBase if it can not find the key in memory. Similarly when you put an element to map, hazelcast will persist it to HBase.
Why have not we implemented the loadAll? loadAll and loadAllKeys methods are for initially filling the hazelcast map from database. As we expect millions of records, it is not feasible to load db to memory. So we left them empty.
Unfortunately HTable is not thread safe, so you have to handle concurrency.

Configure Hazelcast
Here is hazelcast.xml that we put to classpath.
First difference from default one is I have added mapstore declaration to map config part.
Secondly I have enabled the eviction on maps. You can use hazelcast as a distributed cache by enabling eviction. So hazelcast evicts (removes) expired entries. To enable eviction set eviction-policy to LRU (or LFU) and max-size. For more information about hazelcast eviction see: 

Run The Code
Now let's test it. 
And see the records in database:

hbase(main):055:0> get 'user', 'u-6'
COLUMN                CELL                                                      
 cf_basic:age         timestamp=1334320415281, value=\x00\x00\x00\x1D           
 cf_basic:location    timestamp=1334320415281, value=Istanbul                   
 cf_basic:name        timestamp=1334320415281, value=Mehmet Dogan               
 cf_text:details      timestamp=1334320415281, value=software developer .....   
4 row(s) in 0.0150 seconds

Write-Through and Write-Behind
The default configuration of map-store is write-through: records are synchronously persisted to datastore.
If you set write-delay-seconds in hazelcast.xml to a positive value then the behaviour will be write-behind.
The entries added will be persisted after n seconds. 
deleteAll and storeAll methods implemented in mapstore are used in write-behind mode.

POJO Mapping
If you do not want to map your objects manually; you can use Kundera.
It is JPA compliant ORM for Big data.

Source Code
You can reach the example project code:

Getting started with Spring and Hazelcast

This article is a getting-started tutorial on integrating hazelcast into a Spring project.

Part 1: First Create Spring project (if you have already skip to Part 2)
1- Create a maven project.
mvn archetype:generate -DgroupId=spring_hazelcast -DartifactId=spring_hazelcast  

2- Add spring dependencies into pom.

3- Create a TestBean.

3- Create beans.xml in your source root (src/main/java/beans.xml)

4- Test your Spring app.

If you see "success" printed, now we can integrate hazelcast.

Part 2: Integrate Hazelcast
1- Add hazelcast dependencies

2- Add hz name spaces, hazelcast configuration and an hazelcast map bean to beans xml.
In hz:map definition, "id" is spring bean id, name refers to hazelcast map's name.

3- Test hazelcast.

You can download and browse the getting started project here:

Distribute Grails with Hazelcast

Hazelgrails

 

In this article I will try to integrate my two favorite technology: grails and hazelcast.
(Bias: I am currently work for Hazelcast)

Ruby on Rails gained popularity among people who seeks productivity on web programming.
Java is often criticisized on being heavy for rapid development.
But richness of Java community has given birth to flexible and dynamic JVM languages like Groovy.
Grails is somewhat synthesis of power of Java (with the help of Groovy) and productivity of Rails with philosophy "convention over configuration".

Another technology which amazes me is Hazelcast.
I remember the days which I first meet socket programming, RMI; in university.
And when I first tried the Hazelcast my first reaction is "How "Distributing your data over machines" could be so easy.
Single jar, no dependency, distribute your data over maps, queues, topics...

So I have decided to integrate these two technology, write a simple plugin so anyone can easily distribute data over memory by hazelcast.
I called it hazelgrails and pushed it to GitHub: https://github.com/enesakar/hazelgrails
Here introductory on using this plugin.

How to Install Plugin

Run the command:
install-plugin hazelgrails

Configuration

You will see hazelcast.xml in conf directory under plugins directory.
You can configure hazelcast in details. 
For available options have a look at:

To see hazelcast logs add following to Config.groovy:
info 'com.hazelcast'

Use Hazelcast as Hibernate 2nd Level Cache
In DataSource.groovy replace the following line in hibernate configuration block.
cache.region.factory_class = 'com.hazelcast.hibernate.HazelcastCacheRegionFactory'
For more details about 2nd level cache configuration have a look at:

Test The  Plugin

Create an Grails application and install the plugin. Then create a domain and two controllers.
create-domain-class com.hazelgrails.Customer
create-controller com.hazelgrails.Server1
create-controller com.hazelgrails.Server2

As you see, Customer is serializable. Hazelcast requires the objects to be serializable in order to distribute them in cluster.

Now create the war file (command "grails war") but copy the file with different name (app2.war). 
You may deploy the wars into different machines in the same network, or to different servers (Tomcat, Jetty) in same machine or even into the same Tomcat.
For simplicity I have run the current app by "grails run-app" and I have deployed the war to an external Tomcat.
And test them:

http://localhost:9091/TestGrails/server1/index
Output:
Cities:[2:New York, 1:London] 
Timestamps:[1333447087796, 1333447112863]
http://localhost:8080/app2/server2/index
Output:
Cities:[2:New York, 1:London] 
First customer name:tom, age:20 
Timestamps:[1333447087796, 1333447112863]

In practice, if you see the following then you can conclude the nodes formed a cluster succesfully. (you should add "info 'com.hazelcast'" into Config.groovy)
Members [2] {
        Member [127.0.0.1:5701]
        Member [127.0.0.1:5702] this
}

Usage Examples

There are two new methods defined for domain classes.
saveHz() method, first persists the domain object (like original save()) then puts it to hazelcast map.
getHz() method tries to find object with given id first in hazelcast map, if not found then tries to find it in datasource.
Hazelcast create a distributed map for each domain class. 
So by using saveHz() and getHz() you can get your objects from distributed memory instead of getting by database operations.
Also by injecting hazelService, you can create hazelcast instances.
Here the usage exampples:

Hazelcast: Advantage of using MultiMap

One of the distributed data structures supported by Hazelcast is MultiMap.
It is very useful container as it stores a collection of values mapped to given key.
But as you know you can always use an Map as MultiMap.
Just create a set or list for each new key, and add to this collection in each put operation.

But in Hazelcast's distributed world, MultiMap has extra benefits.
Hazelcast optimized the MultiMap operations so that operations (put, remove, containsValue and containsEntry) do not serialize, deserialize the whole collection.
Instead just the added or retrieved object is processed.

Note: This information is extracted from a conversation in Hazelcast mail group.
For more you can join the group.

Eviction and Hazelcast

Eviction

 

This article briefly describes "what is eviction?" then evaluates eviction capabilities of Hazelcast.

 


What is eviction?
It is a common practice to cache commonly accessed data in memory.
But as memory is still an expensive resource, you probably want to limit the size of your cache. 
When your cache reaches the limit, you have to remove (evict) some objects to get a place for new objects.

Hazelcast and Eviction
You can use Hazelcast as distributed cache for your application.
Although distributed systems have advantage on memory, you still need to think about eviction, how to put limit on cache size.

Hazelcast is powerful with its eviction capabilities; you can set global eviction policy also set lifetime for values seperately on maps.
Let's try and test Hazelcast eviction.

1- Simple Java Project

What is most charming about Hazelcast is its simplicity. 
Distributing data over nodes seem so complex but using Hazelcast is not different than using a simple StringUtils library.
So just create a Java Project in your ide and add hazelcast.jar to your classpath.
Or if you use maven:

2- Let's code

Let's fill our distributed map with default hazelcast settings.

And the output:
Current Map Size:1000
Current Map Size:2000
Current Map Size:3000
Current Map Size:4000
Current Map Size:5000
Current Map Size:6000
java.lang.OutOfMemoryError

 

In default configuration hazelcast do not have eviction policy, as you see cache grows without limit. That is not the case we favour.

 

3. Global Eviction Policy: Limit Map Size

 

What we want to limit the maximum size of the map, so eviction should occur when size reaches the limit.
So we should add needed configuration to hazelcast.xml. You can either copy hazelcast-default.xml and put as hazelcast.xml to your classpath or you may specify config file location in code as follows.

So here the config part we changed: 

We set the max size 3000 for maps, and as policy we have set Least Recently Used items to be evicted.
You may choose LFU (Least Frequently Used) depending on your scenario.
Also we set 25 as eviction percentage, so when the map size reaches limit, 25% of items will be removed from Map.

Now execute the same just with the custom hazelcast.xml.
Here is the output:
Current Map Size:1000
Current Map Size:2000
Current Map Size:3000
Current Map Size:2313
Current Map Size:2329
Current Map Size:2333
Current Map Size:2334
Current Map Size:2335
...............................
As you see hazelcast starts eviction when size reaches 3000. 

4. Global Eviction Policy: Limit Entry Life Span
Now we will set maximum seconds to live for each entries in the map.
We will make the following changes to hazelcast.xml

Also let's track the time in our test code.

Here is the output:
Current Map Size:1000 Seconds:3
Current Map Size:2000 Seconds:6
Current Map Size:3000 Seconds:9
Current Map Size:3085 Seconds:13
Current Map Size:3052 Seconds:16
Current Map Size:2992 Seconds:19
Current Map Size:3024 Seconds:23
Current Map Size:2883 Seconds:26
We see that hazelcast starts to evict entries older than 10 seconds.

5. Limiting Life Span of Specific Entries
Besides setting global lifespan we can also set different life span values for each entries.
For this purpose we can use default hazelcast xml but change the code.
In this test we will set 10 seconds time-to-live for half of the entries.

And the output:

Current Map Size:1000 Seconds:3
Current Map Size:2000 Seconds:6
Current Map Size:3000 Seconds:9
Current Map Size:3539 Seconds:13
Current Map Size:4025 Seconds:16
Current Map Size:4493 Seconds:19
Current Map Size:4938 Seconds:23

So half of entries are started to be evicted after tenth seconds.

To sum-up we have covered the different eviction capabilities supported by hazelcast, in the simplest manner..

 

Good Practice: How to perform application wide initialization actions before web application startup

Use ServletContextListener


public class SampleListener  implements ServletContextListener {
    public void contextInitialized(ServletContextEvent servletContextEvent) {
              // init actions
    }
    public void contextDestroyed(ServletContextEvent servletContextEvent) {
    }
}

Do not forget web xml:

  <listener>
        <listener-class>
            com.package.SampleListener
        </listener-class>
    </listener>

Compile Groovy 1.8 With Maven (GMaven)

Use GMaven.

But in default configuration there is problem.

The workaround:

            <plugin>
                <groupId>org.codehaus.gmaven</groupId>
                <artifactId>gmaven-plugin</artifactId>
                <executions>
                    <execution>
                        <goals>
                            <goal>generateStubs</goal>
                            <goal>compile</goal>
                            <goal>generateTestStubs</goal>
                            <goal>testCompile</goal>
                        </goals>
                        <configuration>
                            <!-- 1.8 not yet supported by plugin but 1.7 works here so long as we provide explicit version -->
                            <providerSelection>1.7</providerSelection>
                        </configuration>
                    </execution>
                </executions>
                <dependencies>
                    <dependency>
                        <groupId>org.codehaus.gmaven.runtime</groupId>
                        <artifactId>gmaven-runtime-1.7</artifactId>
                        <version>1.3</version>
                        <exclusions>
                            <exclusion>
                                <groupId>org.codehaus.groovy</groupId>
                                <artifactId>groovy-all</artifactId>
                            </exclusion>
                        </exclusions>
                    </dependency>
                    <dependency>
                        <groupId>org.codehaus.groovy</groupId>
                        <artifactId>groovy-all</artifactId>
                        <version>1.8.5</version>
                        <scope>compile</scope>
                    </dependency>
                </dependencies>
            </plugin>

via: