Blog4Java

Getting started with Docker

Javier (@jbbarquero) — Tue, 31 Mar 2015 15:51:23 +0000

What is Docker?

Docker is a platform.

Docker runs natively on Linux or on OS X and Windows through a helper application called boot2docker that creates a Linux Virtual Machine, by using only RAM, to run Docker.

Docker‘s main goal is to allow you to ship distributed applications, that you’ve created previously, by running them as isolated process on what is known as a container, thus it avoids the need of Virtual Machines that implies a resources saving in the involved machines. The isolation also allows to run several containers simultaneously.

So, Docker is an open platform for developing, shipping, and running applications.

The key point is that you can separate your application from the infrastructure and also treats the infrastructure as an application. Everything can be packaged, distributed and deployed anywhere and quickly. It’s said that Docker eliminates the friction between development, QA, and production environments.

Docker has two major components:

Docker, the container virtualization platform, that has a daemon running on a host server and a client (the docker binary) to talk to the Docker daemon.
Docker Hub, the SaaS platform for sharing and managing Docker containers.

Docker concepts

Image (build component of Docker): a Docker environment template, it consists in files (a copy of what is expected to contain) and metadata (information such as environment variables, port mappings and so on) and it has a name (“ubuntu”, for instance)
Registry (distribution component of Docker): a public or private store that holds images. The public Docker registry is called Docker Hub.
Container (run component of Docker): a running instance of a Docker image. It’s created from a Docker image, and it can be run, started, stopped, moved, and deleted. Each container is an isolated and secure application platform.

How Does Docker work?

An image is a serie of layers that are combined into a single image by Docker using union file systems (UnionFS, a file system service that allows files and directories of separate file systems, known as branches, to be transparently overlaid, forming a single coherent file system)

From a base image, you can add and modify additional layers and there’s no need of rebuilding the entire image if there is a change, you only need replace the added or updated layer.

A container is built from an image, that is read-only. For running the container, Docker adds a read-write layer on top of the image (using UnionFS) and then allocates a network/bridge interface, an IP address, and finally, executes the specified process and captures input and provides output.

The container is isolated thanks to a technology called namespaces (Namespace isolation, where groups of processes are separated such that they cannot “see” resources in other groups)

Docker use these namespaces:

The pid namespace, for process isolation.
The net namespace, for managing network interfaces.
The ipc namespace, for Inter-Process Communication.
The mnt namespace, for managing mount-points.
The uts namespace, for changing the hostname.

In order to achieve isolation on running application, Docker uses another technology called control groups (cgroups, it limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes)

Installing Docker

I’m going to use Ubuntu (as Sheldon Cooper said, my favorite Linux-based operating system)

There are three packages available, two of them from Ubuntu, an older KDE3/GNOME2 called docker (warning, this is the suggested package when you try docker version before having installed: “The program ‘docker’ is currently not installed. You can install it by typing: sudo apt-get install docker”) and a newer called docker.io, that is not the most recent Docker release. The third one is a PPA (Personal Package Archives for Ubuntu) for Docker and it’s called lxc-docker.

If you don’t want extra repositories and you don’t want the latest version, just install docker.io (I recommend you to enable tab-completion of Docker commands in BASH if you install it):

$ sudo apt-get update
$ sudo apt-get install docker.io
$ source /etc/bash_completion.d/docker.io

To validate the installation, type:

$ sudo docker version
Client version: 1.0.1
Client API version: 1.12
Go version (client): go1.2.1
Git commit (client): 990021a
Server version: 1.0.1
Server API version: 1.12
Go version (server): go1.2.1
Git commit (server): 990021a
$

You need to use sudo because the docker daemon always runs as the root user, that’s because the docker daemon binds to a Unix socket (instead of a TCP port, as it happened until version 0.5.2). By default that Unix socket is owned by the user root, so you need to use the docker command with sudo. You can solve it by giving non-root access to Docker or, even better, you can create a docker group and add users to it.

Otherwise, if you run docker without sudo you will obtain this error message:

$ docker version
Client version: 1.0.1
Client API version: 1.12
Go version (client): go1.2.1
Git commit (client): 990021a
2015/03/20 11:40:30 Get http:///var/run/docker.sock/v1.12/version: dial unix /var/run/docker.sock: permission denied
$

In order to have the latest version, use the lxc-docker package, that is the one maintained by Docker itself.

There is a script to install the Docker package on Ubuntu, but I’d rather to do it manually. But you can try with:

$ curl -sSL https://get.docker.com/ubuntu/ | sudo sh

To install lxc-docker, your APT system have to deal with https (you can if the file /usr/lib/apt/methods/https exists, it you don’t have it, install the apt-transport-https package)

Then, just add the Docker repository key to the local keychain, and to the apt sources list. Finally, update and install the lxc-docker package:

$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 36A1D7869245C8950F966E92D8576A8BA88D21E9
$ sudo sh -c "echo deb https://get.docker.com/ubuntu docker main > /etc/apt/sources.list.d/docker.list"
$ sudo apt-get update
$ sudo apt-get install lxc-docker

Now we have the last version:

$ sudo docker version
Client version: 1.5.0
Client API version: 1.17
Go version (client): go1.4.1
Git commit (client): a8a31ef
OS/Arch (client): linux/amd64
Server version: 1.5.0
Server API version: 1.17
Go version (server): go1.4.1
Git commit (server): a8a31ef
$

And we can run the basic example, that downloads the ubuntu image, and then start bash in a container. When you’re done, type exit.

$ sudo docker run -i -t ubuntu /bin/bash
Unable to find image 'ubuntu:latest' locally
511136ea3c5a: Pull complete 
f3c84ac3a053: Pull complete 
511136ea3c5a: Download complete 
f3c84ac3a053: Download complete 
a1a958a24818: Download complete 
9fec74352904: Download complete 
d0955f21bf24: Download complete 
Status: Downloaded newer image for ubuntu:latest
root@0f5e5a64c583:/# ls
bin   dev  home  lib64  mnt  proc  run   srv  tmp  var
boot  etc  lib   media  opt  root  sbin  sys  usr
root@0f5e5a64c583:/# exit
exit
$

Having said that, Docker has changed the instructions for Installing Docker on Ubuntu… today!

It seems easier, but I don’t like it very much. If you have wget installed, you can get the latest Docker package with:

$ wget -qO- https://get.docker.com/ | sh

It prompts for the password and then it downloads and installs… the lxc-docker package.

And then, verify the installation with:

$ sudo docker run hello-world
[sudo] password for Javier: 
Unable to find image 'hello-world:latest' locally
31cbccb51277: Pull complete 
e45a5af57b00: Pull complete 
511136ea3c5a: Already exists 
hello-world:latest: The image you are pulling has been verified. Important: image verification is a tech preview feature and should not be relied on to provide security.
Status: Downloaded newer image for hello-world:latest
Hello from Docker.
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (Assuming it was not already locally available.)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

For more examples and ideas, visit:
 http://docs.docker.com/userguide/
$

Using Docker

Running applications inside containers

Docker allows you to run applications inside containers with the command docker run.

The command docker run creates a new container from the image name that you specify (the mandatory parameter for run) and then runs the command in it by performing these steps: Docker looks for the image in this computer, if it isn’t installed yet, Docker searches the image at Docker Hub to download it and install it. Once the image is installed, Docker creates a new container and starts the program.

We’ve seen above one example:

$ sudo docker run -t -i ubuntu:14.04.2 /bin/bash

It creates a container from the Official Ubuntu base image (tag 14.04.2) and then it runs a Bash shell command, with a terminal assigned (flag -t), and grabs the standard in of the container(flag -i)

For executing in the background (or daemonized), use the -d flag:

$ sudo docker run -d --name="Javier" ubuntu:14.04 /bin/sh -c "while true; do echo hello world; sleep 1; done"
e84ae64138881b9eaf6ac743c6f0076cfc414f017daef9e59c3d2e1d591eb7b9

Here, the command will run forever. Besides, I have assigned a name to the container (OK, my very name, some egocentricity here) to easily discover it later (Docker automatically names any containers that you start, but I’d rather to specify the name by myself). Furthermore, Docker returns the container ID (e84ae6413888…).

You can find both the ID and the name when listing containers with the command docker ps (flag -a to show all regardless they are running or not)

$ sudo docker ps
CONTAINER ID        IMAGE               COMMAND                CREATED             STATUS              PORTS               NAMES
e84ae6413888        ubuntu:14.04        "/bin/sh -c 'while t   5 days ago          Up 2 minutes                            Javier              
$

The ports in the container can be exposed randomly with the -P flag or manually with -p. In any case, you can see in the column PORTS of the docker ps command or with the docker port command. Let’s see an example with a sample web application image.

$ sudo docker run -d -P training/webapp python app.py
Unable to find image 'training/webapp:latest' locally
Pulling repository training/webapp
31fa814ba25a: Download complete 
511136ea3c5a: Download complete 
f10ebce2c0e1: Download complete 
82cdea7ab5b5: Download complete 
5dbd9cb5a02f: Download complete 
74fe38d11401: Download complete 
64523f641a05: Download complete 
0e2afc9aad6e: Download complete 
e8fc7643ceb1: Download complete 
733b0e3dbcee: Download complete 
a1feb043c441: Download complete 
e12923494f6a: Download complete 
a15f98c46748: Download complete 
Status: Downloaded newer image for training/webapp:latest
a03ec94ea8087789d605ca91d6689d8026e7806e9138b8b9b7ed7f5a1295db85
$ sudo docker ps
CONTAINER ID        IMAGE                    COMMAND             CREATED             STATUS              PORTS                     NAMES
a03ec94ea808        training/webapp:latest   "python app.py"     25 seconds ago      Up 24 seconds       0.0.0.0:49153->5000/tcp   happy_pare          
$ sudo docker port happy_pare
5000/tcp -> 0.0.0.0:49153
$ curl http://localhost:49153
Hello world!
$

This means that you can access the application running in the container by using the port 49153 locally.

To manage the container, you can use the next commands:

docker logs to see the standard output of a container (-f to view the new additions, press Ctrl+C to exit)
docker top to see the process in the container.
docker inspect to see configuration and status information about a container.
docker stop/kill to stop or kills (respectively) a running container.
docker start to restart a stopped container (remember the command docker ps -a).
docker rm to remove a container.
docker version to see the current version of the program, its programming language (Go) and so on.

$ sudo docker logs happy_pare
 * Running on http://0.0.0.0:5000/
172.17.42.1 - - [31/Mar/2015 09:51:38] "GET / HTTP/1.1" 200 -
$ sudo docker stop happy_pare
happy_pare
$ sudo docker start happy_pare
happy_pare
$ sudo docker kill happy_pare
happy_pare
$ sudo docker rm happy_pare
happy_pare
$ sudo docker top happy_pare
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                8580                1372                0                   12:00               ?                   00:00:00            python app.py
$ sudo docker inspect hungry_kirch
[{
    "AppArmorProfile": "",
    "Args": [
        "app.py"
    ],
    "Config": {
        "AttachStderr": false,
        "AttachStdin": false,
        "AttachStdout": false,
        "Cmd": [
            "python",
            "app.py"
        ],
        "CpuShares": 0,
        "Cpuset": "",
        "Domainname": "",
        "Entrypoint": null,
        "Env": [
            "HOME=/",
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
        ],
        "ExposedPorts": {
            "5000/tcp": {}
        },
        "Hostname": "468cd593eada",
        "Image": "training/webapp",
        "MacAddress": "",
        "Memory": 0,
        "MemorySwap": 0,
        "NetworkDisabled": false,
        "OnBuild": null,
        "OpenStdin": false,
        "PortSpecs": null,
        "StdinOnce": false,
        "Tty": false,
        "User": "",
        "Volumes": null,
        "WorkingDir": "/opt/webapp"
    },
    "Created": "2015-03-31T10:00:32.697111567Z",
    "Driver": "aufs",
    "ExecDriver": "native-0.2",
    "ExecIDs": null,
    "HostConfig": {
        "Binds": null,
        "CapAdd": null,
        "CapDrop": null,
        "ContainerIDFile": "",
        "Devices": [],
        "Dns": null,
        "DnsSearch": null,
        "ExtraHosts": null,
        "IpcMode": "",
        "Links": null,
        "LxcConf": [],
        "NetworkMode": "bridge",
        "PidMode": "",
        "PortBindings": {},
        "Privileged": false,
        "PublishAllPorts": true,
        "ReadonlyRootfs": false,
        "RestartPolicy": {
            "MaximumRetryCount": 0,
            "Name": ""
        },
        "SecurityOpt": null,
        "VolumesFrom": null
    },
    "HostnamePath": "/var/lib/docker/containers/468cd593eadaea6d18441a33ca6c1ea42f1b398d6fd028fa5b557181a5cf36f3/hostname",
    "HostsPath": "/var/lib/docker/containers/468cd593eadaea6d18441a33ca6c1ea42f1b398d6fd028fa5b557181a5cf36f3/hosts",
    "Id": "468cd593eadaea6d18441a33ca6c1ea42f1b398d6fd028fa5b557181a5cf36f3",
    "Image": "31fa814ba25ae3426f8710df7a48d567d4022527ef2c14964bb8bc45e653417c",
    "MountLabel": "",
    "Name": "/hungry_kirch",
    "NetworkSettings": {
        "Bridge": "docker0",
        "Gateway": "172.17.42.1",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "IPAddress": "172.17.0.10",
        "IPPrefixLen": 16,
        "IPv6Gateway": "",
        "LinkLocalIPv6Address": "fe80::42:acff:fe11:a",
        "LinkLocalIPv6PrefixLen": 64,
        "MacAddress": "02:42:ac:11:00:0a",
        "PortMapping": null,
        "Ports": {
            "5000/tcp": [
                {
                    "HostIp": "0.0.0.0",
                    "HostPort": "49155"
                }
            ]
        }
    },
    "Path": "python",
    "ProcessLabel": "",
    "ResolvConfPath": "/var/lib/docker/containers/468cd593eadaea6d18441a33ca6c1ea42f1b398d6fd028fa5b557181a5cf36f3/resolv.conf",
    "RestartCount": 0,
    "State": {
        "Error": "",
        "ExitCode": 0,
        "FinishedAt": "0001-01-01T00:00:00Z",
        "OOMKilled": false,
        "Paused": false,
        "Pid": 8580,
        "Restarting": false,
        "Running": true,
        "StartedAt": "2015-03-31T10:00:32.898205331Z"
    },
    "Volumes": {},
    "VolumesRW": {}
}
]
$

Working with Docker Images

As we have explained, Docker runs container by using images that are either installed in your system or exists at Docker Hub (in order to download it and install in your system)

You can see what images are already installed with the docker images command.

$ sudo docker images
REPOSITORY              TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
ubuntu                  trusty-20150320     d0955f21bf24        11 days ago         188.3 MB
ubuntu                  14.04               d0955f21bf24        11 days ago         188.3 MB
ubuntu                  14.04.2             d0955f21bf24        11 days ago         188.3 MB
ubuntu                  latest              d0955f21bf24        11 days ago         188.3 MB
ubuntu                  trusty              d0955f21bf24        11 days ago         188.3 MB
training/webapp         latest              31fa814ba25a        10 months ago       278.8 MB
$

If you want to manually install an image before running it, you can use the docker pull command for images that you can find at Docker Hub or with the command docker search.

As a curiosity, if you look for images at Docker Hub, you can find Official repos, for instance, the Java OFFICIAL REPO.

You can add a new name to the image with the docker tag.

If you want to remove an image, you can do so with the docker rmi command.

A note about the official images

The images and relevant files are maintained at GitHub by an organization called docker-library (Docker is open source, and it’s maintained at GitHub by an organization called Docker, that has several repositories, including the one for docker).

The official images exist in a repository, Docker Official Images, that contains a folder for the library definitions, for instance, the one for java.

The image packages are also maintained by docker-library in each corresponding repository, for instance, the Docker Official Image packaging for Java (openJDK)

It’s worth to mention it because there is another organization at GitHub called (Trusted Automated Docker Builds) that has repositories for several Docker builds, for instance, the one for Java, that includes an image for Oracle Java 8 JDK (that I was looking for)

Creating your own images

There are two ways for updating and creating images.

Run a container from an image (docker run), then update it, and finally commit the results to an image, with the docker commit command.
Use a Dockerfile to create an image.

When you’re done, you can push the created image to Docker Hub with the docker push command.

Finally, you can remove images with the docker rmi command.

$ sudo docker run -t -i ubuntu:latest /bin/bash
root@98bec5327540:/# sudo apt-get install --reinstall software-properties-common
root@98bec5327540:/# sudo add-apt-repository ppa:webupd8team/java
root@98bec5327540:/# sudo apt-get update
root@98bec5327540:/# sudo apt-get install oracle-java8-installer
root@98bec5327540:/# sudo apt-get install nano
root@98bec5327540:/# wget http://apache.rediris.es/maven/maven-3/3.3.1/binaries/apache-maven-3.3.1-bin.tar.gz
root@98bec5327540:/# tar -xvf apache-maven-3.3.1-bin.tar.gz 
root@98bec5327540:/# cp -r apache-maven-3.3.1 /usr/local/apache-maven
root@98bec5327540:/# sudo nano /etc/environment
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/apache-maven/bin"
JAVA_HOME="/usr/lib/jvm/java-8-oracle"
MAVEN_OPTS="-Xms256m -Xmx512m"
root@98bec5327540:/# source /etc/environment 
root@98bec5327540:/# echo $JAVA_HOME
/usr/lib/jvm/java-8-oracle
root@98bec5327540:/# java -version
java version "1.8.0_40"
Java(TM) SE Runtime Environment (build 1.8.0_40-b25)
Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25, mixed mode)
root@98bec5327540:/# mvn -version
Apache Maven 3.3.1 (cab6659f9874fa96462afef40fcf6bc033d58c1c; 2015-03-13T20:10:27+00:00)
Maven home: /usr/local/apache-maven
Java version: 1.8.0_40, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: ANSI_X3.4-1968
OS name: "linux", version: "3.13.0-48-generic", arch: "amd64", family: "unix"
root@98bec5327540:/# exit
$ sudo docker ps -a
CONTAINER ID        IMAGE                    COMMAND                CREATED             STATUS                        PORTS               NAMES
98bec5327540        ubuntu:14.04             "/bin/bash"            47 minutes ago      Exited (0) 4 minutes ago                          determined_carson   
468cd593eada        training/webapp:latest   "python app.py"        2 hours ago         Exited (137) 2 minutes ago                        hungry_kirch        
e84ae6413888        ubuntu:14.04             "/bin/sh -c 'while t   5 days ago          Exited (137) 3 hours ago                          Javier              
0f5e5a64c583        ubuntu:14.04             "/bin/bash"            11 days ago         Exited (0) 11 days ago                            dreamy_hopper     

$ sudo docker commit -m "Ubuntu latest (14.04) with  Oracle Java 8 JDK and Apache Maven 3.3.1 (and nano editor)" -a "Javier Beneito Barquero" 98bec5327540 jbbarquero/ubuntu-java8_oracle-maven
e262639379c46afa53eca74de9df9bd2f81e0ab7839a3472cafb67d7e199d85e
$

$ sudo docker images
REPOSITORY                             TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
jbbarquero/ubuntu-java8_oracle-maven   latest              e262639379c4        56 seconds ago      828 MB
ubuntu                                 14.04               d0955f21bf24        11 days ago         188.3 MB
ubuntu                                 14.04.2             d0955f21bf24        11 days ago         188.3 MB
ubuntu                                 latest              d0955f21bf24        11 days ago         188.3 MB
ubuntu                                 trusty              d0955f21bf24        11 days ago         188.3 MB
ubuntu                                 trusty-20150320     d0955f21bf24        11 days ago         188.3 MB
centos                                 latest              88f9454e60dd        3 weeks ago         210 MB
hello-world                            latest              e45a5af57b00        12 weeks ago        910 B
training/webapp                        latest              31fa814ba25a        10 months ago       278.8 MB
$ sudo docker tag jbbarquero/ubuntu-java8_oracle-maven my-java8
$ sudo docker images
REPOSITORY                             TAG                 IMAGE ID            CREATED              VIRTUAL SIZE
jbbarquero/ubuntu-java8_oracle-maven   latest              e262639379c4        About a minute ago   828 MB
my-java8                               latest              e262639379c4        About a minute ago   828 MB
ubuntu                                 trusty-20150320     d0955f21bf24        11 days ago          188.3 MB
ubuntu                                 14.04               d0955f21bf24        11 days ago          188.3 MB
ubuntu                                 14.04.2             d0955f21bf24        11 days ago          188.3 MB
ubuntu                                 latest              d0955f21bf24        11 days ago          188.3 MB
ubuntu                                 trusty              d0955f21bf24        11 days ago          188.3 MB
centos                                 latest              88f9454e60dd        3 weeks ago          210 MB
hello-world                            latest              e45a5af57b00        12 weeks ago         910 B
training/webapp                        latest              31fa814ba25a        10 months ago        278.8 MB

$ sudo docker run -t -i my-java8 /bin/bash
root@da1045b88138:/# mvn -version
Apache Maven 3.3.1 (cab6659f9874fa96462afef40fcf6bc033d58c1c; 2015-03-13T20:10:27+00:00)
Maven home: /usr/local/apache-maven
Java version: 1.8.0_40, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: ANSI_X3.4-1968
OS name: "linux", version: "3.13.0-48-generic", arch: "amd64", family: "unix"
root@da1045b88138:/# exit
exit
$ 

$ sudo docker push jbbarquero/ubuntu-java8_oracle-maven
The push refers to a repository [jbbarquero/ubuntu-java8_oracle-maven] (len: 1)
Sending image list

Please login prior to push:
Username: jbbarquero
Password: 
Email: jbbarquero@gmail.com
Login Succeeded
The push refers to a repository [jbbarquero/ubuntu-java8_oracle-maven] (len: 1)
Sending image list
Pushing repository jbbarquero/ubuntu-java8_oracle-maven (1 tags)
511136ea3c5a: Image already pushed, skipping 
f3c84ac3a053: Image already pushed, skipping 
a1a958a24818: Image already pushed, skipping 
9fec74352904: Image already pushed, skipping 
d0955f21bf24: Image already pushed, skipping 
e262639379c4: Image successfully pushed 
Pushing tag for rev [e262639379c4] on {https://cdn-registry-1.docker.io/v1/repositories/jbbarquero/ubuntu-java8_oracle-maven/tags/latest}
$

You can find this image at jbbarquero / ubuntu-java8_oracle-maven

How to write a Dockerfile is out of the bounds of this first post, but I can create a very basic one. Just for fun.

Having this Dockerfile:

# This is a comment
FROM jbbarquero/ubuntu-java8_oracle-maven
MAINTAINER Javier Beneito Barquero 
RUN apt-get update && apt-get install -y git

Just build it (and then push it)

$ sudo docker build -t jbbarquero/ubuntu-java8_oracle-maven-git .
Sending build context to Docker daemon 2.048 kB
Sending build context to Docker daemon 
Step 0 : FROM jbbarquero/ubuntu-java8_oracle-maven
 ---> e262639379c4
Step 1 : MAINTAINER Javier Beneito Barquero 
 ---> Running in d1be5fc2ed6e
 ---> 162fdb8b3a3d
Removing intermediate container d1be5fc2ed6e
Step 2 : RUN apt-get update && apt-get install -y git
 ---> Running in b7c2cf318638

... Lot of installation messages here

 ---> f63b88cf14ab
Removing intermediate container b7c2cf318638
Successfully built f63b88cf14ab
$ sudo docker images
REPOSITORY                                 TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
jbbarquero/ubuntu-java8_oracle-maven-git   latest              f63b88cf14ab        17 seconds ago      876.5 MB
...

$ sudo docker run -t -i jbbarquero/ubuntu-java8_oracle-maven-git /bin/bash
root@bb9caa298372:/# git --version
git version 1.9.1
root@bb9caa298372:/# mvn -version
Apache Maven 3.3.1 (cab6659f9874fa96462afef40fcf6bc033d58c1c; 2015-03-13T20:10:27+00:00)
Maven home: /usr/local/apache-maven
Java version: 1.8.0_40, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: ANSI_X3.4-1968
OS name: "linux", version: "3.13.0-48-generic", arch: "amd64", family: "unix"
root@bb9caa298372:/# exit
exit

The new image can be found in Docker Hub, find it at jbbarquero / ubuntu-java8_oracle-maven-git.

Closing words

That’s enough for now. There are a couple of interesting topics, but we’ll leave for another time:

Linking Containers, for sending information between containers in Docker.
Data in containers, for managing data volumes.

Resources

Docker official documentation
Docker 10-minute tutorial
Docker installation: Ubuntu
Docker in Action. By Jeff Nickoloff (Manning)
Docker in Practice. By Ian Miell and Aidan Hobson Sayers (Manning)
Install oracle java 8 in ubuntu via ppa repository [jdk8]
How to add a PPA on a server?
Nano at Ubuntu
Download Apache Maven 3.3.1

Post scríptum

When re-starting the container jbbarquero/ubuntu-java8_oracle-maven, source /etc/environment is not called, so the values that I wrote there are not applied. For solving this issue, I edited /etc/bash.bashrc and I put there the environment variables.

root@98bec5327540:/# sudo nano /etc/bash.bashrc
$ sudo docker start -i 98bec5327540
root@98bec5327540:/# sudo nano /etc/bash.bashrc

JAVA_HOME="/usr/lib/jvm/java-8-oracle"
export JAVA_HOME

M2_HOME=/usr/local/apache-maven
export M2_HOME
M2=$M2_HOME/bin
export M2

PATH=$PATH:$JAVA_HOME
PATH=$PATH:$M2
export PATH

root@98bec5327540:/# exit

Maybe a Dockerfile is a better idea to create containers with environment variables using ENV.
so the values that I wrote there

Webinar Confirmation: Java, Ubuntu and browsers hell

Javier (@jbbarquero) — Tue, 24 Mar 2015 09:19:29 +0000

Hello Javier,

Your registration is confirmed for the webinar… We are looking forward to having you join us.

To help maximize your webinar experience we recommend that you join a test meeting before the session to check your system and browser compatibility at http://www.webex.com/test-meeting.html.

Let’s try…

Java is not working.

Dammit!

Now I have to waste half an hour for discovering the problem and solve it.

Let’s see, How do I use Java with the Google Chrome browser?

Chrome and Linux
Starting with Chrome version 35, NPAPI (Netscape Plug-in API) support was removed for the Linux platform. For more information, see Chrome and NPAPI (blog.chromium.org).

Firefox is the recommended browser for Java on Linux.

No problem, let’s use Firefox. But…

Java Security

Sometimes I miss Windows and its closed environment.

Don’t give up. In the error page, the details shows the Java Console:

Java Plug-in 11.40.2.25

And the more information link leads to the instructions to handle the Java Security via the Control Panel. Including a link to the configuration of the Exception Site List.

But…

Find the Java Control Panel
» Windows
» Mac OS X

Where is the Java Control Panel on Linux?!!!

Keep calm and open the Terminal:

$ whereis java
java: /usr/bin/java
$ ls -la /usr/bin/java
lrwxrwxrwx 1 root root 22 may  8  2014 /usr/bin/java -> /etc/alternatives/java
$ cd /usr/lib/jvm/java-8-oracle/jre/bin
$ ls
ControlPanel  javaws.real  keytool  policytool   servertool
java          jcontrol     orbd     rmid         tnameserv
javaws        jjs          pack200  rmiregistry  unpack200
$ ./ControlPanel

Now, you can follow the instructions:

Java Control Panel

Go to the Security tab:

Java Control Panel: Security tab

Press Edit Site List…

Exception Site List

Add the location:

Exception Site List: add URL

Ok. Ok.

Now re-try:

I’m really tired

I’ve spent too much time with you, so I accept the risk

Horray!

Piece of cake.

It’s been fun and annoying at the same time.

Getting started with Spark

Javier (@jbbarquero) — Mon, 02 Mar 2015 15:27:16 +0000

Spark Introduction

Apache Spark is a cluster computing platform designed to be fast, expresive, high level, general-purpose, fault-tolerante and compatible with Hadoop (Spark can work directly with HDFS, S3 and so on). Spark can also be defined as a framework for distributed processing and analisys of big amounts of data. People from databricks (the company behind Spark) called it a distributed executing engine for large scale analytics.

Spark improves efficiency over Hadoop because it uses in-memory computing primitives. According to the Apache Spark site, it can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

It also claims to improve usability through rich Scala, Python and Java APIs as well as an interactive shell, in Scala and Python. Spark is written in Scala.

Spark Architecture

Spark has three main components

Apache Spark stack

Spark Core (API)

A high level programming framework that allows programmers to focus on the logic and not the plumbing of distributing programming, that is, the steps to be done without worrying of coordinating tasks, networking of so on.

These steps are define by RDD (Resilient Distributed Datasets), the main programming abstraction that represent a collection of items distributed across many compute nodes that can be manipulated in parallel.

Spark clustering

Spark itself doesn’t manage the cluster, but it supports three cluster managers:

Standalone: a simple cluster manager included in Spark itself called the Standalone Scheduler.
Hadoop YARN: see my introduction to Apache Hadoop.
Apache Mesos.

Spark stack

Finally, Spark provides high level specialized components that are closely integrated in order to provide one great platform.

The current components are:

Spark SQL: for querying data via SQL.
Spark Streaming: for real-time processing of live streams of data.
GraphX: a library for manipulating graphs and performing graph-parallel computations.
MLLib: a library for machine learning providing algorithms for doing so (classification, regression, …)

Spark Usage

There are two ways to work with Spark:

The Spark interactive shells
Spark standalone applications

Spark Shell

It’s an interactive shell from the command line that has two implementations, one in Python and the other in Scala, an RPEL that is very useful for learning the API or for data exploration.

Spark’s shells allow you to interact with data not only on your single machine, but on disk or in memory across many machines, thanks to the distributed nature of Spark.

Spark Applications

The other way to work with Spark is by creating standalone applications either in Python, Scala or Java. Use them for large scale data processing.

Spark main concepts

Driver program

It’s the program that launches the distributed operations on a cluster.

The Spark shell is a driver program.

The application that you write, with its main function that defines de datasets and applies operations on them is a driver program.

Spark Context (sc)

It’s the main entry point to the Spark API.

When using the shell, a preconfigured SparkContext is automatically created and it’s available in the variable called sc.

When writing applications, the first thing that you need to create is your own instance of the SparkContext.

Resilient Distributed Dataset (RDD)

The goal of Spark is to allow you to operate in datasets in a single machine and that these operations work in the same way in a distributed cluster.

For achieving this, Spark offers the Resilient Distributed Dataset (RDD), they are immutable collections (dataset) of objects that Spark distributes (distributed) through the cluster. They are loaded from a source of data and, since they are immutable, RDDs are also created as a result of transformation on existing RDDs (map, filters, etc.). Finally, Spark automatically rebuilds them in a node if there is a failure in another node (resilient)

There are two types of RDD operations on RDDs:

Transformations: lazy operations to build RDDs based on the current RDD.
Actions: return a result or write the RDD to storage. It implies a computation that actually applies the pending transformation that were lazily defined.

In the Spark jargon, this is called a Direct Acyclic Graph (DAG) of operations. The RDDs track the series of transformations used to build them by maintaining a pointer to its parents.

Spark Installation

Go to https://spark.apache.org/downloads.html and then:

Choose a Spark release (1.2.1 is the last at the time of this writing)
Choose a package type: select the package type of “Pre-built for Hadoop 2.4 and later”
Choose a download type: Direct Download is OK, but the default Apache Mirror works well.
Click on the link after Download Spark, for instance spark-1.2.1.tgz, to download Spark.

Unpack the downloaded file and move into that directory in order to use the interactive shell:

$ tar -xf spark-1.2.0-bin-hadoop2.4.tgz
$ cd spark-1.2.0-bin-hadoop2.4

Using the Shell

The Python version of the Spark shell is available via the command bin/pyspark and the Scala version of the shell by using bin/spark-shell.

Note: the shell accept code completion with the Tab key.

Let’s try the Scala shell:

$ bin/spark-shell
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/02/26 17:23:45 INFO SecurityManager: Changing view acls to: Javier
15/02/26 17:23:45 INFO SecurityManager: Changing modify acls to: Javier
15/02/26 17:23:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Javier); users with modify permissions: Set(Javier)
15/02/26 17:23:45 INFO HttpServer: Starting HTTP Server
15/02/26 17:23:45 INFO Utils: Successfully started service 'HTTP class server' on port 46130.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.2.1
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_31)
Type in expressions to have them evaluated.
Type :help for more information.
15/02/26 17:23:50 WARN Utils: Your hostname, xxx resolves to a loopback address: 127.0.1.1; using 192.168.2.49 instead (on interface eth0)
15/02/26 17:23:50 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/02/26 17:23:50 INFO SecurityManager: Changing view acls to: Javier
15/02/26 17:23:50 INFO SecurityManager: Changing modify acls to: Javier
15/02/26 17:23:50 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Javier); users with modify permissions: Set(Javier)
15/02/26 17:23:51 INFO Slf4jLogger: Slf4jLogger started
15/02/26 17:23:51 INFO Remoting: Starting remoting
15/02/26 17:23:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@xxx.malsolo.lan:55248]
15/02/26 17:23:51 INFO Utils: Successfully started service 'sparkDriver' on port 55248.
15/02/26 17:23:51 INFO SparkEnv: Registering MapOutputTracker
15/02/26 17:23:51 INFO SparkEnv: Registering BlockManagerMaster
15/02/26 17:23:52 INFO DiskBlockManager: Created local directory at /tmp/spark-1420fe71-6907-408a-b44c-9547ba1a2c49/spark-909fad01-a3df-484b-bd30-1ea6006396e9
15/02/26 17:23:52 INFO MemoryStore: MemoryStore started with capacity 265.1 MB
15/02/26 17:23:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/02/26 17:23:52 INFO HttpFileServer: HTTP File server directory is /tmp/spark-82586699-a230-47e4-8148-2cc4dcc741ec/spark-72f09be4-797a-4612-a845-e4fd1e578e76
15/02/26 17:23:52 INFO HttpServer: Starting HTTP Server
15/02/26 17:23:52 INFO Utils: Successfully started service 'HTTP file server' on port 41493.
15/02/26 17:23:53 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/02/26 17:23:53 INFO SparkUI: Started SparkUI at http://xxx.malsolo.lan:4040
15/02/26 17:23:53 INFO Executor: Starting executor ID  on host localhost
15/02/26 17:23:53 INFO Executor: Using REPL class URI: http://192.168.2.49:46130
15/02/26 17:23:53 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@xxx.malsolo.lan:55248/user/HeartbeatReceiver
15/02/26 17:23:53 INFO NettyBlockTransferService: Server created on 40938
15/02/26 17:23:53 INFO BlockManagerMaster: Trying to register BlockManager
15/02/26 17:23:53 INFO BlockManagerMasterActor: Registering block manager localhost:40938 with 265.1 MB RAM, BlockManagerId(, localhost, 40938)
15/02/26 17:23:53 INFO BlockManagerMaster: Registered BlockManager
15/02/26 17:23:53 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala>

To exit either shell, press Ctrl-D.

scala> Stopping spark context.
15/02/26 17:27:40 INFO SparkUI: Stopped Spark web UI at http://xxx.malsolo.lan:4040
15/02/26 17:27:40 INFO DAGScheduler: Stopping DAGScheduler
15/02/26 17:27:41 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped!
15/02/26 17:27:41 INFO MemoryStore: MemoryStore cleared
15/02/26 17:27:41 INFO BlockManager: BlockManager stopped
15/02/26 17:27:41 INFO BlockManagerMaster: BlockManagerMaster stopped
15/02/26 17:27:41 INFO SparkContext: Successfully stopped SparkContext
15/02/26 17:27:41 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/02/26 17:27:41 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/02/26 17:27:41 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
$

It’s possible to control the verbosity of the logging by creating a conf/log4j.properties file (use the existing conf/log4j.properties.template, Currently, Spark uses log4j 1.2.17, so you can find more details at Apache log4j™ 1.2 website.) and then changing the line:

log4j.rootCategory=INFO, console

To:

log4j.rootCategory=WARN, console

Now, with the shell we can try some commands like examining the sc variable, create RDDs, filtering them and so on.

scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@76af34b5

scala> val lines = sc.textFile("README.md")
lines: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[1] at textFile at :12

scala> lines.count()
res1: Long = 98                                                                 

scala> lines.first()
res2: String = # Apache Spark

There is an INFO message that informs of the URL of the Spark UI (INFO SparkUI: Started SparkUI at http://[ipaddress]:4040), so you can use it to see information about the tasks and clusters.

Spark UI

Spark Operations

Once we have the Spark shell, let’s use it to take a look to the available operations before we dive into creating applications.

Creating RDDs

You can turn an existing collection into a RDD (parallelize it), you can load an external file (several formats: text, JSON, CSV, SequenceFiles, objects) or even existing Hadoop InputFormat (with sc.hadoopFile())

scala> val numbers = sc.parallelize(List(1,2,3))
numbers: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :12

scala> val lines = sc.textFile("README.md")
lines: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[2] at textFile at :12

scala>

Transformations

As we said earlier, transformations are lazy evaluated operations on RDDs that return a new RDD.

You can pass each element through a function (with map()) or keep elements that pass a predicate (with filter()) or produce zero or more elements for each element (with flatMap()) and so on.

scala> val squares = numbers.map(x => x*x)
squares: org.apache.spark.rdd.RDD[Int] = MappedRDD[3] at map at :14

scala> val spark = lines.filter(line => line.contains("Spark"))
spark: org.apache.spark.rdd.RDD[String] = FilteredRDD[4] at filter at :14

scala> val sequences = numbers.flatMap(x => 1 to x)
sequences: org.apache.spark.rdd.RDD[Int] = FlatMappedRDD[5] at flatMap at :14

scala> val words = lines.flatMap(line => line.split(" "))
words: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[6] at flatMap at :14

scala>

Actions

They are the operations that return a final value to the driver program or write data to an external storage system that result in the evaluation of the transformations in the RDD.

For instance, retrieve the contents (collect()), return the first n elements (take()), count the number of elements (count()), combine elements with an associative function (reduce()) or write elements to a text file (saveAsTextFile())

scala> sequences.collect()
res0: Array[Int] = Array(1, 1, 2, 1, 2, 3)                                      

scala> squares.take(2)
res1: Array[Int] = Array(1, 4)

scala> words.count()
res2: Long = 524

scala> numbers.reduce((x, y) => x + y)
res4: Int = 6

scala> spark.saveAsTextFile("borrar.txt")

scala>

Key/Value Pairs

There is a special type of RDD, Pair RDDs, that that contain elements that are tuples, that is, a key-value pair, being key and value of any type.

They are very useful for perform aggregations, grouping, counting. They can be obtained from some initial ETL (extract, transform, load) operations.

The pair RDDS can be partitioned across nodes for improving speed by allowing similar keys to be accesible on the same node.

Regarding operations, Spark offers special operation for Pair RDDs that allow you to act on each key in parallel, for instance, reduceByKey() to aggregate data by key, join() to merge two RDDs by grouping elements with the same key, or even sortByKey().

scala> val pets = sc.parallelize(List(("cat", 1), ("dog", 1), ("cat", 2)))
pets: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[8] at parallelize at :12

scala> pets.collect()
res12: Array[(String, Int)] = Array((cat,1), (dog,1), (cat,2))

scala> pets.reduceByKey((a, b) => a + b).collect()
res9: Array[(String, Int)] = Array((dog,1), (cat,3))

scala> pets.groupByKey().collect()
res10: Array[(String, Iterable[Int])] = Array((dog,CompactBuffer(1)), (cat,CompactBuffer(1, 2)))

scala> pets.sortByKey().collect()
res11: Array[(String, Int)] = Array((cat,1), (cat,2), (dog,1))

scala>

Now let’s use the Shell to see how easily you can implement the MapReduce WordCount example in a single line:

scala> val counts = lines.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((x, y) => x + y)
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[18] at reduceByKey at :14

scala> counts.collect()
res16: Array[(String, Int)] = Array((package,1), (this,1), (Because,1), (Python,2), (cluster.,1), (its,1), ([run,1), (general,2), (YARN,,1), (have,1), (pre-built,1), (locally.,1), (changed,1), (locally,2), (sc.parallelize(1,1), (only,1), (several,1), (This,2), (basic,1), (first,1), (documentation,3), (Configuration,1), (learning,,1), (graph,1), (Hive,2), (["Specifying,1), ("yarn-client",1), (page](http://spark.apache.org/documentation.html),1), ([params]`.,1), (application,1), ([project,2), (prefer,1), (SparkPi,2), (,1), (engine,1), (version,1), (file,1), (documentation,,1), (MASTER,1), (example,3), (are,1), (systems.,1), (params,1), (scala>,1), (provides,1), (refer,2), (configure,1), (Interactive,2), (distribution.,1), (can,6), (build,3), (when,1), (Apache,1),...
scala>

Spark Applications

For writing a Spark Application it’s possible to use Scala, Python or Java. What I’m going to do is to use Java 8 to take advantage of the new features of the language in order to have a less verbose syntax.

Word count Java application

First, use the appropriate dependency. For instance, with maven:


			org.apache.spark
			spark-core_2.10
			1.2.1

Then, you have to instantiate your own SparkContext, and it’s done via a SparkConf object. We use the minimal configuration: a name for the cluster URL (“local” to use a local cluster) and an application name to identify the application on the cluster:

SparkConf conf = new SparkConf().setMaster("local").setAppName("Word Count with Spark");
		JavaSparkContext sc = new JavaSparkContext(conf);

Now, before writing Java, code it’s necessary to explain the differences with Scala.

Spark is written in Scala and it takes full advantage of its features. But Java lacks of some of them. So Spark provides alternatives with interfaces or concrete classes.

Let’s see the Word Count example in Spark written in Scala:

val file = spark.textFile("file")
val counts = file.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
counts.saveAsTextFile("out")

Java didn’t accept functions as parameters, so Spark provides interfaces in the org.apache.spark.api.java.function package to be implemented, either as anonymous inner classes or as named classes, to be passed as arguments of the functions (flatMap(), map(), reduceByKey(), …)

In our case, these are the functions that are needed:

FlatMapFunction with the method Iterable call(T t) to return zero or more output records from each input record (t).
PairFunction with the method Tuple2 call(T t) to return key-value pairs (Tuple2), and can be used to construct PairRDDs.
Function2 with the method R call(T1 v1, T2 v2), a two-argument function that takes arguments of type T1 and T2 and returns an R.

Java doesn’t have a native implementation of Tuple (as Lukas Eder said On a side-note at here “Why the JDK doesn’t ship with built-in tuples like C#’s or Scala’s escapes me.”, in other words, “Functional programming without tuples is like coffee without sugar: A bitter punch in your face.”)

For that reason, Spark provides several implementations for Tuple in the scala package.

But, Java has evolved, and now functions are first class citizens, so it’s possible to pass them as parameters for other functions, thus it’s very easy to write the Java 8 version of the word count in Spark using lambdas (since the provided interfaces have a sole public method. And the result is almost as clear as the Scala version)

JavaRDD lines = sc.textFile("file");
		JavaPairRDD counts = lines.flatMap(line -> Arrays.asList(line.split(" ")))
			.mapToPair(word -> new Tuple2(word, 1))
			.reduceByKey((x, y) -> x + y);
		counts.saveAsTextFile("out");

The complete source code is available at GitHub.

Build and run

Now, we only have to build the project (with maven) and submit it to Spark (with bin/spark-submit). From the root directory of the application (note: the out directory must not exists, so remove it previously if you need so with rm -r out):

$ mvn clean install
$ ~/Applications/spark-1.2.1-bin-hadoop2.4/bin/spark-submit --class com.malsolo.spark.examples.WordCount target/spark-examples-0.0.1-SNAPSHOT.jar

Finally, we can see the results to compare with the ones obtained using Hadoop:

$ cat out/part-00000 | grep President
(President,,26)
(President,72)
(President.,8)
(Vice-President,5)
(Vice-President,,5)
(Vice-President;,1)
(President;,3)
(Vice-President.,1)

$ cat out/part-00000 | grep United
(United,85)

$ cat out/part-00000 | grep State
(State,47)
(States,46)
(States.",1)
(State.,6)
(States,,55)
(State,,20)
(States:,2)
(States.,8)
(Statement,1)
(States;,13)
(State;,4)

Shared Variables

Spark closures and the variables they use are sent separately to the tasks running on the cluster, thus the variables created in the driver program are recieved in the tasks as a new copy, so updates on these copies are not propagated back to the driver.

Spark has two kinds of shared variables, accumulators and broadcast variables, to solve that problem as well as for solving issues related with the amount of data that is sent across the cluster.

Accumulators

Variables that can be used to aggregate values from worker nodes back to the driver program. In a nutshell:

They are created with SparkContext.accumulator(initialValue) that returns an org.apache.spark.Accumulator[T] (with T, the type of initialValue)
Worker code adds values with += in Scala or the function add() in Java.
The driver program can access with value in Scala or value()/setValue() in Java (accessing from worker code throws an exception)
The right value will be obtained after calling an action (remember that transformations are lazy operations)

Spark has built-in support for accumulators of type Integer, but you can create custom Accumulators by extending AccumulatorParam.

Let’s see an example that counts the empty lines in the file that we use to count words:

public static void main(String[] args) {
		SparkConf conf = new SparkConf().setMaster("local").setAppName("Word Count with Spark");
		JavaSparkContext sc = new JavaSparkContext(conf);
		
		JavaRDD lines = sc.textFile(INPUT_FILE_TEXT);
		
		final Accumulator blankLines = sc.accumulator(0);
		
		@SuppressWarnings("resource")
		JavaPairRDD counts = lines.flatMap(line -> 
			{
				if ("".equals(line)) {
					blankLines.add(1);
				}
				return Arrays.asList(line.split(" "));
			})
			.mapToPair(word -> new Tuple2(word, 1))
			.reduceByKey((x, y) -> x + y);

		counts.saveAsTextFile(OUTPUT_FILE_TEXT);
		
		System.out.println("Blank lines: " + blankLines.value());
		
		sc.close();
	}

In line 5 we create an Accumulator initialized to 0
In lines 11 to 16 we modify the FlatMapFunction to add 1 if the input line is empty
In line 22 we print the value of the content. After the saveAsTextFile() action.

Let’s try:

$ rm -r out
$ mvn clean install
$ ~/Applications/spark-1.2.1-bin-hadoop2.4/bin/spark-submit --class com.malsolo.spark.examples.WordCount target/spark-examples-0.0.1-SNAPSHOT.jar

Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/03/02 15:26:22 WARN Utils: Your hostname, xxx resolves to a loopback address: 127.0.1.1; using yyy.yyy.y.yy instead (on interface eth0)
15/03/02 15:26:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/03/02 15:26:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Blank lines: 169
$

Broadcast variables

Shared variable to efficiently distribute large read-only values to all the worker nodes.

If you need to use the same same variable in multiple parallel operations, it’s likely you’d rather share it instead of letting Spark sends it separately for each operation.

In a nutshell:

They are created with SparkContext.broadcast(initValue) on an object of type T that has to be Serializable.
Access its value with value in Scala or value() in Java.
The value shouldn’t be modified after creation, because the change will only happen in one node.

Let’s see an example with a list of words that have not to be included in the count (a short list, but enough to see the concept):

public static void main(String[] args) {
		SparkConf conf = new SparkConf().setMaster("local").setAppName("Word Count with Spark");
		JavaSparkContext sc = new JavaSparkContext(conf);
		
		JavaRDD lines = sc.textFile(INPUT_FILE_TEXT);
		
		final Accumulator blankLines = sc.accumulator(0);
		
		final Broadcast> wordsToIgnore = sc.broadcast(getWordsToIgnore());
		
		@SuppressWarnings("resource")
		JavaPairRDD counts = lines.flatMap(line -> 
			{
				if ("".equals(line)) {
					blankLines.add(1);
				}
				return Arrays.asList(line.split(" "));
			})
			.filter(word -> !wordsToIgnore.value().contains(word))
			.mapToPair(word -> new Tuple2(word, 1))
			.reduceByKey((x, y) -> x + y);
		
		counts.saveAsTextFile(OUTPUT_FILE_TEXT);
		
		System.out.println("Blank lines: " + blankLines.value());
		
		sc.close();
	}

	private static List getWordsToIgnore() {
		return Arrays.asList("the", "of", "and", "for");
	}

In line 9 we create the broadcast variable: a list of words to ignore. In lines 30 to 31 we only return 4 words, but it’s easy to see that the list could be big enough.
In line 19 we access the broadcast variable with the value() method and use it in a filter method.

Resources

Source code at GitHub
Learning Spark. By Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia (O’Reilly Media)
Cloudera Developer Training for Apache Spark. By Diana Carroll (Cloudera training)
Parallel Programming with Spark (Part 1 & 2). By Matei Zaharia ((UC Berkeley AMPLab YouTube channel))
Advanced Spark Features. By Matei Zaharia (UC Berkeley AMPLab YouTube channel)
A Deeper Understanding of Spark Internals. By Aaron Davidson (Apache Spark YouTube channel)

Getting started with Hadoop

Javier (@jbbarquero) — Wed, 25 Feb 2015 15:36:14 +0000

Hadoop Introduction

Hadoop is an open source framework for distributed fault-tolerant data storage and batch processing. It allows you to write applications for processing really huge data sets across clusters of computers using simple programming model with linear scalability on commodity hardware. Commodity hardware means cheaper hardware than the dedicated servers that are sold by many vendors. Linear scalability means that you have only to add more machines (nodes) to the Hadoop cluster.

The key concept for Hadoop is move-code-to-data, that is, data is distributed across the nodes of the Hadoop cluster and the applications (the jar files) are later sent to that nodes instead of vice versa (as in Java EE where applications are centralized in a application server and the data is collected to it over the network) in order to process the data locally.

At its core, Hadoop has two parts:

· Hadoop Distributed File System (HDFS™): a distributed file system that provides high-throughput access to application data.
· YARN (Yet Another Resource Negotiator): a framework for job scheduling and cluster resource management.

As you can see in the very definition of the Apache Hadoop website (what is Apache Hadoop?), Hadoop offers as a third component Hadoop MapReduce, a batch-based, distributed computing framework modeled after Google’s paper on MapReduce. It allows you to parallelize work over a large amount of raw data by splitting the input dataset into independent chunks which are processed by the map tasks (initial ingestion and transformation) in parallel, whose outputs are sorted and then passed to the reduce tasks (aggregation or summarization).

In the previous version of Hadoop (Hadoop 1), the implementation of MapReduce was based on a master JobTracker, for resource management and job scheduling/monitoring, and per-node slaves called TaskTracker to launch/teardown tasks. But it had scalability problems, specially when you wanted very large clusters (more than 4,000 nodes).

So, MapReduce has undergone a complete overhaul and now is called MapReduce 2.0 (MRv2), but it is not a part by itself, currently, MapReduce is a YARN-based system. That’s the reason why we can say that Hadoop has two main parts: HDFS and YARN.

Hadoop ecosystem

Besides the two core technologies, the distributed file system (HDFS) and Map Reduce (MR), there are a lot of projects that expand Hadoop with additional useful technologies, in such a way that we can consider all of them an ecosystem around Hadoop.

Next, a list of some of these projects, organized by some kind of categories:

Data Ingestion: to move data from and into HDFS
- Flume: a system for moving data into HDFS from remote systems using configurable memory-resident daemons that watch for data on those systems and then forward the data to Hadoop. For example, weblogs from multiple servers to HDFS.
- Sqoop: a tool for efficient bulk transfer of data between structured data stores (such as relational databases) and HDFS.
Data Processing:
- Pig: a procedural language for querying and data transform with scripts in a data flow language call PigLatin.
- Hive: a declarative SQL-like kanguage.
- Spark: an in-memory distributed data processing that breaks problems up over all of the Hadoop nodes, but keeps the data in memory for better performance and can be rebuilt with the details stored in the Resilient Distributed Dataset (RDD) from an external store (usually HDFS).
- Storm: a distributed real-time computation system for processing fast, large streams of data.
Data Formats:
- Avro: a language-neutral data serialization system. Expressed as JSON.
- Parquet: a compressed columnar storage format that can efficiently store nested data
Storage:
- HBase: a scalable, distributed database that supports structured data storage for large tables.
- Accumulo: a scalable, distributed database that supports structured data storage for large tables.
Coordination:
- Zookeeper: a high-performance coordination service for distributed applications.
Machine Learning:
- Mahout: a scalable machine learning and data mining library: classification, clustering, pattern mining, collaborative filtering and so on.
Workflow Management:
- Oozie: a service for running and scheduling workflows of Hadoop jobs (including Map-Reduce, Pig, Hive, and Sqoop jobs).

Hadoop installation

To install Hadoop on a single machine to try it out, just download the compressed file for the desired version and unpack it on the filesystem.

Prerequisites

There is some required software for running Apache Hadoop:

Java. It’s also necessary to inform Hadoop where Java is via the environment variable JAVA_HOME

$ java -version
java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
$ echo $JAVA_HOME
/usr/lib/jvm/java-8-oracle

ssh: I have Ubuntu 14.04 that comes with ssh, but I had to manually install a server.

$ which ssh
/usr/bin/ssh
$ which sshd
/usr/sbin/sshd

On Mac OSX, make sure Remote Login (under System Preferences -> File Sharing) is enabled for the current user or for all users.
On Windows, the best option is to follow the instructions at the Wiki: Build and Install Hadoop 2.x or newer on Windows.

Download and install

To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors.

There are several directories, for the current, last stable, last v1 stable version and so on. Basically, you’ll download a tar gzipped file named hadoop-x.y.z.tar.gz, for instance: hadoop-2.6.0.tar.gz.

You can unpack it wherever you want and then point the PATH to that directory. For example:

$ tar xzf hadoop-2.6.0.tar.gz
$
$ export HADOOP_HOME=~/Applications/hadoop-2.6.0
$ export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Now you can verify the installation by typing hadoop version:

$ hadoop version
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using ~/Applications/hadoop-2.6.0/share/hadoop/common/hadoop-common-2.6.0.jar
$

Configuration

Hadoop has three supported modes:

Local (Standalone) Mode: a single Java process with daemons running. For development testing and debugging.
Pseudo-Distributed Mode: each Hadoop daemon runs in a separate Java process. For simulating a cluster on a small scale.
Fully-Distributed Mode: the Hadoop daemons run on a cluster of machines. If you want to take a look, see the oficial documentation: Hadoop MapReduce Next Generation – Cluster Setup.

In standalone mode, there is no further action to take, the default properties are enough and there are no daemons to run.

In pseudodistributed mode, you have to set up your computer as described at Hadoop MapReduce Next Generation – Setting up a Single Node Cluster. But let’s review the steps needed.

You need at least a minimum configuration with four files in HADOOP_HOME/etc/hadoop/:

core-site.xml. Common configuration, default values at Configuration: core-default.xml




	
		fs.defaultFS
		hdfs://localhost:8020

fs.defaultFS replaces the deprecated fs.default.name whose default value is file://

hdfs-site.xml. HDFS configuration, default values at Configuration: hdfs-default.xml




	
		dfs.replication
		1

dfs.replicationis the default block replication, unless other is specified at creation time. The default value is 3, but we use 1 because we have only one node.

Other useful values are:

dfs.namenode.name.dir, local path for storing the fsimage by the NN (defaults to file://${hadoop.tmp.dir}/dfs/name with hadoop.tmp.dir configurable at core-site.xml with default value /tmp/hadoop-${user.name})

dfs.datanode.data.dir, local path for storing blocks by the DN (defaults to file://${hadoop.tmp.dir}/dfs/data)

mapred-site.xml. MapReduce configuration, default values at Configuration: mapred-default.xml




	
		mapreduce.framework.name
		yarn

mapreduce.framework.name, the runtime framework for executing MapReduce jobs: local, classic or yarn.

Other useful values are:

mapreduce.jobtracker.system.dir, the directory where MapReduce stores control files (defaults to ${hadoop.tmp.dir}/mapred/system).

mapreduce.cluster.local.dir, the local directory where MapReduce stores intermediate data files (defaults to ${hadoop.tmp.dir}/mapred/local)

yarn-site.xml. YARN configuration, default values at Configuration: yarn-default.xml



	
		yarn.resourcemanager.hostname
		localhost
	
	
		yarn.nodemanager.aux-services
		mapreduce_shuffle

yarn.resourcemanager.hostname, the host name of the Resource Manager.

yarn.nodemanager.aux-services, list of auxiliary services executed by the Node Manager. The value mapreduce_shuffle is for the Suffle/Sort in MapReduce that is an auxilary service in Hadoop 2.x.

Configuring SSH

Pseudodistributed mode is like fully distributed mode with a single host: localhost. In order to start the daemons on the set of hosts in the cluster, SSH is used. So we’ll configure SSH to log in without password.

Remember that you need to have SSH installed and a server running. On Ubuntu, try this if you need so:

$ sudo apt-get install ssh

Now create a SSH key with an empty passprhase

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_hadoop
$ cat ~/.ssh/id_rsa_hadoop.pub >> ~/.ssh/authorized_keys

Finally, test if you can connect without a pasword by trying:

$ ssh localhost

First steps with HDFS

Before using HDFS for the first time, some steps must be performed:

Formatting the HDFS filesystem

Just run the following command:

$ hdfs namenode -format

Starting the daemons

To start the HDFS, YARN, and MapReduce daemons, type:

$ start-dfs.sh
$ start-yarn.sh
$ mr-jobhistory-daemon.sh start historyserver

You can check what processes are running with the Java’s jps command:

$ jps
25648 NodeManager
25521 ResourceManager
25988 JobHistoryServer
25355 SecondaryNameNode
25180 DataNode
26025 Jps
$

Stopping the daemons

Once it’s over, you can stop the daemons with:

$ mr-jobhistory-daemon.sh stop historyserver
$ stop-yarn.sh
$ stop-dfs.sh

Creating A User Directory

You can create a home directory for a user with the next command:

$ hadoop fs -mkdir -p ~/Documents/hadoop-home/

Other Hadoop installations

There are another way to get installed Hadoop, that is, using companies that provide products that include Apache Hadoop or some kind of derivatives:

Hadoop Distributed File System (HDFS)

The HDFS filesystem designed for distributed storage of very large files (hundreds of megabytes, gigabytes, or terabytes in size) and distributed processing using commodity hardware. It is a hierarchical UNIX-like file system, but internally it splits large files into blocks (with size from 32MB to 128MB, being 64MB the default), in order to perform a distribution and a replication of these blocks among the nodes of the Hadoop cluster.The applications that use HDFS usually write data once and read data many times.

The HDFS has two types of nodes:

The master NameNode (NN), that stores the filesystem tree and the metadata for locating the files and directories in the tree that are actually located in the DataNodes. It stores this information in memory, however, to ensure against data loss, it’s also saved to disk using two files: the namespace image and the edit log.
- fsimage: a point in time snapshot of what HDFS looks like.
- edit log: the deltas or changes to HDFS since the last snapshot.
Both are prediodically merged.
The DataNodes (DN), that are responsible for serving the actual file data (once the client knows which one to use after contacting the NameNode). They also sends heatbeats every 3 seconds (by default) to the NN and block reports every 1 hour (by default) to the DN both for maintenance purposes.

There is also a node poorly named Secondary NameNode that is not a failover node nor a backup node, it periodically merges the namespace image with the edit log to prevent the edit log from becoming too large. Thus, the best name for it is Checkpoint Node.

The Command-Line Interface

Once you have installed Hadoop, you can interact with HDFS, as well as other file systems that Hadoop supports (local filesystem, HFTP FS, S3 FS, and others), using the command line. The FS shell is invoked by:

$ hadoop fs

Provided to you have hadoop in the PATH as we saw above.

You can find a list of available commands at File System Shell.

You can perform operations like:

copyFromLocal (putting files into HDFS)
copyToLocal (getting files form HDFS)
mkdir (creating directories in HDFS)
ls (list files in HDFS)

Data exchange with HDFS

Hadoop is mainly written in Java, being the core class for HDFS the abstract class org.apache.hadoop.fs.FileSystem, that represents a filesystem in Hadoop. The several concrete subclasses provide implementations from local filesystem (fs.LocalFileSystem) to HDFS (hdfs.DistributedFileSystem), or Amazon S3 (fs.s3native.NativeS3FileSystem) and many more (read-only HTTP, FTP server, …)

Reading data

Reading data using the Java API involves to obtain the abstract FileSystem via one of the factory methods (get()), or the convenient method for retrieving the local filesystem (getLocal()):

public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user)
throws IOException
public static LocalFileSystem getLocal(Configuration conf) throws IOException

And then obtain an input stream for a file (that can be later be closed):

public FSDataInputStream open(Path f) throws IOException
public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException

With these methods on hand, the flow of the data in HDFS will be the next:

Reading from HDFS

The client calls the open() method to read a file. A DistributedFileSystem is returned.
The DistributedFileSystem asks the NameNode for the block locations. The NameNode returns an ordered list of the DataNodes that have a copy of the block (sorted by proximity to the client). The DistributedFileSystem returns a FSDataInputStream to the client for it to read data from
The client calls read() on the input stream
The FSDataInputStream reads data for the client from the DataNode until there is no more data in that node.
The FSDataInputStream will manage the closing and opening connection to DataNodes for serving data to the client in a transparently way. It also manages validation (checksum) and errors (by trying to read data from a replica)
When the client has finished reading, it calls close() on the FSDataInputStream

Writing data

The Java API allows you to create files with create methods (that, by the way, also create any parent directories of the file that don’t already exist). The API also includes a Progressable interface to be notified of the process of the data being written to the datanodes. It’s also possible to append data to an existing file, but this functionality is optional (S3 doesn’t support it from the time being)

public FSDataOutputStream create(Path f) throws IOException
public FSDataOutputStream append(Path f) throws IOException

The output stream will be used for writing the data. Furthermore, the FSDataOutputStream can inform of the current position in the file.

The flow of the data written to HDFS with these methods is the next:

Writing to HDFS

The client creates the file by calling create() on DistributedFileSystem.
DistributedFileSystem makes an RPC call to the namenode to create a
new file in the filesystem’s namespace, with no blocks associated with it. The NameNode checks file existence and permission, throwing IOException if theres is any problem, otherwise, it returns a FSDataOutputStream for writing data to.
The data written by the client is splitted into packets that are sent to a data queue.
The data queue is consumed by the Data Streamer which streams the packets to a pipeline of DataNodes (one per replication factor). Each DataNode stores the packet and send it to the next DataNode in the pipeline.
There is another queue, ack queue, that contains the packets that are waiting for acknowledged by all the datanodes in the pipeline. If a DataNode fails in a write operation, the pipeline will be re-arranged transparently for the client.
When the client has finished writing data, it calls close() on the stream.
The remaining packets are flushed and, after receiving all the acknowledgments, the NameNode is notified that the write to the file is completed.

Apache YARN (Yet Another Resource Negotiator)

YARN is Hadoop’s cluster resource management system. It provides provides APIs for requesting and working with cluster resources to be used not by user code, but for higher level APIs, like MapReduce v2, Spark, Tez…

YARN separates resource management and job scheduling/monitoring into separate daemons. In Hadoop 1.x these two functions were performed by the JobScheduler, that implies a bottleneck for scaling the Hadoop nodes in the cluster.

YARN Components

There are five major component types in a YARN cluster:

Resource Manager (RM): a global per-cluster daemon that is solely responsible for allocating and managing resources available within the cluster.
Node Manager (NM): a per-node daemon that is responsible for creating, monitoring, and killing containers.
Application Master (AM): This is a per-application daemon whose duty is the negotiation of resources from the ResourceManager and to work with the NodeManager(s) to execute and monitor the tasks.
Container: This is an abstract representation of a resource set that is given to a particular application: memory and cpu. It’s a computational unit (one node runs several containers, but a container cannot cross a node boundary). The AM is a specialized container that is used to bootstrap and manage the entire application’s life cycle.
Application Client: it submits applications to the RM and it specifies the type of AM needed to execute the application (for instance, MapReduce).

Anatomy of a YARN Request

These are the steps involved in the submission of a job to the YARN framework.

YARN architecture

The client submits a job to the RM asking to run an AM process (Job Submission in the picture above).
The RM looks for resources to acquire a container on a node to launch an instance of the AM.
The AM registers with the RM to enable the client to query the RM for details about the AM.
Now the AM is running, and it could run the computation returning the result to the client, or it could request more containers to the RM to run a distributed computation (Resource Request in the picture above)
The application code executing in the launched container (tasks) reports its status to the AM through an application-specific protocol (MapReduce status in the picture above, that it’s assuming that the YARN application being executed is MapReduce).
Once the application completes execution, the AM deregisters with the RM, and the containers used are released back to the system.

This process applies for each client that submits jobs. In the picture above there are two clients (the red one and the blue one)

Hadoop first program: WordCount MapReduce

MapReduce is a paradigm for data processing that uses two key phases:

Map: it performs a transformation on input key-value pairs to generate intermediate key-value pairs.
Reduce: it performs a summarize function on intermediate key-value groups to generate the final output of key-value pairs.
The groups that are the input of the Reduce phase are created by sorting the output of the Map phase in an operation called as Short/Shuffle (in YARN, is an auxiliary service)

Writing the program

For writing a MapReduce program in Java for running it in Hadoop you need to provide a Mapper class, a Reducer class, and a driver program to run a job.

Let’s begin with the Mapper class, it will separate each word with a count of 1:

public class WordCountMapper extends Mapper {
	
	private final static IntWritable ONE = new IntWritable(1);
	private Text word = new Text();
	
	@Override
	protected void map(LongWritable key, Text value,
			Mapper.Context context)
			throws IOException, InterruptedException {
		String line = value.toString();
		StringTokenizer tokenizer = new StringTokenizer(line);
		while (tokenizer.hasMoreTokens()) {
			word.set(tokenizer.nextToken());
			context.write(word, ONE);
		}
	}

}

Highlights here are the parameters of the Mapper class, in this case:

The input key, a long that will be ignored
The input value, a line of text
The output key, the word to be counted
The output value, the count for the word, always one, as we said before.

As you can see, instead of using Java types, it’s better to use Hadoop basic types that are optimized for network serialization (available in the org.apache.hadoop.io package)

The basic approach is to override the map() method and make use of the key and value input parameters, as well as the instance of a Context to write the output to: the words with its count (one, for the moment being)

Let’s continue with the Reducer class.

public class WordCountReducer extends Reducer {
	@Override
	protected void reduce(Text key, Iterable values,
			Reducer.Context context)
			throws IOException, InterruptedException {
		int sum = 0;
		for (IntWritable val : values) {
			sum += val.get();
		}
		context.write(key, new IntWritable(sum));
	}
	
}

The intermediate result from the Mapper will be partitioned by MapReduce in such a way that the same reducer will receive all output records containing the same key. MapReduce will also sort all the map output keys and will call each reducer only once for each output key along with a list of all the output values for this key.

Thus, to write a Mapper class, you override the method reduce that has as parameters the only key, the list of values as an iterable and an instance of the Context to write the final result to.

In our case, the reducer will sum the count that each words carry (always one) and it will write the result to the context.

Finally, the Driver class, the class that runs the MapReduce job.

public class WordCountDriver {
	
	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		String[] myArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
		if (myArgs.length != 2) {
			System.err.println("Usage: WordCountDriver  ");
			System.exit(-1);
		}
		Job job = Job.getInstance(conf, "Classic WordCount");
		job.setJarByClass(WordCountDriver.class);
		
		FileInputFormat.addInputPath(job, new Path(myArgs[0]));
		FileOutputFormat.setOutputPath(job, new Path(myArgs[1]));
		
		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);
		
		//job.setMapOutputKeyClass(Text.class);
		//job.setMapOutputValueClass(IntWritable.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}
}

First, create a Hadoop Configuration (the default values are enough for this example) and use the GenericOptionsParser class to parse only the generic Hadoop arguments.

To configure, submit and control the execution of the job, as well as to monitor its progress, use a Job object. Take care of configuring it (via its set() methods) before submitting the job or an IllegalStateException will be thrown.

In a Hadoop cluster, the JAR package will be distributed around the cluster, to allow Hadoop to locate this JAR we pass a class in the Job ’s setJarByClass() method.

Next, we specify the input and output paths by calling the static addInputPath() (or setInputPaths) method on FileInputFormat() (with a file, directory or file pattern) and the static setOutputPath() method on FileOutputFormat (with a non-existing directory, in order to avoid data loss from another job) respectively.

Then, the job is configured with the Mapper class and the Reducer class.

There is no need for specifying the map output types because they are the same than the ones produced by the Reducer class, but we need to indicate the output types for the reduce function.

Finally, the waitForCompletion() method on Job submits the job and waits for it to finish. The argument is a flag for verbosity in the generated output. The return value indicates success (true) or failure (false). We use it for the the program’s exit code (0 or 1).

Running the program

The source code is available at github. You can download it, go to the directory and just run the next commands:

$ mvn clean install
$ export HADOOP_CLASSPATH=target/mapreduce-0.0.1-SNAPSHOT.jar
$ hadoop com.malsolo.hadoop.mapreduce.WordCountDriver data/the_constitution_of_the_united_states.txt out

You will see something like this:

15/02/25 15:30:47 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/02/25 15:30:47 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/02/25 15:30:47 INFO input.FileInputFormat: Total input paths to process : 1
15/02/25 15:30:47 INFO mapreduce.JobSubmitter: number of splits:1
15/02/25 15:30:47 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local284822998_0001
15/02/25 15:30:47 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/02/25 15:30:47 INFO mapreduce.Job: Running job: job_local284822998_0001
15/02/25 15:30:47 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/02/25 15:30:47 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
15/02/25 15:30:48 INFO mapred.LocalJobRunner: Waiting for map tasks
15/02/25 15:30:48 INFO mapred.LocalJobRunner: Starting task: attempt_local284822998_0001_m_000000_0
15/02/25 15:30:48 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
15/02/25 15:30:48 INFO mapred.MapTask: Processing split: file:.../mapreduce/data/the_constitution_of_the_united_states.txt:0+45119
15/02/25 15:30:48 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/02/25 15:30:48 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/02/25 15:30:48 INFO mapred.MapTask: soft limit at 83886080
15/02/25 15:30:48 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/02/25 15:30:48 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/02/25 15:30:48 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/02/25 15:30:48 INFO mapred.LocalJobRunner: 
15/02/25 15:30:48 INFO mapred.MapTask: Starting flush of map output
15/02/25 15:30:48 INFO mapred.MapTask: Spilling map output
15/02/25 15:30:48 INFO mapred.MapTask: bufstart = 0; bufend = 75556; bufvoid = 104857600
15/02/25 15:30:48 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26183792(104735168); length = 30605/6553600
15/02/25 15:30:48 INFO mapred.MapTask: Finished spill 0
15/02/25 15:30:48 INFO mapred.Task: Task:attempt_local284822998_0001_m_000000_0 is done. And is in the process of committing
15/02/25 15:30:48 INFO mapred.LocalJobRunner: map
15/02/25 15:30:48 INFO mapred.Task: Task 'attempt_local284822998_0001_m_000000_0' done.
15/02/25 15:30:48 INFO mapred.LocalJobRunner: Finishing task: attempt_local284822998_0001_m_000000_0
15/02/25 15:30:48 INFO mapred.LocalJobRunner: map task executor complete.
15/02/25 15:30:48 INFO mapred.LocalJobRunner: Waiting for reduce tasks
15/02/25 15:30:48 INFO mapred.LocalJobRunner: Starting task: attempt_local284822998_0001_r_000000_0
15/02/25 15:30:48 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
15/02/25 15:30:48 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@1bd34bf7
15/02/25 15:30:48 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
15/02/25 15:30:48 INFO reduce.EventFetcher: attempt_local284822998_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
15/02/25 15:30:48 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local284822998_0001_m_000000_0 decomp: 90862 len: 90866 to MEMORY
15/02/25 15:30:48 INFO reduce.InMemoryMapOutput: Read 90862 bytes from map-output for attempt_local284822998_0001_m_000000_0
15/02/25 15:30:48 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 90862, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->90862
15/02/25 15:30:48 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
15/02/25 15:30:48 INFO mapred.LocalJobRunner: 1 / 1 copied.
15/02/25 15:30:48 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
15/02/25 15:30:48 INFO mapred.Merger: Merging 1 sorted segments
15/02/25 15:30:48 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 90857 bytes
15/02/25 15:30:48 INFO reduce.MergeManagerImpl: Merged 1 segments, 90862 bytes to disk to satisfy reduce memory limit
15/02/25 15:30:48 INFO reduce.MergeManagerImpl: Merging 1 files, 90866 bytes from disk
15/02/25 15:30:48 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
15/02/25 15:30:48 INFO mapred.Merger: Merging 1 sorted segments
15/02/25 15:30:48 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 90857 bytes
15/02/25 15:30:48 INFO mapred.LocalJobRunner: 1 / 1 copied.
15/02/25 15:30:48 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
15/02/25 15:30:48 INFO mapred.Task: Task:attempt_local284822998_0001_r_000000_0 is done. And is in the process of committing
15/02/25 15:30:48 INFO mapred.LocalJobRunner: 1 / 1 copied.
15/02/25 15:30:48 INFO mapred.Task: Task attempt_local284822998_0001_r_000000_0 is allowed to commit now
15/02/25 15:30:48 INFO output.FileOutputCommitter: Saved output of task 'attempt_local284822998_0001_r_000000_0' to file:.../out/_temporary/0/task_local284822998_0001_r_000000
15/02/25 15:30:48 INFO mapred.LocalJobRunner: reduce > reduce
15/02/25 15:30:48 INFO mapred.Task: Task 'attempt_local284822998_0001_r_000000_0' done.
15/02/25 15:30:48 INFO mapred.LocalJobRunner: Finishing task: attempt_local284822998_0001_r_000000_0
15/02/25 15:30:48 INFO mapred.LocalJobRunner: reduce task executor complete.
15/02/25 15:30:48 INFO mapreduce.Job: Job job_local284822998_0001 running in uber mode : false
15/02/25 15:30:48 INFO mapreduce.Job:  map 100% reduce 100%
15/02/25 15:30:48 INFO mapreduce.Job: Job job_local284822998_0001 completed successfully
15/02/25 15:30:48 INFO mapreduce.Job: Counters: 33
	File System Counters
		FILE: Number of bytes read=283490
		FILE: Number of bytes written=809011
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=872
		Map output records=7652
		Map output bytes=75556
		Map output materialized bytes=90866
		Input split bytes=175
		Combine input records=0
		Combine output records=0
		Reduce input groups=1697
		Reduce shuffle bytes=90866
		Reduce input records=7652
		Reduce output records=1697
		Spilled Records=15304
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=8
		CPU time spent (ms)=0
		Physical memory (bytes) snapshot=0
		Virtual memory (bytes) snapshot=0
		Total committed heap usage (bytes)=525336576
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=45119
	File Output Format Counters 
		Bytes Written=17405
$

And you can take a look to the result using:

$ sort -k2 -h -r out/part-r-00000 | head -20
the	663
of	494
shall	293
and	256
to	183
be	178
or	157
in	139
by	101
a	94
United	85
for	81
any	79
President	72
The	64
have	64
as	64
States,	55
such	52
State	47
$

Regarding this example, I have to mention a couple of things:

The code includes a data directory containing a text file (yes, the constitution of the USA)
- Yes, the program need some improvements for not considering the commas and something like that.
- It’s funny to see the most repeated words (the, of, shall, and, to, be, or, in, by, a) and the most important words (United, States, President)
Due to a problem with Hadoop and IPv6, it doesn’t work with pseudistributed mode due to a connection exception (java.net.ConnectException: Call […] to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused). For this example is enough to use local mode (just recover the original core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml and you can also stop the HDFS, YARN, and MapReduce daemons. See above)

Resources

Hadoop: The Definitive Guide, 4th Edition. By Tom White (O’Reilly Media)
Pro Apache Hadoop, 2nd Edition. By Sameer Wadkar, Madhu Siddalingaiah, Jason Venner (Apress)
Mastering Hadoop. By Sandeep Karanth (Packt publishing)
Hadoop in Practice, Second Edition. By Alex Holmes (Manning publications)
Hadoop in Action, Second Edition. By Chuck P. Lam and Mark W. Davis (Manning publications)
Hadoop – Just the Basics for Big Data Rookies. By Adam Shook (SpringDeveloper YouTube channel)
Getting started with Spring Data and Apache Hadoop. By Thomas Risberg, Janne Valkealahti (SpringDeveloper YouTube channel)
Hadoop 201 — Deeper into the Elephant. By Roman Shaposhnik (SpringDeveloper YouTube channel)
Hadoop MapReduce Next Generation – Setting up a Single Node Cluster.
The File System (FS) shell
Primeros pasos con Hadoop: instalación y configuración en Linux. By Juan Alonso Ramos (Adictos al trabajo)
How-to: Install a Virtual Apache Hadoop Cluster with Vagrant and Cloudera Manager. By Justin Kestelyn (@kestelyn Cloudera blog()
How to build a Hadoop VM with Ambari and Vagrant. By Saptak Sen (Hortonworks blog)
Hadoop HDFS Data Flow IO Classes. By Shrey Mehrotra (Hadoop Ecosystem : Hadoop 2.x)

Orika time!

Javier (@jbbarquero) — Thu, 23 Oct 2014 15:52:56 +0000

I love good software, and Orika is a really interesting project. In this post I’m going to talk about this Java bean mapping framework that I’ve used recently and I highly recommend.

Orika (formerly hosted at google code) claims to be a simpler, lighter and faster Java bean mapping. And it really is.

It allows you to convert objects by copying its attributes from one to another. It performs this by using Java introspection, instead of XML configuration or something like that. It uses code generation to create the mappers, and it has an interesting strategy for its core classes that allows you to optimize the performance.

It is completely configurable, for instance, the java code generator by default is Javassist, but you can use EclipseJdt or even use another provider by implementing the appropriate interface.

Finally, it’s a very well documented project, with clear explanations and useful code.

I really like it.

If you want to try it, just add the next dependency to your maven project:


			ma.glasnost.orika
			orika-core
			1.4.5

Or download the library along with three required libraries:

javassist (v 3.12.0+)
slf4j (v 1.5.6+)
paranamer (v 2.0+)

Main concepts

There are two core classes, the first one is MapperFactory, the class to configure the mappings and to obtain the second core class: MapperFacade, that actually provides the service of a Java bean mapping.

MapperFactory

By using a fluent API, you can construct a MapperFactory via its main implementation, DefaultMapperFactory and its static class helpers intended for building the MapperFactory.

Then you can obtain a ClassMapBuilder from the MapperFactory in order to declare the mapping configuration in a fluent-style, that is: mapping fields (bi-directional by default, or in one direction if you’d rather), excluding fields, specifying constructors and customizing mappings of Lists, Sets, nested fields and so on.

There is a default mapping that maps fields with matching names between two classes, and you can also register the mapping in the MapperFactory (the ClassMapBuilder is built with the very MapperFactory as atribute). Both operations have the appropriate methods to be called.

Let’s see an example:

We have two classes: PersonDTO and Person, and we want to convert one object for the former class to an object of the latter one, so let’s configure the MapperFactory:

MapperFactory mapperFactory = new DefaultMapperFactory.Builder().build();
mapperFactory.classMap(PersonDTO.class, Person.class) //A ClassMapBuilder
        .field("lastNames", "surnames") //Register field mappings
        .field("streetAddress", "address.street")
        .field("city", "address.city")
        .field("postalCode", "address.zipCode")
        .byDefault() //the remaining fields on both classes should be mapped matching the fields by name
        .register(); //register the mapping with the MapperFactory.

The ClassMapBuilder provides interesting mappings for excluding fields, mapping fields in one direction only (as I mentioned above, mapping is by default bi-directional) and so on. See the Declarative Mapping Configuration and Advanced Mapping Configurations for more options.

MapperFacade and BoundMapperFacade

The MapperFacade performs the actual mapping work, for doing so, you have to obtain the MapperFacade from the MapperFactory and then use its map methods. There are two general options, map(objectA, B.class), which will create a new instance of B.class and map the property values of objectA onto it, or map(objectA, objectB), which copies properties from objectA onto objectB, being both of them not null.

Let’s see an example:

MapperFacade mapper = mapperFactory.getMapperFacade();
PersonDTO personDTO = new PersonDTO();
//Set values here
Person person = mapper.map(personDTO, Person.class);
//Alternatively:
//Person person = new Person();
//mapper.map(personDTO, person);

The BoundMapperFacade is intended to be used with a pair of types to map and no further customization needed. It provides improved performance over use of the standard MapperFacade and it has the same usage plus a mapReverse method for mapping in the reverse direction.

See below an example:

BoundMapperFacade boundMapper = mapperFactory.getMapperFacade(PersonDTO.class, Person.class);
PersonDTO personDTO = new PersonDTO();
//Set values here
Person person = boundMapper.map(personDTO);
//Alternatively:
//Person person = new Person();
//boundMapper.map(personDTO, person);
//Map in the reverse
PersonDTO personDTO = boundMapper.mapReverse(person);
//Alternatively with no null objects:
//boundMapper.mapReverse(person, personDTO); //curiously, it returns an A object instead of void

If you’re primarily concerned with the mapping of a particular type to another, and your object graph has no cycles, use BoundMapperFacade because it has better performance, since you avoid the overhead of looking up an appropriate mapping strategy for the fields of the objects to be mapped, that happens when you call the map method.

Performance

The most expensive operations are instantiation and initialization of the MapperFactory, and the MapperFacade which is obtained from it.

Both of them are thread-safe objects, thus singletons or static access for them are a very recommended approach.

For instance, you can consider an static acces to the MapperFactory:

package com.malsolo.orika.test.domain.mappers;

import ma.glasnost.orika.MapperFactory;
import ma.glasnost.orika.impl.DefaultMapperFactory;

public class BaseMapper {
    final static MapperFactory MAPPER_FACTORY = new DefaultMapperFactory.Builder().build();
}

And then use this class for registering class maps and access to the MapperFacade in a customize way:

package com.malsolo.orika.test.domain.mappers;

import com.malsolo.orika.test.domain.Customer;
import com.malsolo.orika.test.dto.CustomerDTO;
import ma.glasnost.orika.MapperFacade;

public enum CustomerMapper {
    
    INSTANCE;
    
    private final MapperFacade mapperFacade;
    
    private CustomerMapper() {
        BaseMapper.MAPPER_FACTORY.classMap(CustomerDTO.class, Customer.class)
                .byDefault()
                .register();
        mapperFacade = BaseMapper.MAPPER_FACTORY.getMapperFacade();
    }
    
    public Customer map(CustomerDTO customerDTO) {
        return this.mapperFacade.map(customerDTO, Customer.class);
    }
    
    public CustomerDTO map(Customer customer) {
        return this.mapperFacade.map(customer, CustomerDTO.class);
    }
}

In this case I’m using two new classes: Customer and CustomerDTO.

Finally, just use this helper classes:

CustomerDTO customerDTO = new CustomerDTO();
//Setters at will
Customer customer = CustomerMapper.INSTANCE.map(customerDTO);

How it works and Customization

The best way to understand how Orika works is to read the FAQ, but the source code is very clear and easy to follow, just in case you want to take a deeper look to the details.

“Orika uses reflection to investigate the metadata of the objects to be mapped and then generates byte-code at runtime which is used to map those classes […]”

“Orika mapping is configured via Java code at runtime, via a fluent-style ClassMapBuilder API. An instance is obtained from a MapperFactory which is then used to describe the mapping from one type to another. Finally, the specified ClassMap is registered with the MapperFactory for later use in generating a byte-code mapper.”

To configure Orika, the author recommends to extend the ConfigurableMapper class, and I will use this approach for using Orika with Spring (see below)

If you want to configure the code generation, compilation and so on, see how to do this using the MapperFactory in the next topic.

MapperFactory Configuration

The core class MapperFactory is instantiated as you’ve seen above, by building the provided implementation: DefaultMapperFactory. It uses a set of Strategy and factory objects for resolving constructors, for compiling code (actually, they belong to its inner static class MapperFactoryBuilder that is extended by its other inner static class that you use to instantiate the MapperFactory: the Builder), for resolving properties, and specially for building class maps.

So you can customize all of them after calling Builder() and before the build() method:

new DefaultMapperFactory.Builder()
.constructorResolverStrategy(new SimpleConstructorResolverStrategy()) //By default, you can extend it or implement your own ConstructorResolverStrategy
.compilerStrategy(new JavassistCompilerStrategy()) //By default, you can extend it, implement your own CompilerStrategy or use EclipseJdtCompilerStrategy
.propertyResolverStrategy(new IntrospectorPropertyResolver()) //By default, you can extend it or implement your own PropertyResolverStrategy
.classMapBuilderFactory(new ClassMapBuilder.Factory()) //by default. You can use //enabled by default or //enabled by default.Factory
.useAutoMapping(true) //enabled by default
.mapNulls(true) //enabled by default
.build();

Custom Mappers

A Mapper copies the properties from one object onto another and it’s used internally by Orika (MapperFacade, MapperGenerator and MapperFactory use it, but it isn’t really interesting for this post, actually)

If you want to take control of how properties of one object instance are copied to another object instance (not null), just extends the abstract class CustomMapper.

Then, you can register your custom mapper in the MapperFactory via its method registerMapper(Mapper mapper).

It’s intended to be used when the mapping behavior configured via the ClassMapBuilder API doesn’t suit your purposes, but only to copy properties, that is, the destination object already exists. If you want to control instantiation, use a CustomConverter or ObjectFactory which both return an instance. See below.

If you need to map nested members of the given object, you can doing so by using the current MapperFacade that it’s accessible via a protected mapperFacade variable.

I’ve used converters in a Spring applications as components that can be scanned and used to configure the MapperFactory by extending ConfigurableMapper as I mentioned above and I’ll show below.

Object Factories

An ObjectFactory instantiates objects. You can implement this interface and register your implementation in the MapperFactory via their method registerObjectFactory (there are a couple of them)

I’ve never used this option.

Custom Converters

A Converter combines both ObjectFactory and Mapper together, returning a new instance of the destination type with all properties copied from the source instance.

When you want to take complete control over the instantiation and mapping, just extend CustomConverter to create the new instance and to provide it the properties from the source object.

Then, you can register your custom converter in the MapperFactory by obtaining its ConverterFactory via the getConverterFactory() method and then using one of its registerConverter() methods.

I’ve found this option useful when you want to create some hard conversion between classes that has the same meaning in your code but are not really related in terms of Java. Well, it’s a conversion, let’s say, from a date represented in a String and a date that has to be a Calendar. You have to use some formatter for doing so.

It’s also very easy to configure converters as Spring components if you need so.

Using Orika with Spring

Since the Orika core classes MapperFactory and the MapperFacade obtained from it are very expensive to instantiate and initialize, but at the same time are thread-safe, they should be shared as singletons. That is, they are good candidates for being Spring beans with singleton scope.

Let’s start with using spring-framework in a maven project,just add the next dependency to the pom.xml file:


			org.springframework
			spring-context
			4.1.1.RELEASE

The Spring application

Let’s create a couple of beans to obtain a Person and a PersonDTO, let’s call them service and repository respectively:

package com.malsolo.orika.test.spring;

import com.malsolo.orika.test.domain.Person;

public interface PersonService {
    
    public Person obtainPerson();
    
}

package com.malsolo.orika.test.spring;

import com.malsolo.orika.test.dto.PersonDTO;

public interface PersonRepository {
    
    public PersonDTO obtainPerson();

}

And their implementations, the one for the service will use the repository interface, and the one for the repository will be a simple in-memory implementation.

package com.malsolo.orika.test.spring;

import com.google.common.collect.ImmutableList;
import com.malsolo.orika.test.dto.PersonDTO;
import java.util.Random;
import org.springframework.stereotype.Repository;

@Repository
public class PersonRepositoryImpl implements PersonRepository {
    
    Random random = new Random();

    @Override
    public PersonDTO findPerson() {
        return this.createPersonDTO(random.nextLong());
    }

    private PersonDTO createPersonDTO(long l) {
        PersonDTO dto = new PersonDTO();
        dto.setId(l);
        //More setters here
        return dto;
    }
}

Note that the service will need to convert from the DTO to the domain object, so let’s inject a mapper.

package com.malsolo.orika.test.spring;

import com.malsolo.orika.test.domain.Person;
import ma.glasnost.orika.MapperFacade;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

@Service
public class PersonServiceImpl implements PersonService {
    
    @Autowired
    private PersonRepository personRepository;
    
    private MapperFacade mapper;

    @Autowired
    public void setMapperFactory(MapperFactory mapperFactory) {
        this.mapper = mapperFactory.getMapperFacade();
    }
    
    @Override
    public Person obtainPerson() {
        return mapper.map(this.personRepository.findPerson(), Person.class);
    }

}

As you can see, I used setter injection (via autowiring) to provide a MapperFactory in order to obtain from it a MapperFacade. I’m using the suggestion of having the MapperFactory as singleton, that it’s easy to achieve with Spring.

Since MapperFactory has a particular way to be instantiated, the best option is to create a factory for exposing it, a FactoryBean:

package com.malsolo.orika.test.spring;

import ma.glasnost.orika.MapperFactory;
import ma.glasnost.orika.impl.DefaultMapperFactory;
import org.springframework.beans.factory.FactoryBean;
import org.springframework.stereotype.Component;

@Component
public class MapperFactoryFactory implements FactoryBean {

    @Override
    public MapperFactory getObject() throws Exception {
        return new DefaultMapperFactory.Builder().build();
    }

    @Override
    public Class getObjectType() {
        return MapperFactory.class;
    }

    @Override
    public boolean isSingleton() {
        return true;
    }

}

This approach has been selected just in case I want to use a BoundMapperFacade, but it’s possible to use a MapperFacade directly by creating a MapperFacadeFactory.

Now, let’s configure the Spring application

package com.malsolo.orika.test.spring;

import org.springframework.context.ApplicationContext;
import org.springframework.context.annotation.AnnotationConfigApplicationContext;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.context.annotation.Configuration;

@Configuration
@ComponentScan
public class OrikaSpringTest {
    
    public static void main(String... args) {
        ApplicationContext context = 
          new AnnotationConfigApplicationContext(OrikaSpringTest.class);
        PersonService personService = context.getBean(PersonService.class);
        Person person = personService.obtainPerson();
        System.out.printf("%s\n", person.toString());
    }

}

This is enough for the project to run:

Person{id=0, name=Name 0, surnames=null, address=null}

But pay attention, the mapper didn’t work, because there are differences between the two classes (tow attributes with different names, and the same concept, address, with different types)

Plugging in Orika Mappers and Converters with Spring

Orika provides custom converters and mappers, so the only thing we need is to configure them as spring beans in the application context. Let’s see how to do this.

First, we need a custom mapper. We annotate it as component to be discovered (via component scanning) by Spring.

package com.malsolo.orika.test.spring;

import com.malsolo.orika.test.domain.Address;
import com.malsolo.orika.test.domain.Person;
import com.malsolo.orika.test.dto.PersonDTO;
import ma.glasnost.orika.CustomMapper;
import ma.glasnost.orika.MappingContext;
import org.springframework.stereotype.Component;

@Component
public class PersonDtoToPersonMapper extends CustomMapper {

    @Override
    public void mapAtoB(PersonDTO a, Person b, MappingContext context) {
        b.setId(a.getId());
        b.setName(a.getName());
        b.setSurnames(a.getLastNames());
        //I could use the protected mapperFacade if I need to map a particular field
        //this.mapperFacade.map(sourceObject, destinationClass);
        Address address = new Address();
        address.setStreet(a.getStreetAddress());
        address.setCity(a.getCity());
        address.setZipCode(a.getPostalCode());
        b.setAddress(address);
    }

    @Override
    public void mapBtoA(Person b, PersonDTO a, MappingContext context) {
        a.setId(b.getId());
        a.setName(b.getName());
        a.setLastNames(b.getSurnames());
        Address address = b.getAddress();
        if (address != null) {
            a.setStreetAddress(address.getStreet());
            a.setCity(address.getCity());
            a.setPostalCode(address.getZipCode());
        }
    }

}

Finally, we have to plug in mappers and converters that belongs to the Spring application context, into the MapperFactory. If we extend ConfigurableMapper we will have a MapperFactory created for us, and later on, the chance to access the application context itself to look up for the custom mappers and converters that are Spring components to register them in the MapperFactory.

Let’s show the code:

package com.malsolo.orika.test.spring;

import java.util.Map;
import ma.glasnost.orika.Converter;
import ma.glasnost.orika.Mapper;
import ma.glasnost.orika.MapperFactory;
import ma.glasnost.orika.impl.ConfigurableMapper;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.ApplicationContext;
import org.springframework.stereotype.Component;

@Component
public class SpringConfigurableMapper extends ConfigurableMapper {

    private ApplicationContext applicationContext;

    private MapperFactory mapperFactory;

    @Autowired
    public void setApplicationContext(ApplicationContext applicationContext) {
        this.applicationContext = applicationContext;
        addSpringMappers();
        addSpringConverter();
    }

    @Override
    protected void configure(MapperFactory factory) {
        super.configure(factory);
        this.mapperFactory = factory;
    }

    private void addSpringMappers() {
        @SuppressWarnings("rawtypes")
        final Map mappers = applicationContext
                .getBeansOfType(Mapper.class);
        for (final Mapper mapper : mappers.values()) {
            addMapper(mapper);
        }
    }

    private void addMapper(Mapper mapper) {
        this.mapperFactory.registerMapper(mapper);
    }

    private void addSpringConverter() {
        @SuppressWarnings("rawtypes")
		final Map converters = applicationContext
                .getBeansOfType(Converter.class);
        for (final Converter converter : converters.values()) {
            addConverter(converter);
        }
    }

    private void addConverter(Converter converter) {
        this.mapperFactory.getConverterFactory().registerConverter(converter);
    }

}

Some important things to note here:

Our class SpringConfigurableMapper is a @Component

When Spring instantiates it, it will call its constructor that is inherited from ConfigurableMapper and it calls an init method that creates a MapperFactory and allows you to configure via the overriden method.

So, the first method called is configure(MapperFactory).

Furthermore, don’t create your own MapperFactory as I did above, because it won’t be the one you’ll use to register the mappers and converters, on the contrary, this class will create a new MapperFactory.

Our class SpringConfigurableMapper has the ApplicationContext @Autowired

It’s the same as if you implement ApplicationContextAware, but I’d rather this approach.

Besides, it implies that the method will be called after all the beans have been created. Since you already have a MapperFactory, you can then search for custom components and converters that are spring beans and register them in the MapperFactory.

Our class extends ConfigurableMapper that implements MapperFacade

So, you have to slightly change the PersonServiceImpl and it’s better to not have a MapperFacadeFactory or you will have an autowiring exception because there would be two beans of the MapperFacade class:

package com.malsolo.orika.test.spring;

import ma.glasnost.orika.MapperFacade;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import com.malsolo.orika.test.domain.Person;

@Service
public class PersonServiceImpl implements PersonService {
    
    @Autowired
    private PersonRepository personRepository;
    
    @Autowired
    private MapperFacade mapper;

    @Override
    public Person obtainPerson() {
        return mapper.map(this.personRepository.findPerson(), Person.class);
    }

}

Now, if you run the Orika with Spring main class, OrikaSpringTest, everything will run as expected:

Person{id=85, name=Name 85, surnames=[S., Surname 85], address=Address{street=My street 85, city=City 85, zipCode=code 85}}

Resources

Source code for this post
Orika User Guide
Orika and Spring Framework = easy bean mapping by Ken Blair
Spring Framework Quick Start
Spring Framework Reference Documentation (Meatier examples)

10 useful tools for me as Java EE developer

Javier (@jbbarquero) — Fri, 05 Sep 2014 08:10:18 +0000

Yes, I love trivial things in my computers as terminal appearance, desktop backgrounds (from time to time I look for wallpapers at Wallbase.cc) and some programs as Rainmeter or Fences (on Windows)

See an example:

Terminal kermit-green and black on Ubuntu

But now let’s see what tools I want to have installed in any computer I use for my daily work as Java EE developer, regardless the particular project.

As a side note, I alternate Operating Systems whenever I can. Windows is the prefered for most of the companies. I like Ubuntu and I love my Mac Book Pro, so I make the effort of using all of them. Besides, it allows me to test if my applications are really WORA.

A Java IDE: Eclipse (Spring Tool Suite), NetBeans or IntelliJ IDEA. Now I’m trying all of them just for fun.
Text editor: Sublime Text is my choice, but in Windows I used to love UltraEdit and I use a lot Notepad++.
A web browser, mainly Chrome. I use Firefox a lot too, and from time to time, my beloved Opera. There’s also time for IE on Windows and Safari on OS X.
A password manager: KeePass or 1Password on OS X.
I love to use the terminal. On Windows I use Console 2.
Java
Spring Framework
Maven
Git. I love GitHub.
YouTube. It provides me tutorials, music and some fun when I need a little rest.

I also have to mention Google and stackoverflow, nowadays I can’t imagine to work without them.

And what about application servers? Since I want to title the POST “10 tools…” they fall from the list. But to talk about a little bit of them, in each project I have to use the application server in which the application will be deployed, but I always use Tomcat for local tests. For anything beyond servlets and JSPs I use open source projects (ActiveMQ if I need JMS, for instance). There is always a targeted server in development environment, so I’ve never used GlassFish or TomEE, but I want to try WildFly.

Introduction to Spring Batch. Part II: more on running a Job

Javier (@jbbarquero) — Thu, 04 Sep 2014 15:45:23 +0000

In the previous blog post entry, we introduced Spring Batch with a simple exposition of its features, main concepts both for configuring and running Batch Jobs.

We also saw a sample application and two ways of running it: by invoking a JobLauncher bean or by using CommandLineJobRunner from the command line.

In this blog entry, we’ll see two additional ways to run a Spring Batch job:

Using JobOperator, in order to have control of the batch process, from start a job to monitoring tasks such as stopping, restarting, or summarizing a Job. We’ll only pay attention to the start operation, but once a JobOperator is configured, you can use for the remaining monitoring tasks.
Using Spring Boot, the new convention-over-configuration centric framework from the Spring team, that allows you to create with a few lines of code applications that “just run”, because Spring Boot provides a lot of features based on what you have in your classpath.

As usual, all the source code is available at GitHub.

Running the sample: JobOperator

JobOperator is an interface that provides operations for inspecting and controlling jobs, mainly for a command-line client or a remote launcher like a JMX console.

The implementation that Spring Batch provides, SimpleJobOperator, uses JobLauncher, JobRepository, JobExplorer, and JobRegistry for performing its operations. They are created by the @EnableBatchProcessing annotation, so we can create an additional batch configuration file with these dependencies autowired and later import it in the batch configuration file (not without issues, due to the way Spring loads the application context, we’ll see this shortly):

@Configuration
public class AdditionalBatchConfiguration {

    @Autowired
    JobRepository jobRepository;
    @Autowired
    JobRegistry jobRegistry;
    @Autowired
    JobLauncher jobLauncher;
    @Autowired
    JobExplorer jobExplorer;

    @Bean
    public JobOperator jobOperator() {
        SimpleJobOperator jobOperator = new SimpleJobOperator();
        jobOperator.setJobExplorer(jobExplorer);
        jobOperator.setJobLauncher(jobLauncher);
        jobOperator.setJobRegistry(jobRegistry);
        jobOperator.setJobRepository(jobRepository);
        return jobOperator;
    }

}

And the @Import:

@Configuration
@EnableBatchProcessing
@Import(AdditionalBatchConfiguration.class)
public class BatchConfiguration {

	// Omitted

}

Now it seems easy to run the job with a main class:

@Component
public class MainJobOperator {

    @Autowired
    JobOperator jobOperator;

    @Autowired
    Job importUserJob;

    public static void main(String... args) throws JobParametersInvalidException, JobInstanceAlreadyExistsException, NoSuchJobException, DuplicateJobException, NoSuchJobExecutionException {

        AnnotationConfigApplicationContext context = new AnnotationConfigApplicationContext(ApplicationConfiguration.class);

        MainJobOperator main = context.getBean(MainJobOperator.class);
        long executionId = main.jobOperator.start(main.importUserJob.getName(), null);

        MainHelper.reportResults(main.jobOperator, executionId);
        MainHelper.reportPeople(context.getBean(JdbcTemplate.class));

        context.close();

        System.out.printf("\nFIN %s", main.getClass().getName());

    }
}

But there’s a little problem… it doesn’t work:

Exception in thread "main" org.springframework.batch.core.launch.NoSuchJobException: No job configuration with the name [importUserJob] was registered
	at org.springframework.batch.core.configuration.support.MapJobRegistry.getJob(MapJobRegistry.java:66)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:190)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
	at org.springframework.batch.core.configuration.annotation.SimpleBatchConfiguration$PassthruAdvice.invoke(SimpleBatchConfiguration.java:127)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:207)
	at com.sun.proxy.$Proxy14.getJob(Unknown Source)
	at org.springframework.batch.core.launch.support.SimpleJobOperator.start(SimpleJobOperator.java:310)
	at com.malsolo.springframework.batch.sample.MainJobOperator.main(MainJobOperator.java:15)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)

Process finished with exit code 1

The problem here, No job configuration with the name [importUserJob] was registered, is due to the way that JobOperator.start(String jobName, String parameters) works.

The main difference with JobLauncher.run(Job job, JobParameters jobParameters) is that the former has String as parameters while the latter uses objects directly.

So JobOperator, actually SimpleJobOperator, has to obtain a Job with the provided name. In order to do so, it uses the JobRegistry.getJob(String name) method. The available Spring Batch implementation is MapJobRegistry that uses a ConcurrentMap to store using the job name as the key, a JobFactory to create the Job when requested.

The problem is that this map has not been populated.

The first solution is easy: the JobRegistry allows you to register at runtime a JobFactory to later obtain the Job as explained above. So, we only need to create this JobFactory…

@Configuration
public class AdditionalBatchConfiguration {
//    Rest omitted
    @Autowired
    Job importUserJob;

    @Bean
    public JobFactory jobFactory() {
        return new ReferenceJobFactory(importUserJob);
    }
}

…and register it in the main method:

@Component
public class MainJobOperator {

    @Autowired
    JobFactory jobFactory;
    @Autowired
    JobRegistry jobRegistry;
    @Autowired
    JobOperator jobOperator;

    @Autowired
    Job importUserJob;

    public static void main(String... args) throws JobParametersInvalidException, JobInstanceAlreadyExistsException, NoSuchJobException, DuplicateJobException, NoSuchJobExecutionException {

        AnnotationConfigApplicationContext context = new AnnotationConfigApplicationContext(ApplicationConfiguration.class);

        MainJobOperator main = context.getBean(MainJobOperator.class);
        main.jobRegistry.register(main.jobFactory);
        long executionId = main.jobOperator.start(main.importUserJob.getName(), null);

        MainHelper.reportResults(main.jobOperator, executionId);
        MainHelper.reportPeople(context.getBean(JdbcTemplate.class));

        context.close();

        System.out.printf("\nFIN %s", main.getClass().getName());

    }
}

And now it works:

***********************************************************
JobExecution: id=0, version=2, startTime=2014-09-04 13:03:37.964, endTime=2014-09-04 13:03:38.141, lastUpdated=2014-09-04 13:03:38.141, status=COMPLETED, exitStatus=exitCode=COMPLETED;exitDescription=, job=[JobInstance: id=0, version=0, Job=[importUserJob]], jobParameters=[{}]
* Steps executed:
StepExecution: id=0, version=3, name=step1, status=COMPLETED, exitStatus=COMPLETED, readCount=5, filterCount=0, writeCount=5 readSkipCount=0, writeSkipCount=0, processSkipCount=0, commitCount=1, rollbackCount=0, exitDescription=
***********************************************************
***********************************************************
* People found:

* Found firstName: JILL, lastName: DOE in the database

* Found firstName: JOE, lastName: DOE in the database

* Found firstName: JUSTIN, lastName: DOE in the database

* Found firstName: JANE, lastName: DOE in the database

* Found firstName: JOHN, lastName: DOE in the database
***********************************************************

But I don’t like this approach, it’s too manual.

I’d rather to populate the JobRegistry automatically, and Spring Batch provides two mechanisms for doing so, a bean post-processor, JobRegistryBeanPostProcessor, and a component that loads and unloads Jobs by creating child context and registering jobs from those contexts as they are created, AutomaticJobRegistrar.

We’ll see the post-processor approach, because it’s very easy. Just declare the bean in the Batch configuration and run the original main class.

@Configuration
@EnableBatchProcessing
@Import(AdditionalBatchConfiguration.class)
public class BatchConfiguration {

	// Omitted

    @Bean
    public JobRegistryBeanPostProcessor jobRegistryBeanPostProcessor(JobRegistry jobRegistry) {
        JobRegistryBeanPostProcessor jobRegistryBeanPostProcessor = new JobRegistryBeanPostProcessor();
        jobRegistryBeanPostProcessor.setJobRegistry(jobRegistry);
        return jobRegistryBeanPostProcessor;
    }

}

The bean post processor has to be declared in this configuration file for registering the job when it’s created (this is the issue that I mentioned before, if you declare the post processor in another java file configuration, for instance in the AdditionalBatchConfiguration it will never receive the job bean). It uses the same JobRegistry that uses the JobOperator to launch the Job. Actually the only that exists, but it’s good to know this.

It also works:

***********************************************************
JobExecution: id=0, version=2, startTime=2014-09-04 13:20:07.343, endTime=2014-09-04 13:20:07.522, lastUpdated=2014-09-04 13:20:07.522, status=COMPLETED, exitStatus=exitCode=COMPLETED;exitDescription=, job=[JobInstance: id=0, version=0, Job=[importUserJob]], jobParameters=[{}]
* Steps executed:
StepExecution: id=0, version=3, name=step1, status=COMPLETED, exitStatus=COMPLETED, readCount=5, filterCount=0, writeCount=5 readSkipCount=0, writeSkipCount=0, processSkipCount=0, commitCount=1, rollbackCount=0, exitDescription=
***********************************************************
***********************************************************
* People found:

* Found firstName: JILL, lastName: DOE in the database

* Found firstName: JOE, lastName: DOE in the database

* Found firstName: JUSTIN, lastName: DOE in the database

* Found firstName: JANE, lastName: DOE in the database

* Found firstName: JOHN, lastName: DOE in the database
***********************************************************

Running the sample: Spring Boot

We’d like to try how quickly and easy is Spring Boot for launching Spring Batch applications once we already have a functional configuration.

And it seems to be a piece of cake (the problem here is to know what is happening under the hood)

package com.malsolo.springframework.batch.sample;
@ComponentScan(excludeFilters = {@ComponentScan.Filter(type = FilterType.ASSIGNABLE_TYPE, value = ApplicationConfiguration.class)})
@EnableAutoConfiguration
public class MainBoot {

    public static void main(String... args) {

        ApplicationContext context = SpringApplication.run(MainBoot.class);

        MainHelper.reportPeople(context.getBean(JdbcTemplate.class));

    }
}

Actually, Spring Boot is out of the bounds of this topic, it deserves its own entry (even, an entire book) but we can summarize the important code here:

Line 3: @EnableAutoConfiguration, with this annotation you want Spring Boot to instantiate the beans that you’re going to need based on the libraries on your classpath.
Line 8: the run method, to bootstrap the application by passing the class itself (in our case, MainBoot) that serves as the primary Spring component.

It’s enough to run the Batch application:

.   ____          _            __ _ _
 /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
 \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::        (v1.1.4.RELEASE)

2014-09-04 16:58:55.109  INFO 15646 --- [           main] c.m.s.batch.sample.MainBoot              : Starting MainBoot on jbeneito-Latitude-3540 with PID 15646 (/home/jbeneito/Documents/git/spring-batch-sample/target/classes started by jbeneito in /home/jbeneito/Documents/git/spring-batch-sample)
2014-09-04 16:58:55.237  INFO 15646 --- [           main] s.c.a.AnnotationConfigApplicationContext : Refreshing org.springframework.context.annotation.AnnotationConfigApplicationContext@6366ebe0: startup date [Thu Sep 04 16:58:55 CEST 2014]; root of context hierarchy
2014-09-04 16:58:56.039  INFO 15646 --- [           main] o.s.b.f.s.DefaultListableBeanFactory     : Overriding bean definition for bean 'jdbcTemplate': replacing [Root bean: class [null]; scope=; abstract=false; lazyInit=false; autowireMode=3; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=batchConfiguration; factoryMethodName=jdbcTemplate; initMethodName=null; destroyMethodName=(inferred); defined in class path resource [com/malsolo/springframework/batch/sample/BatchConfiguration.class]] with [Root bean: class [null]; scope=; abstract=false; lazyInit=false; autowireMode=3; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=mainForSpringInfo; factoryMethodName=jdbcTemplate; initMethodName=null; destroyMethodName=(inferred); defined in class path resource [com/malsolo/springframework/batch/sample/MainForSpringInfo.class]]
2014-09-04 16:58:56.691  WARN 15646 --- [           main] o.s.c.a.ConfigurationClassEnhancer       : @Bean method ScopeConfiguration.stepScope is non-static and returns an object assignable to Spring's BeanFactoryPostProcessor interface. This will result in a failure to process annotations such as @Autowired, @Resource and @PostConstruct within the method's declaring @Configuration class. Add the 'static' modifier to this method to avoid these container lifecycle issues; see @Bean Javadoc for complete details
2014-09-04 16:58:56.724  WARN 15646 --- [           main] o.s.c.a.ConfigurationClassEnhancer       : @Bean method ScopeConfiguration.jobScope is non-static and returns an object assignable to Spring's BeanFactoryPostProcessor interface. This will result in a failure to process annotations such as @Autowired, @Resource and @PostConstruct within the method's declaring @Configuration class. Add the 'static' modifier to this method to avoid these container lifecycle issues; see @Bean Javadoc for complete details
2014-09-04 16:58:56.729  INFO 15646 --- [           main] f.a.AutowiredAnnotationBeanPostProcessor : JSR-330 'javax.inject.Inject' annotation found and supported for autowiring
2014-09-04 16:58:56.975  INFO 15646 --- [           main] trationDelegate$BeanPostProcessorChecker : Bean 'batchConfiguration' of type [class com.malsolo.springframework.batch.sample.BatchConfiguration$$EnhancerBySpringCGLIB$$c3ec56ab] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2014-09-04 16:58:57.145  INFO 15646 --- [           main] trationDelegate$BeanPostProcessorChecker : Bean 'org.springframework.transaction.annotation.ProxyTransactionManagementConfiguration' of type [class org.springframework.transaction.annotation.ProxyTransactionManagementConfiguration$$EnhancerBySpringCGLIB$$5a15b25b] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2014-09-04 16:58:57.238  INFO 15646 --- [           main] trationDelegate$BeanPostProcessorChecker : Bean 'transactionAttributeSource' of type [class org.springframework.transaction.annotation.AnnotationTransactionAttributeSource] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2014-09-04 16:58:57.294  INFO 15646 --- [           main] trationDelegate$BeanPostProcessorChecker : Bean 'transactionInterceptor' of type [class org.springframework.transaction.interceptor.TransactionInterceptor] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2014-09-04 16:58:57.301  INFO 15646 --- [           main] trationDelegate$BeanPostProcessorChecker : Bean 'org.springframework.transaction.config.internalTransactionAdvisor' of type [class org.springframework.transaction.interceptor.BeanFactoryTransactionAttributeSourceAdvisor] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2014-09-04 16:58:57.374  INFO 15646 --- [           main] trationDelegate$BeanPostProcessorChecker : Bean 'dataSourceConfiguration' of type [class com.malsolo.springframework.batch.sample.DataSourceConfiguration$$EnhancerBySpringCGLIB$$18a97d02] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2014-09-04 16:58:57.455  INFO 15646 --- [           main] o.s.j.d.e.EmbeddedDatabaseFactory        : Creating embedded database 'testdb'
2014-09-04 16:58:58.156  INFO 15646 --- [           main] o.s.jdbc.datasource.init.ScriptUtils     : Executing SQL script from class path resource [schema-all.sql]
2014-09-04 16:58:58.178  INFO 15646 --- [           main] o.s.jdbc.datasource.init.ScriptUtils     : Executed SQL script from class path resource [schema-all.sql] in 20 ms.
2014-09-04 16:58:58.178  INFO 15646 --- [           main] o.s.jdbc.datasource.init.ScriptUtils     : Executing SQL script from class path resource [org/springframework/batch/core/schema-hsqldb.sql]
2014-09-04 16:58:58.189  INFO 15646 --- [           main] o.s.jdbc.datasource.init.ScriptUtils     : Executed SQL script from class path resource [org/springframework/batch/core/schema-hsqldb.sql] in 11 ms.
2014-09-04 16:58:58.198  INFO 15646 --- [           main] trationDelegate$BeanPostProcessorChecker : Bean 'dataSource' of type [class org.springframework.jdbc.datasource.embedded.EmbeddedDatabaseFactory$EmbeddedDataSourceProxy] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2014-09-04 16:58:58.206  INFO 15646 --- [           main] trationDelegate$BeanPostProcessorChecker : Bean 'org.springframework.boot.autoconfigure.jdbc.DataSourceAutoConfiguration$DataSourceInitializerConfiguration' of type [class org.springframework.boot.autoconfigure.jdbc.DataSourceAutoConfiguration$DataSourceInitializerConfiguration$$EnhancerBySpringCGLIB$$c0608242] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2014-09-04 16:58:58.257  INFO 15646 --- [           main] trationDelegate$BeanPostProcessorChecker : Bean 'spring.datasource.CONFIGURATION_PROPERTIES' of type [class org.springframework.boot.autoconfigure.jdbc.DataSourceProperties] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2014-09-04 16:58:58.260  INFO 15646 --- [           main] o.s.jdbc.datasource.init.ScriptUtils     : Executing SQL script from URL [file:/home/jbeneito/Documents/git/spring-batch-sample/target/classes/schema-all.sql]
2014-09-04 16:58:58.262  INFO 15646 --- [           main] o.s.jdbc.datasource.init.ScriptUtils     : Executed SQL script from URL [file:/home/jbeneito/Documents/git/spring-batch-sample/target/classes/schema-all.sql] in 2 ms.
2014-09-04 16:58:58.263  WARN 15646 --- [           main] o.s.b.a.jdbc.DataSourceInitializer       : Could not send event to complete DataSource initialization (ApplicationEventMulticaster not initialized - call 'refresh' before multicasting events via the context: org.springframework.context.annotation.AnnotationConfigApplicationContext@6366ebe0: startup date [Thu Sep 04 16:58:55 CEST 2014]; root of context hierarchy)
2014-09-04 16:58:58.263  INFO 15646 --- [           main] trationDelegate$BeanPostProcessorChecker : Bean 'dataSourceInitializer' of type [class org.springframework.boot.autoconfigure.jdbc.DataSourceInitializer] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2014-09-04 16:58:58.277  INFO 15646 --- [           main] trationDelegate$BeanPostProcessorChecker : Bean 'org.springframework.batch.core.configuration.annotation.SimpleBatchConfiguration' of type [class org.springframework.batch.core.configuration.annotation.SimpleBatchConfiguration$$EnhancerBySpringCGLIB$$85a27e41] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2014-09-04 16:58:58.320  INFO 15646 --- [           main] trationDelegate$BeanPostProcessorChecker : Bean 'jobRegistry' of type [class com.sun.proxy.$Proxy25] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2014-09-04 16:58:58.795  INFO 15646 --- [           main] o.s.b.c.r.s.JobRepositoryFactoryBean     : No database type set, using meta data indicating: HSQL
2014-09-04 16:58:59.056  INFO 15646 --- [           main] o.s.b.c.l.support.SimpleJobLauncher      : No TaskExecutor has been set, defaulting to synchronous executor.
2014-09-04 16:58:59.361  INFO 15646 --- [           main] o.s.jdbc.datasource.init.ScriptUtils     : Executing SQL script from class path resource [org/springframework/batch/core/schema-hsqldb.sql]
2014-09-04 16:58:59.381  INFO 15646 --- [           main] o.s.jdbc.datasource.init.ScriptUtils     : Executed SQL script from class path resource [org/springframework/batch/core/schema-hsqldb.sql] in 20 ms.
2014-09-04 16:58:59.890  INFO 15646 --- [           main] o.s.j.e.a.AnnotationMBeanExporter        : Registering beans for JMX exposure on startup
2014-09-04 16:58:59.926  INFO 15646 --- [           main] o.s.b.a.b.JobLauncherCommandLineRunner   : Running default command line with: []
2014-09-04 16:59:00.023  INFO 15646 --- [           main] o.s.b.c.l.support.SimpleJobLauncher      : Job: [FlowJob: [name=importUserJob]] launched with the following parameters: [{run.id=1}]
2014-09-04 16:59:00.060  INFO 15646 --- [           main] o.s.batch.core.job.SimpleStepHandler     : Executing step: [step1]
Converting (firstName: Jill, lastName: Doe) into (firstName: JILL, lastName: DOE)
Converting (firstName: Joe, lastName: Doe) into (firstName: JOE, lastName: DOE)
Converting (firstName: Justin, lastName: Doe) into (firstName: JUSTIN, lastName: DOE)
Converting (firstName: Jane, lastName: Doe) into (firstName: JANE, lastName: DOE)
Converting (firstName: John, lastName: Doe) into (firstName: JOHN, lastName: DOE)
2014-09-04 16:59:00.162  INFO 15646 --- [           main] o.s.b.c.l.support.SimpleJobLauncher      : Job: [FlowJob: [name=importUserJob]] completed with the following parameters: [{run.id=1}] and the following status: [COMPLETED]
2014-09-04 16:59:00.164  INFO 15646 --- [           main] c.m.s.batch.sample.MainBoot              : Started MainBoot in 5.963 seconds (JVM running for 8.019)
***********************************************************
* People found:

* Found firstName: JILL, lastName: DOE in the database

* Found firstName: JOE, lastName: DOE in the database

* Found firstName: JUSTIN, lastName: DOE in the database

* Found firstName: JANE, lastName: DOE in the database

* Found firstName: JOHN, lastName: DOE in the database
***********************************************************
2014-09-04 16:59:00.258  INFO 15646 --- [       Thread-1] s.c.a.AnnotationConfigApplicationContext : Closing org.springframework.context.annotation.AnnotationConfigApplicationContext@6366ebe0: startup date [Thu Sep 04 16:58:55 CEST 2014]; root of context hierarchy
2014-09-04 16:59:00.262  INFO 15646 --- [       Thread-1] o.s.j.e.a.AnnotationMBeanExporter        : Unregistering JMX-exposed beans on shutdown

Process finished with exit code 0

As a side note, this class already scans for compontents, so we don’t need to make an additional scan for components, so we exclude the ApplicationConfiguration class with a filter.

Finally, we use the MainHelper class to show a summary of the results.

That’s all for now, because one more time this entry is growing really fast. Thus, we’ll see in the next post the last topic of Spring Batch that I want to talk about: JSR 352.

Resources

Webinar: Spring Batch 3.0.0 by Michael Minella. Published on Jun 18, 2014.
JSR-352, Spring Batch and You by Michael Minella. Published on Feb 3, 2014.
Integrating Spring Batch and Spring Integration by Gunnar Hillert, Michael Minella. Published on Jul 9, 2014.
Pro Spring Batch (Expert’s Voice in Spring) by Michael Minella. Published on July 12, 2011 by Apress.
Spring Batch in Action by Arnaud Cogoluegnes, Thierry Templier, Gary Gregory, Olivier Bazoud. Published on October 10, 2011 by Manning Publications.
Spring.io GETTING STARTED GUIDE: Creating a Batch Service.
Spring Batch – Reference Documentation
Spring Batch – API specification

Introduction to Spring Batch

Javier (@jbbarquero) — Wed, 03 Sep 2014 10:58:02 +0000

Spring Batch is the Spring Project aimed to write Java Batch applications by using the foundations of Spring Framework.

Michael T. Minella, project lead of Spring Batch and also a member of the JSR 352 (Batch Applications for the Java Platform) expert group, wrote in his book Pro Spring Batch the next definition “Batch processing […] is defined as the processing of data without interaction or interruption. Once started, a batch process runs to some form of completion without any intervention“.

Typically Batch Jobs are long-running, non-interactive and process large volumes of data, more than fits in memory or a single transaction. Thus they usually run outside office hours and include logic for handling errors and restarting if necessary.

Spring Batch provides, among others, the next features:

Transaction management, to allow you to focus on business processing.
Chunk based processing, to process a large value of data by dividing it in small pieces.

Start/Stop/Restart/Skip/Retry capabilities, to handle non-interactive management of the process.

Web based administration interface (Spring Batch Admin), it provides an API for administering tasks.

Based on Spring framework, so it includes all the configuration options, including Dependency Injection.

Compliance with JSR 352: Batch Applications for the Java Platform.

Spring Batch concepts

Batch Stereotypes

Job: an entity that encapsulates an entire batch process. It is composed of one or more ordered Steps and it has some properties such as restartability.
Step: a domain object that encapsulates an independent, sequential phase of a batch job.
Item: the individual piece of data that it’s been processed.
Chunk: the processing style used by Spring Batch: read and process the item and then aggregate until reach a number of items, called “chunk” that will be finally written.

Chunk-Oriented Processing

JobLauncher: the entry point to launch Spring Batch jobs with a given set of JobParameters.
JobRepository: maintains all metadata related to job executions and provides CRUD operations for JobLauncher, Job, and Step implementations.

Running a Job

The JobLauncher interface has a basic implementation SimpleJobLauncher whose only required dependency is a JobRepository, in order to obtain an execution, so that you can use it for executing the Job.

JobLauncher

You can also launch a Job asynchronously by configuring a TaskExecutor. You can also this configuration for running Jobs from within a Web Container.

Job launcher sequence async

A JobLauncher uses the JobRepository to create new JobExecution objects and run them.

Job Repository

Running Jobs: concepts

The main concepts related with Job execution are

Job hierarchy with steps

JobInstance: a logical run of a Job.
JobParameters: a set of parameters used to start a batch job. It categorizes each JobInstance.
JobExecution: physical runs of Jobs, in order to know what happens with the execution.
StepExecution: a single attempt to execute a Step, that is created each time a Step is run and it also provides information regarding the result of the processing.
ExecutionContext: a collection of key/value pairs that are persisted and controlled by the framework in order to allow developers a place to store persistent state that is scoped to a StepExecution or JobExecution.

Sample application

Now we are going to see a simple sample application that reads a POJO that represents a Person from a file containing People data and after processing each of them, that is just uppercase its attributes, saving them in a database.

All the code is available at GitHub.

Let’s begin with the basic domain class: Person, just a POJO.

package com.malsolo.springframework.batch.sample;

public class Person {
    private String lastName;
    private String firstName;
    //...
}

Then, let’s see the simple processor, PersonItemProcessor. It implements an ItemProcessor, with a Person both as Input and Output.

It provides a method to be overwritten, process, that allows you to write the custom transformation.

package com.malsolo.springframework.batch.sample;

import org.springframework.batch.item.ItemProcessor;

public class PersonItemProcessor implements ItemProcessor {

    @Override
    public Person process(final Person person) throws Exception {
        final String firstName = person.getFirstName().toUpperCase();
        final String lastName = person.getLastName().toUpperCase();

        final Person transformedPerson = new Person(firstName, lastName);

        System.out.println("Converting (" + person + ") into (" + transformedPerson + ")");

        return transformedPerson;
    }

}

Once done this, we can proceed to configure the Spring Batch Application, for doing so, we’ll use Java Annotations in a BatchConfiguration file:

// Imports and package omitted
@Configuration
@EnableBatchProcessing
@Import(AdditionalBatchConfiguration.class)
public class BatchConfiguration {

	// Input, processor, and output definition
	
	@Bean
    public ItemReader reader() {
		FlatFileItemReader reader = new FlatFileItemReader();
		reader.setResource(new ClassPathResource("sample-data.csv"));
		reader.setLineMapper(new DefaultLineMapper() {{
			setLineTokenizer(new DelimitedLineTokenizer() {{
				setNames(new String[] {"firstName", "lastName"});
			}});
			setFieldSetMapper(new BeanWrapperFieldSetMapper() {{
				setTargetType(Person.class);
			}});
			
		}});
		return reader;
	}
	
	@Bean
    public ItemProcessor processor() {
        return new PersonItemProcessor();
    }
	
	@Bean
    public ItemWriter writer(DataSource dataSource) {
		JdbcBatchItemWriter writer = new JdbcBatchItemWriter();
		writer.setItemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider());
		writer.setSql("INSERT INTO people (first_name, last_name) VALUES (:firstName, :lastName)");
		writer.setDataSource(dataSource);
		return writer;
	}
	
	//  Actual job configuration
	
	@Bean
    public Job importUserJob(JobBuilderFactory jobs, Step s1) {
		return jobs.get("importUserJob")
				.incrementer(new RunIdIncrementer())
				.flow(s1)
				.end()
				.build();
	}
	
	@Bean
    public Step step1(StepBuilderFactory stepBuilderFactory, ItemReader reader,
            ItemWriter writer, ItemProcessor processor) {
		return stepBuilderFactory.get("step1")
				. chunk(10)
				.reader(reader)
				.processor(processor)
				.writer(writer)
				.build();
	}
	
	@Bean
    public JdbcTemplate jdbcTemplate(DataSource dataSource) {
        return new JdbcTemplate(dataSource);
    }
	
}

Highlights for this class are:

Line 2: @Configuration, this class will be processed by the Spring container to generate bean definitions.
Line 3: @EnableBatchProcessing, provides a base configuration for building batch jobs by creating the next beans beans available to be autowired:
- JobRepository – bean name “jobRepository”
- JobLauncher – bean name “jobLauncher”
- JobRegistry – bean name “jobRegistry”
- PlatformTransactionManager – bean name “transactionManager”
- JobBuilderFactory – bean name “jobBuilders”
- StepBuilderFactory – bean name “stepBuilders”
We’ll see shortly how it works.
Line 10: the reader bean, an instance of a FlatFileItemReader, that implements the ItemReader interface to read each Person from the file containing people. Spring Batch provides several implementations for this interface, being this implementation that read lines from one Resource one of them.You know, no need of custom code.
Line 26: the processor bean, an instance of the previously defined PersonItemProcessor. See above.
Line 31: the writer bean, an instance of a JdbcBatchItemWriter, that implements the ItemWriter interface to write the People already processed to the database. It’s also an implementation provided by Spring Batch, so no need of custom code again. In this case, you only have to provide an SQL, and a callback for the parameters. Since we are using named parameters, we’ve chosen a BeanPropertyItemSqlParameterSourceProvider. This bean also needs a DataSource, so we provided it by passing one as a method parameter in order to Spring inject the instance that it has registered.
Line 42: a Job bean, that it’s built using the JobBuilderFactory that is autowired by passing it as method parameter for this @Bean method. When you call its get method, Spring Batch will create a job builder and will initialize its job repository, the JobBuilder is the convenience class for building jobs of various kinds as you can see in the code above. We also use a Step that is configured as the next Spring bean.
Line 51: a Step bean, that it’s built using the StepBuilderFactory that is autowired by passing it as method parameter for this @Bean method, as well as the other dependencies: the reader, the processor and the writer previously defined. When calling the get method from the StepBuilderFactory, Spring Batch will create a step builder and will initialize its job repository and transaction manager, the StepBuilder is an entry point for building all kinds of steps as you can see in the code above.

This configuration is almost everything needed to configure a Batch process as defined in the concepts above.

Actually, only one configuration class needs to have the @EnableBatchProcessing annotation in order to have the base configuration for building batch jobs. Then you can define the job with their steps and the readers/processors/writers that they need.

But an additional data source is needed to be used by the JobRepository. For this sample we’ll use an in-memory one:

@Configuration
public class DataSourceConfiguration {

    @Bean
    public DataSource dataSource() {
        EmbeddedDatabaseBuilder builder = new EmbeddedDatabaseBuilder();
        return builder
                .setType(HSQL)
                .addScript("schema-all.sql")
                .addScript("org/springframework/batch/core/schema-hsqldb.sql")
                .build();
    }

}

In this case we’ll use the same in-memory database, HSQL, with the schema for the application (line 9) and the schema for the job repository (line 10). The former is available as a resource of the application, the file called schema-all.sql, and the latter in the spring-batch-core jar (spring-batch-core-3.0.1.RELEASE.jar at the time of this writing)

Alternate Configuration

The official documentation shows an slightly different configuration by using the @Autowired annotation for the beans that @EnableBatchProcessing will create. Use the one that you like most. In this case they also imports the data base configuration.

@Configuration
@EnableBatchProcessing
@Import(DataSourceConfiguration.class)
public class AppConfig {

    @Autowired
    private JobBuilderFactory jobs;

    @Autowired
    private StepBuilderFactory steps;

    // Input, processor, and output definition omitted

    @Bean
    public Job importUserJob() {
        return jobs.get("importUserJob").incrementer(new RunIdIncrementer()).flow(step1()).end().build();
    }

    @Bean
    protected Step step1(ItemReader reader, ItemProcessor processor, ItemWriter writer) {
        return steps.get("step1")
            . chunk(10)
            .reader(reader)
            .processor(processor)
            .writer(writer)
            .build();
    }

}

We chose another approach: we load it when configuring the application in the main method as you’ll see shortly. besides, we imported an additional batch configuration (see line 28 at BatchConfiguration.java) to provide an alternate way to launch the application.

Enable Batch Processing: how it works

As we said before, we will go a little deeper in how the annotation @EnableBatchProcessing works.

To remind its goal, this annotation provides a base configuration for building batch jobs by creating a list of beans available to be autowired. An extract of the source code gives us a lot of information:

@Target(ElementType.TYPE)
@Retention(RetentionPolicy.RUNTIME)
@Documented
@Import(BatchConfigurationSelector.class)
public @interface EnableBatchProcessing {

	/**
	 * Indicate whether the configuration is going to be modularized into multiple application contexts. If true then
	 * you should not create any @Bean Job definitions in this context, but rather supply them in separate (child)
	 * contexts through an {@link ApplicationContextFactory}.
	 */
	boolean modular() default false;

}

As you can see at line 4, this annotation imports an implementation of an ImportSelector, one of the options to import beans in a configuration class, in particular, to selective import beans according to certain criteria.

This particular implementation, BatchConfigurationSelector, instantiates the expected beans for providing common structure for enabling and using Spring Batch based in the EnableBatchProcessing’s attribute modular.

There are two implementations depending on whether you want the configuration to be modularized into multiple application contexts so that they don’t interfere with each other with the naming and the uniqueness of beans (for instance, beans named reader) or not. They are ModularBatchConfiguration and SimpleBatchConfiguration respectively. Mainly they both do the same, but the former using an AutomaticJobRegistrar which is responsible for creating separate ApplicationContexts for register isolated jobs that are later accesible via the JobRegistry, and the latter just creates the main components as lazy proxies that only initialize when a method is called (in order to prevent configuration cycles)

The key concept here is that both extends AbstractBatchConfiguration that uses the core interface for this configuration: BatchConfigurer.

The default implementation, DefaultBatchConfigurer, provides the beans mentioned above (jobRepository, jobLauncher, jobRegistry, transactionManager, jobBuilders and stepBuilders), for doing so it doesn’t require a dataSource, it’s Autowired with required to false, so it will use a Map based JobRepository if its dataSource is null, but you have take care if you have a dataSource eligible for autowiring that doesn’t contain the expected database schema for the job repository: the batch process will fail in this case.

Spring Boot provides another implementation, BasicBatchConfigurer, but this is out of the scope of this entry.

With all this information, we already have a Spring Batch application configured, and we more or less know how this configuration is achieved using Java.

Now it’s time to run the application.

Running the sample: JobLauncher

We have all we need to launch a batch job, the Job to be launched and a JobLauncher, so wait no more and execute this main class: MainJobLauncher.

@Component
public class MainJobLauncher {

    @Autowired
    JobLauncher jobLauncher;

    @Autowired
    Job importUserJob;

    public static void main(String... args) throws JobParametersInvalidException, JobExecutionAlreadyRunningException, JobRestartException, JobInstanceAlreadyCompleteException {

        AnnotationConfigApplicationContext context = new AnnotationConfigApplicationContext(ApplicationConfiguration.class);

        MainJobLauncher main = context.getBean(MainJobLauncher.class);

        JobExecution jobExecution = main.jobLauncher.run(main.importUserJob, new JobParameters());

        MainHelper.reportResults(jobExecution);
        MainHelper.reportPeople(context.getBean(JdbcTemplate.class));

        context.close();

    }

}

First things first. This is the way I like to write main classes. Some people from Spring are used to writing main classes annotated with @Configuration, but I’d rather to annotate them as @Components in order to separate the actual application and its configuration from the classes that test the functionality.

As Spring component (line 1), it only needs to have the dependencies @Autowired.

That’s the reason for the ApplicationConfiguration class. It’s a @Configuration class that also performs a @ComponentScan from its own package, that will find the very MainJobLauncher and the remain @Configuration classes, because they are also @Components: BatchConfiguration and DataSourceConfiguration.

As a main class, it creates the Spring Application Context (line 12), it gets the component as a Spring bean (line 14) and then it uses its methods (or attributes in this example. Line 16)

Let’s back to the Batch application: the line 16 is the call to the JobLauncher that will run the Spring Batch process.

The remaining lines are intended to show the results, both from the job execution and the results in the database.

It will be something like this:

***********************************************************
importUserJob finished with a status of  (COMPLETED).
* Steps executed:
	step1 : exitCode=COMPLETED;exitDescription=
StepExecution: id=0, version=3, name=step1, status=COMPLETED, exitStatus=COMPLETED, readCount=5, filterCount=0, writeCount=5 readSkipCount=0, writeSkipCount=0, processSkipCount=0, commitCount=1, rollbackCount=0
***********************************************************

***********************************************************
* People found:

* Found firstName: JILL, lastName: DOE in the database

* Found firstName: JOE, lastName: DOE in the database

* Found firstName: JUSTIN, lastName: DOE in the database

* Found firstName: JANE, lastName: DOE in the database

* Found firstName: JOHN, lastName: DOE in the database
***********************************************************

Running the sample: CommandLineJobRunner

CommandLineJobRunner is a main class provided by Spring Batch as the primary entry point to launch a Spring Batch Job.

It requires at least two arguments: JobConfigurationXmlPath/JobConfigurationClassName and jobName. With the first, it will create an ApplicationContext by loading a Java Configuration from a class with the same name or by loading an XML Configuration file with the same name.

It has a JobLauncher attribute that is autowired with the application context via its AutowireCapableBeanFactory exposed, that is used to autowire the bean properties by type.

It accepts some options (“-restart”, “-next”, “-stop”, “-abandon”) as well as parameters for the JobLauncher that are converted with the DefaultJobParametersConverter as JobParametersConverter that expects a ‘name=value’ format.

You can declare this main class in the manifest file, directly or using some maven plugin as maven-jar-plugin, maven-shade-plugin or even exec-maven-plugin.

That is, you can invoke from your command line something like this:

$ java CommandLineJobRunner job.xml jobName parameter=value

Well, the sample code is a maven project that you can install (it’s enough if you package the application) and it allows to manage the dependencies (the mvn dependency:copy-dependencies command copies all the dependencies in the target/dependency directory)

To simplify, I’ll also copy the generated jar to the same directory of the dependencies in order to invoke the java command more easily:

~/Documents/git/spring-batch-sample$mvn clean install
[INFO] Scanning for projects...
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
...
~/Documents/git/spring-batch-sample$ mvn dependency:copy-dependencies
[INFO] Scanning for projects...
...
[INFO] Copying spring-batch-core-3.0.1.RELEASE.jar to ~/Documents/git/spring-batch-sample/target/dependency/spring-batch-core-3.0.1.RELEASE.jar
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
...
~/Documents/git/spring-batch-sample$ cp target/spring-batch-sample-0.0.1-SNAPSHOT.jar ./target/dependency/
~/Documents/git/spring-batch-sample$ java -classpath "./target/dependency/*" org.springframework.batch.core.launch.support.CommandLineJobRunner com.malsolo.springframework.batch.sample.ApplicationConfiguration importUserJob
...
12:32:17.039 [main] INFO  o.s.b.c.l.support.SimpleJobLauncher 
- Job: [FlowJob: [name=importUserJob]] 
completed with the following parameters: [{}] 
and the following status: [COMPLETED]
...

That’s all for now.

Since this entry is becoming very large, I’ll explain other ways to run Spring Batch Jobs in a next post.

Tomcat and JSTL: sometimes I feel tired

Javier (@jbbarquero) — Wed, 13 Aug 2014 11:45:13 +0000

Bazinga!

tomcat 7 org.apache.jasper.JasperException: The absolute uri: http://java.sun.com/jsp/jstl/core cannot be resolved in either web.xml or the jar files deployed with this application

You can dive into our beloved stackoverflow.com to find out that Tomcat doesn’t include JSTL,not even in Tomcat 8, in spite that they have an implementation of the JSP Standard Tag Library (JSTL) specification, versions 1.0, 1.1 and 1.2.

Of course, I have the correct taglib in the JSPs (note the /jsp within the URI):

<%@ taglib uri="http://java.sun.com/jsp/jstl/core" prefix="c" %>

I don’t want to include the JSTL jar included in the war, because it works in GlassFish (time to move to this server, or even try WildFly). Find here some interesting instructions if it is not your case.

Thus, it’s time to copy the JSTL.jar (available almost anywhere)

In NetBeans, you can go to Services (Window -> Services), find the Servers entry, right click in Apache Tomcat 8.0.3.0 to see its properties. This is the long way to discover where Tomcat is installed. In my case: /usr/local/apache-tomcat-8.0.3.

So I copied one of the available maven dependencies that I used to have and after restarting Tomcat, everything went OK.

$ cd /usr/local/apache-tomcat-8.0.3
$ sudo cp ~/.m2/repository/javax/servlet/jstl/1.2/jstl-1.2.jar .

Spring logging with SLF4J and Logback

Javier (@jbbarquero) — Tue, 12 Aug 2014 12:30:26 +0000

As you already know, Spring framework uses Commons Logging (JCL, the J stands for Jakarta, the former house for Apache Java solutions) as the framework for logging, mainly for historical reasons and backward compatibility.

But it’s possible to use another framework easily thanks to the existing binding process for most of the popular frameworks.

Log4J

If you want to use the classic and popular framework it’s very easy, since Log4J can be used directly with JCL (I’d rather commons-logging, BTW)

Just add the dependency, no need of excluding anything from Spring. For instance, in a maven project:


        
            org.springframework
            spring-context
            4.0.6.RELEASE
        
        
            log4j
            log4j
            1.2.17

Don’t forget to put a configuration file ( log4j.properties or log4j.xml) in the root of the classpath.

SLF4J with Logback

These two frameworks have became my favourite ones for Java logging.

However, in order to use them, you have to make a little bit more changes.

Exclude CL from Spring
Bridging legacy logging APIs, that is, redirect log4j and java.util.logging calls to SLF4J.
Include the SLF4J API dependency
Include the Logback dependency

Since Logback implements SLF4J natively, there is no need of further binding.


            org.springframework
            spring-context
            4.0.6.RELEASE
            
                
                    commons-logging
                    commons-logging
                
            
        
        
            org.slf4j
            jcl-over-slf4j
            1.7.7
        
        
            org.slf4j
            log4j-over-slf4j
            1.7.7
        
        
            org.slf4j
            slf4j-api
            1.7.7
        
        
            ch.qos.logback
            logback-classic
            1.1.2

Finally, you can configure logback with a logback.groovy in the classpath, a logback-test.xml in the classpath, a logback.xml in the classpath or using the BasicConfigurator.

Spring provides instructions to use SLF4J with Log4J and there is a great explanation for the bindings at SLF4J logging with Log4J and JCL