Decrypting another JPA benchmark

Posted in Persistance / Données by apatricio on octobre 23, 2012

Reviewing a benchmark is always an interesting task. Unfortunately it is often useless because benchmarks are rarely a good representation of real life conditions.

Last month, someone told me a new JPA implementation was trying to make noise. The author was making some buzz almost everywhere on social networks, commenting on popular articles and writing 2 articles claiming that this implementation was over 15 times faster than Hibernate. And here we are again. This sort of buzz happens every 2 years, it’s a cycle, there is nothing we can do about that.

The benchmark

Reading more about the implementation, the focus is made on fine grained tuning on every single line of code which is a really brave approach. However the way to announce « The New JPA Implementation That Runs Over 15 Times Faster… » is not a good way to promote the implementation. Anyway let’s have a closer look at the benchmark that allowed the author to conclude « 15 times faster ».

The benchmark is essentially composed of massive inserts, updates, deletes, queries by packet of 250 iterations. The domain model is composed of

4 entities, yes 4 entities,
no inheritance, it sounds weird that a fabulous JPA implementation is not including more mapping configurations in its benchmark
eager fetching enabled (no subselect fetching for example),
cascading enabled,
no versioning

Last but not least, the benchmark targets an in-memory database, yes this is where it becomes suspicious.

15 times faster… faster what?

The big problem behind the title is that faster is not at all related to response time. The author tries to explain it, it’s related to CPU load, fair enough but could he clearly explain the consequences on a full system?

The fact is that the results are really hard to read and there is no relative view of the benchmark. You are told that Batoo might take 15 times less CPU time than Hibernate and what? The danger here is to mix internal APIs CPU load with the full persistence cycle composed of JPA internals, JDBC driver, networking, database engine. We all know, at least I hope so, that most of the time is spent on network and database so is this really important to focus _that much_ on JPA internals? Is hibernate that slow? Will Batoo solve your performance issues?

Let’s be extremely clear here, most of the performance issues users are facing are NOT because of JPA internals CPU load, the problems are almost always related to a slow database, too fat entity managers, inefficient queries. There are 3 general ways to solve these issues:

check your database !!! JPA is NOT going to fix a database internal problem
make a better use of the database and JPA
tune your own code, that requires various features and that is going to have a really big impact on the global performance numbers. In order to do so, the frameworks you are using must offer various features, this is exactly what Hibernate does: allows you to tune every single interaction your are triggering with the database.

Impacts: the Global view

I’m not going to rely on the timers provided by the benchmark. People are interested in checking how a system globally reacts when switching from one software option to another. I asked my friend and mate what tool a sysadmin would use to have a good view of the resources consumed by a process. « Easy » he told me, use the time command.

The time command runs the specified program command with the given arguments. When command finishes, time writes a message to standard output giving timing statistics about this program run. These statistics consist of (i) the elapsed real time between invocation and termination, (ii) the user CPU time (the sum of the tms_utime and tms_cutime values in a struct tms as returned by times(2)), and (iii) the system CPU time (the sum of the tms_stime and tms_cstime values in a struct tmsas returned by times(2)).

So I’m going to use time mvn test with the good format to grab the CPU numbers.

Some would say it’s not accurate. I would answer that it allows to measure a real relative impact on a global system not at the thread level.

We’ll see how much CPU% are consumed by the benchmark.

The good thing with mvn test is that we’ll also get the total time execution.

Well used Hibernate is ways faster than noob Hibernate

In order to do some tuning, I’ll start with a low BENCHMARK_LENGTH iteration number, let’s say 5. That is 5 times each sub test, one subtest globally doing 250 iterations.

Since the in-memory db will be started by the root process, time command will also catch the DB load which we don’t want but we are just tuning the hibernate config for the moment.

Shoot 1: In memory derby, BENCHMARK_LENGTH 5

Hibernate 13 seconds 200% CPU

Batoo 9 seconds 210% CPU

Hibernate based setup _globally_ consumes less CPU but is also slower. Wait, I’ve said that this measure was including the database engine load. Having in mind that Batoo is claiming it consumes 15 times less CPU than hibernate what would that mean? That would mean that the JPA internal CPU load is extremely low compared to the in-memory database CPU load … and still, we have an unsolvable equation here, we need more tests. Anyway, I already ~~think~~ know that CPU time consumed by JPA may not be a critical issue at all. But wait, small benchmark size and in memory database? This is irrelevant.

Let’s move to a local MySql instance, in-memory databases might be good for development, not for production. If I want a memory based config, I’ll got for Hibernate + Infinispan + a real database instance, not in-memory.

Shoot 2: Local MySql, BENCHMARK_LENGTH 10

Hibernate 44 seconds 64% CPU

Batoo 36 seconds 55% CPU

Batoo is the winner. Should we stop here? Certainly not.

Something is wrong. The good thing with hibernate, a _mature_ JPA implementation, is that it allows tons of tuning, being in the mapping, in the queries well everywhere.

I remember 7 years ago, one guy at work was insulting Hibernate saying it was a piece a crap because he was observing memory issues and/or extremely long response times. One quick look at the hibernate documentation, 2 mapping parameter updates, more care of the entity manager (well Hibernate Session at that time), allowed to solve the memory issue and allowed a 10 000 faster execution. I promise it’s true, 10 000 times faster in response time.

A quick look at the documentation informs us that we should use the following global config parameters:

<property name="hibernate.jdbc.batch_size" value="50"/>
<property name="hibernate.order_update" value="true"/>
<property name="hibernate.id.new_generator_mappings" value="true"/>

More important: checking the INFO level logs, you’ll notice that the author of the benchmark left the auto commit mode. For someone who claims to be implementing the fastest JPA implementation ever, he should review the basics.

So let’s add

<property name="hibernate.connection.autocommit" value="false" />

and rerun the test with something bigger, I also raise the BENCHMARK_LENGTH to 100.

Shoot 3: Local MySql, BENCHMARK_LENGTH 100

Hibernate 5 min 1 seconds 29% CPU

Batoo 4 min 52 seconds 22% CPU

29% versus 22% CPU… We are talking about a production server. In my case I used a 8 core CPUs 3 years old server.

29% of one CPU …

This benchmark is using

3.625% of my global System CPU capacity with hibernate
2.75% of my global System CPU capacity with Batoo

It seems we are really far from the 15 times faster!

A more realistic environment

I won’t stop here, the author says that his motivation is to reduce cluster size thanks to his optimized implementation. So let’s say I don’t have 8 core, only 2. Here I might be interested in reducing my CPU load.

Who says cluster, also says remote database. And even without a cluster, production system will host the database in a different server.

So I set up a really fast LAN, excellent ping and bandwidth.

Here are the new results

Shoot 4: Remote MySql, BENCHMARK_LENGTH 100

Hibernate 16 min 30 seconds 6% CPU

Batoo 17 min 55 seconds 5% CPU

(Is it me or Hibernate is faster ?)

Which, for a 4 core CPUs system means

1.5% of my global System CPU capacity with hibernate
1.25% of my global System CPU capacity with Batoo

or for a 8 core CPUs system means

0.75% of my global System CPU capacity with hibernate
0.62% of my global System CPU capacity with Batoo

Yes Batoo is better in terms of CPU load but with such low levels, who cares? The persistence engine is always waiting for the database!

In the middleware side we have a very low CPU load (at least for the JPA part), what about the DB server side? We have one CPU over the 4 constantly above 80%. A second one above 40%.

Putting all of this together, Batoo MAY consume slightly less CPUthan Hibernate but the actual database load on these benchmarks far outweighs the load imposed by either Batoo or Hibernate.

And, since the internal ORM performance is a relatively minor issue, we should think about the products’ features.

Why is that important?

write stupid queries and you’ll get network, database and possibly memory issues (no matter if you use Hibernate or Batoo) -> query oriented advanced features are welcome
be idiot using the entity manager and you’ll observe slowness and memory issues (no matter if you use Hibernate or Batoo) -> get some training and certification, persistence is not something easy, you need SKILLS
not talking about the DB slowness … if you do not understand RDBMS, you MUST work with a DBA, this is not an advice, this is a requirement.
if you are an expert in JPA, and your application is slow because of the database, you may have hit the RDBMS limitations. Give a try to Hibernate OGM with a NoSql supported database.

Shoot 5: Remote MySql, BENCHMARK_LENGTH 100, 2 concurrent run

I’m not going to provide the results but response times are impacted by 50%. CPU load is close to 2%

Conclusion: I don’t care these CPU optimizations I prefer FEATURES!

I honestly cannot imagine JPA being a cause of CPU problem in a real life scenarios. I’m also not forgetting that Hibernate is a 12 years old project. Talented people are working full time on it, the major version is 4, that means it has been heavily totally reworked 4 times. So that’s enough to understand that it is well written and that efforts are put in the most critical area: offering features.

I did online community support then official Red Hat support. I’ve seen query response times issues (1), memory issues (2) sometimes but I can’t remember about a CPU issue related to JPA internals.

To solve 1 and 2, you need skills but also features that help you tune your code. Like being able to use a specific database syntax (dialects), or tune associations loading, I could continue on 300 pages here and I would recommend you simply consult the Hibernate documentation’s index.

Tagged with: hibernate, jpa, persistence, tuning

Hibernate Search: la cerise sur le gâteau

Posted in Persistance / Données by apatricio on juillet 15, 2009

Sondage

Vous utilisez Hibernate? oui
Vous utilisez les annotations pour définir vos méta données? oui
Vous ne connaissez pas Hibernate Search? honte à vous!

R.O.I. HB-Search

Il y a des frameworks qui proposent des ROI assez impressionnants, Hibernate Search en fait partie.

Comme vous l’avez probablement deviné Hibernate Search permet d’implémenter un moteur de recherche fulltext efficace. Il s’appuie sur Lucene, Hibernate et les annotations.

Lucene est une technologie java d’indexation et de recherche très mature, aboutie et efficace. L’intérêt d’HB Search réside en son intégration avec Hibernate, il en résulte une facilité de mise en œuvre impressionnante.

Nous sommes régulièrement confronté au problème d’implémentation de moteur recherche dans nos applications d’entreprise. Plusieurs soucis:

niveau conception : nous sommes très forts pour proposer des formulaires de recherche ciblant toutes les données imaginables de nos modèles, allant parfois implémenter des formulaires de recherche comportant 36 champs. Le problème de ces formulaires étant leur inaccessibilité pour la ménagère de moins de 50 ans –> pour le grand public

niveau pertinence: nous savons être pertinents et précis sur des numériques, des dates, des booléens mais lorsque l’on nous demandes de prendre en compte les fautes d’orthographes ou les synonymes sur les chaînes de caractères, on se retrouve généralement démunie

Avec HB Search vous pouvez proposer à vos clients, pour un coût moindre, une ouverture vers un moteur fulltext user-friendly (typiquement champ de formulaire unique « à la google »). Ils seront agréablement surpris et n’auront aucun mal à élargir le spectre des spécifications pour consolider ce moteur.

L’exemple

Imaginez une classe Produit avec diverses propriétés de type String comme le libellé et la marque ou libellePrincipal et libelleSecondaire.

Vous souhaitez que la recherche cible ces deux propriétés.

Ci-dessous l’entité annotée comme vous en avez l’habitude:

@Entity
public class Produit {

	@Id
	private int codeProduit;

	private String libellePrincipal;

	private String libelleSecondaire;
 	...
}

Et effectuer une recherche, par exemple, via HQL:

javax.persistence.Query q =
	em.createQuery(
		"select produit " +
		"from Produit produit " +
		"where produit.libellePrincipal = :param");
q.setParameter("param", "café");
List results = q.getResultList();

Méta données

Que faut-il ajouter pour que l’entité et ses 2 champs soient puissent être ciblées par le moteur fulltext?

@Entity
@Indexed
public class Produit {

	@Id
	@DocumentId
	private int codeProduit;

	@Field
	private String libellePrincipal;

	@Field
	private String libelleSecondaire;
	...
}

@org.hibernate.search.annotations.Indexed stipule que l’entité annotée peut être indexée. Grâce à cette annotation, l’intégration Lucene/Hibernate est activée.

Parmi tant d’autres fonctionnalités gérées, l’indexation automatique vous simplifie la vie: lorsque vous agissez sur une entité de ce type, l’index lucene est automatiquement géré.

org.hibernate.search.annotations.Field déclare qu’une propriété est indexée. L’annotation propose divers leviers pour définir comment la propriété est indexée. Pour le moment appliquons le paramétrage par défaut.

Plutôt facile non? Attardons nous maintenant à l’aspect API

API de recherche

Avant de commencer, notez que des APIs équivalentes existent pour la session Hibernate.

Ici, plusieurs étapes sont nécessaires. Il faut d’abord obtenir un EntityManager fulltext, puis créer une requête Lucene. Enfin, la création d’une requête de recherche JPA depuis la requête Lucene nous permettra de retomber sur une API familière et pratique pour manipuler les entités retournées par la recherche.

Voici ce que ça donne:

// expression littérale de la requête Lucene</pre>
String searchQuery = "cafe~";

org.hibernate.search.jpa.FullTextEntityManager fullTextEm =
	Search.getFullTextEntityManager(entityManager);
SearchFactory sf = fullTextEm.getSearchFactory();

// Construction d'un QueryParser, définition du champ par défaut
// récupération de l'analyseur lié à l'entité
org.apache.lucene.queryParser.QueryParser parser = new QueryParser(
	"libellePrincipal",
	sf.getAnalyzer( Produit.class )
);

// construction de la requête lucene
org.apache.lucene.search.Query luceneQuery = parser.parse(searchQuery);

// création de la requête JPA fulltext
org.hibernate.search.jpa.FullTextQuery ftq =
	fullTextEm.createFullTextQuery(luceneQuery, Produit.class);

// exécution de la requête
List results = ftq.getResultList();

La subtilité ici réside en la recherche Lucene «~cafe ». Le tilde active une recherche par approximation. Ce type de recherche permet d’éviter les problèmes d’accent et de typo que l’on rencontre très souvent. De même si les utilisateurs saisissent des fautes d’orthographes, cette recherche s’en sortira facilement.

Bien plus de fonctionnalités

Cette article n’a pas l’ambition de couvrir toute la puissance d’Hibernate Search, simplement de démontrer la facilité et rapidité de mise en œuvre. L’exploitation d’un graph d’objet (et de ses associations) pour la recherche, la pondération de certains champs, la pertinence de la recherche sont possibles et faciles à utiliser.
Bien entendu, d’autres aspects doivent être pris en compte, notamment l’utilisation des analyseurs (approximation, phonétique, synonymes,…) et la gestion / maintenance des index.
Je vous recommande donc la lecture du guide de référence mais surtout du livre d’Emmanuel Bernard et John Griffin.

Tagged with: fulltext, hibernate, indexation, jpa, lucene, recherche

2 comments

Teiid: les données que vous voulez à partir de celles dont vous disposez

Posted in Persistance / Données by apatricio on juin 12, 2009

Teiid

On ne peut être plus clair: les données que vous voulez à partir de celles dont vous disposez.

Teiid vous permet d’agréger plusieurs sources de données en une seule. Imaginez plutôt, vous disposez de 2 bases de données relationnelles, 3 services web et de fichiers plats, vous mettez tout dans le mixeur et à la fin vous ne manipulez plus qu’une seule source de données.

Si l’appli cliente est en Java, vous exploiterez cette source de données via JDBC ou encore mieux via Hibernate ou JPA puisque la dernière release incluse un dialect Hibernate.

Bien entendu, Teiid dispatchera chaque ordre vers les sources de données finales.

La force de Teiid réside, à mon sens en 3 points:

la maturité, Teiid est en fait fondé sur MetaMatrix
l’outillage, tout simplement impressionnant
une liste de fonctionnalités bien plus riche et subtile que le résumé de cet article

Rendez-vous au plus vite le blog et sur les sites Teiid et Teiid designer.

Tagged with: datastore, données, hibernate, metamatrix, teiid

Anthony Patricio’s Blog