23 September 2010

Oh what a tangled web, or, Maven dependency management

I was at a presentation last year by Arnaud Heritier, of the Maven core team, who advised the following best practices for dependency management (slide 23 in the presentation):

  • Define all dependencies you are using - and no more!

  • Cleanup your dependencies with mvn dependency:analyze

The first item is really two pieces of advice, both excellent.

  1. Your POM's dependencies should list all of the artifacts that your code uses directly. This apparently obvious rule is very easy to violate unintentionally---and indeed unknowingly. Here's an example: suppose you are writing a toString() method, and you ask your IDE's autocomplete facility to look for ToStringBuilder. Your IDE looks on its classpath, which includes all your dependencies both direct and transitive. You haven't defined a dependency on commons-lang, but you have defined one on spring-context, which depends on commons-lang. So your IDE finds ToStringBuilder in the project classpath, and adds the import statement. It doesn't warn you, because it doesn't know the difference between a direct dependency and a transitive dependency. Bam! You've got a used but undeclared dependency in your project, a small but unpredictable landmine(*) waiting for the right trigger to set it off and fail your build. It will make your build fail the day that your dependency on spring-context disappears, or changes to a version that no longer pulls in commons-lang.

    (* In any project that's been around a while, a cluster bomb would be a better analogy for what's likely to be lurking in the dependencies section of its POM.)

  2. Your POM's dependencies should not list any artifacts that your code does not need. This rule is also obvious and easily broken. It happens whenever you cease to use any classes from a library, but fail to remove the dependency from your POM. You don't necessarily realise when you've removed the very last reference to a library from your project, and your IDE won't tell you. If you break this rule, it won't fail your build directly, but it will pollute your POM with unused artifacts. Since you have to define the version that you want of each of these unused artifacts, it also increases the chance of a version conflict. This happens you declare dependencies on A, that you really do use, and X, version v, that you don't use, and A also declares a dependency on X, version v'. Maven will enforce version v of X, which may or may not be compatible with A.

So what about the advice to use dependency:analyze? This will compare the classes referenced from your Java source with the dependencies declared in your (effective) POM, and flag up any discrepancies between the two: that is, artifacts that you have declared in your POM but do not use, and artifacts that you use in your code but do not declare (and have got away without declaring because they are pulled in as transitive dependencies of something else).

The dependency:analyze goal can be a useful tool, but it gives inadequate protection against the problems mentioned above.

  • One aims to catch mistakes as early as possible. The solution needs to be sought, not in an analyser run post-hoc, but in IDE tooling. If the IDE were aware of the distinction between direct and transitive dependencies, it would know not to import a class without checking that its JAR was declared in the POM. And it could do that yellow wavy underlining thing when you imported something that was not directly listed in the POM. It could also, perhaps, warn you when you excised from your project's source code the last reference to an artifact. This would be far superior to occasionally running, or more likely forgetting to run, a dependency analysis tool and hand-processing its output. (The dependency analysis tool has to be launched manually. There is no point running it automatically as part of the build, since no-one will read its output unless it fails the build upon warnings, and you cannot let it fail the build, since it generates spurious warnings -- see below.)

  • On top of that, it simply doesn't work very well.

    • It will tell you that you have declared a dependency but not used it, when in fact the class is referenced in a configuration file (e.g. Spring), and failing to declare the dependency would cause a runtime ClassNotFoundException (or is it NoClassDefFoundError? I forget. Anyway, you know the one I mean.)

    • Conversely, it will fail to tell you that you have used a dependency without declaring it, if the only reference to the class is in a configuration file.

    • It will tell you that you are using a dependency without having declared it, when the dependency is referenced only from generated code. As an example, if you use YFWSF (Your Favourite Web Services Framework) to call a webservice, you'll probably use yfwsf-maven-plugin to generate the client-side source code from the WSDL during the generate-sources phase. This source code will reference classes from, say, jaxb-api. The dependency:analyze goal will therefore give a warning, unless you put a jaxb-api dependency in your POM. However, you should not put that dependency in your POM, since the generated source code is effectively an artefact of YFWSF and not of your project, and the transitive dependency on jaxb-api declared by YFWSF is sufficient.

    • Conversely, it won't warn you if you've declared a dependency that is only referenced by generated code.

If the Maven meta-model allows it, fixing the problem with generated code would be relatively simple. It would be enough to add a flag to make it ignore either generated code, or code under /target (which should come to the same thing; some people generate source code under /src, but they deserve all that is coming to them).

Detecting non-Java references to classes is an entirely hairier proposition. It isn't feasible to understand the configuration formats of every single tool capable of instantiating a class referenced in a non-compile-checked manner. It might be possible to run a simple plain-text search across certain text-based files (XML, properties), looking for the fully qualified names of any of the classes contained in direct dependencies (to rule out apparent "declared, not used" errors), or any of the classes contained in transitive dependencies but not in direct dependencies (to catch "not declared, but used" errors).

In all of that, I haven't even talked about the <dependencyManagement> section. The above only applies to the <dependencies> section, which is where you define what your project really uses. What Maven calls dependencyManagement serves to define the versions that you want for artifacts It's a bit confusing: an artifact can be listed

  • in both dependencyManagement and dependencies: it's a dependency of the project, and will be propagated transitively to projects that use this project; the version number must be given under dependencyManagement, but should not be given under dependencies; the version used will be the one specified under dependencyManagement.

  • in dependencies but not dependencyManagement: it's a dependency of the project, and will be propagated transitively to projects that use this project; the version must be given. This is poor practice because it's preferable to group all the version management together in dependencyManagement.

  • in dependencyManagement but not dependencies: it's not a dependency of the project; if one of the project's dependencies requests it, the version requested will be overridden by this one; if it's not even a transitive dependency, there will still be no error nor warning from dependency:analyze (or anything else).

Our architect put me onto the practice of keeping the version numbers in the dependencyManagement section only, and keeping them out of dependencies. On a multi-module project (i.e. most projects), you have a single dependencyManagement in the parent project, which ensures that all modules use the same versions of their dependencies. The downside of this is that you have to keep skipping between child and parent POMs when you add or remove dependencies, and this further burdens the task of keeping track of which dependencies you are really using.

It is even worse when child and parent are on a different release cycle: when you change a dependency, you have to (a) change the child's parent to the latest snapshot of the parent, (b) add the artifact to the parent's dependencyManagement, (c) build the parent, (d) commit the parent's POM, (e) add the artifact to the child's dependencies, (f) code what you needed, (g) integrate the changes to the child into the version-control trunk (i.e. make sure tests pass), (h) perform a release of the parent, (i) change the child's parent to the newly released version, (j) commit the child's POM again. You could leave the child inheriting from a snapshot version of the parent, but you won't be be able to release as long as that's the case, and I've learnt that it's a bad idea to put impediments in the way of a release.

It is worse still when the parent's release cycle includes other sub-modules, which may have work in progress on them. If you do want to share a dependencyManagement section between projects that are related but on separate release cycles, I strongly recommend either having a grand-parent project that contains only the dependencyManagement but no modules, and exists on its own release cycle (so that you can modify and release it quickly when you need, without impacting sibling projects which can carry on inheriting from the previous version), or using the <type>pom</type> <scope>import</scope> technique.

16 September 2010

How to ignore accents (diacritics) in a JPA query?

This is a work in progress. Please feel free to contribute in the comments. In what follows, I'll talk about JPA, but it applies equally to native Hibernate queries (Criterion or HQL), and indeed to plain JDBC.

JPAQL provides for case-insensitive searches using the upper() or lower() functions (though it's not clear to me whether JPA Criterion API has an equivalent). However, if you work with non-English-speaking clients, it's very likely that they will also need to perform accent-insensitive searches. For example, a search for "noel" in the surname field should find anyone named Noël.

I can think of two ways to do this, but neither are very satisfactory.

The first is to call convert(input_text, 'US7ASCII') upon both your search string and the column you are searching. This will have the same effect upon accents that lower() will have upon case. However, it only works on Oracle. Also, if you use it on an indexed column, you had better index convert(my_column, 'US7ASCII') too, otherwise the index won't help in that search.

The second way is to maintain a second, accent-free copy of every text column on which you need to perform accent-insensitive searches. You would have to include this extra column in the JPA mapping, otherwise you wouldn't be able to refer to it in your JPA-QL (or JPA Criterion) searches. I don't much like this either, because it pollutes the entities, the mapping and the database schema with fields that have no business meaning, but are essentially technical. But is it not a lesser evil than using a non-portable database function?

The ideal solution would be to use a JPA-provided API or syntax, as one can do for case-sensitivity. That would be portable across JPA providers and across database providers. But I know of no such API. Please leave a comment if you do.

23 July 2010

Move Windows XP installation to a new drive

Oh dear. I am going to seem like a serious Windows geek, kicking off my blog with a post like this. I am no Windows geek, otherwise I wouldn't have struggled with this.

Also, this has nothing to do with the blog's name. Never mind.

Here's the scenario. You've got this Windows machine (for whatever bizarre reason) that's a bit slow, and you decide that putting in a faster hard drive, maybe an SSD, should speed it up. So you get the new drive, and install it physically, but it won't make your machine much faster unless you transfer your operating system and programs onto it from the existing drive. This post is about how to do so under Windows XP - I haven't tried under Vista or Windows 7 but some details definitely won't apply under those versions.

You're not going to bother re-installing and re-configuring absolutely everything. What you want is to clone the Windows partition from the old drive onto the new one. Oddly, although all the information here is fairly easy to find on the net, I haven't found a page that brings it all together.

You need to do two things: clone the existing Windows partition onto the new drive, and get Windows to boot and run from the new drive. However, if you go about it in that order, it won't work. The second part is trickier than it sounds.

Booting from the correct partition

There are several factors that affect which drive is which on your system, and the BIOS and operating system don't identify them in the same way:
  • The BIOS identifies physical drives (not partitions on those drives) according to which SATA socket or IDE channel they're plugged into. For SATA drives, the drive in the first socket is drive 0, the one in the second socket is drive 1, and so on,.
  • The "boot order" setting within the BIOS determines which of the disks the BIOS will first look at for a boot sector from which to launch the OS. (For reasons I cannot divulge, I was unable to modify the BIOS settings when I did this operation, but as it turned out, there was no need to.)
  • The boot.ini file in the root of the boot partition lists the Windows installations you can load. Usually there is just the one, but it is identified by the drive's BIOS number and by the partition's sequence number within the drive, not by its Windows drive letter. This won't need changing either, though, as we'll see.
  • Windows generates a unique identifier for each hard drive partition it sees, and stores the association between drive letter and partition in the registry. You will need to modify this.

(The same principle applies under Linux, except that you have a mount point instead of a drive letter, the association between partition and mount point is in /etc/fstab instead of in the registry.)

Here's the problem. As explained here, Windows remembers which physical partition was assigned which drive letter. If you've cloned your drive using a Windows partition, that means the new drive has been recognised by Windows and assigned a drive letter... maybe D: or E:, say. When you boot onto the new drive, it'll start Windows alright, except that you'll be running off drive D: or E:, and all your program shortcuts reference drive C. And you can't change the letter of your system drive (the one off which you're currently running).

I used Method 2 from the previously-cited page. Before cloning the Windows partition, run regedit, open [HKEY_LOCAL_MACHINE\System\MountedDevices], and delete the key named after the new drive (\DosDevices\D: or \DosDevices\E:, or whatever letter Windows has given the new drive).

Then clone the Windows partition onto the new drive - see below.

After cloning, shut down, then unplug the old drive and plug the new drive into its place, so that the new drive is in position 0 (assuming that's where the old drive was previously). Do not plug the old drive back in. Then restart the machine. By doing it this way,
  • you don't have to modify the boot order in the BIOS, nor modify boot.ini, since both of these work from the drive's physical position, and you're still booting from drive 0.
  • when it starts up, Windows will notice that the new partition isn't in its list of drive letter assignments, and will assign it to drive C:. This will be possible since the letter C:, although it's in Windows' list, is available since it currently refers to a non-existent drive (the old drive, which you unplugged).
If you plugged the old drive back in before restarting, Windows would see that its assignment for C: was a valid drive, and would attribute a different letter to the new partition, and you'd have to start all over again.

Cloning the existing partition

This is well explained on a number of sites. I did it with a product called DriveImage XML which was free for home use, and worked fine, but there are others (Wikipedia has a list of drive cloning software). The limitation with the free cloning programs is that they'll only clone towards a partition that's at least as big as the source partition. If you're installing an SSD, there's a good chance that it'll be smaller than your existing HDD partition, in which case you'll need to
  • Make sure the volume of data on it is less than the capacity of the new drive
  • Shrink the existing partition with some partition editing software - I used this one which is again free for home use.

Linux/Windows dual-boot machines

I haven't tried this, but you'd probably be best off cloning the existing Linux and Windows partitions onto the new drive from Linux or from an install CD. I'm not sure whether that would copy the bootloader for you, or whether you'd need to install the bootloader (which you could do from within the existing Linux installation) as a separate step. The only Windows-related precautions you'd need to take would be making sure you never booted into the old Windows partition with the new drive connected, and that you unplugged the old drive before booting into the new Windows partition for the first time.

Windows Vista or Windows 7

The only thing I know about these is that the method of assigning boot letters has changed, and no longer involves registry keys. I don't know if the problem of assigning the letter C: to the new partition persists, but if it does, and you can still delete drive assignments with the new mechanism, then doing the Windows 7/Vista equivalent of the above actions in the same order should work.


I make no warranty that the above instructions will not cause data loss, or indeed any other kind of loss. The only assurance I can give is that they worked for me.