March 26, 2010

Having Fun with Encoding!

Compiler Plugin

Recently, I edited our main company root POM to upgrade some plugins to new versions. Of course, we are following best practice to lock down the plugin version, so when a new version is available we only need to adjust the parent POM. Nearly all version updates were on the last build number digit, which is the z in x.y.z version string – so I didn't expect much difficulties.

However, for the compiler plugin, it was a jump from version 2.0.2 to 2.1, and indeed it turned out that some of the test cases failed compiling with strange encoding issues when using the new compiler plugin version.

Specify Encoding

We are following the suggestion to specify a POM property for source file encoding, for not being forced to configure encoding for all relevant plugins individually. Moreover, we were exactly using what's shown in the example:

<project>
...
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
...
</properties>
...
</project>
That is, we assumed our source files were UTF-8 encoded, which is the most widely used encoding for unicode characters. But, for some of the projects, that's actually not the case since we are using Eclipse with the default setting for text file encoding which is Cp1252 (Western European) on our german Windows.

Why didn't we ever notice that? Well, it happens that both the UTF-8 as well as Cp1252 encodings are backwards compatible with ASCII. We are coding most of the stuff in english (concerning package, class, method, attribute and parameter names, and even Javadoc comments), so the resulting byte stream will never be different for both encodings. However, some of the files used german umlauts in line comments which are exactly the files that can't be compiled any more with new compiler plugin version.

When looking at the debug output of compiler plugin 2.0.2 mojo configuration, you can see that the encoding is not explicitely set, probably meaning that the platform default encoding is used (which is again Cp1252 on all build machines):

[DEBUG] Configuring mojo 'org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile' -->
[DEBUG] (f) basedir = ...
[DEBUG] (f) buildDirectory = ...
[DEBUG] (f) classpathElements = [...]
[DEBUG] (f) compileSourceRoots = [...]
[DEBUG] (f) compilerId = javac
[DEBUG] (f) debug = true
[DEBUG] (f) failOnError = true
[DEBUG] (f) fork = false
[DEBUG] (f) optimize = true
[DEBUG] (f) outputDirectory = ...
[DEBUG] (f) outputFileName = xxx-0.2.0-SNAPSHOT
[DEBUG] (f) projectArtifact = xxx:jar:0.2.0-SNAPSHOT
[DEBUG] (f) showDeprecation = false
[DEBUG] (f) showWarnings = false
[DEBUG] (f) source = 1.6
[DEBUG] (f) staleMillis = 0
[DEBUG] (f) target = 1.6
[DEBUG] (f) verbose = false
[DEBUG] -- end configuration --

The new version 2.1 of compiler plugin is now considering what has been configured in project.build.sourceEncoding property, and hence tries to compile the Cp1252 coded source file with UTF-8 encoding which doesn't work when umlauts are used.

Specify Correct Encoding

Of course, the solution is to specify the correct encoding in project.build.sourceEncoding property, matching the encoding that is used in the development environment when writing the source files.

Oh, yes, Cp1252 is quite similar to ISO 8859-1 encoding (only some special characters on positions 0x80–0x9F are different which we don't use), so in fact we are using ISO 8859-1 now to allow builds on non-Windows platforms as well.

Certainly, it would be nice if the plugins had a history on their site where you can find this type of changes for new versions, without having to search in the Jira...