Ofcourse java developers and certainly unix administrators know commands like uptime
, ps
, top
etc. Here are a few tools to help you drill down to the core of the problem when Java gets in trouble on a unix environment.
The case
In an actual production situation, a java process used a LOT of IO-time, seemed to hang (no logging output), and the machine had a load-average of about 20, which is way too high for this machine. The environment is relatively complex, multiple JBoss instances running on a server, shared file systems, etc. We needed to know which process was doing what and why. Quickly. These were the tools used to discover the root cause.
jps – Show java process information
jps
is part of the Oracle JDK installation and also present in OpenJDK. It lists all java-processes, with a caveat as presented in an earlier post. Use options like -m
and -v
to see more details than just the process id and jar-file. When top
just shows java
as busy, where the environment runs multiple JBoss instances, that is not very helpful. Matching the “offending” java-pid with jps -v
helped drill down the IO-hogging issue to one of the running JBoss instances.
lsof – List of open files
lsof -p <pid>
is a unix administators friend. Given the process id, it shows every file that is opened by that process. The JBoss instance that was causing trouble is responsible for processing files, so we needed to know what files it was processing. It turned out that it had the same, large file open 20 times, as many as there were worker-threads available. This was caused by a retry-mechanism. The caller had lost it’s patience, the processing simply took unexpectedly long… More info about lsof.
What, then, was taking so long? We now know which process is processing what files, now we need to know what takes so long.
jstack – Dump all stacks for all runtime threads of a process
jstack
too is part of Oracle JDK and OpenJDK. When given a process id of a running java process, as determined with jps,
it dumps out all threads and what they are doing. It shows a list of stacktraces and the state of each thread (RUNNABLE, WAITING, …). This helped the investigation a lot. It turned out that during processing of a file, the JVM was waiting a lot on a call to File.length()
. Apparently this was called very often, or just took rather long. The OS caches this value and can respond quickly, so this was something in the code. The stacktrace pointed to the right line…
Conclusions
The look of amazement on the face of the co-worker who saw that File.length()
got called for each byte of a file inside a loop is one to remember… These files don’t change during processing, so this call should be done outside the loop. And while being there, the code could be optimised in a few other ways as well, so the story has a happy ending after all.
This issue could be found quickly with the help of runtime analysis tools available on the command line. jps
, lsof
, jstack
, and also other tools like jmap
and jstat
, are essential tools in the toolbox of devs and ops in a unix-based environment.