Problem statement
CPU getting maxed out in JVM based production system. Issue recurs at least once in 3 days but no consistent pattern. Once the CPU consumption hits closer to 100%, its stays there and require a restart. Issue not reproducible in lower environment servers.
Solution approach
Run-time details of the system were analyzed first before getting to the application design and code. High level architectural aspects were learnt before diving into the run-time details.
Server details
- 8 CPU cores
- 32 GB mem.
- RHEL OS
- JDK 8
Gathered data
- Thread dump – Series of them (at least 5) separated by minutes.
- GC logs.
- Application log.
- Webserver request log.
- Top command CPU screen snap – series of them ( at least 5) separated by minutes.
Driving rationale
Based on experience it’s known that CPU hung issue happens because of infinite loop in the code, GC collection or prolonged network wait.
Troubleshooting
Top
command taken from OS terminal during the issue shows low CPU IO wait, low CPU Idle and high CPU User time. This means that user processes are consuming the CPU. Looking at the processes that consume CPU, Java ranks the top confirming that the JVM is making the CPU maxed out. Follow this link to learn how to usetop
command.- As the Java process was making the CPU hung, next step was to find out what thread in it was clogging the CPU. A series of thread dumps were taken and analyses. Analysis on the
Running
user threads didn’t let any suspicion. Focus shifted to GC threads. JVM was configured to use Parallel GC and 8 GC threads (equal to CPU cores) were running. Good article on analyzing thread dumps – link. Usinghtop
command could also help to find out the individual cpu consumption of the threads. - GC logs were pulled out and visualized using Greasy.io and there found the issue. GC threads occupying the CPU cycles choking JVM’s throughput as the heap got filled up and the full GCs failing to reclaim the allocated memory. That’s memory leak!
- It was time to gather evidences to support new found hypothesis. Application logs show the
OutOfMemory
error confirming the finding. - Once the issue was confirmed, the hunt for the cause began. Heap dump was taken and top objects occupying heap were identified. They point to do something with database session usage.
- Code was reviewed and there found the programming mistake of unclosed session. There it was concluded that the unclosed sessions bloating the heap and full GC attempts fail forcing the GC threads making the CPU hung.