当前位置：首页 > news >正文

系统一个小时多次Full GC，导致系统线程停止运行，影响系统的性能，可靠性

news 2025/8/5 8:35:01

背景：

某一天系统出现了请求超时，然后通过日志查看，程序执行到某一个位置，直接停下来来了，或者说所有的线程的执行都停下来了。而且是该时间段，请求处理变慢。排查相关的服务，并没有出现死锁，异常，内存不足的情况，并且不是特定的接口超时，而且超时的时间也比较散乱。于是有同事提出来有没有可能系统正在做full GC。于是我们开始朝着这个方向排查。

确定请求超时的原因

第一步：获取GClog：通过GClog于请求超时的时间段对比，发现请求超时的时间，系统真正进行fullGC。STW 机制的本质Full GC 需扫描并清理整个堆内存（包括新生代、老年代、元空间），为确保垃圾回收过程中对象引用关系的一致性，JVM 必须暂停所有应用线程。这种全局暂停称为 STW（Stop-The-World）。

表现：所有用户请求卡顿、任务队列堆积、监控指标显示线程状态为 BLOCKED或 WAITING。
耗时：通常持续百毫秒至数秒，堆越大、存活对象越多，暂停时间越长（例如 10GB 堆可能暂停 5 秒以上）。

而且每一次的full GC都是长达：5-6 s 也就线程暂停了5-6s。并且每次超时都是在full GC发生的时候，所以基本上可以确认就是full GC导致的。 GC导致的系统卡顿的特点：1. 无特定服务 2.通过系统执行日志发现线程执行情况 3.通过GC日志判断

确定full GC的原因

1.确定了GC导致的系统吞吐量降低，延迟抬升，接下来需要解决full GC的问题。

导致full GC的原因一般有哪些？
1.首先建议你们自习看GClog，因为好的GClog他会直接告诉你full GC的原因。
2.其次你也需要知道哪些会导致full GC，

第一：内存空间不足
1.老年代空间不足、元空间溢出、新生代晋升压力
第二: 显示触发
System.gc()

其实GClog已经给了原因：reason：sys，其实就告诉是system.gc导致的，其实也有遇到内存空间不足的情况，给的原因：reason：af（allocation fail）分配失败，也就是分配老年代空间不足。

也可以通过gc log 里面的老年代的空间进行分析，会发现在system.gc 前，heap size 在gc前有很多的空间，而且老年代也够的。那么其实就可以判断不是第一个原因，而是第二个原因。

但是System.gc()是谁发起的呢？首先全局搜索代码，我们的业务代码并没有直接调用System.gc()，那么会是什么触发System.gc()呢。以及我们能不能直接禁用System.gc()。

System.gc()可以直接禁用嘛

首先不建议禁用System.gc()，因为有可能你使用的第三方的框架，或者你以来的组建需要通过System.gc()，清理对内内存。

比如：
堆外内存回收依赖
DirectByteBuffer 等堆外内存：其清理依赖于关联的 Cleaner对象被 GC 回收。若禁用 System.gc()，堆外内存可能无法及时释放，导致 OutOfMemoryError（即使堆内存充足）。典型案例：Netty 等NIO 框架需定期触发 Full GC 释放堆外内存。禁用后可能需等待 JVM 自动触发 Full GC，延迟释放可能引发内存溢出。

System.gc()原因排查

那么如果不禁用，我们应该怎么做？到了这里我们的解决思路是什么？这里提到了直接内存，那有没有可能就是直接内存不够导致的fullGC呢？如果是这个怀疑那怎么证明？直接内存不足？那直接内存使用了多少？于是我们想到，不如打印直接内存看看，看看使用情况，以及我们可以结合GClog再进一步观察一下。其实如果对直接内存熟悉的同学，不一定需要打印，GClog里面会有一个虚引用的数量，虚引用和直接内存又是什么关系呢？

虚引用是管理直接内存的“监控触发器”

虚引用的作用虚引用是 Java 中最弱的引用类型（PhantomReference），无法通过 get()获取对象实例（始终返回 null），其主要功能是跟踪对象被垃圾回收的时机。虚引用必须与 ReferenceQueue关联，当对象被 GC 回收时，虚引用会被加入队列，从而触发后续清理操作。人话：目标对象被GC回收，目标对象的虚引用会加入队列，触发后续动作。

直接内存的特殊性直接内存（如 DirectByteBuffer分配的内存）位于 JVM 堆外，由操作系统管理。其生命周期不受 JVM 垃圾回收器直接控制，但堆内的 DirectByteBuffer对象本身是受 GC 管理的。当该对象被回收时，其关联的直接内存需要手动释放（通过 Cleaner机制），否则会导致堆外内存泄漏。

监控对象回收：虚引用绑定到 DirectByteBuffer对象上，当该对象被 GC 回收时，虚引用被加入 ReferenceQueue。
触发资源释放：通过轮询 ReferenceQueue，程序可执行堆外内存的释放（如调用 Unsafe.freeMemory()）。
防止内存泄漏：此机制确保堆外内存不会因对象回收而遗留未释放的资源。

堆内：DirectByteBuffer对象 – 虚引用 – 堆外内存。所以可以理解是一种桥梁。

DBBs use a PhantomReference which is essentially a more flexible finalizer and they allow the native memory of the DBB to be freed once there are no longer any live Java references. Finalizers and their ilk are generally not recommended because their cleanup time by the garbage collector is non-deterministic.

所以聊到这里我们也会发现，其实通过GC，清理DirectByteBuffer对象，虚引用被加入 ReferenceQueue，进而清理堆外内存的空间。所以达到了JVM对堆外内存的控制。所以你可以通过判断虚引用的数量，判断DirectByteBuffer对象数量，虽然只是数量，但你也可以间接判断堆外内存的使用情况。

import java.lang.management.ManagementFactory;
import java.lang.management.BufferPoolMXBean;
import java.util.List;public class DirectMemoryMonitor {public static void main(String[] args) {List<BufferPoolMXBean> pools = ManagementFactory.getPlatformMXBeans(BufferPoolMXBean.class);for (BufferPoolMXBean pool : pools) {if ("direct".equals(pool.getName())) {System.out.println("直接内存使用: " + formatBytes(pool.getMemoryUsed()) + " / " + formatBytes(pool.getTotalCapacity()));}}}private static String formatBytes(long bytes) {if (bytes < 1024) return bytes + " B";int exp = (int) (Math.log(bytes) / Math.log(1024));char unit = "KMGTPE".charAt(exp-1);return String.format("%.1f %sB", bytes / Math.pow(1024, exp), unit);}
}

打印成功后发现：当直接内存达到64M，96M，128M的时候系统就会发生full GC。所以到这里我们通过直接内存的使用情况初步怀疑就是直接内存不足导致的。我们使用的是IBM JDK 1.8所以我们也咨询了IBM的同事，他们给我们丢了一个链接：

https://publib.boulder.ibm.com/httpserv/cookbook/WebSphere_Application_Server-WAS_traditional-HTTP.html

https://publib.boulder.ibm.com/httpserv/cookbook/Troubleshooting-Troubleshooting_Java.html#Troubleshooting-Troubleshooting_Java-Excessive_Direct_Byte_Buffers

There are two main types of problems with Direct Byte Buffers:

Excessive native memory usage
Excessive performance overhead due to System.gc calls by the DBB code

This section primarily discusses issue 1. For issue 2, note that IBM Java starts with a soft limit of 64MB and increases by 32MB chunks with a System.gc each time, so consider setting -XX:MaxDirectMemorySize=$BYTES (e.g. -XX:MaxDirectMemorySize=1024m) to avoid this upfront cost (although read on for how to size this).

This type of problem is particularly bad with generational collectors because the whole purpose of a generational collector is to minimize the collection of the tenured space (ideally never needing to collect it). If a DBB is tenured, because the size of the Java object is very small, it puts little pressure on the tenured heap. Even if the DBB is ready to be garbage collected, the PhantomReference can only become ready during a tenured collection. Here is a description of this problem (which also talks about native classloader objects, but the principle is the same):（人话：如果DBB已经进入了老年代，除非full GC 回收老年代空间，否则不会回收DBB，从而导致DBB泄漏）

In most cases, something like -XX:MaxDirectMemorySize=1024m (and ensuring -Xdisableexplicitgc is not set) is a reasonable solution to the problem.

A system dump or HPROF dump may be loaded in the IBM Memory Analyzer Tool & the IBM Extensions for Memory Analyzer DirectByteBuffer plugin may be run to show how much of the DBB native memory is available for garbage collection. For example:

规律基本上对上了，至此可以得出结论：堆外内存不足，导致的显示GC的发生。

所以至此我们需要解决的就是DBB的问题：