I posted a little quiz yesterday, and here is the answer.
The short answer is that InputStream needs to be closed. It’s easy to see why if it’s FileInputStream because you know the file handle needs to be released. But in this case, it’s just ByteArrayInputStream. We can just let GC recycle all the memory, right?
Turns out GZIPInputStream (or more precisely Deflater that it uses internally) uses native zlib code to perform decompression, so it’s actually occupying more memory (about 32K-64K depending on the compression level, I believe) on the native side, while its Java heap footprint is small. So if you allocate enough of those, you can end up eating a lot of native memory, while Java heap is still mostly idle. Even though those GZipInputStreams are no longer referenced, it just doesn’t create enough heap pressure to cause the GC to run.
And eventually you eat up all the native memory, and zlib’s malloc fails, and you get OutOfMemoryError (or your system starts to swap like crazy and your system effectively becomes unusable first.)
The other interesting thing to note is that -XX:HeapDumpOnOutOfMemoryError doesn’t do anything in this case. I read the JVM source code and I learned that heap dump only happens when OOME is caused during 3 or 4 specific memory allocation operations, like allocating a Java object, array, GC saturation, and a few other things. There are many other code passes in JVM that reports OOME, like this zlib malloc failure, that doesn’t trigger heap dump. There’s no question HeapDumpOnOutOfMemoryError is useful, but just beware that in some cases it doesn’t get created.
I knew that GZipInputStream is using native code internally, but I didn’t think about it too much when I was putting this original code together. Humans can’t think about all the transitive object graph and its implications.
The other lesson is that now I know why ps sometimes report such a big memory footprint for JVM while jmap reports only a modest usage. The difference is native memory outside Java heap, although unfortunately I don’t think there’s any easy way to check what’s eating the native memory.
My colleague and friend Paul Sandoz pointed out that if GZipInputStream was nice enough to free them up at EOF, it would have saved a lot of hassle, and I think he’s right — one still needs to consider the case where IOException causes the processing to abort before hitting EOF, but it would have helped, because those abnormal cases would be rare. I mean, there’s no harm in doing so, and anything that makes the library more robust in the face of abuse is a good thing, especially when the failure mode is this cryptic.