-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Summary
A memory leak in cloudgene forces frequent restarts- the most recent deployment lasted ~3 weeks before reaching the danger threshold for intervention. Symptoms include decreased export performance, elevated 5xx error rates to users, and slow site responsiveness (90%ile response time > 10 seconds).
We have consistently seen evidence of memory leaks and issues across our last several deploys. As usage increases, these leaks force restarts ever more often- potentially > 1 / month.
Details/ evidence
Initial limited exploration suggests that there may be a memory leak in the S3 upload functionality, among other places. Below are the top 50 items from jmap -histo <pid>, taken just before a recent restart. Note that there are 842k instances of com.amazonaws.services.s3.model.UploadPartRequest, even after the queue was completely drained and no jobs/uploads were in progress.
num #instances #bytes class name
----------------------------------------------
1: 177150799 5668825568 java.util.Hashtable$Entry
2: 173521201 5552678432 java.util.concurrent.ConcurrentHashMap$Node
3: 370453 2292378272 [Ljava.util.concurrent.ConcurrentHashMap$Node;
4: 280325 1744738896 [Ljava.util.Hashtable$Entry;
5: 57184250 1376673256 [Ljava.lang.String;
6: 5696954 1116222200 [B
7: 9562157 1030570944 [C
8: 1291734 913105408 [I
9: 9087848 218108352 java.lang.String
10: 2188009 110985472 [Ljava.lang.Object;
11: 842030 94307360 com.amazonaws.services.s3.model.UploadPartRequest
12: 1590783 76357584 java.util.HashMap
13: 545638 65116480 java.lang.Class
14: 1600795 38419080 com.google.protobuf.ByteString$LiteralByteString
15: 842326 33693040 java.util.EnumMap
16: 957209 30630688 java.util.HashMap$Node
17: 930378 29772096 java.util.concurrent.FutureTask
18: 461692 29548288 java.util.concurrent.ConcurrentHashMap
19: 395224 28456128 org.apache.hadoop.yarn.proto.YarnProtos$ResourceInformationProto
20: 1745499 27927984 java.lang.Object
21: 132294 26458800 org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage
22: 1075438 25810512 java.util.ArrayList
23: 249083 24671232 [Ljava.util.HashMap$Node;
24: 929288 22302912 com.amazonaws.services.s3.model.PartETag
25: 842216 20213184 com.amazonaws.RequestClientOptions
26: 841642 20199408 com.amazonaws.services.s3.transfer.internal.UploadPartCallable
27: 840771 20178504 java.util.concurrent.LinkedBlockingQueue$Node
28: 1032731 16523696 com.mysql.cj.result.StringValueFactory
29: 263529 14757624 org.apache.hadoop.yarn.proto.YarnProtos$StringLongMapProto
30: 127995 14335440 java.util.GregorianCalendar
31: 144545 13886256 [[B
32: 278233 13355184 java.util.Properties
33: 197610 12647040 org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto
34: 128288 12315648 sun.util.calendar.Gregorian$Date
35: 479362 11504688 java.util.Collections$UnmodifiableRandomAccessList
36: 286710 11468400 java.util.LinkedHashMap$Entry
37: 65857 11063976 org.apache.hadoop.yarn.proto.YarnProtos$ApplicationReportProto
38: 193168 10817408 org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto
39: 125224 10017920 java.net.URI
40: 138024 8833536 org.apache.hadoop.ipc.Client$ConnectionId
41: 136587 8741568 cloudgene.mapred.core.User
42: 264358 8459456 sun.nio.fs.UnixPath
43: 173599 8332752 java.nio.HeapByteBuffer
44: 65865 7376880 org.apache.hadoop.yarn.proto.YarnProtos$ApplicationResourceUsageReportProto
45: 125767 7042952 java.util.Calendar$Builder
46: 166697 6667880 java.lang.ref.Finalizer
47: 138024 6625152 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker
48: 199409 6381088 java.util.LinkedList
49: 173590 6182232 [J
50: 128666 6175968 freemarker.core.Token
This appears to be leading to pressure on the garbage collector. (when evaluating GC total run time, in units of seconds, consider that the site was redeployed to fresh hardware ~1 month ago)
[user@ ~]$ sudo jstat -gc pid
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
1024.0 2048.0 896.0 0.0 2750464.0 1239804.9 20970496.0 19587020.8 3558952.0 3167623.3 444208.0 333253.0 446473 16369.459 1701 6032.675 22402.134
[user@ ~]$ sudo jstat -gccapacity pid
NGCMN NGCMX NGC S0C S1C EC OGCMN OGCMX OGC OC MCMN MCMX MC CCSMN CCSMX CCSC YGC FGC
698880.0 10484736.0 8955392.0 1024.0 2048.0 2750464.0 1398272.0 20970496.0 20970496.0 20970496.0 0.0 4163584.0 3558952.0 0.0 1048576.0 444208.0 446473 1701
[user@ ~]$ sudo jstat -gccause pid
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT LGCC GCC
87.50 0.00 46.19 93.40 89.00 75.02 446473 16369.459 1701 6032.675 22402.134 Allocation Failure No GC
[user@ ~]$ sudo jstat -gcold pid
MC MU CCSC CCSU OC OU YGC FGC FGCT GCT
3558952.0 3167623.3 444208.0 333253.0 20970496.0 19587020.8 446473 1701 6032.675 22402.134
[user@ ~]$ sudo jstat -gcoldcapacity pid
OGCMN OGCMX OGC OC YGC FGC FGCT GCT
1398272.0 20970496.0 20970496.0 20970496.0 446473 1701 6032.675 22402.134
[user@ ~]$ sudo jstat -gcutil pid
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT
87.50 0.00 46.20 93.40 89.00 75.02 446473 16369.459 1701 6032.675 22402.134
Largest memory user on the system by far.
[hadoop@ip-10-120-4-95 ~]$ ps aux --sort=%mem,%cpu | grep 20283
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
user pid 487 19.2 50717600 37628456 ? Sl Feb26 242354:41 java <redacted> cloudgene.mapred.CommandLineInterface server