Project

General

Profile

Task #563

Memory management, possible leaks

Added by Marcel Poul almost 4 years ago. Updated about 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Feedback
Target version:
-
Start date:
07/02/2017
Due date:
% Done:

100%

Estimated time:
Milestones:

Description

This ticket closely correspond with #561 ticket.

We made a workflow that resave all entities and started it for identities. It fell after like 200 entities.
Filip then made a LRT that resave all entities too, but runs in 4 threads - this was not because we believed that it could address our problem, but we wanted to make the run quicker. It was quite a surprise, that it manage all of 5600 identities. But after the end of LRT the memory of java is constantly on 6GB, even if IdM do nothing all day. I guess garbage collector run several times since the end of LRT.

So there are several possible problems with memory management in CzechIdM and it deserves some analytic work.

result.png (27.6 KB) result.png Ondřej Kopr, 07/03/2017 09:22 AM
dominator_tree.png (176 KB) dominator_tree.png Ondřej Kopr, 07/03/2017 09:27 AM
threads.png (172 KB) threads.png Ondřej Kopr, 07/03/2017 09:28 AM
result_unreachable.png (27 KB) result_unreachable.png Ondřej Kopr, 07/03/2017 09:51 AM
domitaro_tree_unreachable.png (162 KB) domitaro_tree_unreachable.png Ondřej Kopr, 07/03/2017 09:53 AM
max_class_usage.png (126 KB) max_class_usage.png Ondřej Kopr, 07/03/2017 09:54 AM
output_complete.png (348 KB) output_complete.png Ondřej Kopr, 07/03/2017 01:34 PM
groovy_test_leak.java (1.51 KB) groovy_test_leak.java Ondřej Kopr, 07/04/2017 06:43 AM
overview_sync.png (22.6 KB) overview_sync.png Ondřej Kopr, 07/04/2017 08:12 AM
histogram_sync_idm_objects.png (120 KB) histogram_sync_idm_objects.png Ondřej Kopr, 07/04/2017 08:12 AM
histogram_sync_idm_objects.png (120 KB) histogram_sync_idm_objects.png Ondřej Kopr, 07/04/2017 08:12 AM
dominator_sync.png (168 KB) dominator_sync.png Ondřej Kopr, 07/04/2017 08:12 AM
histogram_sync.png (120 KB) histogram_sync.png Ondřej Kopr, 07/04/2017 08:15 AM
postgres-caching.png (108 KB) postgres-caching.png Jan Helbich, 07/12/2017 05:21 PM
munin_classes_loaded.png (34.1 KB) munin_classes_loaded.png Petr Fišer, 07/28/2017 12:03 PM
munin_heap_alloc.png (57.6 KB) munin_heap_alloc.png Petr Fišer, 07/28/2017 12:03 PM
munin_metaspace_peak.png (43 KB) munin_metaspace_peak.png Petr Fišer, 07/28/2017 12:03 PM
munin_nonheap_alloc.png (39.4 KB) munin_nonheap_alloc.png Petr Fišer, 07/28/2017 12:03 PM
run_2h50m_top.png (56.5 KB) run_2h50m_top.png Petr Fišer, 07/28/2017 12:03 PM
visualvm_heap.png (33.9 KB) visualvm_heap.png Petr Fišer, 07/28/2017 12:03 PM
visualvm_metaspace.png (30.6 KB) visualvm_metaspace.png Petr Fišer, 07/28/2017 12:03 PM
dominator_drops.png (46.6 KB) dominator_drops.png Petr Fišer, 08/01/2017 12:57 PM
dominator_cake.png (84.7 KB) dominator_cake.png Petr Fišer, 08/01/2017 12:57 PM
leakhunter_class_dominator.png (78 KB) leakhunter_class_dominator.png Petr Fišer, 08/01/2017 12:57 PM
leakhunter_accu_by_class.png (171 KB) leakhunter_accu_by_class.png Petr Fišer, 08/01/2017 12:57 PM
leakhunter_obj_dominator.png (99.9 KB) leakhunter_obj_dominator.png Petr Fišer, 08/01/2017 12:57 PM
leakhunter_overview.png (53.5 KB) leakhunter_overview.png Petr Fišer, 08/01/2017 12:57 PM
leakhunter_path_to_accupoint.png (53.1 KB) leakhunter_path_to_accupoint.png Petr Fišer, 08/01/2017 12:57 PM
classes_loaded_result.png (122 KB) classes_loaded_result.png Ondřej Kopr, 08/08/2017 06:22 AM

Related issues

Related to CzechIdM - Task #561: resave user performanceClosed06/29/2017

Related to CzechIdM - Task #816: Implement connector poolingClosed11/03/2017

History

#1 Updated by Marcel Poul almost 4 years ago

  • Related to Task #561: resave user performance added

#2 Updated by Marcel Poul almost 4 years ago

  • Description updated (diff)

#3 Updated by Marcel Poul almost 4 years ago

  • Subject changed from Memory management, possilbe leaks to Memory management, possible leaks

#6 Updated by Radek Tomiška almost 4 years ago

Memory on project machine:

                     total        used        free      shared  buff/cache   available
Mem:           7633        6341         139         568        1151         450

After dump memory
jmap  -dump:format=b,file=/tmp/aopk_heap.bin <PID>

file has ~2.3GB, after analyze in Memory Analyzer in eclise get this result (See pic result) memory ussage: 870,4 MB:
  • Remainder: 491,9 MB,
  • org.codehaus.groovy.reflection.ClassInfo: 141,3 MB,
  • java.beans.ThreadGroupContext: 128,9 MB,
  • java.lang.ref.Finalizer: 108,2 MB.
Dominator Tree (children alive by node) (see dominator_tree pic)
  • org.codehaus.groovy.reflection.ClassInfo: 16.24%,
  • java.beans.ThreadGroupContext: 14,81%,
  • java.lang.ref.Finalizer: 12,44%.

#7 Updated by Ondřej Kopr almost 4 years ago

Compare with unreachable object:
pics: result_unreachable, domitaro_tree_unreachable, max_class_usage

#8 Updated by Ondřej Kopr almost 4 years ago

After search in open tickets in groovy project found this: https://issues.apache.org/jira/browse/GROOVY-6704
I have created the same test as in ticket: after 30min of running unit test java consume still only about ~200mb and 7000 interactions and still continuing.

#9 Updated by Ondřej Kopr almost 4 years ago

Output from jemalloc

#10 Updated by Ondřej Kopr almost 4 years ago

Test for memory leak in groovy script - test did not find any errors or leaks.

#11 Updated by Ondřej Kopr almost 4 years ago

During synchronization I dump again memory and verified the behavior during synchronization:

Memory during synchronization:

                     total        used           free      shared  buff/cache   available
Mem:           7633        4918         133         568        2582        1858
Swap:             0           0                0

for detail info see pics.

#13 Updated by Marcel Poul almost 4 years ago

Ondřej Kopr wrote:

During synchronization I dump again memory and verified the behavior during synchronization:

Memory during synchronization:
[...]

for detail info see pics.

I recommend to get the workflow that caused the out of memory error from Filip and run it again and see why it crashed if it happen again.

#15 Updated by Ondřej Kopr almost 4 years ago

  • Status changed from Needs feedback to Closed
  • % Done changed from 90 to 100

#16 Updated by Jan Helbich almost 4 years ago

I think there might be something wrong with postgres configuration.
I 10 threads resave entity task it seems to consume even more memory then configured - see attached pic.

#17 Updated by Jan Helbich almost 4 years ago

  • Assignee changed from Ondřej Kopr to Petr Fišer

Postgres memory seems to be holding on reported level. Configuration:

max_connections = 100
shared_buffers = 256MB
effective_cache_size = 768MB
work_mem = 2621kB
maintenance_work_mem = 64MB
min_wal_size = 1GB
max_wal_size = 2GB
checkpoint_completion_target = 0.7
wal_buffers = 7864kB
default_statistics_target = 100

It seems like every postmaster process has its own shared_buffers cache or something like that.
Or am I wrong here?
With described setting the database actually takes over 3.5G of memory, which is definitely not supposed to...
Petr, can you take a look at this? Thx

#18 Updated by Radek Tomiška almost 4 years ago

  • Status changed from Closed to In Progress
  • % Done changed from 100 to 70

#19 Updated by Jan Helbich almost 4 years ago

I've tried to limit metaspace to 1G -> resulted to outofmemory exception.

#20 Updated by Petr Fišer almost 4 years ago

  • Assignee changed from Petr Fišer to Jan Helbich

Jan Helbich wrote:

Postgres memory seems to be holding on reported level. Configuration:
[...]

It seems like every postmaster process has its own shared_buffers cache or something like that.

I wouldn't expect Postgres to be a culprit here. Databases usually have those things very thoroughly tested. It is possible, but not very probable.
The total sum of shared_buffers does not make sense. As by name, this part of memory is shared between the processes, so there is only one instance of it. Because the "top" tool is process-based you can see the memory listed at every postgres process - it is true, after all, that every postgres process has those shared_buffers mapped into their memory range. But physically, there is only one "instance" of shared_buffers in the memory.

Looking at jeprof output from Ondra: https://redmine.czechidm.com/attachments/download/69/output_complete.png
The strange thing here is so much allocation in JVM_FindSignal, I have never seen an application have massive allocations on this. Also I managed to google couple of bugs related to JVM_FindSignal, (most of them would probably be regression bugs) in the JVM itself. This would also explain, why limiting heap and/or metaspace has no effect.

Currently we are running on os-packaged version of jdk, that means OpenJDK. The first thing I would to would be to change to OracleJDK to see if it behaves the same way.

#22 Updated by Jan Helbich almost 4 years ago

  • Assignee changed from Jan Helbich to Petr Fišer

#24 Updated by Petr Fišer almost 4 years ago

I got the visualvm up and running on the remote and traced what is happening. Also added "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/idm_oomdump.dmp" to the JAVA_OPTS so we will eventually have a dump from the time of crash.
I have "resave all" scheduled tasks running on roles and identities. I started the "resave all roles" again (like 5th time now) every time it finished. The "resave all identities" is still going through its first run.

Looks like we are leaking classes somewhere. I started with roughly 60k loaded classes in the morning.
By now, there are about 130k loaded classes. Almost none (~5k) got unloaded.
Heap seems to be holding fine - JVM tries really hard to make it fit into -Xmx:3500M limit we gave it. On the other hand, Metaspace grows steadily - which makes sense if we are really leaking classes. Munin monitoring, installed by Zdenda, confirms this.

Attached screenshots contain more info. Especially note correlation between non-heap memory and metaspace usage.

Now we know what is happenning, but I do not know the culprit yet.
I am going to let the JVM OOM so it will produce a dump. I hope I will find out more from the dump.

#25 Updated by Vít Švanda almost 4 years ago

Maybe is problem still in Groovy scripts (as Ondra wrote). We have minimal three places where we using Groovy scripts:

  • GroovyScirptService - but here is now using cache for same scirpts.
  • Workflow - for example process for change user permissions use many of Groovy scripts and there is not any cache for same scripts. In standard situation (on AOPK may be yes), is not call any workflow after save identity.
  • JDBC scripted connector - This is my favorite, because here is not cache for script too and connector is used after identity is saved.

#26 Updated by Petr Fišer almost 4 years ago

Can be, thanks for hints. My personal tip is the JAXB we use for deserializing XMLs. JAXB itself uses WeakHashMap and such things. Those, when handled improperly, can very easily create a leak.
But first I need to collect the dump - IdM has been chugging memory away almost whole day and I do not want to throw the invested time away. :)

#27 Updated by Ondřej Kopr almost 4 years ago

After some discussion with colleagues, we think problem may be in the script connector. For our scripts is cache (from Filip), but for script connector isn't. Please try deactivate system with script connector.

#28 Updated by Petr Fišer almost 4 years ago

There is a bit better view on the issue. I started on IdM with ~30k loaded classes, I ran "resave all identities" and "resave all roles" until the count of loaded classes was about ~60k - this means potentially one half of 60k classes leaked. Then I examined heap dump with leakhunter from eclipse memory analyzer.
One identified issue which occupies about one half of the heap (so this neatly corresponds with "one half of loaded classes leaked"). See leakhunter_overview.png.

So there is obviously problem with org.codehaus.groovy.reflection.ClassInfo and (as by other screenshots) with groovy.lang.MetaClassImpl classes. This really points out that classes loaded through groovy do not get unloaded anywhere.

#29 Updated by Ondřej Kopr almost 4 years ago

  • Assignee changed from Petr Fišer to Ondřej Kopr

#30 Updated by Ondřej Kopr almost 4 years ago

Possibly a final judgment: as you se in picture (classes_loaded_result.png) after disable scripted connector, or change this connector to table is loaded class apparently less than when we use scripted connector. So problem is definitely in scripted connector, we tried to find some solution, but we did not come to anything yet.

#31 Updated by Ondřej Kopr almost 4 years ago

  • Assignee changed from Ondřej Kopr to Jan Helbich

#32 Updated by Jan Helbich over 3 years ago

  • Assignee changed from Jan Helbich to Ondřej Kopr

#34 Updated by Ondřej Kopr over 3 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 90 to 100

I get hit from Alca about connector configuration reloadScriptOnExecution when this configuration is set to true is all scripts loaded and compiled every time when connector call script, but on project is this configuration property set to false :(

Also exists new version of this connector 2.2.5 but in this version is only add connector configuration to script and another minor fixes.

Found some discussion on https://stackoverflow.com/questions/36407119/groovyshell-in-java8-memory-leak-duplicated-classes-src-code-load-test-pr

on our script evaluator is already similar cache exists, but in https://github.com/Evolveum/openicf/blob/release1.4.2.0/framework/java/connector-framework-internal/src/main/java/org/identityconnectors/common/script/groovy/GroovyScriptExecutorFactory.java#L67-L71 isn't.

If we forked connector-framework-internal it is possible update this class with similar cache like in our DefaultGroovyScriptService.

For now I recommend to use script connector on remote connector server, because when will be connector server shut down by memory leak is less problem than shut down IdM.

I write some resume of this ticket into documentation and for now I close this ticket (in anoher ticket is possible fork connector-framework-internal).

documentation: https://wiki.czechidm.com/devel/dev/system/supported-connectors#probable_memory_leak_in_scripted_sql_connector

#35 Updated by Marcel Poul over 3 years ago

Ondřej Kopr wrote:

I get hit from Alca about connector configuration reloadScriptOnExecution when this configuration is set to true is all scripts loaded and compiled every time when connector call script, but on project is this configuration property set to false :(

So should we do something on projects? If so, please discuss with relevant people...

Thx

#36 Updated by Ondřej Kopr over 3 years ago

Marcel Poul wrote:

Ondřej Kopr wrote:

I get hit from Alca about connector configuration reloadScriptOnExecution when this configuration is set to true is all scripts loaded and compiled every time when connector call script, but on project is this configuration property set to false :(

So should we do something on projects? If so, please discuss with relevant people...

Thx

This configuration property is by default turn off, so it not necessary set anything on any projects.

#37 Updated by Alena Peterová over 3 years ago

Ondřej Kopr wrote:

This configuration property is by default turn off, so it not necessary set anything on any projects.

Yes, it is turned off, but does it really work? In the log it looked like the scripts were compiled every time. Definitely when we made the changes to groovy scripts, they were loaded immediately.

#38 Updated by Ondřej Kopr over 3 years ago

Alena Peterová wrote:

Ondřej Kopr wrote:

This configuration property is by default turn off, so it not necessary set anything on any projects.

Yes, it is turned off, but does it really work? In the log it looked like the scripts were compiled every time. Definitely when we made the changes to groovy scripts, they were loaded immediately.

On my local build i set up new Scripted SQL Connector, create all necessary scripts (create, update and search) and debug this class: https://github.com/Tirasa/ConnIdDBBundle/blob/db-2.2.4/scriptedsql/src/main/java/net/tirasa/connid/bundles/db/scriptedsql/ScriptedSQLConnector.java

and check every place that has similar if statement like this:

 if (config.isReloadScriptOnExecution()) {
     searchExecutor = getScriptExecutor(config.getSearchScript(), config.getSearchScriptFileName());
     LOG.ok("Search script loaded");
}

if is this configuration property turn off it not reached body of this if statement, so this property works well.

#39 Updated by Alena Peterová over 3 years ago

Then maybe the problem is in multiple calling the Init method:

Thread Id: 52   Time: 2017-11-02 11:02:07.797   Class: net.tirasa.connid.bundles.db.scriptedsql.ScriptedSQLConnector    Method: init(ScriptedSQLConnector.java:155)     Level: OK       Message: 
Create script loaded
Thread Id: 52   Time: 2017-11-02 11:02:08.207   Class: net.tirasa.connid.bundles.db.scriptedsql.ScriptedSQLConnector    Method: init(ScriptedSQLConnector.java:158)     Level: OK       Message: 
Update script loaded
Thread Id: 52   Time: 2017-11-02 11:02:08.352   Class: net.tirasa.connid.bundles.db.scriptedsql.ScriptedSQLConnector    Method: init(ScriptedSQLConnector.java:161)     Level: OK       Message: 
Delete script loaded
Thread Id: 52   Time: 2017-11-02 11:02:08.649   Class: net.tirasa.connid.bundles.db.scriptedsql.ScriptedSQLConnector    Method: init(ScriptedSQLConnector.java:164)     Level: OK       Message: Search script loaded
Thread Id: 52   Time: 2017-11-02 11:02:08.650   Class: net.tirasa.connid.bundles.db.scriptedsql.ScriptedSQLConnector    Method: init(ScriptedSQLConnector.java:167)     Level: OK       Message: Sync script loaded
Thread Id: 52   Time: 2017-11-02 11:02:08.650   Class: net.tirasa.connid.bundles.db.scriptedsql.ScriptedSQLConnector    Method: init(ScriptedSQLConnector.java:170)     Level: OK       Message: Sync script loaded
Thread Id: 52   Time: 2017-11-02 11:02:08.651   Class: net.tirasa.connid.bundles.db.scriptedsql.ScriptedSQLConnector    Method: init(ScriptedSQLConnector.java:173)     Level: OK       Message: Test script loaded

#40 Updated by Ondřej Kopr over 3 years ago

  • Assignee changed from Ondřej Kopr to Vít Švanda
  • % Done changed from 100 to 90

#41 Updated by Vít Švanda over 3 years ago

  • Status changed from Closed to In Progress
  • Target version set to Forsterite (7.6.0)
Alena Peterová wrote: Then maybe the problem is in multiple calling the Init method.


It makes sense, because pooling for ConnId (in IC modul) was not never implemented.
I will try to simulating the problem and use the pooling for this case.

#42 Updated by Vít Švanda over 3 years ago

  • Related to Task #816: Implement connector pooling added

#43 Updated by Vít Švanda over 3 years ago

I did some tests and prototype for using pool in JDBC scripted connector.
When is pool used and connector facade are cached in IC module and checkbox 'reloadScriptOnExecution' is unchecked, then are GroovyScriptExecutors reusing (only GroovyScriptExecutor.execute is calling everytime).
I can`t now confirm that problem with memory leak will be definitally solved (I tested it only on 1000 accounts), but I think this is correct way.

I created task #816 for implement pool and cache in IC module. I will consulted with Zdeněk, when will be this task implemented.

#44 Updated by Vít Švanda over 3 years ago

  • Target version deleted (Forsterite (7.6.0))

#45 Updated by Vít Švanda about 3 years ago

  • Assignee changed from Vít Švanda to Ondřej Kopr

#46 Updated by Ondřej Kopr about 3 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 90 to 100

Thanks Vitek for give me this hot apple! :)

The ticket will be probably solved by task #816. Thank all of interested people for help and debug.

Also available in: Atom PDF

Go to top