Task #563
closedMemory management, possible leaks
Added by Marcel Poul over 7 years ago. Updated over 6 years ago.
100%
Description
This ticket closely correspond with #561 ticket.
We made a workflow that resave all entities and started it for identities. It fell after like 200 entities.
Filip then made a LRT that resave all entities too, but runs in 4 threads - this was not because we believed that it could address our problem, but we wanted to make the run quicker. It was quite a surprise, that it manage all of 5600 identities. But after the end of LRT the memory of java is constantly on 6GB, even if IdM do nothing all day. I guess garbage collector run several times since the end of LRT.
So there are several possible problems with memory management in CzechIdM and it deserves some analytic work.
Files
result.png (27.6 KB) result.png | Ondřej Kopr, 07/03/2017 09:22 AM | ||
dominator_tree.png (176 KB) dominator_tree.png | Ondřej Kopr, 07/03/2017 09:27 AM | ||
threads.png (172 KB) threads.png | Ondřej Kopr, 07/03/2017 09:28 AM | ||
result_unreachable.png (27 KB) result_unreachable.png | Ondřej Kopr, 07/03/2017 09:51 AM | ||
domitaro_tree_unreachable.png (162 KB) domitaro_tree_unreachable.png | Ondřej Kopr, 07/03/2017 09:53 AM | ||
max_class_usage.png (126 KB) max_class_usage.png | Ondřej Kopr, 07/03/2017 09:54 AM | ||
output_complete.png (348 KB) output_complete.png | Ondřej Kopr, 07/03/2017 01:34 PM | ||
groovy_test_leak.java (1.51 KB) groovy_test_leak.java | Ondřej Kopr, 07/04/2017 06:43 AM | ||
overview_sync.png (22.6 KB) overview_sync.png | Ondřej Kopr, 07/04/2017 08:12 AM | ||
histogram_sync_idm_objects.png (120 KB) histogram_sync_idm_objects.png | Ondřej Kopr, 07/04/2017 08:12 AM | ||
histogram_sync_idm_objects.png (120 KB) histogram_sync_idm_objects.png | Ondřej Kopr, 07/04/2017 08:12 AM | ||
dominator_sync.png (168 KB) dominator_sync.png | Ondřej Kopr, 07/04/2017 08:12 AM | ||
histogram_sync.png (120 KB) histogram_sync.png | Ondřej Kopr, 07/04/2017 08:15 AM | ||
postgres-caching.png (108 KB) postgres-caching.png | Jan Helbich, 07/12/2017 05:21 PM | ||
munin_classes_loaded.png (34.1 KB) munin_classes_loaded.png | Petr Fišer, 07/28/2017 12:03 PM | ||
munin_heap_alloc.png (57.6 KB) munin_heap_alloc.png | Petr Fišer, 07/28/2017 12:03 PM | ||
munin_metaspace_peak.png (43 KB) munin_metaspace_peak.png | Petr Fišer, 07/28/2017 12:03 PM | ||
munin_nonheap_alloc.png (39.4 KB) munin_nonheap_alloc.png | Petr Fišer, 07/28/2017 12:03 PM | ||
run_2h50m_top.png (56.5 KB) run_2h50m_top.png | Petr Fišer, 07/28/2017 12:03 PM | ||
visualvm_heap.png (33.9 KB) visualvm_heap.png | Petr Fišer, 07/28/2017 12:03 PM | ||
visualvm_metaspace.png (30.6 KB) visualvm_metaspace.png | Petr Fišer, 07/28/2017 12:03 PM | ||
dominator_drops.png (46.6 KB) dominator_drops.png | Petr Fišer, 08/01/2017 12:57 PM | ||
dominator_cake.png (84.7 KB) dominator_cake.png | Petr Fišer, 08/01/2017 12:57 PM | ||
leakhunter_class_dominator.png (78 KB) leakhunter_class_dominator.png | Petr Fišer, 08/01/2017 12:57 PM | ||
leakhunter_accu_by_class.png (171 KB) leakhunter_accu_by_class.png | Petr Fišer, 08/01/2017 12:57 PM | ||
leakhunter_obj_dominator.png (99.9 KB) leakhunter_obj_dominator.png | Petr Fišer, 08/01/2017 12:57 PM | ||
leakhunter_overview.png (53.5 KB) leakhunter_overview.png | Petr Fišer, 08/01/2017 12:57 PM | ||
leakhunter_path_to_accupoint.png (53.1 KB) leakhunter_path_to_accupoint.png | Petr Fišer, 08/01/2017 12:57 PM | ||
classes_loaded_result.png (122 KB) classes_loaded_result.png | Ondřej Kopr, 08/08/2017 06:22 AM |
Related issues
Updated by Marcel Poul over 7 years ago
- Related to Task #561: resave user performance added
Updated by Marcel Poul over 7 years ago
- Subject changed from Memory management, possilbe leaks to Memory management, possible leaks
Updated by Ondřej Kopr over 7 years ago
- File result.png result.png added
- File dominator_tree.png dominator_tree.png added
- File threads.png threads.png added
Updated by Radek Tomiška over 7 years ago
Memory on project machine:
total used free shared buff/cache available Mem: 7633 6341 139 568 1151 450
After dump memory
jmap -dump:format=b,file=/tmp/aopk_heap.bin <PID>file has ~2.3GB, after analyze in Memory Analyzer in eclise get this result (See pic result) memory ussage: 870,4 MB:
- Remainder: 491,9 MB,
- org.codehaus.groovy.reflection.ClassInfo: 141,3 MB,
- java.beans.ThreadGroupContext: 128,9 MB,
- java.lang.ref.Finalizer: 108,2 MB.
- org.codehaus.groovy.reflection.ClassInfo: 16.24%,
- java.beans.ThreadGroupContext: 14,81%,
- java.lang.ref.Finalizer: 12,44%.
Updated by Ondřej Kopr over 7 years ago
- File result_unreachable.png result_unreachable.png added
- File domitaro_tree_unreachable.png domitaro_tree_unreachable.png added
- File max_class_usage.png max_class_usage.png added
Compare with unreachable object:
pics: result_unreachable, domitaro_tree_unreachable, max_class_usage
Updated by Ondřej Kopr over 7 years ago
After search in open tickets in groovy project found this: https://issues.apache.org/jira/browse/GROOVY-6704
I have created the same test as in ticket: after 30min of running unit test java consume still only about ~200mb and 7000 interactions and still continuing.
Updated by Ondřej Kopr over 7 years ago
- File output_complete.png output_complete.png added
Output from jemalloc
Updated by Ondřej Kopr over 7 years ago
- File groovy_test_leak.java groovy_test_leak.java added
Test for memory leak in groovy script - test did not find any errors or leaks.
Updated by Ondřej Kopr over 7 years ago
- File overview_sync.png overview_sync.png added
- File histogram_sync_idm_objects.png histogram_sync_idm_objects.png added
- File histogram_sync_idm_objects.png histogram_sync_idm_objects.png added
- File dominator_sync.png dominator_sync.png added
- Status changed from New to Needs feedback
- Priority changed from High to Normal
- % Done changed from 0 to 90
During synchronization I dump again memory and verified the behavior during synchronization:
Memory during synchronization:
total used free shared buff/cache available Mem: 7633 4918 133 568 2582 1858 Swap: 0 0 0
for detail info see pics.
Updated by Ondřej Kopr over 7 years ago
- File histogram_sync.png histogram_sync.png added
Updated by Marcel Poul over 7 years ago
Ondřej Kopr wrote:
During synchronization I dump again memory and verified the behavior during synchronization:
Memory during synchronization:
[...]for detail info see pics.
I recommend to get the workflow that caused the out of memory error from Filip and run it again and see why it crashed if it happen again.
Updated by Ondřej Kopr over 7 years ago
- Status changed from Needs feedback to Closed
- % Done changed from 90 to 100
Updated by Jan Helbich over 7 years ago
- File postgres-caching.png postgres-caching.png added
I think there might be something wrong with postgres configuration.
I 10 threads resave entity task it seems to consume even more memory then configured - see attached pic.
Updated by Jan Helbich over 7 years ago
- Assignee changed from Ondřej Kopr to Petr Fišer
Postgres memory seems to be holding on reported level. Configuration:
max_connections = 100 shared_buffers = 256MB effective_cache_size = 768MB work_mem = 2621kB maintenance_work_mem = 64MB min_wal_size = 1GB max_wal_size = 2GB checkpoint_completion_target = 0.7 wal_buffers = 7864kB default_statistics_target = 100
It seems like every postmaster process has its own shared_buffers cache or something like that.
Or am I wrong here?
With described setting the database actually takes over 3.5G of memory, which is definitely not supposed to...
Petr, can you take a look at this? Thx
Updated by Radek Tomiška over 7 years ago
- Status changed from Closed to In Progress
- % Done changed from 100 to 70
Updated by Jan Helbich over 7 years ago
I've tried to limit metaspace to 1G -> resulted to outofmemory exception.
Updated by Petr Fišer over 7 years ago
- Assignee changed from Petr Fišer to Jan Helbich
Jan Helbich wrote:
Postgres memory seems to be holding on reported level. Configuration:
[...]It seems like every postmaster process has its own shared_buffers cache or something like that.
I wouldn't expect Postgres to be a culprit here. Databases usually have those things very thoroughly tested. It is possible, but not very probable.
The total sum of shared_buffers does not make sense. As by name, this part of memory is shared between the processes, so there is only one instance of it. Because the "top" tool is process-based you can see the memory listed at every postgres process - it is true, after all, that every postgres process has those shared_buffers mapped into their memory range. But physically, there is only one "instance" of shared_buffers in the memory.
Looking at jeprof output from Ondra: https://redmine.czechidm.com/attachments/download/69/output_complete.png
The strange thing here is so much allocation in JVM_FindSignal, I have never seen an application have massive allocations on this. Also I managed to google couple of bugs related to JVM_FindSignal, (most of them would probably be regression bugs) in the JVM itself. This would also explain, why limiting heap and/or metaspace has no effect.
Currently we are running on os-packaged version of jdk, that means OpenJDK. The first thing I would to would be to change to OracleJDK to see if it behaves the same way.
Updated by Jan Helbich over 7 years ago
- Assignee changed from Jan Helbich to Petr Fišer
Updated by Petr Fišer over 7 years ago
- File munin_classes_loaded.png munin_classes_loaded.png added
- File munin_heap_alloc.png munin_heap_alloc.png added
- File munin_metaspace_peak.png munin_metaspace_peak.png added
- File munin_nonheap_alloc.png munin_nonheap_alloc.png added
- File run_2h50m_top.png run_2h50m_top.png added
- File visualvm_heap.png visualvm_heap.png added
- File visualvm_metaspace.png visualvm_metaspace.png added
I got the visualvm up and running on the remote and traced what is happening. Also added "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/idm_oomdump.dmp" to the JAVA_OPTS so we will eventually have a dump from the time of crash.
I have "resave all" scheduled tasks running on roles and identities. I started the "resave all roles" again (like 5th time now) every time it finished. The "resave all identities" is still going through its first run.
Looks like we are leaking classes somewhere. I started with roughly 60k loaded classes in the morning.
By now, there are about 130k loaded classes. Almost none (~5k) got unloaded.
Heap seems to be holding fine - JVM tries really hard to make it fit into -Xmx:3500M limit we gave it. On the other hand, Metaspace grows steadily - which makes sense if we are really leaking classes. Munin monitoring, installed by Zdenda, confirms this.
Attached screenshots contain more info. Especially note correlation between non-heap memory and metaspace usage.
Now we know what is happenning, but I do not know the culprit yet.
I am going to let the JVM OOM so it will produce a dump. I hope I will find out more from the dump.
Updated by Vít Švanda over 7 years ago
Maybe is problem still in Groovy scripts (as Ondra wrote). We have minimal three places where we using Groovy scripts:
- GroovyScirptService - but here is now using cache for same scirpts.
- Workflow - for example process for change user permissions use many of Groovy scripts and there is not any cache for same scripts. In standard situation (on AOPK may be yes), is not call any workflow after save identity.
- JDBC scripted connector - This is my favorite, because here is not cache for script too and connector is used after identity is saved.
Updated by Petr Fišer over 7 years ago
Can be, thanks for hints. My personal tip is the JAXB we use for deserializing XMLs. JAXB itself uses WeakHashMap and such things. Those, when handled improperly, can very easily create a leak.
But first I need to collect the dump - IdM has been chugging memory away almost whole day and I do not want to throw the invested time away. :)
Updated by Ondřej Kopr over 7 years ago
After some discussion with colleagues, we think problem may be in the script connector. For our scripts is cache (from Filip), but for script connector isn't. Please try deactivate system with script connector.
Updated by Petr Fišer over 7 years ago
- File dominator_cake.png dominator_cake.png added
- File dominator_drops.png dominator_drops.png added
- File leakhunter_accu_by_class.png leakhunter_accu_by_class.png added
- File leakhunter_class_dominator.png leakhunter_class_dominator.png added
- File leakhunter_obj_dominator.png leakhunter_obj_dominator.png added
- File leakhunter_overview.png leakhunter_overview.png added
- File leakhunter_path_to_accupoint.png leakhunter_path_to_accupoint.png added
There is a bit better view on the issue. I started on IdM with ~30k loaded classes, I ran "resave all identities" and "resave all roles" until the count of loaded classes was about ~60k - this means potentially one half of 60k classes leaked. Then I examined heap dump with leakhunter from eclipse memory analyzer.
One identified issue which occupies about one half of the heap (so this neatly corresponds with "one half of loaded classes leaked"). See leakhunter_overview.png.
So there is obviously problem with org.codehaus.groovy.reflection.ClassInfo and (as by other screenshots) with groovy.lang.MetaClassImpl classes. This really points out that classes loaded through groovy do not get unloaded anywhere.
Updated by Ondřej Kopr over 7 years ago
- Assignee changed from Petr Fišer to Ondřej Kopr
Updated by Ondřej Kopr over 7 years ago
- File classes_loaded_result.png classes_loaded_result.png added
- % Done changed from 70 to 90
Possibly a final judgment: as you se in picture (classes_loaded_result.png) after disable scripted connector, or change this connector to table is loaded class apparently less than when we use scripted connector. So problem is definitely in scripted connector, we tried to find some solution, but we did not come to anything yet.
Updated by Ondřej Kopr over 7 years ago
- Assignee changed from Ondřej Kopr to Jan Helbich
Updated by Jan Helbich about 7 years ago
- Assignee changed from Jan Helbich to Ondřej Kopr
Updated by Ondřej Kopr about 7 years ago
- Status changed from In Progress to Closed
- % Done changed from 90 to 100
I get hit from Alca about connector configuration reloadScriptOnExecution when this configuration is set to true is all scripts loaded and compiled every time when connector call script, but on project is this configuration property set to false :(
Also exists new version of this connector 2.2.5 but in this version is only add connector configuration to script and another minor fixes.
Found some discussion on https://stackoverflow.com/questions/36407119/groovyshell-in-java8-memory-leak-duplicated-classes-src-code-load-test-pr
on our script evaluator is already similar cache exists, but in https://github.com/Evolveum/openicf/blob/release1.4.2.0/framework/java/connector-framework-internal/src/main/java/org/identityconnectors/common/script/groovy/GroovyScriptExecutorFactory.java#L67-L71 isn't.
If we forked connector-framework-internal it is possible update this class with similar cache like in our DefaultGroovyScriptService.
For now I recommend to use script connector on remote connector server, because when will be connector server shut down by memory leak is less problem than shut down IdM.
I write some resume of this ticket into documentation and for now I close this ticket (in anoher ticket is possible fork connector-framework-internal).
documentation: https://wiki.czechidm.com/devel/dev/system/supported-connectors#probable_memory_leak_in_scripted_sql_connector
Updated by Marcel Poul about 7 years ago
Ondřej Kopr wrote:
I get hit from Alca about connector configuration reloadScriptOnExecution when this configuration is set to true is all scripts loaded and compiled every time when connector call script, but on project is this configuration property set to false :(
So should we do something on projects? If so, please discuss with relevant people...
Thx
Updated by Ondřej Kopr about 7 years ago
Marcel Poul wrote:
Ondřej Kopr wrote:
I get hit from Alca about connector configuration reloadScriptOnExecution when this configuration is set to true is all scripts loaded and compiled every time when connector call script, but on project is this configuration property set to false :(
So should we do something on projects? If so, please discuss with relevant people...
Thx
This configuration property is by default turn off, so it not necessary set anything on any projects.
Updated by Alena Peterová about 7 years ago
Ondřej Kopr wrote:
This configuration property is by default turn off, so it not necessary set anything on any projects.
Yes, it is turned off, but does it really work? In the log it looked like the scripts were compiled every time. Definitely when we made the changes to groovy scripts, they were loaded immediately.
Updated by Ondřej Kopr about 7 years ago
Alena Peterová wrote:
Ondřej Kopr wrote:
This configuration property is by default turn off, so it not necessary set anything on any projects.
Yes, it is turned off, but does it really work? In the log it looked like the scripts were compiled every time. Definitely when we made the changes to groovy scripts, they were loaded immediately.
On my local build i set up new Scripted SQL Connector, create all necessary scripts (create, update and search) and debug this class: https://github.com/Tirasa/ConnIdDBBundle/blob/db-2.2.4/scriptedsql/src/main/java/net/tirasa/connid/bundles/db/scriptedsql/ScriptedSQLConnector.java
and check every place that has similar if statement like this:
if (config.isReloadScriptOnExecution()) { searchExecutor = getScriptExecutor(config.getSearchScript(), config.getSearchScriptFileName()); LOG.ok("Search script loaded"); }
if is this configuration property turn off it not reached body of this if statement, so this property works well.
Updated by Alena Peterová about 7 years ago
Then maybe the problem is in multiple calling the Init method:
Thread Id: 52 Time: 2017-11-02 11:02:07.797 Class: net.tirasa.connid.bundles.db.scriptedsql.ScriptedSQLConnector Method: init(ScriptedSQLConnector.java:155) Level: OK Message: Create script loaded Thread Id: 52 Time: 2017-11-02 11:02:08.207 Class: net.tirasa.connid.bundles.db.scriptedsql.ScriptedSQLConnector Method: init(ScriptedSQLConnector.java:158) Level: OK Message: Update script loaded Thread Id: 52 Time: 2017-11-02 11:02:08.352 Class: net.tirasa.connid.bundles.db.scriptedsql.ScriptedSQLConnector Method: init(ScriptedSQLConnector.java:161) Level: OK Message: Delete script loaded Thread Id: 52 Time: 2017-11-02 11:02:08.649 Class: net.tirasa.connid.bundles.db.scriptedsql.ScriptedSQLConnector Method: init(ScriptedSQLConnector.java:164) Level: OK Message: Search script loaded Thread Id: 52 Time: 2017-11-02 11:02:08.650 Class: net.tirasa.connid.bundles.db.scriptedsql.ScriptedSQLConnector Method: init(ScriptedSQLConnector.java:167) Level: OK Message: Sync script loaded Thread Id: 52 Time: 2017-11-02 11:02:08.650 Class: net.tirasa.connid.bundles.db.scriptedsql.ScriptedSQLConnector Method: init(ScriptedSQLConnector.java:170) Level: OK Message: Sync script loaded Thread Id: 52 Time: 2017-11-02 11:02:08.651 Class: net.tirasa.connid.bundles.db.scriptedsql.ScriptedSQLConnector Method: init(ScriptedSQLConnector.java:173) Level: OK Message: Test script loaded
Updated by Ondřej Kopr about 7 years ago
- Assignee changed from Ondřej Kopr to Vít Švanda
- % Done changed from 100 to 90
Updated by Vít Švanda about 7 years ago
- Status changed from Closed to In Progress
- Target version set to Forsterite (7.6.0)
Alena Peterová wrote: Then maybe the problem is in multiple calling the Init method.
It makes sense, because pooling for ConnId (in IC modul) was not never implemented.
I will try to simulating the problem and use the pooling for this case.
Updated by Vít Švanda about 7 years ago
- Related to Task #816: Implement connector pooling added
Updated by Vít Švanda about 7 years ago
I did some tests and prototype for using pool in JDBC scripted connector.
When is pool used and connector facade are cached in IC module and checkbox 'reloadScriptOnExecution' is unchecked, then are GroovyScriptExecutors reusing (only GroovyScriptExecutor.execute is calling everytime).
I can`t now confirm that problem with memory leak will be definitally solved (I tested it only on 1000 accounts), but I think this is correct way.
I created task #816 for implement pool and cache in IC module. I will consulted with Zdeněk, when will be this task implemented.
Updated by Vít Švanda about 7 years ago
- Target version deleted (
Forsterite (7.6.0))
Updated by Vít Švanda over 6 years ago
- Assignee changed from Vít Švanda to Ondřej Kopr
Updated by Ondřej Kopr over 6 years ago
- Status changed from In Progress to Closed
- % Done changed from 90 to 100
Thanks Vitek for give me this hot apple! :)
The ticket will be probably solved by task #816. Thank all of interested people for help and debug.