Very high memory consumption #133

karlism · 2019-08-13T15:28:56Z

Hello,

vmware_exporter seems to be consuming very much RAM on our systems. Here's an example of memory consumption on VM that is running vmware exporter, drops in spikes are when I restart vmware exporter manually or systemd does that for me:

vmware exporter is installed from pypi:

# pip3 list | grep vmware-exporter
vmware-exporter (0.9.8)

Running on latest CentOS version:

# uname -a                   
Linux hostname 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
# uname -r
3.10.0-957.21.3.el7.x86_64

Metrics are collected from single vCenter instance that manages 26 ESXi hosts with about 550 virtual machines.

Please let me know if I can provide any additional information.

The text was updated successfully, but these errors were encountered:

pryorda · 2019-08-13T15:33:30Z

That's doing only one collection at a time? Sent from ProtonMail mobile

…

-------- Original Message --------

On Aug 13, 2019, 9:28 AM, karlism wrote: Hello, vmware_exporter seems to be consuming very much RAM on our systems. Here's an example of memory consumption on VM that is running vmware exporter, drops in spikes are when I restart vmware exporter manually or systemd does that for me: [image](https://user-images.githubusercontent.com/19568035/62953993-e4342700-bdee-11e9-96a0-382227da7149.png) vmware exporter is installed from pypi: # pip3 list | grep vmware-exporter vmware-exporter (0.9.8) Running on latest CentOS version: # uname -a Linux hostname 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux # uname -r 3.10.0-957.21.3.el7.x86_64 Metrics are collected from single vCenter instance that manages 26 ESXi hosts with about 550 virtual machines. Please let me know if I can provide any additional information. — You are receiving this because you are subscribed to this thread. Reply to this email directly, [view it on GitHub](#133?email_source=notifications&email_token=ABLCAHFUK5VMTCYO5G3VP73QELHLTA5CNFSM4ILMAVJKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HE7TXVQ), or [mute the thread](https://github.com/notifications/unsubscribe-auth/ABLCAHCMXVX72HVTSDINHQLQELHLTANCNFSM4ILMAVJA).

karlism · 2019-08-14T05:43:48Z

@pryorda, metrics are scraped every 90 seconds from two prometheus servers.

karlism · 2019-08-23T11:36:08Z

I've just noticed that we had very old version of python prometheus-client module installed (0.0.19), I've just updated it to 0.7.1, will report next week if this has solved this issue.

pryorda · 2019-08-23T15:59:02Z

@karlism Thanks for the update. I was going to ask if you could run this in a container and let me know.

pryorda · 2019-08-23T16:15:59Z

Here is our consumption on v0.9.0

I'm going to bump it to the latest and see.

pryorda · 2019-08-23T17:59:46Z

Here is our consumption on v0.9.9

Keep an eye on it and let me know.

karlism · 2019-08-28T07:29:12Z

Hello,

Unfortunately it seems that the problem has gone away on it's own, despite the fact that nothing has been changed on the server:

Blue line marks point, when I updated prometheus_client python library to 0.7.1. It also didn't have any negative (or positive) effect on Prometheus scrape duration:

So I'd say that it is safe to remove dependency to specific prometheus client library version here:
https://github.com/pryorda/vmware_exporter/blob/master/requirements.txt#L1

As for original memory consumption issue, let's leave this ticket open for another week, let's see if the problem returns. I will report as soon as I have any updates on this.

pryorda · 2019-08-28T16:27:30Z

Are you running in a container or installed at the OS level?

karlism · 2019-08-29T06:49:12Z

@pryorda, not a container, it's a VM running CentOS 7.

karlism · 2019-09-20T06:56:49Z

The issue repeated again:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                          
  1144 vmware-+  20   0 6081604   2.3g   1732 S 100.7 81.5   4597:46 vmware_exporter                                  
  1145 prometh+  20   0  116832  42172   4200 S  11.0  1.5   1517:48 blackbox_export                                  
 86930 prometh+  20   0  113832  28268   1432 S   9.3  1.0  72:22.47 snmp_exporter                                    
  1139 prometh+  20   0  115724  10192   2620 S   0.0  0.4  24:00.96 node_exporter                                    
   560 root      20   0   48720   4624   4112 S   0.0  0.2   9:39.88 systemd-journal    

$ ps aux | grep vmwa
vmware-+   1144 41.4 82.8 6114372 2398296 ?     Ssl  Sep12 4598:10 /usr/bin/python3.6 /usr/local/bin/vmware_exporter --config /etc/vmware_exporter/config.yml --loglevel WARNING

$ pip3 list installed
asn1crypto (0.24.0)
attrs (19.1.0)
Automat (0.7.0)
cffi (1.12.2)
chardet (3.0.4)
constantly (15.1.0)
cryptography (2.6.1)
hyperlink (19.0.0)
idna (2.7)
incremental (17.5.0)
pip (8.1.2)
prometheus-client (0.7.1)
pyasn1 (0.4.5)
pyasn1-modules (0.2.4)
pycparser (2.19)
PyHamcrest (1.9.0)
PySocks (1.6.8)
pytz (2019.1)
pyvmomi (6.7.1)
PyYAML (3.12)
requests (2.12.5)
service-identity (18.1.0)
setuptools (39.2.0)
six (1.11.0)
Twisted (19.2.0)
urllib3 (1.19.1)
vmware-exporter (0.9.8)
yamlconfig (0.3.1)
zope.interface (4.6.0)

# cat /etc/vmware_exporter/config.yml 
---
#
# Ansible managed
#

default:
    collect_only:
        datastores: true
        hosts: true
        snapshots: true
        vmguests: true
        vms: true
    ignore_ssl: true
    vsphere_host: vcenter_hostname
    vsphere_password: password
    vsphere_user: username

# free
              total        used        free      shared  buff/cache   available
Mem:        2895428     2709824       68456        2456      117148       21164
Swap:       2097148     2091328        5820
# systemctl restart vmware_exporter
# free
              total        used        free      shared  buff/cache   available
Mem:        2895428      232584     2514000        1976      148844     2482852
Swap:       2097148      209352     1887796

Couple of weeks ago ESXi hosts and vCenter have been updated to 6.7U3, to exclude the option that old software version on that side might cause the issues.

karlism · 2019-09-20T07:32:59Z

I've also noticed errors in vmware exporter logs, this is after service restart and other than these errors in logs, exporter seems to be working fine:

Sep 20 07:22:31 hostname vmware_exporter: 2019-09-20 07:22:31,205 ERROR:Traceback (most recent call last):
Sep 20 07:22:31 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
Sep 20 07:22:31 hostname vmware_exporter: result = g.send(result)
Sep 20 07:22:31 hostname vmware_exporter: StopIteration
Sep 20 07:22:31 hostname vmware_exporter: During handling of the above exception, another exception occurred:
Sep 20 07:22:31 hostname vmware_exporter: Traceback (most recent call last):
Sep 20 07:22:31 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
Sep 20 07:22:31 hostname vmware_exporter: result = g.send(result)
Sep 20 07:22:31 hostname vmware_exporter: StopIteration: [Metric(vmware_datastore_capacity_size, VMWare Datasore capacity in bytes, gauge, , [removed...]
Sep 20 07:22:31 hostname vmware_exporter: During handling of the above exception, another exception occurred:
Sep 20 07:22:31 hostname vmware_exporter: Traceback (most recent call last):
Sep 20 07:22:31 hostname vmware_exporter: File "/usr/local/lib/python3.6/site-packages/vmware_exporter/vmware_exporter.py", line 868, in _async_render_GET
Sep 20 07:22:31 hostname vmware_exporter: yield self.generate_latest_metrics(request)
Sep 20 07:22:31 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
Sep 20 07:22:31 hostname vmware_exporter: result = g.send(result)
Sep 20 07:22:31 hostname vmware_exporter: File "/usr/local/lib/python3.6/site-packages/vmware_exporter/vmware_exporter.py", line 915, in generate_latest_metrics
Sep 20 07:22:31 hostname vmware_exporter: request.write(output)
Sep 20 07:22:31 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/web/server.py", line 238, in write
Sep 20 07:22:31 hostname vmware_exporter: http.Request.write(self, data)
Sep 20 07:22:31 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/web/http.py", line 1118, in write
Sep 20 07:22:31 hostname vmware_exporter: self.channel.writeHeaders(version, code, reason, headers)
Sep 20 07:22:31 hostname vmware_exporter: AttributeError: 'NoneType' object has no attribute 'writeHeaders'
Sep 20 07:22:50 hostname vmware_exporter: Unhandled error in Deferred:
Sep 20 07:22:50 hostname vmware_exporter: Traceback (most recent call last):
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/internet/defer.py", line 501, in errback
Sep 20 07:22:50 hostname vmware_exporter: self._startRunCallbacks(fail)
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
Sep 20 07:22:50 hostname vmware_exporter: self._runCallbacks()
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
Sep 20 07:22:50 hostname vmware_exporter: current.result = callback(current.result, *args, **kw)
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/internet/defer.py", line 1475, in gotResult
Sep 20 07:22:50 hostname vmware_exporter: _inlineCallbacks(r, g, status)
Sep 20 07:22:50 hostname vmware_exporter: --- <exception caught here> ---
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
Sep 20 07:22:50 hostname vmware_exporter: result = result.throwExceptionIntoGenerator(g)
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
Sep 20 07:22:50 hostname vmware_exporter: return g.throw(self.type, self.value, self.tb)
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib/python3.6/site-packages/vmware_exporter/vmware_exporter.py", line 872, in _async_render_GET
Sep 20 07:22:50 hostname vmware_exporter: request.write(b'# Collection failed')
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/web/server.py", line 238, in write
Sep 20 07:22:50 hostname vmware_exporter: http.Request.write(self, data)
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/web/http.py", line 1133, in write
Sep 20 07:22:50 hostname vmware_exporter: self.channel.writeSequence(toChunk(data))
Sep 20 07:22:50 hostname vmware_exporter: builtins.AttributeError: 'NoneType' object has no attribute 'writeSequence'

Jc2k · 2019-09-20T08:02:17Z

That error is normally when the exporter tries to write to a socket that was closed - usually vmware was slow to respond, so prometheus timed out its connection to the exporter. A longer timeout on the prometheus side will help with this, but you might still get them if the vmware API has a slow blip.

karlism · 2019-09-20T09:08:07Z

@Jc2k, you're right, scrape timeout for that job is set to 30 seconds and scrape durations averages around 20 seconds for that environment:

I will increase scrape timeout, thanks!

karlism · 2019-09-23T08:24:01Z

Did some additional testing regarding errors in log files.

$ for i in `jot 10 1`; do echo -n "$i: ";  time curl -s 'http://exporter:9272/metrics?target=vcenter' | wc -l; sleep 5; done 
1:     7418
    0m04.85s real     0m00.04s user     0m00.05s system
2:     7418
    0m09.56s real     0m00.03s user     0m00.03s system
3:     7418
    0m08.47s real     0m00.05s user     0m00.01s system
4:     7418
    0m04.77s real     0m00.01s user     0m00.07s system
5:     7418
    0m04.12s real     0m00.02s user     0m00.05s system
6:     7418
    0m13.35s real     0m00.02s user     0m00.04s system
7:     7418
    0m05.33s real     0m00.01s user     0m00.03s system
8:     7418
    0m11.61s real     0m00.01s user     0m00.03s system
9:     7418
    0m13.31s real     0m00.03s user     0m00.02s system
10:     7418
    0m05.13s real     0m00.03s user     0m00.02s system

Also disabled vmguests metric collection as it improved scrape time quite a lot and we do not care about these metrics as VMs are monitored by node exporter.

# cat /etc/vmware_exporter/config.yml
---
#
# Ansible managed
#

default:
    collect_only:
        datastores: true
        hosts: true
        snapshots: true
        vmguests: false
        vms: true
    ignore_ssl: true
    vsphere_host: vcenter_hostname
    vsphere_password: password
    vsphere_user: username

While running curl query loop, whenever scrape time reached about 10 seconds, we would get errors in vmware exporter log file. Other than increased scrape time and errors in logfiles, it seems to be working fine. VMware metric scrape jobs were disabled on prometheus servers during the tests so that they wouldn't interfere with results.

pryorda · 2019-09-26T18:15:34Z

To make sure I understand the issue:

Run a curl loop
If the scrape time goes above 10s, it gets errors + memleak.

If this is the case we might be able to look at setting higher timeouts in the connection. However, How many vms/hosts are you scraping? We average about 1-3s during our scrapes with 30 hosts and 800 VMs. I'm also wondering if your vcenter instance is undersized for your current environment.

karlism · 2019-09-30T08:43:27Z

@pryorda, there are currently two issues:

memory leak which happens occasionally and is hard to reproduce, see this memory usage graph for the last 30 days:
errors in log files when scrape time exceeds ~10 seconds (despite the fact that metrics are returned properly)
I'm not entirely sure if these two are related in any way.

As for the vCenter, currently it has 26 ESXi hosts across 6 different sites, with 587 VMs running across them. One thing to note is that some of the datacenters are located quite far, with latency reaching 180ms in some of them.
vCenter instance has 24GB RAM (reported memory usage is below 4GB), 8 CPUs (reported CPU usage is below 2GHz) and nothing indicates that it doesn't have enough resources assigned. vCenter web interface seems to be responsive and all operations are performed quickly.

We also have lab vCenter instance, which is also scraped by vmware_exporter with identical versions and configuration as production one, and we do not experience any issues there at all. It's also worth mentioning that this lab vCenter instance has only 2 ESXi hosts and 41 VMs, which are all located in one datacenter.

Can you please point me where I can set higher timeouts in vmware_exporter for the vCenter connection?

Thanks!

pryorda · 2019-10-03T20:43:41Z

I think we will have to do a code fix to set the timeout.

pryorda · 2020-08-17T15:09:45Z

Where can we access the recordings from the sessions?

pryorda added the bug Something isn't working label Aug 23, 2019

pryorda closed this as completed Aug 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very high memory consumption #133

Very high memory consumption #133

karlism commented Aug 13, 2019

pryorda commented Aug 13, 2019 via email

karlism commented Aug 14, 2019

karlism commented Aug 23, 2019

pryorda commented Aug 23, 2019

pryorda commented Aug 23, 2019

pryorda commented Aug 23, 2019

karlism commented Aug 28, 2019

pryorda commented Aug 28, 2019

karlism commented Aug 29, 2019

karlism commented Sep 20, 2019 •

edited

Loading

karlism commented Sep 20, 2019

Jc2k commented Sep 20, 2019

karlism commented Sep 20, 2019 •

edited

Loading

karlism commented Sep 23, 2019 •

edited

Loading

pryorda commented Sep 26, 2019 •

edited

Loading

karlism commented Sep 30, 2019

pryorda commented Oct 3, 2019

pryorda commented Aug 17, 2020

Very high memory consumption #133

Very high memory consumption #133

Comments

karlism commented Aug 13, 2019

pryorda commented Aug 13, 2019 via email

karlism commented Aug 14, 2019

karlism commented Aug 23, 2019

pryorda commented Aug 23, 2019

pryorda commented Aug 23, 2019

pryorda commented Aug 23, 2019

karlism commented Aug 28, 2019

pryorda commented Aug 28, 2019

karlism commented Aug 29, 2019

karlism commented Sep 20, 2019 • edited Loading

karlism commented Sep 20, 2019

Jc2k commented Sep 20, 2019

karlism commented Sep 20, 2019 • edited Loading

karlism commented Sep 23, 2019 • edited Loading

pryorda commented Sep 26, 2019 • edited Loading

karlism commented Sep 30, 2019

pryorda commented Oct 3, 2019

pryorda commented Aug 17, 2020

karlism commented Sep 20, 2019 •

edited

Loading

karlism commented Sep 20, 2019 •

edited

Loading

karlism commented Sep 23, 2019 •

edited

Loading

pryorda commented Sep 26, 2019 •

edited

Loading