Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very high memory consumption #133

Closed
karlism opened this issue Aug 13, 2019 · 18 comments
Closed

Very high memory consumption #133

karlism opened this issue Aug 13, 2019 · 18 comments
Labels
bug Something isn't working

Comments

@karlism
Copy link
Contributor

karlism commented Aug 13, 2019

Hello,

vmware_exporter seems to be consuming very much RAM on our systems. Here's an example of memory consumption on VM that is running vmware exporter, drops in spikes are when I restart vmware exporter manually or systemd does that for me:
image

vmware exporter is installed from pypi:

# pip3 list | grep vmware-exporter
vmware-exporter (0.9.8)

Running on latest CentOS version:

# uname -a                   
Linux hostname 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
# uname -r
3.10.0-957.21.3.el7.x86_64

Metrics are collected from single vCenter instance that manages 26 ESXi hosts with about 550 virtual machines.

Please let me know if I can provide any additional information.

@pryorda
Copy link
Owner

pryorda commented Aug 13, 2019 via email

@karlism
Copy link
Contributor Author

karlism commented Aug 14, 2019

@pryorda, metrics are scraped every 90 seconds from two prometheus servers.

@karlism
Copy link
Contributor Author

karlism commented Aug 23, 2019

I've just noticed that we had very old version of python prometheus-client module installed (0.0.19), I've just updated it to 0.7.1, will report next week if this has solved this issue.

@pryorda
Copy link
Owner

pryorda commented Aug 23, 2019

@karlism Thanks for the update. I was going to ask if you could run this in a container and let me know.

@pryorda pryorda added the bug Something isn't working label Aug 23, 2019
@pryorda
Copy link
Owner

pryorda commented Aug 23, 2019

Here is our consumption on v0.9.0

image

I'm going to bump it to the latest and see.

@pryorda
Copy link
Owner

pryorda commented Aug 23, 2019

Here is our consumption on v0.9.9

image

Keep an eye on it and let me know.

@karlism
Copy link
Contributor Author

karlism commented Aug 28, 2019

Hello,

Unfortunately it seems that the problem has gone away on it's own, despite the fact that nothing has been changed on the server:
image
Blue line marks point, when I updated prometheus_client python library to 0.7.1. It also didn't have any negative (or positive) effect on Prometheus scrape duration:
image

So I'd say that it is safe to remove dependency to specific prometheus client library version here:
https://github.com/pryorda/vmware_exporter/blob/master/requirements.txt#L1

As for original memory consumption issue, let's leave this ticket open for another week, let's see if the problem returns. I will report as soon as I have any updates on this.

@pryorda
Copy link
Owner

pryorda commented Aug 28, 2019

Are you running in a container or installed at the OS level?

@karlism
Copy link
Contributor Author

karlism commented Aug 29, 2019

@pryorda, not a container, it's a VM running CentOS 7.

@karlism
Copy link
Contributor Author

karlism commented Sep 20, 2019

The issue repeated again:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                          
  1144 vmware-+  20   0 6081604   2.3g   1732 S 100.7 81.5   4597:46 vmware_exporter                                  
  1145 prometh+  20   0  116832  42172   4200 S  11.0  1.5   1517:48 blackbox_export                                  
 86930 prometh+  20   0  113832  28268   1432 S   9.3  1.0  72:22.47 snmp_exporter                                    
  1139 prometh+  20   0  115724  10192   2620 S   0.0  0.4  24:00.96 node_exporter                                    
   560 root      20   0   48720   4624   4112 S   0.0  0.2   9:39.88 systemd-journal    

$ ps aux | grep vmwa
vmware-+   1144 41.4 82.8 6114372 2398296 ?     Ssl  Sep12 4598:10 /usr/bin/python3.6 /usr/local/bin/vmware_exporter --config /etc/vmware_exporter/config.yml --loglevel WARNING

$ pip3 list installed
asn1crypto (0.24.0)
attrs (19.1.0)
Automat (0.7.0)
cffi (1.12.2)
chardet (3.0.4)
constantly (15.1.0)
cryptography (2.6.1)
hyperlink (19.0.0)
idna (2.7)
incremental (17.5.0)
pip (8.1.2)
prometheus-client (0.7.1)
pyasn1 (0.4.5)
pyasn1-modules (0.2.4)
pycparser (2.19)
PyHamcrest (1.9.0)
PySocks (1.6.8)
pytz (2019.1)
pyvmomi (6.7.1)
PyYAML (3.12)
requests (2.12.5)
service-identity (18.1.0)
setuptools (39.2.0)
six (1.11.0)
Twisted (19.2.0)
urllib3 (1.19.1)
vmware-exporter (0.9.8)
yamlconfig (0.3.1)
zope.interface (4.6.0)
# cat /etc/vmware_exporter/config.yml 
---
#
# Ansible managed
#

default:
    collect_only:
        datastores: true
        hosts: true
        snapshots: true
        vmguests: true
        vms: true
    ignore_ssl: true
    vsphere_host: vcenter_hostname
    vsphere_password: password
    vsphere_user: username
# free
              total        used        free      shared  buff/cache   available
Mem:        2895428     2709824       68456        2456      117148       21164
Swap:       2097148     2091328        5820
# systemctl restart vmware_exporter
# free
              total        used        free      shared  buff/cache   available
Mem:        2895428      232584     2514000        1976      148844     2482852
Swap:       2097148      209352     1887796

image
image

Couple of weeks ago ESXi hosts and vCenter have been updated to 6.7U3, to exclude the option that old software version on that side might cause the issues.

@karlism
Copy link
Contributor Author

karlism commented Sep 20, 2019

I've also noticed errors in vmware exporter logs, this is after service restart and other than these errors in logs, exporter seems to be working fine:

Sep 20 07:22:31 hostname vmware_exporter: 2019-09-20 07:22:31,205 ERROR:Traceback (most recent call last):
Sep 20 07:22:31 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
Sep 20 07:22:31 hostname vmware_exporter: result = g.send(result)
Sep 20 07:22:31 hostname vmware_exporter: StopIteration
Sep 20 07:22:31 hostname vmware_exporter: During handling of the above exception, another exception occurred:
Sep 20 07:22:31 hostname vmware_exporter: Traceback (most recent call last):
Sep 20 07:22:31 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
Sep 20 07:22:31 hostname vmware_exporter: result = g.send(result)
Sep 20 07:22:31 hostname vmware_exporter: StopIteration: [Metric(vmware_datastore_capacity_size, VMWare Datasore capacity in bytes, gauge, , [removed...]
Sep 20 07:22:31 hostname vmware_exporter: During handling of the above exception, another exception occurred:
Sep 20 07:22:31 hostname vmware_exporter: Traceback (most recent call last):
Sep 20 07:22:31 hostname vmware_exporter: File "/usr/local/lib/python3.6/site-packages/vmware_exporter/vmware_exporter.py", line 868, in _async_render_GET
Sep 20 07:22:31 hostname vmware_exporter: yield self.generate_latest_metrics(request)
Sep 20 07:22:31 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
Sep 20 07:22:31 hostname vmware_exporter: result = g.send(result)
Sep 20 07:22:31 hostname vmware_exporter: File "/usr/local/lib/python3.6/site-packages/vmware_exporter/vmware_exporter.py", line 915, in generate_latest_metrics
Sep 20 07:22:31 hostname vmware_exporter: request.write(output)
Sep 20 07:22:31 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/web/server.py", line 238, in write
Sep 20 07:22:31 hostname vmware_exporter: http.Request.write(self, data)
Sep 20 07:22:31 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/web/http.py", line 1118, in write
Sep 20 07:22:31 hostname vmware_exporter: self.channel.writeHeaders(version, code, reason, headers)
Sep 20 07:22:31 hostname vmware_exporter: AttributeError: 'NoneType' object has no attribute 'writeHeaders'
Sep 20 07:22:50 hostname vmware_exporter: Unhandled error in Deferred:
Sep 20 07:22:50 hostname vmware_exporter: Traceback (most recent call last):
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/internet/defer.py", line 501, in errback
Sep 20 07:22:50 hostname vmware_exporter: self._startRunCallbacks(fail)
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
Sep 20 07:22:50 hostname vmware_exporter: self._runCallbacks()
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
Sep 20 07:22:50 hostname vmware_exporter: current.result = callback(current.result, *args, **kw)
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/internet/defer.py", line 1475, in gotResult
Sep 20 07:22:50 hostname vmware_exporter: _inlineCallbacks(r, g, status)
Sep 20 07:22:50 hostname vmware_exporter: --- <exception caught here> ---
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
Sep 20 07:22:50 hostname vmware_exporter: result = result.throwExceptionIntoGenerator(g)
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
Sep 20 07:22:50 hostname vmware_exporter: return g.throw(self.type, self.value, self.tb)
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib/python3.6/site-packages/vmware_exporter/vmware_exporter.py", line 872, in _async_render_GET
Sep 20 07:22:50 hostname vmware_exporter: request.write(b'# Collection failed')
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/web/server.py", line 238, in write
Sep 20 07:22:50 hostname vmware_exporter: http.Request.write(self, data)
Sep 20 07:22:50 hostname vmware_exporter: File "/usr/local/lib64/python3.6/site-packages/twisted/web/http.py", line 1133, in write
Sep 20 07:22:50 hostname vmware_exporter: self.channel.writeSequence(toChunk(data))
Sep 20 07:22:50 hostname vmware_exporter: builtins.AttributeError: 'NoneType' object has no attribute 'writeSequence'

@Jc2k
Copy link
Collaborator

Jc2k commented Sep 20, 2019

That error is normally when the exporter tries to write to a socket that was closed - usually vmware was slow to respond, so prometheus timed out its connection to the exporter. A longer timeout on the prometheus side will help with this, but you might still get them if the vmware API has a slow blip.

@karlism
Copy link
Contributor Author

karlism commented Sep 20, 2019

@Jc2k, you're right, scrape timeout for that job is set to 30 seconds and scrape durations averages around 20 seconds for that environment:
image
I will increase scrape timeout, thanks!

@karlism
Copy link
Contributor Author

karlism commented Sep 23, 2019

Did some additional testing regarding errors in log files.

$ for i in `jot 10 1`; do echo -n "$i: ";  time curl -s 'http://exporter:9272/metrics?target=vcenter' | wc -l; sleep 5; done 
1:     7418
    0m04.85s real     0m00.04s user     0m00.05s system
2:     7418
    0m09.56s real     0m00.03s user     0m00.03s system
3:     7418
    0m08.47s real     0m00.05s user     0m00.01s system
4:     7418
    0m04.77s real     0m00.01s user     0m00.07s system
5:     7418
    0m04.12s real     0m00.02s user     0m00.05s system
6:     7418
    0m13.35s real     0m00.02s user     0m00.04s system
7:     7418
    0m05.33s real     0m00.01s user     0m00.03s system
8:     7418
    0m11.61s real     0m00.01s user     0m00.03s system
9:     7418
    0m13.31s real     0m00.03s user     0m00.02s system
10:     7418
    0m05.13s real     0m00.03s user     0m00.02s system

Also disabled vmguests metric collection as it improved scrape time quite a lot and we do not care about these metrics as VMs are monitored by node exporter.

# cat /etc/vmware_exporter/config.yml
---
#
# Ansible managed
#

default:
    collect_only:
        datastores: true
        hosts: true
        snapshots: true
        vmguests: false
        vms: true
    ignore_ssl: true
    vsphere_host: vcenter_hostname
    vsphere_password: password
    vsphere_user: username

While running curl query loop, whenever scrape time reached about 10 seconds, we would get errors in vmware exporter log file. Other than increased scrape time and errors in logfiles, it seems to be working fine. VMware metric scrape jobs were disabled on prometheus servers during the tests so that they wouldn't interfere with results.

@pryorda
Copy link
Owner

pryorda commented Sep 26, 2019

To make sure I understand the issue:

  1. Run a curl loop
  2. If the scrape time goes above 10s, it gets errors + memleak.

If this is the case we might be able to look at setting higher timeouts in the connection. However, How many vms/hosts are you scraping? We average about 1-3s during our scrapes with 30 hosts and 800 VMs. I'm also wondering if your vcenter instance is undersized for your current environment.

@karlism
Copy link
Contributor Author

karlism commented Sep 30, 2019

@pryorda, there are currently two issues:

  1. memory leak which happens occasionally and is hard to reproduce, see this memory usage graph for the last 30 days:
    image
  2. errors in log files when scrape time exceeds ~10 seconds (despite the fact that metrics are returned properly)
    I'm not entirely sure if these two are related in any way.

As for the vCenter, currently it has 26 ESXi hosts across 6 different sites, with 587 VMs running across them. One thing to note is that some of the datacenters are located quite far, with latency reaching 180ms in some of them.
vCenter instance has 24GB RAM (reported memory usage is below 4GB), 8 CPUs (reported CPU usage is below 2GHz) and nothing indicates that it doesn't have enough resources assigned. vCenter web interface seems to be responsive and all operations are performed quickly.

We also have lab vCenter instance, which is also scraped by vmware_exporter with identical versions and configuration as production one, and we do not experience any issues there at all. It's also worth mentioning that this lab vCenter instance has only 2 ESXi hosts and 41 VMs, which are all located in one datacenter.

Can you please point me where I can set higher timeouts in vmware_exporter for the vCenter connection?

Thanks!

@pryorda
Copy link
Owner

pryorda commented Oct 3, 2019

I think we will have to do a code fix to set the timeout.

@pryorda
Copy link
Owner

pryorda commented Aug 17, 2020

Where can we access the recordings from the sessions?

@pryorda pryorda closed this as completed Aug 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants