Improve monitoring information of WMAgent components #12302

vkuznet · 2025-03-11T17:45:51Z

Status

ready

Description

To improve WMagent component's monitoring we introduced the following changes:

Add new procfs module with processStatus function based on /proc info for given PID. This function returns information about process threads which later can be used by WMStats. For example, here is addition to agentInfo dictionary we supply to CouchDB:

  "components": {
...
    "DBS3Upload": [
      {"process": "python", "pid": "1921054", "status": "S (sleeping)",  "type": "process"},
      {"process": "python", "pid": "1921059", "status": "S (sleeping)",  "type": "thread" }
    ],
...
}

It shows the main process and its threads, in this case main process pid is 1921054, and pid of its threads (in this case only one additional thread) is 1921059 which is in sleeping state.

add processThreadsInfo function in Utils/ProcessStats.py module which provides process and thread monitoring information:

{"process_name": .., "pid": .., "status": .., "ppid": .., "cmdline":.., 
 "cpu_usage_percent": .., "memory_usage_percent": .., "memory_rss": .., 
 "num_open_files": .., "num_connections": .., 
 "threads": [..]}

and thread info:

{"thread_id": .., "user_time":.., "system_time":.., "cpu_usage_percent": ..,
  "memory_usage_bytes":.., "num_open_files":.., "num_connections":..,
  "state":.., "name":..}

update WMComponent/AnalyticsDataCollector/DataCollectAPI.py parts with information about dead threads via threadsDetails function within agentInfo['down_component_detail'] used by WMStats web UI
update bin/wmcoreD where we provide status of component's process along with its threads, e.g.

manage status
...
Component:WorkQueueManager 3545240 running with threads [3545241, 3545242, 3545243, 3545244]
Component:DBS3Upload 3545249 running with threads [3545254]
...

Each component now reports all its threads and if thread is missing/died/stuck it will provide proper message about it, e.g. component can be in partially-running state where its main process and some of the threads are ok but other threads misbehave

introduced trheads.json file in component area along with Daemon.xml to capture initial conditions of component state. This information is used by aforementioned new functionality to determine if original threads are running or died
propagate thread information into WMComponent/AnalyticsDataCollector/DataCollectAPI.py who use it and analyze state of specific component and properly report its state (with new threads information) to CouchDB.

Note: The information reported in WMStats web UI agentInfo tab can be accessed via the following URL: https://xxxx.cern.ch/couchdb/wmstats/_design/WMStats/_view/agentInfo which return nested dictionary which is used by WMStats JavaScript engine to report this information in agentInfo tab.

Pylint note: I noticed degradation in modified bin/wmcoreD codebase which is not related to proposed changes. For instance, there are underlined variables used in this codebase. And, I explicitly did not touch those errors/warnings which most likely come from stricter rules imposed by pylint over time. If those should be addressed I suggest to make an explicit request for them and I'll be happy to fix those issues.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

External dependencies / deployment changes

dmwm-bot · 2025-03-11T17:59:02Z

Jenkins results:

Python3 Unit tests: failed
- 1 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 4 warnings
- 50 comments to review
Pycodestyle check: succeeded
- 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/463/artifact/artifacts/PullRequestReport.html

dmwm-bot · 2025-03-11T18:22:05Z

Jenkins results:

Python3 Unit tests: failed
- 1 tests added
- 3 changes in unstable tests
Python3 Pylint check: succeeded
- 4 warnings
- 34 comments to review
Pycodestyle check: succeeded
- 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/464/artifact/artifacts/PullRequestReport.html

dmwm-bot · 2025-03-11T19:50:53Z

Jenkins results:

Python3 Unit tests: failed
- 1 tests added
- 4 changes in unstable tests
Python3 Pylint check: succeeded
- 4 warnings
- 34 comments to review
Pycodestyle check: succeeded
- 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/465/artifact/artifacts/PullRequestReport.html

dmwm-bot · 2025-03-12T13:11:29Z

Jenkins results:

Python3 Unit tests: failed
- 1 tests added
- 3 changes in unstable tests
Python3 Pylint check: succeeded
- 4 warnings
- 34 comments to review
Pycodestyle check: succeeded
- 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/471/artifact/artifacts/PullRequestReport.html

dmwm-bot · 2025-03-12T13:31:25Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 tests added
Python3 Pylint check: succeeded
- 4 warnings
- 34 comments to review
Pycodestyle check: succeeded
- 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/472/artifact/artifacts/PullRequestReport.html

dmwm-bot · 2025-03-12T17:48:50Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 tests added
- 2 changes in unstable tests
Python3 Pylint check: succeeded
- 4 warnings
- 34 comments to review
Pycodestyle check: succeeded
- 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/476/artifact/artifacts/PullRequestReport.html

dmwm-bot · 2025-03-13T17:35:18Z

Jenkins results:

Python3 Unit tests: succeeded
- 3 tests added
- 3 changes in unstable tests
Python3 Pylint check: failed
- 14 warnings and errors that must be fixed
- 5 warnings
- 130 comments to review
Pycodestyle check: succeeded
- 89 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/483/artifact/artifacts/PullRequestReport.html

… with its threads

…its info

dmwm-bot · 2025-03-25T13:34:51Z

Jenkins results:

Python3 Unit tests: succeeded
- 3 tests added
- 5 changes in unstable tests
Python3 Pylint check: failed
- 20 warnings and errors that must be fixed
- 10 warnings
- 136 comments to review
Pycodestyle check: succeeded
- 103 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/519/artifact/artifacts/PullRequestReport.html

amaltaro

Without having a deep look into the details of the worker threads implementation, but seeing each component worker thread inherits a BaseWorkerThread:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkerThreads/BaseWorkerThread.py

and runs as a daemon, spawning a Daemon.xml file, e.g.:

(WMAgent-2.3.9.2) [xxx@vocms0xxx:current]$ cat install/WorkQueueManager/Daemon.xml

Can't we consolidate most of this development into this module:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Agent/Daemon/Details.py
?

Or perhaps we could have the relevant worker thread monitoring logic in the BaseWorkerThread itself:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkerThreads/BaseWorkerThread.py

In addition, I see the Harness module is used for starting up a component:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Agent/Harness.py
but I think it is a higher layer and would not fit well for the worker threads monitoring.

vkuznet · 2025-03-26T13:41:28Z

src/python/WMCore/WorkerThreads/BaseWorkerThread.py

Alan (@amaltaro), please go through the review process to avoid triggering unnecessary work. In my view, what you suggest does not fit in existing architectural design because we don't have ability to communicate with threads from external process. In other words, our components and pollers are python processes rather (HTTP) services where you can request some monitoring metrics. In external process with thread model what can be done is to push metrics somewhere, e.g. we can push such information to internal database, or Monit infrastructure and then use them afterwards. At the moment we rely on process ID (extracted from Daemon.xml) to inspect its status about run-time environment of components/daemons.

I provided a stand-along functions for that and they are used in Details.py as you mentioned in your comment. Therefore, I recommend that you review the current changes, and if you have specific idea how to implement thread monitoring via integration with [WMCore/WorkerThreads/BaseWorkerThread.py](https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkerThreads/BaseWorkerThread.py) I suggest that you outline your idea as proposal for such monitoring. For instance

add process metrics to BaseWorkerThread
how to push/collect these metrics elsewhere (please specify where)
how to extract these metrics from elsewhere and use it for monitoring of running components/daemons. For instance:
- how to use these metrics in manage status
- how to use these metrics in WMComponent/AnalyticsDataCollector/DataCollectAPI.py which report them to CouchDB, i.e. how it can communicate with run-time environment (BaseWorkerThread.py)
- maybe we can push metrics directly from BaseWorkerThread.py to CouchDB, is it desired?

My point is that without global picture with all details of monitoring metrics workflows it is very hard to understand what you have in mind, and simply mentioning different parts of WMCore does not help to understand if it is feasible to accomplish with current architectural design.

amaltaro · 2025-03-27T11:24:39Z

@vkuznet I am puzzled with the fragmentation of process-related code. Despite the extensive PR description (thanks!), please help me understand the following:
a) what is wmcoreD bin used for? I guess it is used by the manage script, hence the fragmentation from the components codebase
b) can't we merge the functionality in the new module ProcFS into ProcessStats? Is there any reason to keep those separated?
c) instead of extending Utilities module, can't we use the Daemon/Details module - which already loads the Daemon.xml and gives you the ProcessID element?
d) in DataCollectAPI, instead of relying in a broken information from the database - and the new data struct created - don't you think it is better to persist the actual worker thread ID in the database (wm_workers table?)

Checking the information already avaialble in WMStats agentInfo view, I see:

        "workers": [
          {
            "name": "AgentStatusPoller",
            "last_updated": 1743074171,
            "state": "Running",
            "poll_interval": 300,
            "cycle_time": 1.5446
          },
          {
            "name": "AnalyticsPoller",
            "last_updated": 1743074124,
            "state": "Running",
            "poll_interval": 600,
            "cycle_time": 0.8201
          },
...

and I would be in favor of always reporting the state of all worker threads (+ main component), instead of just publishing when one of the threads is down.

What do you think?

mapellidario · 2025-03-28T10:17:36Z

Before I leave a review, I have a curiosity, mainly for my education, and if it has already been discussed i am sorry if i missed it. Is it there any reason why you use read the content of the /proc/$pid files in some places and use the python module psutils somewhere else? do they provide different information? if psutil can provide all the information we need, i would be in favor of never reading directly from /prod/$pid files

vkuznet · 2025-03-30T17:47:48Z

@amaltaro , here are my responses to your questions:
a) the wmcoreD is used within manage script, e.g.

start_agent(){
    if  _init_valid $wmaInitUsing ; then
        echo "Starting WMAgent..."
        wmcoreD --start --config=$WMA_CONFIG_DIR/config.py
    else
        echo "ERROR: This agent is not fully initialized. Cannot use it."
        return $(false)
    fi
}

b) we can merge ProcFS into ProcessStats but they have code not related to each other. The former keeps code to work with Linux file system, while later the code to about process monitoring. Because of this reason I kept them separately but I don't mind to merge them.
c) the only reason I added extractFromXML due to its generic nature, i.e. this function can be used with any XML parsing task. Moving this from Utilities to Daemon/Details is acceptable but based on my previous reviews you always wanted to keep pieces in right places, and to me keeping generic function in Utilities is more appropriate then in Daemon/Details. And, I do use processStatus function I provided in Daemon/Details. If my answers are not satisfactory to you please elaborate more how do you prefer relocate the code. In my view everything is in right places and used accordingly.
d) I don't really have a clue why we keep process information in a database as it is outdated due to different time it is used and collected. My understanding that we want to get up-to-date info about component to report it to the couch. The database write seems useless to me since knowing process ID we can get up-to-date process information. Regarding structure used agentInfo I intentionally kept it backward compatible to avoid even more changes on wmstats JavaScript (client) side. And, so far we only improving what we have. Instead the entire process of collecting info is outdated and I'm in favor of collecting metrics and pushing them to local Prometheus instances (which then can push it to CMS Monitoring one). Here again, I tried to minimize the effort (which seems already extensive).

@mapellidario , answering your question. Even though psutil is well known python package the Linux /proc file system is foundation of everything in Linux OS. I trust it more :) That said, I provided both ways, either use ProcFS module based on /proc FS, or use psutil approach in ProcessStats module.

amaltaro · 2025-04-03T15:56:13Z

Thank you for these clarifications, Valentin.

b) we can merge ProcFS into ProcessStats but they have code not related to each other. The former keeps code to work with Linux file system, while later the code to about process monitoring. Because of this reason I kept them separately but I don't mind to merge them.

Given that we are still checking information for processes, I would merge them all together (probably keeping ProcessStats module).

c) the only reason I added extractFromXML due to its generic nature, i.e. this function can be used with any XML parsing task. Moving this from Utilities to Daemon/Details is acceptable but based on my previous reviews you always wanted to keep pieces in right places, and to me keeping generic function in Utilities is more appropriate then in Daemon/Details. And, I do use processStatus function I provided in Daemon/Details. If my answers are not satisfactory to you please elaborate more how do you prefer relocate the code. In my view everything is in right places and used accordingly.

Yes, generic functions are better placed under the Utilities package. However, in this case you are dealing with an agent component/daemon, which creates an XML. From what I can see, the Details.py module:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Agent/Daemon/Details.py

exist exactly to deal with that Daemon.xml file. So it is somehow natural to me that we would use that module to retrieve the node that you are looking for. If not yet available, then we can extend that class with the required method.

Generic module does the job and can be used by anything else that needs to parse an xml. The inconsistency accessing Daemon.xml is what bugs me. Just my 2 cents, stick to whatever you/people prefer.

d) I don't really have a clue why we keep process information in a database as it is outdated due to different time it is used and collected. ...

Data in the database will only be stale if a process dies, which is actually our clue to see that that component/process is no longer running.
The way the agent works is that, whenever we start a new component, it self-registers in the database (plus some information). If the row was already populated with an outdated PID, it is updated as well. So that table is always giving us a summary of components / worker threads / process id (well, without the thread id, which is being resolved with this PR). I am adding a couple of database screenshots for your consideration as well.

@mapellidario , answering your question. Even though psutil is well known python package the Linux /proc file system is foundation of everything in Linux OS. I trust it more :) That said, I provided both ways, either use ProcFS module based on /proc FS, or use psutil approach in ProcessStats module.

It looks like psutil relies on /proc and /sys for linux process management, so we are probably fine if we adopt only one solution. I'd stick with one well tested solution, than having 2 and not being sure which one to take.

vkuznet · 2025-04-14T12:47:54Z

Data in the database will only be stale if a process dies, which is actually our clue to see that that component/process is no longer running.

The data in the database will not be correct if ANY thread will die with main process still alive. For instance, if main process is fine and all of its threads too then your statement is correct. But if main process is fine but any of its threads are not (it is possible that thread dies but main process is alive) then content of database is not representing actual state of the component. For that reason I think we cannot rely on database content for any component (in this regard I don't know why do we need the database table for them either).

Before making any changes I want you to review this statement and decide how to proceed. In my view the database entries are not useful to represent state of the component and we must use either proc fs or pstuils to determine state of all component threads each time of polling cycle.

I'm tagging everyone to make awareness of this situation: @amaltaro , @anpicci , @todor-ivanov , @mapellidario , @khurtado , @d-ylee

amaltaro · 2025-04-15T19:52:34Z

The data in the database will not be correct if ANY thread will die with main process still alive. For instance, if main process is fine and all of its threads too then your statement is correct. But if main process is fine but any of its threads are not (it is possible that thread dies but main process is alive) then content of database is not representing actual state of the component. For that reason I think we cannot rely on database content for any component (in this regard I don't know why do we need the database table for them either).

And this is exactly the expected behavior. Let me expand on the algorithm that was probably considered when these tables were designed (which makes sense to me):

component starts and it registers its PID in the database (including worker threads)
whatever is monitoring these components/worker threads, reads from the database to find out what is the expected PID for those threads
it checks their status. If anything is not alive, it then reports that it is down. If every PID is alive, then it knows all worker threads are behaving properly.
if someone stops a component, all its worker threads are stopped as well - which goes back to point 3 above, meaning monitoring will report those as down.
whenever a component is restarted, all its worker threads are respawned and the table gets updated with new PIDs

In my view the database entries are not useful to represent state of the component

I guess here is the miscommunication. The database is not meant to track the state of components and worker threads, it is only meant to track the expected (registered) PID for them. Monitoring of their state needs to happen "outside" of the table - which implies, the table is not updated, other than the heartbeat, if we want to.

vkuznet added New Feature WMAgent WMStats Monitoring labels Mar 11, 2025

vkuznet self-assigned this Mar 11, 2025

vkuznet force-pushed the fix-issue-12145 branch from 86fafb8 to 5613776 Compare March 11, 2025 18:12

vkuznet force-pushed the fix-issue-12145 branch from 5613776 to 6b5db72 Compare March 11, 2025 19:10

vkuznet force-pushed the fix-issue-12145 branch from 6b5db72 to 8cfcb9f Compare March 12, 2025 12:59

vkuznet force-pushed the fix-issue-12145 branch from 8cfcb9f to 9138b58 Compare March 12, 2025 13:17

vkuznet force-pushed the fix-issue-12145 branch from 9138b58 to 2233d74 Compare March 12, 2025 17:04

vkuznet added 2 commits March 13, 2025 13:19

add thread info status to WMAgent info pushed to CouchDB

51248dc

new unit test for procfs related function

ec6dddd

vkuznet force-pushed the fix-issue-12145 branch from 2233d74 to ec6dddd Compare March 13, 2025 17:19

vkuznet added 3 commits March 18, 2025 14:38

Add processThreadsInfo helper function

63f7bd0

introduce checkProcessThreads to properly report process status along…

309b93c

… with its threads

Add threadsDetails and fill out agentInfo down_component_detail with …

2600ec8

…its info

vkuznet requested review from amaltaro, todor-ivanov, khurtado, mapellidario, anpicci and d-ylee March 25, 2025 13:52

mapellidario mentioned this pull request Mar 25, 2025

restartComponent - full path of manage script #12308

Open

3 tasks

amaltaro reviewed Mar 25, 2025

View reviewed changes

anpicci approved these changes Mar 31, 2025

View reviewed changes

anpicci added this to WMCore quarterly developments Apr 6, 2025

anpicci moved this to In Progress in WMCore quarterly developments Apr 6, 2025

vkuznet mentioned this pull request Apr 13, 2025

WMAgent components terminated but PID remains alive #12091

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve monitoring information of WMAgent components #12302

Improve monitoring information of WMAgent components #12302

vkuznet commented Mar 11, 2025 •

edited

Loading

dmwm-bot commented Mar 11, 2025

dmwm-bot commented Mar 11, 2025

dmwm-bot commented Mar 11, 2025

dmwm-bot commented Mar 12, 2025

dmwm-bot commented Mar 12, 2025

dmwm-bot commented Mar 12, 2025

dmwm-bot commented Mar 13, 2025

dmwm-bot commented Mar 25, 2025

amaltaro left a comment

vkuznet commented Mar 26, 2025 •

edited

Loading

amaltaro commented Mar 27, 2025

mapellidario commented Mar 28, 2025

vkuznet commented Mar 30, 2025

amaltaro commented Apr 3, 2025

vkuznet commented Apr 14, 2025

amaltaro commented Apr 15, 2025

Improve monitoring information of WMAgent components #12302

Are you sure you want to change the base?

Improve monitoring information of WMAgent components #12302

Conversation

vkuznet commented Mar 11, 2025 • edited Loading

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

dmwm-bot commented Mar 11, 2025

dmwm-bot commented Mar 11, 2025

dmwm-bot commented Mar 11, 2025

dmwm-bot commented Mar 12, 2025

dmwm-bot commented Mar 12, 2025

dmwm-bot commented Mar 12, 2025

dmwm-bot commented Mar 13, 2025

dmwm-bot commented Mar 25, 2025

amaltaro left a comment

Choose a reason for hiding this comment

vkuznet commented Mar 26, 2025 • edited Loading

amaltaro commented Mar 27, 2025

mapellidario commented Mar 28, 2025

vkuznet commented Mar 30, 2025

amaltaro commented Apr 3, 2025

vkuznet commented Apr 14, 2025

amaltaro commented Apr 15, 2025

vkuznet commented Mar 11, 2025 •

edited

Loading

vkuznet commented Mar 26, 2025 •

edited

Loading