-
Notifications
You must be signed in to change notification settings - Fork 108
Improve monitoring information of WMAgent components #12302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Jenkins results:
|
86fafb8
to
5613776
Compare
Jenkins results:
|
5613776
to
6b5db72
Compare
Jenkins results:
|
6b5db72
to
8cfcb9f
Compare
Jenkins results:
|
8cfcb9f
to
9138b58
Compare
Jenkins results:
|
9138b58
to
2233d74
Compare
Jenkins results:
|
2233d74
to
ec6dddd
Compare
Jenkins results:
|
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without having a deep look into the details of the worker threads implementation, but seeing each component worker thread inherits a BaseWorkerThread
:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkerThreads/BaseWorkerThread.py
and runs as a daemon, spawning a Daemon.xml file, e.g.:
(WMAgent-2.3.9.2) [xxx@vocms0xxx:current]$ cat install/WorkQueueManager/Daemon.xml
Can't we consolidate most of this development into this module:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Agent/Daemon/Details.py
?
Or perhaps we could have the relevant worker thread monitoring logic in the BaseWorkerThread itself:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkerThreads/BaseWorkerThread.py
In addition, I see the Harness module is used for starting up a component:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Agent/Harness.py
but I think it is a higher layer and would not fit well for the worker threads monitoring.
Alan (@amaltaro), please go through the review process to avoid triggering unnecessary work. In my view, what you suggest does not fit in existing architectural design because we don't have ability to communicate with threads from external process. In other words, our components and pollers are python processes rather (HTTP) services where you can request some monitoring metrics. In external process with thread model what can be done is to push metrics somewhere, e.g. we can push such information to internal database, or Monit infrastructure and then use them afterwards. At the moment we rely on process ID (extracted from Daemon.xml) to inspect its status about run-time environment of components/daemons. I provided a stand-along functions for that and they are used in
My point is that without global picture with all details of monitoring metrics workflows it is very hard to understand what you have in mind, and simply mentioning different parts of WMCore does not help to understand if it is feasible to accomplish with current architectural design. |
@vkuznet I am puzzled with the fragmentation of process-related code. Despite the extensive PR description (thanks!), please help me understand the following: Checking the information already avaialble in WMStats
and I would be in favor of always reporting the state of all worker threads (+ main component), instead of just publishing when one of the threads is down. What do you think? |
Before I leave a review, I have a curiosity, mainly for my education, and if it has already been discussed i am sorry if i missed it. Is it there any reason why you use read the content of the |
@amaltaro , here are my responses to your questions:
b) we can merge @mapellidario , answering your question. Even though |
Thank you for these clarifications, Valentin.
Given that we are still checking information for processes, I would merge them all together (probably keeping ProcessStats module).
Yes, generic functions are better placed under the exist exactly to deal with that Daemon.xml file. So it is somehow natural to me that we would use that module to retrieve the node that you are looking for. If not yet available, then we can extend that class with the required method. Generic module does the job and can be used by anything else that needs to parse an xml. The inconsistency accessing Daemon.xml is what bugs me. Just my 2 cents, stick to whatever you/people prefer.
Data in the database will only be stale if a process dies, which is actually our clue to see that that component/process is no longer running.
It looks like ![]() ![]() |
The data in the database will not be correct if ANY thread will die with main process still alive. For instance, if main process is fine and all of its threads too then your statement is correct. But if main process is fine but any of its threads are not (it is possible that thread dies but main process is alive) then content of database is not representing actual state of the component. For that reason I think we cannot rely on database content for any component (in this regard I don't know why do we need the database table for them either). Before making any changes I want you to review this statement and decide how to proceed. In my view the database entries are not useful to represent state of the component and we must use either proc fs or pstuils to determine state of all component threads each time of polling cycle. I'm tagging everyone to make awareness of this situation: @amaltaro , @anpicci , @todor-ivanov , @mapellidario , @khurtado , @d-ylee |
And this is exactly the expected behavior. Let me expand on the algorithm that was probably considered when these tables were designed (which makes sense to me):
I guess here is the miscommunication. The database is not meant to track the state of components and worker threads, it is only meant to track the expected (registered) PID for them. Monitoring of their state needs to happen "outside" of the table - which implies, the table is not updated, other than the heartbeat, if we want to. |
Fixes #12145
Status
ready
Description
To improve WMagent component's monitoring we introduced the following changes:
processStatus
function based on/proc
info for given PID. This function returns information about process threads which later can be used by WMStats. For example, here is addition toagentInfo
dictionary we supply to CouchDB:It shows the main process and its threads, in this case main process pid is 1921054, and pid of its threads (in this case only one additional thread) is 1921059 which is in sleeping state.
processThreadsInfo
function inUtils/ProcessStats.py
module which provides process and thread monitoring information:and thread info:
WMComponent/AnalyticsDataCollector/DataCollectAPI.py
parts with information about dead threads viathreadsDetails
function withinagentInfo['down_component_detail']
used by WMStats web UIbin/wmcoreD
where we provide status of component's process along with its threads, e.g.Each component now reports all its threads and if thread is missing/died/stuck it will provide proper message about it, e.g. component can be in
partially-running
state where its main process and some of the threads are ok but other threads misbehaveintroduced
trheads.json
file in component area along withDaemon.xml
to capture initial conditions of component state. This information is used by aforementioned new functionality to determine if original threads are running or diedpropagate thread information into
WMComponent/AnalyticsDataCollector/DataCollectAPI.py
who use it and analyze state of specific component and properly report its state (with new threads information) to CouchDB.Note: The information reported in WMStats web UI agentInfo tab can be accessed via the following URL:
https://xxxx.cern.ch/couchdb/wmstats/_design/WMStats/_view/agentInfo
which return nested dictionary which is used by WMStats JavaScript engine to report this information in agentInfo tab.Pylint note: I noticed degradation in modified
bin/wmcoreD
codebase which is not related to proposed changes. For instance, there are underlined variables used in this codebase. And, I explicitly did not touch those errors/warnings which most likely come from stricter rules imposed by pylint over time. If those should be addressed I suggest to make an explicit request for them and I'll be happy to fix those issues.Is it backward compatible (if not, which system it affects?)
YES
Related PRs
External dependencies / deployment changes