Too many tailed files collected #3783

pmoravec · 2024-09-24T12:30:48Z

We noticed a high occurrence of tailing some specific files in different sosreports. Below is a list of the most often tailed files and my suggestion to that. Any comment / suggestion is welcomed. Possible options are "leave as is" or "increase sizelimit" or "drop that file or some data to truncate it".

postgresql/var.lib.pgsql.data.log.postgresql-*.log : this is most probably from Satellite / foreman systems with bigger postgres queries logged. Probably worth increasing the sizelimit, I will raise PR for it
sar/sa*.xml : we collect the files due to legacy reasons only (imho). I would vote for dropping them (until somebody needs them). If that isnt welcomed, let increase sizelimit - having incomplete/broken xml file is bit useless.
various var/log/* files, namely messages* or audit.log or secure - probably let it be, maybe audits or secure should be collected for past X days instead of given filesize..?
pacemaker/var.log.pacemaker.pacemaker.log - any suggestion from pacemaker plugin authors @TurboTurtle , @nrwahl2 ?
pulpcore/core_task - we collect all details about the tasks. Since many of the details are encrypted now, to prevent password leak, a lot of data are useless and I should improve the query. TODO point on me
crio/journalctl_--no-pager_--unit_crio - any suggestion from crio plugin authors @TurboTurtle , @vteratipally , @haircommander ?
openshift/journalctl_--no-pager_--unit_kubelet - any suggestion from openshift plugin authors @TurboTurtle , @vwalek ?
logs/journalctl_--no-pager - that is expected and reasonable, no action

The text was updated successfully, but these errors were encountered:

jcastill · 2024-09-24T13:26:00Z

* sar/sa*.xml : we collect the files due to legacy reasons only (imho). I would vote for dropping them (until somebody needs them). If that isnt welcomed, let increase sizelimit - having incomplete/broken xml file is bit useless.

I'm not sure these files are needed at all, but instead of dropping we could add an option to collect them if needed, in case anyone relies on them for any scripts. "Interpreted/decoded" ones in plain text are more useful.

* various `var/log/*` files, namely `messages*` or `audit.log` or `secure` - probably let it be, maybe audits or secure should be collected for past X days instead of given filesize..?

Agreed, maybe two/three days should be enough by default, or even just one day.

* `logs/journalctl_--no-pager` - that is expected and reasonable, no action

Agreed

These columns are either empty, containing passwords or some encoded data. Get the *remaining* column names and query for them. If the query for column names fail, failover to current "SELECT *". Relevant: sosreport#3783 Resolves: sosreport#3784 Signed-off-by: Pavel Moravec <[email protected]>

pmoravec · 2024-09-24T15:03:38Z

* `postgresql/var.lib.pgsql.data.log.postgresql-*.log` : this is most probably from Satellite / foreman systems with bigger postgres queries logged. Probably worth increasing the sizelimit, I will raise PR for it

This happens for Satellite / foreman, where we already increased sizelimit to 100MB via preset. And I confirm it is applied to these files. Raising it higher is possible, but.. not much worth of it. Usually, tailed files are from previous days only, that is sufficient.

haircommander · 2024-09-24T15:03:44Z

from my perspective as a node team member, crio and kubelet logs are the most important pieces for us to debug issues. We don't need them if they're caught in the overall journal though. Is bumping the size limit an option for those? or, we bump the size limit for the overall journal, and drop the crio/kubelet specfic journals. What do folks think?

TurboTurtle · 2024-09-24T15:24:58Z

I'd prefer increasing the size limit of unit-specific journals and/or log files over increasing the system journal collection. It gives us granularity without enforcing potentially very large system journal collections across the board. Granted, I get the point of "well it's going to be the majority of the system journal anyway...", but I think this is the least-bad option overall.

As far as the sar/sa files go, I'd defer to support teams on how often they're used. I know there's been a general shift away from sar but there's a lot of knowledge built around the use of these, at least the plaintext translations. I'd be open to dropping the binary collections since you need to use the same version to translate those as which generated them (hence why we do that during collection at all), but I'd be wary of dropping them entirely.

jcastill · 2024-09-24T15:56:06Z

The plain text ones are used a lot, even though they are not the most accurate output you could get... but as a first step when looking into performance issues, they are good enough.
I've searched internally and I haven't found any reference to the xmls or any tool that may use them, but "absence of proof..." . I don't remember using them for any support case.
I think there's an old tool, kSar, abandoned now, that used to read the xmls, but other than that nothing.

nrwahl2 · 2024-09-24T23:24:00Z

Pacemaker: It's been a couple of years since I've worked in support, so I would defer to any support engineers. Whether the limit is sufficient will always depend on how promptly the user opens a support ticket after an issue occurs, and on whether additional verbosity has been configured (it usually hasn't been).

We could increase the size limit to some arbitrary higher number. I don't know what fraction of sosreports have truncated Pacemaker log files currently and whether this would be worth doing.

Support engineers should not hesitate to request the full pacemaker.log file if the relevant timestamps are not present. Ideally, that should introduce only a small delay in investigation, though that depends on both the support team and the user.

pafernanr · 2024-09-25T07:37:40Z

Hello all,

+1 to remove sa*.xml files. They are redundant, binary saXX files are also included and they contain the full day dump. Some times also truncated, but not usual. It can happen if interval is too short.

I'd also like to suggest increasing the size limit to the foreman plugin. These CSV files are sometimes truncated which leads to missing important dynflow steps. Note that the plugin already limits the output to last 14 days, which should be enough for any support case. That said, although I fully agree a limit is mandatory, in this specific plugin, file limit is somehow "redundant". IMO increasing it to 150/200M could be a good choice to let the 14 days limit the output in as many cases as possible.

pmoravec · 2024-09-25T18:09:37Z

SAR data: I would drop the xml as rarely-if-at-all used (I am asking internally, either way), while I would keep the binary data (the "source of truth" that we can copy to another system with same sysstat version and get whatever we want) and also text saXX files (concise enough text interpretation of the binary data).

Increasing the 100M limit of foreman's dynflow* tables: no strong opinion. Can you @pafernanr evaluate the impact? I.e. generate so many foreman tasks to have 200M data in each such table, and compare execution time and tarball size for sizelimits of 100MB, 150MB and 200MB? On one side, we would get some more history of tasks. On the other side, the data are already ordered by time so most recent is always present, and I am on torns if it is worth paying the extra cost in longer time and tarball size to get that info. This sizelimit affected my own investigation of foreman/Satellite support cases only rarely, hence my reluctant attitude. But if others hit it more often, no objections.

pmoravec · 2024-09-26T08:22:33Z

SAR: Feedback from two groups of support engineers in Red Hat: "we dont use XML format, but we heavily use binary saXX and text sarXX formats". So I would vote for dropping the xml format (and a reference in release notes - so maybe worth waiting for 4.8.2 tag to mention it in "more major" 4.9 RN?)

These columns are either empty, containing passwords or some encoded data. Get the *remaining* column names and query for them. If the query for column names fail, failover to current "SELECT *". Relevant: #3783 Resolves: #3784 Signed-off-by: Pavel Moravec <[email protected]>

pmoravec mentioned this issue Sep 24, 2024

[pulpcore] Don't collect args columns from tasks tables #3784

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many tailed files collected #3783

Too many tailed files collected #3783

pmoravec commented Sep 24, 2024

jcastill commented Sep 24, 2024

pmoravec commented Sep 24, 2024

haircommander commented Sep 24, 2024

TurboTurtle commented Sep 24, 2024 •

edited

Loading

jcastill commented Sep 24, 2024

nrwahl2 commented Sep 24, 2024 •

edited

Loading

pafernanr commented Sep 25, 2024 •

edited

Loading

pmoravec commented Sep 25, 2024

pmoravec commented Sep 26, 2024

Too many tailed files collected #3783

Too many tailed files collected #3783

Comments

pmoravec commented Sep 24, 2024

jcastill commented Sep 24, 2024

pmoravec commented Sep 24, 2024

haircommander commented Sep 24, 2024

TurboTurtle commented Sep 24, 2024 • edited Loading

jcastill commented Sep 24, 2024

nrwahl2 commented Sep 24, 2024 • edited Loading

pafernanr commented Sep 25, 2024 • edited Loading

pmoravec commented Sep 25, 2024

pmoravec commented Sep 26, 2024

TurboTurtle commented Sep 24, 2024 •

edited

Loading

nrwahl2 commented Sep 24, 2024 •

edited

Loading

pafernanr commented Sep 25, 2024 •

edited

Loading