Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr -operator and prometheus-exporter #760

Closed
aloosnetmatch opened this issue Feb 26, 2025 · 9 comments
Closed

Solr -operator and prometheus-exporter #760

aloosnetmatch opened this issue Feb 26, 2025 · 9 comments

Comments

@aloosnetmatch
Copy link

We upgraded out test env from solr-operator 0.8.1 to 0.9.0 and from solr 9.7.0 to 9.8.0.

The prometheus exporter does not seem to work anymore.

The logging tells me this:

ERROR - 2025-02-26 12:42:40.917; org.apache.solr.prometheus.exporter.SolrExporter; Must provide either --base-url or --zk-host Exception in thread "main" java.lang.NullPointerException: Cannot invoke "org.apache.solr.prometheus.exporter.SolrScrapeConfiguration.getType()" because "configuration" is null at org.apache.solr.prometheus.exporter.SolrExporter.createScraper(SolrExporter.java:127) at org.apache.solr.prometheus.exporter.SolrExporter.<init>(SolrExporter.java:90) at org.apache.solr.prometheus.exporter.SolrExporter.main(SolrExporter.java:426)

I already found this link , which is similar to my issue
https://issues.apache.org/jira/browse/SOLR-17638

In the prometheus-exporter pod, the env variable for "ZK_HOST" seems to have no value.

   env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: PORT
      value: '8080'
    - name: NUM_THREADS
      value: '6'
    - name: ZK_HOST
    - name: CONFIG_FILE
      value: /opt/solr/contrib/prometheus-exporter/conf/solr-exporter-config.xml
  resources:

If I set the correct value there for the ZK_HOST in the Deployment , an additional container starts up which does seem to work.

What can i do to fix this isssue?

@gerlowskija
Copy link
Contributor

I spent some time this morning trying to trigger this using the release-0.9 branch (Operator 0.9.0, essentially), and Solr 9.8.0, and couldn't reproduce. Since @aloosnetmatch saw the behavior change after adding the missing "ZK_HOST" value, I tried telling the prometheus about its ZK_HOST a few different ways, including:

spec:
  solrReference:
    cloud:
      zkConnectionInfo:
        internalConnectionString: "example-solrcloud-zookeeper-0.example-solrcloud-zookeeper-headless.default.svc.cluster.local:2181"
        chroot: "/this/will/be/auto/created"

and

spec:
  solrReference:
    cloud:
      name: "example"

Neither one allowed me to reproduce, out of the box.

@aloosnetmatch - could you share some more details about your solrcloud and solrprometheusexporter, that might help folks here reproduce? The output of commands like kubectl get solrcloud <solrcloud-name> -o yaml and kubectl get solrprometheusexporter <prom-name> -o yaml would be a huge help.

@HoustonPutman
Copy link
Contributor

Yeah, the only way for ZK_HOST to not actually have a value is to do some really weird stuff, so it would be necessary to see what your solrcloud and solrprometheusexporter specs look like

@aloosnetmatch
Copy link
Author

Thanks for your reply.

I looked up the requested info:

This time I compared the PROD environment ( which runs solr operator 0.8.1 and solr 9.7.0)
to the NONP environment ( which runs solr operator 0.9.0 and solr 9.8.0)

spec:
  solrReference:
    cloud:
      name: solr-cluster-netm
      namespace: netm-solr-operator

We use the solrReference.
So at one point ,I assume the solr operator, query's the solr cluster for the zookeeper info, which fails because of the error:

ERROR - 2025-03-14 12:45:49.909; org.apache.solr.prometheus.exporter.SolrExporter; Must provide either --base-url or --zk-host
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "org.apache.solr.prometheus.exporter.SolrScrapeConfiguration.getType()" because "configuration" is null
	at org.apache.solr.prometheus.exporter.SolrExporter.createScraper(SolrExporter.java:127)
	at org.apache.solr.prometheus.exporter.SolrExporter.<init>(SolrExporter.java:90)
	at org.apache.solr.prometheus.exporter.SolrExporter.main(SolrExporter.java:336)

The attached files for the solrprometheusexporter:
solrprometheusexporter_nonp.txt
solrprometheusexporter_prod.txt

also attached files for the solrcloud specs:
solrcloud_nonp.txt
solrcloud_prod.txt

@gerlowskija
Copy link
Contributor

@aloosnetmatch - is this a transient error that goes away as the operator/pods retry, or does the error keep happening once it arises?

Assuming it's transient, I could imagine a timing issue where solrprometheusexporter ("SPE") creates its deployment after the solrcloud exists but before the operator has tried to reconcile it and populated solrCloud.status. In that window, the ZK_HOST env var would get the value of solrConnectionInfo.CloudZkConnnectionInfo.ZkConnectionString(), which (I think) would be an empty string.

But that only makes sense if it's transient behavior...

@HoustonPutman
Copy link
Contributor

@aloosnetmatch can you provide the status of your solrcloud objects as well?

@HoustonPutman
Copy link
Contributor

HoustonPutman commented Mar 14, 2025

Also could you provide your solr operator logs? The most likely culprit here is that there is an error in the SolrCloud reconciling and it cannot set its status.

@aloosnetmatch
Copy link
Author

Here is the solr operator log.

solr-operator_nonp_logs.txt

For which "solrcloud objects " do you like to have the status?

@HoustonPutman
Copy link
Contributor

So if you look at those logs, the Solr Operator cannot create an Ingress for you because your ingress controller is rejecting it. Because of this, the SolrCloud cannot finish reconciling, and the SolrCloud status is not written. It's ultimately an issue, the status should be written, probably, but it's very difficult to know which errors to ignore and which to not ignore when writing the status.

I would recommend fixing your ingress settings so the error doesn't happen.

@aloosnetmatch
Copy link
Author

Hi.

I managed to fix the issue.
The ingress, we tried in the past, we didn't got it to work back then.
So the solr-operator tried to configure the ingress but there is no ingress controller present.

I fixed it in our environment by switching to "ExternalDNS"

addressability:
external:
method: ExternalDNS

We use an Azure Loadbalancer to handle the external traffic.

As soon as this issue was fixed, the prometheus exported started working.

Thanks for your support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants