Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't return CSS files as results #147

Closed
dev-nicolaos opened this issue May 6, 2024 · 3 comments
Closed

Don't return CSS files as results #147

dev-nicolaos opened this issue May 6, 2024 · 3 comments

Comments

@dev-nicolaos
Copy link

It appears searchmysite is not filtering out CSS files from its results. This means if you perform a search that is filled with CSS keywords and values, a lot of the results page is just a bunch of random CSS files with no title.

Example: https://searchmysite.net/search/?q=box+shadow+100%25+height

image

@m-i-l m-i-l closed this as completed in c6e7abc May 7, 2024
@m-i-l
Copy link
Contributor

m-i-l commented May 7, 2024

@dev-nicolaos Many thanks for spotting this issue and taking the time to report it.

I've modified the search query to restrict results to the following content_types:
(content_type:text/html OR content_type:text/plain)

i.e. only HTML and plain text results will be shown, so no CSS, JS, JSON, XML, RSS or any kind of binary files.

For reference, the search index currently contains the following content_types:
"text/html",91583
"application/json",1845
"text/xml",1690
"application/rss+xml",1526
"application/xml",1512
"application/javascript",648
"text/plain",595
"application/xhtml+xml",256
"application/octet-stream",242
"application/atom+xml",235
"application/manifest+json",135
"text/css",90
"application/feed+json",32
"text/javascript",26
"application/pgp-keys",24
"application/opensearchdescription+xml",20
"application/x-tex",17
"text/gemini",17
"text/csv",15
"application/json+oembed",14
"text/markdown",14
"application/pgp-signature",12
"text/x-bibtex",11
"text/calendar",10
"application/pgp-encrypted",8
"application/rdf+xml",7
"text/x-csrc",7
"text/x-python",6
"text/x-c",5
"application/stream+json",4
"application/x-mspublisher",4
"image/svg+xml",4
"text/vcard",4
"text/x-diff",4
"text/x-opml+xml",4
And a whole bunch of other stuff, mostly application/ , but some others like binary/octet-stream, text/x-c++src, x-world/x-vrml and audio/x-pn-realaudio .

There may be a case for blocking indexing of anything that isn't text/html or text/plain (or application/json, text/xml, application/rss+xml, application/xml, application/xhtml+xml, application/atom+xml and application/feed+json because these may contain useful data) to leave more space in the index for useful content - if I decide to do this I'll log another ticket.

For reference, there are a couple of previous modifications to what type of content is shown:

And there's a related open ticket:

@m-i-l
Copy link
Contributor

m-i-l commented May 7, 2024

Now deployed. A search like https://searchmysite.net/search/?q=box+shadow+100%25+height still looks slightly problematic at first because most of the content snippets on the results page are just CSS, but it is better because (a) each result has a proper title, and (b) clicking into each result gets to a web page which includes CSS snippets in the content, i.e. expected behaviour.

@m-i-l
Copy link
Contributor

m-i-l commented May 8, 2024

Additional info: of the 101,032 pages currently in the system, 102, i.e. around 0.1%, don't have a content_type set at all, so none of those 102 pages will be returned via the new filter. That's a small percentage, and looking at some of the pages they're not likely to be useful search results, e.g. .opml, .webmanifest, .v, .ly, .py, .pub, .gpg, .xml, .awk, .bib, .hex etc. files, so not an issue.

I think that does add weight to the case for cleaning up the index, by only indexing known content types. I've raised #149 for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants