Skip to content
This repository has been archived by the owner on May 5, 2020. It is now read-only.

TLD Whitelists to limit domain searches #11

Open
knowtheory opened this issue Jan 24, 2015 · 22 comments
Open

TLD Whitelists to limit domain searches #11

knowtheory opened this issue Jan 24, 2015 · 22 comments

Comments

@knowtheory
Copy link
Contributor

We're going to need two whitelists, one for file types (which we already have) and one for government domains.

Regretfully, the domain issue is going to be an inconsistent solution at best. There are plenty of government sites at .com domains, but we don't want people to be able to search google.com.

The proposed solution for this is to let any search that validate against the TLD whitelist go through automatically, and to throw up a captcha and a form for more information about the site for sites that don't match the whitelist.

@knowtheory knowtheory changed the title Whitelists to limit searches Whitelists to limit domain searches Jan 24, 2015
@knowtheory knowtheory changed the title Whitelists to limit domain searches TLD Whitelists to limit domain searches Jan 24, 2015
@knowtheory
Copy link
Contributor Author

To be clear, this is basically just a set of regexps to match the end of a domain.

So for example, we know that any site that ends with .gov or .gov.uk are okay to go on the whitelist. Sites that end in .com need to be validated for additional information.

@knowtheory
Copy link
Contributor Author

Oh, lets get some checkboxes :)

  • site has whitelist for domain endings that can be modified
  • site checks search requests against the whitelist
  • if a search request fails against the whitelist, the user is asked for info about what site they're requesting info about.

@pavel-i-am
Copy link
Contributor

I think it should be enough to simply leave a contact email somewhere on the bottom of our website that users will use to contact us in case they feel like a certain domain should get whitelisted.

@waldoj
Copy link
Member

waldoj commented Feb 12, 2015

We can use gman as the whitelist. Here is the actual list.

@knowtheory
Copy link
Contributor Author

Oh man. If only i had known. I should have put my faith in Balter.

@waldoj
Copy link
Member

waldoj commented Feb 12, 2015

I can't figure out where the whitelist is, though. A search for “gov” doesn't yield anything.

@pavel-i-am
Copy link
Contributor

Hi, Waldo,
Whitelist is not hardcoded. Instead, it is managed through /admin at the
server. This way you can add any domain without changing the code.

Waldo Jaquith wrote:

I can't figure out where the whitelist is, though. A search for “gov”
https://github.com/opendata/lmgtdfy/search?utf8=%E2%9C%93&q=gov&type=Code
doesn't yield anything.


Reply to this email directly or view it on GitHub
#11 (comment).

@knowtheory
Copy link
Contributor Author

Ah yep, @waldoj just set you up with a user/pass & fired the details your way.

@waldoj
Copy link
Member

waldoj commented Feb 12, 2015

Oh, an admin section! Great. :)

@knowtheory
Copy link
Contributor Author

btw, the format for the TLDs are just gov or gov.uk we automatically prepend a *. to that.

@waldoj
Copy link
Member

waldoj commented Feb 12, 2015

Do you have any sense of at what point this will become bogged down? That is, will adding thousands of domain names be problematic? (Of course, I'll bulk load them directly into MySQL.)

@pavel-i-am
Copy link
Contributor

Waldo, this does not have to be a single domain.
Rather, you can use a postfix.
Therefore by adding .gov to the whitelist, you'll whitelist all .gov
websites and so on.

Waldo Jaquith wrote:

Do you have any sense of at what point this will become bogged down?
That is, will adding /thousands/ of domain names be problematic? (Of
course, I'll bulk load them directly into MySQL.)


Reply to this email directly or view it on GitHub
#11 (comment).

@waldoj
Copy link
Member

waldoj commented Feb 12, 2015

Sure, I follow, but only 1% of the domains that I'm concerned with are .gov domains.

@pavel-i-am
Copy link
Contributor

Do you think we should remove the whitelist at all? The danger of this
is abuse of the search, which is not free to begin with.

Waldo Jaquith wrote:

Sure, I follow, but only 1% of the domains that I'm concerned with
https://github.com/benbalter/gman/blob/master/config/domains.txt are
.gov domains.


Reply to this email directly or view it on GitHub
#11 (comment).

@waldoj
Copy link
Member

waldoj commented Feb 12, 2015

Nope—I think it's a fine idea. I just want to expand it to include the domain names for all governments within the United States.

@pavel-i-am
Copy link
Contributor

Waldo, if there is such a list somewhere, it will be extremely easy to
add them all to the list. Too bad we haven't found a reliable source so
far. They all use custom domains, not necessarily .gov, indeed.

Waldo Jaquith wrote:

Nope—I think it's a fine idea. I just want to expand it to include the
domain names for all governments within the United States.


Reply to this email directly or view it on GitHub
#11 (comment).

@waldoj
Copy link
Member

waldoj commented Feb 12, 2015

This is the list, right here:

https://github.com/benbalter/gman/blob/master/config/domains.txt

@pavel-i-am
Copy link
Contributor

Do you want me to hardcode them or make a script that will add them to
the database?

Waldo Jaquith wrote:

This is the list, right here:

https://github.com/benbalter/gman/blob/master/config/domains.txt


Reply to this email directly or view it on GitHub
#11 (comment).

@waldoj
Copy link
Member

waldoj commented Feb 12, 2015

Oh, it's OK—you don't need to do anything with it. I'm happy to take care of this. :) I'm just wondering if this number of domains is going to be problematic for the software.

@pavel-i-am
Copy link
Contributor

I don't think it is going to give you any kind of problems. 10000
strings with strict comparison should not take much time. I can make the
base script for importing such data for you if you need.

Waldo Jaquith wrote:

Oh, it's OK—you don't need to do anything with it. I'm happy to take
care of this. :) I'm just wondering if this number of domains is going
to be problematic for the software.


Reply to this email directly or view it on GitHub
#11 (comment).

@waldoj
Copy link
Member

waldoj commented Feb 12, 2015

Nah, that's fine—since it's MySQL, I'm happy to do it. I only need this for my own installation of this software—for the base package, there's no need to include this, so no standardized, replicable process is required.

@pavel-i-am
Copy link
Contributor

Got this. In case you need any sort of help, feel free to message me and
we'll get this resolved real quick.

Waldo Jaquith wrote:

Nah, that's fine—since it's MySQL, I'm happy to do it. I only need
this for my own installation of this software—for the base package,
there's no need to include this, so no standardized, replicable
process is required.


Reply to this email directly or view it on GitHub
#11 (comment).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants