Skip to content

Parsers for robots.txt (aka Robots Exclusion Standard / Robots Exclusion Protocol), Robots Meta Tag, and X-Robots-Tag

License

Notifications You must be signed in to change notification settings

toimik/RobotsProtocol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

a7270ac · May 2, 2024

History

75 Commits
May 2, 2024
May 2, 2024
May 2, 2024
May 2, 2024
Jul 19, 2021
Jul 19, 2021
Jul 19, 2021
Jul 19, 2021
Jul 19, 2021
May 2, 2024
Nov 17, 2021

Repository files navigation

Code Coverage Nuget

Toimik.RobotsProtocol

.NET 8 C# robots.txt parser and a C# Robots Meta Tag / X-Robots-Tag parser.

Features

RobotsTxt.cs

  • Creates instance via string or stream
  • Parses standard, extended, and custom fields:
    • User-agent
    • Disallow
    • Crawl-delay
    • Sitemap
    • Allow (Toggle-able; Can be ignored if needed)
    • Others (e.g. Host)
  • Supports misspellings of fields
  • Matches wild cards in paths (* and $)

RobotsTag.cs

  • Parses custom fields

Quick Start

Installation

Package Manager

PM> Install-Package Toimik.RobotsProtocol

.NET CLI

> dotnet add package Toimik.RobotsProtocol

Usage

Snippets are shown below.

Refer to demo programs in samples folder for complete source code.

RobotsTxt.cs (for parsing robots.txt)

var robotsTxt = new RobotsTxt();

// Load content of a robots.txt from a String
var content = "...";
_ = robotsTxt.Load(content);

// Load content of a robots.txt from a Stream
// var stream = "...";
// _ = await robotsTxt.Load(stream);

var isAllowed = robotsTxt.IsAllowed("autobot", "/folder/file.htm"};

RobotsTag.cs (for parsing robots meta tag / x-robots-tag)

var robotsTag = new RobotsTag();

// This data is either retrieved from Robots Meta Tag (e.g. <meta name="badbot"
// content="none"> or X-Robots-Tag HTTP response header (e.g. X-Robots-Tag: otherbot:
// index, nofollow). 
var data = ...;

// Words treated as the name of directives with values (e.g. max-snippet: 10).
var specialWords = new HashSet<string>
{
    "max-snippet",
    "max-image-preview",

    // ... Add accordingly
};

// Load the data to parse. This will extract every directive into their own Tag class
_ = robotsTag.Load(data, specialWords);

var hasNone = robotsTag.HasTag("autobot", "none");
var hasNoIndex = robotsTag.HasTag("autobot", "noindex");
var isIndexable = !hasNone && !hasNoIndex;

About

Parsers for robots.txt (aka Robots Exclusion Standard / Robots Exclusion Protocol), Robots Meta Tag, and X-Robots-Tag

Topics

Resources

License

Stars

Watchers

Forks

Languages