A Go port of Mozilla Readability.js, an algorithm based on heuristics (e.g. link density, text similarity, number of images, etc.) that just somehow work well and powers the Firefox Reader View offering a distraction-free reading experience for articles, blog posts, and other text-heavy web pages by removing ads, GDPR-compliant cookie banners and other unsolicited junk.
The source code is aligned with the latest commit (97db40b) on the main branch.
Readability.js maintained by Mozilla is based on a JavaScript bookmarklet developed by Arc90, a consulting firm which was experimenting with Web technology at that time and which used to share some of their stuff as open source software. The company site has long disappeared but it can still be found with the Wayback Machine.
The source code was released in 2009 under the Apache 2.0 software license on Google Code before being abandoned in 2010 to be repackaged as a web service called Readability.com, then discontinued in 2016. The main source code contributor was Chris Dary (@umbrae).
Most modern browsers still use one of the available forks of the Arc90 original implementation when displaying web pages in reading mode.
For a historical and detailed analysis of this topic, please read this excellent series of articles by Daniel Aleksandersen.
Add a dependency for the package:
go get -u github.com/giulianopz/go-readability
Get text content from a web page article:
package main
import (
"fmt"
"github.com/giulianopz/go-readability"
)
func main() {
var htmlSource = `<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>
Redis will remain BSD licensed - <antirez>
</title>
<link href="/rss" rel="alternate" type="application/rss+xml" />
</head>
<body>
<div id="container">
<header>
<h1><a href="/"><antirez></a></h1>
</header>
<div id="content">
<section id="newslist">
<article data-news-id="120">
<h2><a href="/news/120">Redis will remain BSD licensed</a></h2>
</article>
</section>
<article class="comment" style="margin-left:0px" data-comment-id="120-" id="120-"><span class="info"><span
class="username"><a href="/user/antirez">antirez</a></span> 2095 days ago.
170643 views. </span>
<pre>Today a page about the new Common Clause license in the Redis Labs web site was interpreted as if Redis itself switched license. This is not the case, Redis is, and will remain, BSD licensed. However in the era of [edit] uncontrollable spreading of information, my attempts to provide the correct information failed, and I’m still seeing everywhere “Redis is no longer open source”. The reality is that Redis remains BSD, and actually Redis Labs did the right thing supporting my effort to keep the Redis core open as usually.
[...]
We at Redis Labs are sorry for the confusion generated by the Common Clause page, and my colleagues are working to fix the page with better wording.</pre>
</article>
</div>
</div>
</body>
</html>`
isReaderable := readability.IsProbablyReaderable(htmlSource)
fmt.Printf("Contains any text?: %t\n", isReaderable)
reader, err := readability.New(
htmlSource,
"http://antirez.com/news/120",
readability.ClassesToPreserve("caption"),
)
if err != nil {
panic(err)
}
result, err := reader.Parse()
if err != nil {
panic(err)
}
fmt.Printf("Title: %s\n", result.Title)
fmt.Printf("Author: %s\n", result.Byline)
fmt.Printf("Length: %d\n", result.Length)
fmt.Printf("Excerpt: %s\n", result.Excerpt)
fmt.Printf("SiteName: %s\n", result.SiteName)
fmt.Printf("Lang: %s\n", result.Lang)
fmt.Printf("PublishedTime: %s\n", result.PublishedTime)
fmt.Printf("Content: %s\n", result.Content)
fmt.Printf("TextContent: %s\n", result.TextContent)
}