- Support for rdflib 6.0.0 (PR #177)
- Fix for when jsonld is null (PR #56)
- Documentation fixe (PR #173)
- Support for rdflib 5.0.0 and up. When upgrading, we recommend switching to latest versions of rdflib, mf2py, rdflib-jsonld and pyrdfa3. (PR #161)
- Support for Python 3.8 and 3.9
- Show full README on PyPI (PR #162)
- README rendering fixed and tested (PR #170)
- Using github actions instead of Travis CI
- support Dublin Core Metadata (DC-HTML-2003) (PR #101)
- support the non-standard
product
Open Graph namespace (PR #152) - move release documentation to the wiki (PR #150)
- support open graph arrays via
with_og_array=True
(PR #138) - support "expanded" Open Graph metadata based on og:type (PR #140)
- parse JSON with JS comments for json-ld (PR #137)
- preserve order for duplicated properties for RDFa (PR #139)
- improve microdata parser performance with large number of items (PR #148)
- spelling fixes (PR #145)
- REST API
extruct.service
removed rdflib
dependency restrited to <5.0.0, as parsers used by extruct were removed in 5.0.0
- Python 3.4 support is dropped;
- in case of duplicate OpenGraph definitions (e.g. multiple
og:image
), empty results are de-prioritized now, to do the same as Facebook; - text content of microdata attributes is now extracted using html-text library, which fixes badly extracted text in some cases (words glued together, etc.)
- In case of duplicate OpenGraph definitions (e.g. multiple
og:image
), extruct now keeps the first one, not the last one, to do the same as Facebook.
- Cover all possible exception cases dealt by
extruct()
errors
attribute for valuesstrict
,log
andignore
- avoid including
itemprop
from childitemscope
when usingitemref
for microdata - proper processing order for
itemref
for microdata
- json-ld parsing issue is fixed;
- deprecation warning for
url
argument points to caller code; - better Python 3.7 support (fixed warnings, setup running 3.7 tests on CI).
In this release OpenGraph parsing is improved:
- known OpenGraph namespaces (og, music, video, article, book, profile) work without an explicitly defined prefix;
- prefix is extracted both from
<head>
and<html>
element attributes, not only from<head>
; - prefix parsing is more permissive.
Other changes:
- pypi version badge is added to the README;
- html parsing code is cleaned up.
- JSON-LD parsing is less strict now: control characters are allowed.
- Add OpenGraph and Microformat extractors.
- Add argument
syntaxes
toextract
and command line function, it allows to select which syntaxes to extract. - Add argument
uniform
toextract
and command line function, if True it maps the output of Microdata, OpenGraph, Microformat and Json-ld to the same template. - Add argument
errors
toextract
and command line function, it allows to define if errors should be raised, logged or ignored. - Fix RDFa memory leak, now RDfaExtractor resets
_lookups
after each extraction. - Fixed regex pattern in
JsonLdExtractor
to avoid removing comments from within valid JSON. - In
w3microdata
strip whitespaces, newlines, etc from urls extracted from html nodes. base_url
substitutesurl
inMicroformatExtractor
,JsonLdExtractor
,OpenGraphExtractor
,RDFaExtractor
andMicrodataExtractor
- individual extractors accept
base_url
instead ofurl
, unused keyword arguments are removed. - In
w3microdata.extract_items
items_seen
andurl
are no longer class variables but are passed as arguments. - In
w3microdata
the following functions are now private:extract_item
,extract_property_value
,extract_textContent
,_extract_property
,_extract_properties
,_extract_property_refs
and_extract_textContent
. - In
w3microdata
_extract_properties
,_extract_property_refs
,_extract_property
,_extract_property_value
and_extract_item
now needitems_seen
andurl
to be passed as arguments. - Add argument
return_html_node
toextract
, it allows to return HTML node with the result of metadata extraction. It is supported only by microdata syntax.
Warning: backward-incompatible change:
base_url
is used instead ofurl
inextruct.extract
,url
is still supported by deprecated.- In
extruct.extract
defaultbase_url
is nowNone
to avoid wrong results withurljoin
.
- New
extruct
command line tool to fetch a page and extract its metadata. Works either viaextruct
directly orpython -m extruct
. - Accept leading HTML comment in JSON-LD payload.
- rdflib log messages were silenced to avoid the noise when importing extruct.
- Fix dependencies and support RDFa by default (hence depend on rdflib by default).
- Update README with all-in-one extractor examples.
- All extractors have an
.extract_items()
method, taking an lxml-parsed document as input, if you want to reuse one you already have. - Add generic extraction: use
extruct.extract()
to call all extractors at once.
Warning: backward-incompatible change:
.extract()
methods now return a list of Python dicts (the items) instead of a dict with an "items" key having this list as value.
- Use rdflib's pyRdfa directly instead of pyRdfa3 code copy.
- (Very) Experimental support for RDFa extraction using rdflib+lxml
- Web service response content-type set to 'application/json'
- Web service Python 3 compatibility
- Code coverage reports
- Fix extraction of
<object>
"data" URL with microdata - Handle textContent mixed with
<script>
and<style>
tags - Add JSON-LD extraction example to README
- Tests added for non-nested microdata output
- Tests added for text content option
- Tests added for "meter" and "data" attributes
- First release on PyPI.