This is a command line tool for scanning an IMAP account.
Its goal is to find existing messages which contain Schema.org markup (JSON-LD or Microdata) and to optionally dump the findings as JSON-LD.
Please consider donating test data (in anonymized/pseudonymous form) to the schema-org-examples dataset.
To build the project, use the following command to create the sml-account-scan.jar
file under dist/
:
ant jar-scanner
Requires Java 8+.
You'll also need your IMAP host name and corresponding login credentials to get started.
The following command will traverse all IMAP folders and output all structured data found in the emails to the console:
java -jar sml-account-scan.jar -h <host> -u <user> -p <password>
Optionally, you can dump structured data to a directory (one file per source message):
java -jar sml-account-scan.jar -h <host> -u <user> -p <password> -d <output-directory>
Example:
java -jar sml-account-scan.jar -h imap.example.com -u [email protected] -p "G37-5CHW1F7Y" -d /tmp/scanner-output/
See also additional config options below.
Some email providers, such as FastMail, Google, and Microsoft recommend OAuth as the default authentication mechanism. Since this scanner currently does not support OAuth, you can alternatively set up a so-called "app-specific passwords" for those providers.
Please see the corresponding provider documentation for details:
- FastMail: Adding a new third-party app
- Gmail: Sign in with app passwords
- Microsoft (Outlook.com / Microsoft 365): How to get and use app passwords
Use the generated app-specific password for the <password>
parameter when configuring the scanner CLI application.
You can specify a comma-separated list of IMAP folders to scan using the -i
option.
Example:
java -jar sml-account-scan.jar -h imap.example.com -u [email protected] -p "G37-5CHW1F7Y" -d /tmp/scanner-output/ -i git,INBOX,lists.sml
By default, the tool will use IMAPS (IMAP over SSL) to connect to the email server. You can override this behavior and specify the connection type and port using the following options:
-f, --force-no-ssl
: Disable SSL for the connection (use plain IMAP)-o, --override-port <arg>
: Override the default remote system port- STARTTLS is currently not supported
Each log line corresponds to an email that contains structured data. The format is as follows:
<date> | <sender> | data objects: <structured-object-count> | <message-id>
Example: 09/27/24 16:09 | Noah Baumbach <[email protected]> | data objects: 1 | <pr-audriga/jsonld2html-javascript/40/updated/[email protected]>
The meaning of each part is:
date
: The date and time of the email.sender
: The sender of the email.structured-data-object-count
: The number of structured data objects found in the email.message-id
: The message ID of the email.
If -d
has been used, the output directory will contain one file for each structured data object found in an email. The naming convention is:
<main-schema-org-type>-<date>-<folder>-<messageid>-<sender>.<syntax>.[json|html]
Example: emailmessage-2024_04_12-git-gfaudriga__notifications_github_com-audriga_nextcloud_mail_pull_1_push_1797267404_github_com.jsonld.json
More details:
schema-org-type
: The schema.org type found in the email, derived from something likeEmailMessage
date
: The date of the emailfolder
: The folder the email was found inmessageid
: The message ID of the email, derived from something like<audriga/nextcloud-mail/pull/1/push/[email protected]>
sender
: The sender of the email, derived from something likegfaudriga <[email protected]>
syntax
: The syntax of the structured data, derived from the content type of the structured data. Supportsjsonld
ormicrodata
.
In case the scanner encountered an error during parsing, the output file will contain the full HTML body of the email instead. The syntax will be "unknown" in that case.