Skip to content

Commit b140405

Browse files
author
Chris Newell
committed
Initial commit
1 parent 82c4dee commit b140405

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+17573
-2
lines changed

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,3 +127,10 @@ dmypy.json
127127

128128
# Pyre type checker
129129
.pyre/
130+
131+
# Eclipse
132+
.project
133+
.pydevproject
134+
135+
# Misc
136+
.DS_Store

CONTRIBUTING.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Contributing to Citron
2+
3+
## Questions
4+
5+
Questions can be raised on the [discussion board](https://github.com/bbc/citron/discussions/categories/q-a).
6+
7+
## Reporting bugs
8+
9+
Please report bugs on the: [issue tracker](https://github.com/bbc/citron/issues) including as much detail as possible so your bug is reproducible.
10+
11+
## Feature suggestions
12+
13+
Please suggest new features on the [discussion board](https://github.com/bbc/citron/discussions/categories/ideas) describing the intended usage.
14+
15+
## Contributing code
16+
17+
Except for small changes and docs, it is best to [suggest a feature](https://github.com/bbc/citron/discussions/categories/ideas) before submitting a pull request.

README.md

Lines changed: 78 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,78 @@
1-
# citron
2-
Citron is a quote extraction and attribution system developed by BBC R&D
1+
# Citron #
2+
3+
Citron is an experimental quote extraction and attribution system created by [BBC R&D](https://www.bbc.co.uk/rd), based on a [paper](https://aclanthology.org/D13-1101/) and a [dataset](https://aclanthology.org/L16-1619/) developed by the School of Informatics at the University of Edinburgh.
4+
5+
It can be used to extract quotes from text documents, attributing them to the appropriate speaker and resolving pronouns where necessary. Note that there can be a significant number of errors and omissions. Extracted quotes should be checked against the input text.
6+
7+
You can run Citron using the [pre-trained model](./models/en_2021-11-15) or [train your own model](./scripts/train). You can also [evaluate its performance](./scripts/evaluate).
8+
9+
Training and evaluating models requires data using [Citron's Annotation Format](./docs/data_format.md). Citron provides [pre-processing scripts](./scripts/preprocess) to extract suitable data from the [PARC 3.0 Corpus of Attribution Relations](https://aclanthology.org/L16-1619/). Alternatively, you can create your own data using the [Citron Annotator](./scripts/annotator) app.
10+
11+
Technical details and potential applications are discussed in: ["Quote Extraction and Analysis for News"](./docs/DSJM_2018_paper_1.pdf).
12+
13+
## Installation ##
14+
Requires Python 3.7.2 or above. The package versions shown should be installed when using the [pre-trained model](./models/en_2021-11-15).
15+
16+
- [Install scikit-learn (1.0.*)](https://scikit-learn.org/stable/install.html)
17+
- [Install spaCy (3.*) and download a model](https://spacy.io/usage)    (e.g. "en_core_web_sm")
18+
- Download the source code: ```git clone [email protected]:bbc/citron.git```
19+
20+
Then from the citron root directory:
21+
22+
python3 -m pip install -r requirements.txt
23+
24+
Then from python3:
25+
26+
import nltk
27+
nltk.download("names")
28+
29+
## Usage ##
30+
31+
Scripts to run Citron are available in the [bin/](./bin/) directory.
32+
33+
All scripts require the citron root directory in the PYTHONPATH.
34+
35+
$ export PYTHONPATH=$PYTHONPATH:/path/to/citron_root_directory
36+
37+
### Run the Citron REST API and demonstration server ###
38+
39+
$ citron-server
40+
--model-path Path to Citron model directory
41+
--logfile Path to logfile (Optional)
42+
--port Port for the Citron API (Optional: default is 8080)
43+
-v Verbose mode (Optional)
44+
45+
### Run Citron on the command-line ###
46+
47+
$ citron-extract
48+
--model-path Path to Citron model directory
49+
--input-file Path to input file (Optional: Otherwise read from stdin)
50+
--output-file Path to output file (Optional: Otherwise write to stdout)
51+
-v Verbose mode (Optional)
52+
53+
### Use Citron in Python ###
54+
55+
from citron.citron import Citron
56+
from citron import utils
57+
58+
nlp = utils.get_parser()
59+
citron = Citron(model_path, nlp)
60+
doc = nlp(text)
61+
quotes = citron.get_quotes(doc)
62+
63+
## Issues and Questions ##
64+
Issues can be reported on the [issue tracker](https://github.com/bbc/citron/issues) and questions can be raised on the [discussion board](https://github.com/bbc/citron/discussions/categories/q-a).
65+
66+
## Contributing ##
67+
68+
Contributions would be welcome. Please refer to the [contributing guidelines](./CONTRIBUTING.md).
69+
70+
## License ##
71+
72+
Licensed under the [Apache License, Version 2.0](./LICENSE). The [pre-trained model](./models/en_2021-11-15) includes data from VerbNet 3.3 which is licensed under the [VerbNet 3.0 license](./verbnet-license.3.0.txt).
73+
74+
## Contact ##
75+
76+
For more information please contact: [[email protected]](mailto:[email protected])
77+
78+
Copyright 2021 British Broadcasting Corporation.

bin/citron-extract

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
#!/usr/bin/env python3
2+
# Copyright 2021 BBC
3+
# Authors: Chris Newell <[email protected]>
4+
#
5+
# License: Apache-2.0
6+
7+
"""
8+
This script runs Citron on the command line.
9+
10+
"""
11+
12+
import argparse
13+
import logging
14+
import json
15+
import sys
16+
17+
from citron.citron import Citron
18+
from citron.logger import logger
19+
20+
21+
def main():
22+
parser = argparse.ArgumentParser(
23+
description="Extract quotes from text",
24+
formatter_class=argparse.ArgumentDefaultsHelpFormatter
25+
)
26+
parser.add_argument("-v",
27+
action = "store_true",
28+
default = False,
29+
help = "Verbose mode"
30+
)
31+
parser.add_argument("--model-path",
32+
metavar = "model_path",
33+
type = str,
34+
required=True,
35+
help = "Path to Citron model directory"
36+
)
37+
parser.add_argument("--input-file",
38+
metavar = "input_file",
39+
type = str,
40+
help = "Optional: Otherwise read from stdin"
41+
)
42+
parser.add_argument("--output-file",
43+
metavar = "output_file",
44+
type = str,
45+
help = "Optional: Otherwise write to stdout"
46+
)
47+
args = parser.parse_args()
48+
49+
if args.v:
50+
logger.setLevel(logging.DEBUG)
51+
52+
citron = Citron(args.model_path)
53+
54+
if args.input_file is None:
55+
text = ""
56+
57+
while True:
58+
line = sys.stdin.readline()
59+
60+
if not line:
61+
break
62+
63+
text += " " + line
64+
65+
else:
66+
with open(args.input_file, encoding="utf-8") as infile:
67+
text = infile.read()
68+
69+
results = citron.extract(text)
70+
output = json.dumps(results, indent=4, sort_keys=False, ensure_ascii=False)
71+
72+
if args.output_file is None:
73+
print(output)
74+
75+
else:
76+
with open(args.output_file, "w", encoding="utf-8") as outfile:
77+
outfile.write(output + "\n")
78+
79+
80+
if __name__ == "__main__":
81+
main()

bin/citron-server

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
#!/usr/bin/env python3
2+
# Copyright 2021 BBC
3+
# Authors: Chris Newell <[email protected]>
4+
#
5+
# License: Apache-2.0
6+
7+
"""
8+
This script starts a web server which supports the Citron
9+
REST API and demonstration.
10+
11+
"""
12+
13+
import argparse
14+
import logging
15+
16+
from citron.citron import Citron
17+
from citron.citron import CitronWeb
18+
from citron.logger import logger
19+
20+
21+
def main():
22+
parser = argparse.ArgumentParser(
23+
description="Run the Citron REST API",
24+
formatter_class=argparse.ArgumentDefaultsHelpFormatter
25+
)
26+
parser.add_argument("-v",
27+
action = "store_true",
28+
default = False,
29+
help = "Verbose mode"
30+
)
31+
parser.add_argument("--model-path",
32+
metavar = "model_path",
33+
type = str,
34+
required=True,
35+
help = "Path to Citron model directory"
36+
)
37+
parser.add_argument("--logfile",
38+
metavar = "logfile",
39+
type = str,
40+
default = None,
41+
help = "Logfile for output"
42+
)
43+
parser.add_argument("--port",
44+
metavar = "port",
45+
type = int,
46+
default = 8080,
47+
help = "Port for the Citron API"
48+
)
49+
args = parser.parse_args()
50+
51+
if args.v:
52+
logger.setLevel(logging.DEBUG)
53+
54+
citron = Citron(args.model_path)
55+
web = CitronWeb(citron)
56+
web.start(args.port, args.logfile)
57+
58+
59+
if __name__ == "__main__":
60+
main()

citron/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)