Skip to content

Commit c952cb6

Browse files
cpuhrschfacebook-github-bot
authored andcommitted
multiline get_line / model.test / unit test
Summary: See title. Reviewed By: kahne Differential Revision: D6629548 fbshipit-source-id: 89e0b04097d54845f8c1264a3f1fa72678de9587
1 parent eeddd0d commit c952cb6

File tree

13 files changed

+482
-202
lines changed

13 files changed

+482
-202
lines changed

.circleci/config.yml

+50
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ jobs:
3434
. .circleci/setup_circleimg.sh
3535
. .circleci/python_test.sh
3636
37+
3738
"py353":
3839
docker:
3940
- image: circleci/python:3.5.3
@@ -67,6 +68,51 @@ jobs:
6768
. .circleci/setup_circleimg.sh
6869
. .circleci/python_test.sh
6970
71+
"py361-pypi":
72+
docker:
73+
- image: circleci/python:3.6.1
74+
working_directory: ~/repo
75+
steps:
76+
- checkout
77+
- run:
78+
command: |
79+
. .circleci/setup_circleimg.sh
80+
. .circleci/pip_test.sh
81+
82+
83+
"py353-pypi":
84+
docker:
85+
- image: circleci/python:3.5.3
86+
working_directory: ~/repo
87+
steps:
88+
- checkout
89+
- run:
90+
command: |
91+
. .circleci/setup_circleimg.sh
92+
. .circleci/pip_test.sh
93+
94+
"py346-pypi":
95+
docker:
96+
- image: circleci/python:3.4.6
97+
working_directory: ~/repo
98+
steps:
99+
- checkout
100+
- run:
101+
command: |
102+
. .circleci/setup_circleimg.sh
103+
. .circleci/pip_test.sh
104+
105+
"py2713-pypi":
106+
docker:
107+
- image: circleci/python:2.7.13
108+
working_directory: ~/repo
109+
steps:
110+
- checkout
111+
- run:
112+
command: |
113+
. .circleci/setup_circleimg.sh
114+
. .circleci/pip_test.sh
115+
70116
"gcc5":
71117
docker:
72118
- image: gcc:5
@@ -184,6 +230,10 @@ workflows:
184230
- "py353"
185231
- "py346"
186232
- "py2713"
233+
- "py361-pip"
234+
- "py353-pip"
235+
- "py346-pip"
236+
- "py2713-pip"
187237
- "gcc5"
188238
- "gcc6"
189239
- "gcc7"

.circleci/pip_test.sh

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
#!/usr/bin/env bash
2+
#
3+
# Copyright (c) 2016-present, Facebook, Inc.
4+
# All rights reserved.
5+
#
6+
# This source code is licensed under the BSD-style license found in the
7+
# LICENSE file in the root directory of this source tree. An additional grant
8+
# of patent rights can be found in the PATENTS file in the same directory.
9+
#
10+
11+
sudo pip install --index-url https://test.pypi.org/simple/ fasttext
12+
python runtests.py -u

python/README.md

+41-104
Original file line numberDiff line numberDiff line change
@@ -4,148 +4,85 @@
44

55
## Requirements
66

7-
**fastText** builds on modern Mac OS and Linux distributions.
7+
[fastText](https://fasttext.cc/) builds on modern Mac OS and Linux distributions.
88
Since it uses C\++11 features, it requires a compiler with good C++11 support.
99
These include :
1010

1111
* (gcc-4.8 or newer) or (clang-3.3 or newer)
1212

1313
You will need
1414

15-
* python 2.7 or newer
16-
* numpy & scipy
15+
* [Python](https://www.python.org/) version 2.7 or >=3.4
16+
* [NumPy](http://www.numpy.org/) & [SciPy](https://www.scipy.org/)
1717
* [pybind11](https://github.com/pybind/pybind11)
1818

19-
## Building fastTextpy
19+
## Building fastText
2020

21-
In order to build `fastTextpy`, do the following:
21+
The easiest way to get the latest version of [fastText is to use pip](https://pypi.python.org/pypi/fasttext).
2222

2323
```
24-
$ python setup.py install
24+
$ pip install fasttext
2525
```
2626

27-
This will add the module fastTextpy to your python interpreter.
28-
Depending on your system you might need to use 'sudo', for example
29-
30-
```
31-
$ sudo python setup.py install
32-
```
27+
If you want to use the latest unstable release you will need to build from source using setup.py.
3328

3429
Now you can import this library with
3530

3631
```
3732
import fastText
3833
```
3934

40-
4135
## Examples
4236

43-
If you're already largely familiar with fastText you could skip this section
44-
and take a look at the examples within the doc folder.
45-
46-
## Using models
47-
48-
First, you'll need to train a model with fastText. For example
49-
50-
```
51-
./fasttext skipgram -input data/fil9 -output result/fil9
52-
```
53-
54-
You can see more examples within the scripts in the [fastText repository](https://github.com/facebookresearch/fastText).
55-
56-
Next, you can load this model from Python and query it.
37+
In general it is assumed that the reader already has good knowledge of fastText. For this consider the main [README](https://github.com/facebookresearch/fastText/blob/master/README.md) and in particular [the tutorials on our website](https://fasttext.cc/docs/en/supervised-tutorial.html).
5738

58-
```
59-
from fastText import load_model
60-
61-
f = load_model('result/model.bin')
62-
words, frequency = f.get_words()
63-
subwords = f.get_subwords("Paris")
64-
```
65-
66-
If you trained an unsupervised model, you can get word vectors with
67-
68-
```
69-
vector = f.get_word_vector("London")
70-
```
39+
We recommend you look at the [examples within the doc folder](https://github.com/facebookresearch/fastText/tree/master/python/doc/examples).
7140

72-
If you trained a supervised model, you can get the top k labels and get their probabilities with
41+
As with any package you can get help on any Python function using the help function.
7342

74-
```
75-
k = 5
76-
labels, probabilities = f.predict("I like this Product", k)
77-
```
78-
79-
A more advanced application might look like this:
80-
81-
Getting the word vectors of all words:
82-
83-
```
84-
words, frequency = f.get_words()
85-
for w in words:
86-
print((w, f.get_word_vector(w))
87-
```
88-
89-
## Training models
90-
91-
Training a model is easy. For example
43+
For example
9244

9345
```
94-
from fastText import train_supervised
95-
from fastText import train_unsupervised
96-
97-
model_unsup = train_unsupervised(
98-
input=<data>,
99-
epoch=1,
100-
model="cbow",
101-
thread=10
102-
)
103-
model_unsup.save_model(<path>)
104-
105-
model_sup = train_supervised(
106-
input=<labeled_data>
107-
epoch=1,
108-
thread=10
109-
)
110-
```
111-
112-
You can then use the model objects just as exemplified above.
113-
114-
To get extended help on these functions use the python help functions.
46+
+>>> import fastText
47+
+>>> help(fastText.FastText)
11548
116-
For example
49+
Help on module fastText.FastText in fastText:
11750
118-
```
119-
Help on function train_unsupervised in module fastText.FastText:
51+
NAME
52+
fastText.FastText
12053
121-
train_unsupervised(input, model=u'skipgram', lr=0.05, dim=100, ws=5, epoch=5, minCount=5, minCountLabel=0, minn=3, maxn=6, neg=5, wordNgrams=1, loss=u'ns', bucket=2000000, thread=12, lrUpdateRate=100, t=0.0001, label=u'__label__', verbose=2, pretrainedVectors=u'', saveOutput=0)
122-
Train an unsupervised model and return a model object.
54+
DESCRIPTION
55+
# Copyright (c) 2017-present, Facebook, Inc.
56+
# All rights reserved.
57+
#
58+
# This source code is licensed under the BSD-style license found in the
59+
# LICENSE file in the root directory of this source tree. An additional grant
60+
# of patent rights can be found in the PATENTS file in the same directory.
12361
124-
input must be a filepath. The input text does not need to be tokenized
125-
as per the tokenize function, but it must be preprocessed and encoded
126-
as UTF-8. You might want to consult standard preprocessing scripts such
127-
as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html
62+
FUNCTIONS
63+
load_model(path)
64+
Load a model given a filepath and return a model object.
12865
129-
The input fiel must not contain any labels or use the specified label prefix
130-
unless it is ok for those words to be ignored. For an example consult the
131-
dataset pulled by the example script word-vector-example.sh, which is
132-
part of the fastText repository.
66+
tokenize(text)
67+
Given a string of text, tokenize it and return a list of tokens
68+
[...]
13369
```
13470

135-
## Processing data
71+
## IMPORTANT: Preprocessing data / enconding conventions
13672

137-
You can tokenize using the fastText Dictionary method readWord.
73+
In general it is important to properly preprocess your data. In particular our example scripts in the [root folder](https://github.com/facebookresearch/fastText) do this.
13874

139-
This will give you a list of tokens split on the same whitespace characters that fastText splits on.
75+
fastText assumes UTF-8 encoded text. All text must be [unicode for Python2](https://docs.python.org/2/library/functions.html#unicode) and [str for Python3](https://docs.python.org/3.5/library/stdtypes.html#textseq). The passed text will be [encoded as UTF-8 by pybind11](https://pybind11.readthedocs.io/en/master/advanced/cast/strings.html?highlight=utf-8#strings-bytes-and-unicode-conversions) before passed to the fastText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using [iconv](https://en.wikipedia.org/wiki/Iconv).
14076

141-
It will also add the EOS character as necessary, which is exposed via fastText.EOS
77+
fastText will tokenize (split text into pieces) based on the following ASCII characters (bytes). In particular, it is not aware of UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word boundaries into one of the following symbols as appropiate.
14278

143-
Then resulting text is then stored entirely in memory.
79+
* space
80+
* tab
81+
* vertical tab
82+
* carriage return
83+
* formfeed
84+
* the null character
14485

145-
For example:
86+
The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX\_LINE\_SIZE constant as defined in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h). This means if you have text that is not separate by newlines, such as the [fil9 dataset](http://mattmahoney.net/dc/textdata), it will be broken into chunks with MAX\_LINE\_SIZE of tokens and the EOS token is not appended.
14687

147-
```
148-
from fastText import tokenize
149-
with open(<PATH>, 'r') as f:
150-
tokens = tokenize(f.read())
151-
```
88+
The length of a token is the number of UTF-8 characters by considering the [leading two bits of a byte](https://en.wikipedia.org/wiki/UTF-8#Description) to identify [subsequent bytes of a multi-byte sequence](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc). Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h)) is considered a character and will not be broken into subwords.

0 commit comments

Comments
 (0)