|
4 | 4 |
|
5 | 5 | ## Requirements
|
6 | 6 |
|
7 |
| -**fastText** builds on modern Mac OS and Linux distributions. |
| 7 | +[fastText](https://fasttext.cc/) builds on modern Mac OS and Linux distributions. |
8 | 8 | Since it uses C\++11 features, it requires a compiler with good C++11 support.
|
9 | 9 | These include :
|
10 | 10 |
|
11 | 11 | * (gcc-4.8 or newer) or (clang-3.3 or newer)
|
12 | 12 |
|
13 | 13 | You will need
|
14 | 14 |
|
15 |
| -* python 2.7 or newer |
16 |
| -* numpy & scipy |
| 15 | +* [Python](https://www.python.org/) version 2.7 or >=3.4 |
| 16 | +* [NumPy](http://www.numpy.org/) & [SciPy](https://www.scipy.org/) |
17 | 17 | * [pybind11](https://github.com/pybind/pybind11)
|
18 | 18 |
|
19 |
| -## Building fastTextpy |
| 19 | +## Building fastText |
20 | 20 |
|
21 |
| -In order to build `fastTextpy`, do the following: |
| 21 | +The easiest way to get the latest version of [fastText is to use pip](https://pypi.python.org/pypi/fasttext). |
22 | 22 |
|
23 | 23 | ```
|
24 |
| -$ python setup.py install |
| 24 | +$ pip install fasttext |
25 | 25 | ```
|
26 | 26 |
|
27 |
| -This will add the module fastTextpy to your python interpreter. |
28 |
| -Depending on your system you might need to use 'sudo', for example |
29 |
| - |
30 |
| -``` |
31 |
| -$ sudo python setup.py install |
32 |
| -``` |
| 27 | +If you want to use the latest unstable release you will need to build from source using setup.py. |
33 | 28 |
|
34 | 29 | Now you can import this library with
|
35 | 30 |
|
36 | 31 | ```
|
37 | 32 | import fastText
|
38 | 33 | ```
|
39 | 34 |
|
40 |
| - |
41 | 35 | ## Examples
|
42 | 36 |
|
43 |
| -If you're already largely familiar with fastText you could skip this section |
44 |
| -and take a look at the examples within the doc folder. |
45 |
| - |
46 |
| -## Using models |
47 |
| - |
48 |
| -First, you'll need to train a model with fastText. For example |
49 |
| - |
50 |
| -``` |
51 |
| -./fasttext skipgram -input data/fil9 -output result/fil9 |
52 |
| -``` |
53 |
| - |
54 |
| -You can see more examples within the scripts in the [fastText repository](https://github.com/facebookresearch/fastText). |
55 |
| - |
56 |
| -Next, you can load this model from Python and query it. |
| 37 | +In general it is assumed that the reader already has good knowledge of fastText. For this consider the main [README](https://github.com/facebookresearch/fastText/blob/master/README.md) and in particular [the tutorials on our website](https://fasttext.cc/docs/en/supervised-tutorial.html). |
57 | 38 |
|
58 |
| -``` |
59 |
| -from fastText import load_model |
60 |
| -
|
61 |
| -f = load_model('result/model.bin') |
62 |
| -words, frequency = f.get_words() |
63 |
| -subwords = f.get_subwords("Paris") |
64 |
| -``` |
65 |
| - |
66 |
| -If you trained an unsupervised model, you can get word vectors with |
67 |
| - |
68 |
| -``` |
69 |
| -vector = f.get_word_vector("London") |
70 |
| -``` |
| 39 | +We recommend you look at the [examples within the doc folder](https://github.com/facebookresearch/fastText/tree/master/python/doc/examples). |
71 | 40 |
|
72 |
| -If you trained a supervised model, you can get the top k labels and get their probabilities with |
| 41 | +As with any package you can get help on any Python function using the help function. |
73 | 42 |
|
74 |
| -``` |
75 |
| -k = 5 |
76 |
| -labels, probabilities = f.predict("I like this Product", k) |
77 |
| -``` |
78 |
| - |
79 |
| -A more advanced application might look like this: |
80 |
| - |
81 |
| -Getting the word vectors of all words: |
82 |
| - |
83 |
| -``` |
84 |
| -words, frequency = f.get_words() |
85 |
| -for w in words: |
86 |
| - print((w, f.get_word_vector(w)) |
87 |
| -``` |
88 |
| - |
89 |
| -## Training models |
90 |
| - |
91 |
| -Training a model is easy. For example |
| 43 | +For example |
92 | 44 |
|
93 | 45 | ```
|
94 |
| -from fastText import train_supervised |
95 |
| -from fastText import train_unsupervised |
96 |
| -
|
97 |
| -model_unsup = train_unsupervised( |
98 |
| - input=<data>, |
99 |
| - epoch=1, |
100 |
| - model="cbow", |
101 |
| - thread=10 |
102 |
| -) |
103 |
| -model_unsup.save_model(<path>) |
104 |
| -
|
105 |
| -model_sup = train_supervised( |
106 |
| - input=<labeled_data> |
107 |
| - epoch=1, |
108 |
| - thread=10 |
109 |
| -) |
110 |
| -``` |
111 |
| - |
112 |
| -You can then use the model objects just as exemplified above. |
113 |
| - |
114 |
| -To get extended help on these functions use the python help functions. |
| 46 | ++>>> import fastText |
| 47 | ++>>> help(fastText.FastText) |
115 | 48 |
|
116 |
| -For example |
| 49 | +Help on module fastText.FastText in fastText: |
117 | 50 |
|
118 |
| -``` |
119 |
| -Help on function train_unsupervised in module fastText.FastText: |
| 51 | +NAME |
| 52 | + fastText.FastText |
120 | 53 |
|
121 |
| -train_unsupervised(input, model=u'skipgram', lr=0.05, dim=100, ws=5, epoch=5, minCount=5, minCountLabel=0, minn=3, maxn=6, neg=5, wordNgrams=1, loss=u'ns', bucket=2000000, thread=12, lrUpdateRate=100, t=0.0001, label=u'__label__', verbose=2, pretrainedVectors=u'', saveOutput=0) |
122 |
| - Train an unsupervised model and return a model object. |
| 54 | +DESCRIPTION |
| 55 | + # Copyright (c) 2017-present, Facebook, Inc. |
| 56 | + # All rights reserved. |
| 57 | + # |
| 58 | + # This source code is licensed under the BSD-style license found in the |
| 59 | + # LICENSE file in the root directory of this source tree. An additional grant |
| 60 | + # of patent rights can be found in the PATENTS file in the same directory. |
123 | 61 |
|
124 |
| - input must be a filepath. The input text does not need to be tokenized |
125 |
| - as per the tokenize function, but it must be preprocessed and encoded |
126 |
| - as UTF-8. You might want to consult standard preprocessing scripts such |
127 |
| - as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html |
| 62 | +FUNCTIONS |
| 63 | + load_model(path) |
| 64 | + Load a model given a filepath and return a model object. |
128 | 65 |
|
129 |
| - The input fiel must not contain any labels or use the specified label prefix |
130 |
| - unless it is ok for those words to be ignored. For an example consult the |
131 |
| - dataset pulled by the example script word-vector-example.sh, which is |
132 |
| - part of the fastText repository. |
| 66 | + tokenize(text) |
| 67 | + Given a string of text, tokenize it and return a list of tokens |
| 68 | +[...] |
133 | 69 | ```
|
134 | 70 |
|
135 |
| -## Processing data |
| 71 | +## IMPORTANT: Preprocessing data / enconding conventions |
136 | 72 |
|
137 |
| -You can tokenize using the fastText Dictionary method readWord. |
| 73 | +In general it is important to properly preprocess your data. In particular our example scripts in the [root folder](https://github.com/facebookresearch/fastText) do this. |
138 | 74 |
|
139 |
| -This will give you a list of tokens split on the same whitespace characters that fastText splits on. |
| 75 | +fastText assumes UTF-8 encoded text. All text must be [unicode for Python2](https://docs.python.org/2/library/functions.html#unicode) and [str for Python3](https://docs.python.org/3.5/library/stdtypes.html#textseq). The passed text will be [encoded as UTF-8 by pybind11](https://pybind11.readthedocs.io/en/master/advanced/cast/strings.html?highlight=utf-8#strings-bytes-and-unicode-conversions) before passed to the fastText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using [iconv](https://en.wikipedia.org/wiki/Iconv). |
140 | 76 |
|
141 |
| -It will also add the EOS character as necessary, which is exposed via fastText.EOS |
| 77 | +fastText will tokenize (split text into pieces) based on the following ASCII characters (bytes). In particular, it is not aware of UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word boundaries into one of the following symbols as appropiate. |
142 | 78 |
|
143 |
| -Then resulting text is then stored entirely in memory. |
| 79 | +* space |
| 80 | +* tab |
| 81 | +* vertical tab |
| 82 | +* carriage return |
| 83 | +* formfeed |
| 84 | +* the null character |
144 | 85 |
|
145 |
| -For example: |
| 86 | +The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX\_LINE\_SIZE constant as defined in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h). This means if you have text that is not separate by newlines, such as the [fil9 dataset](http://mattmahoney.net/dc/textdata), it will be broken into chunks with MAX\_LINE\_SIZE of tokens and the EOS token is not appended. |
146 | 87 |
|
147 |
| -``` |
148 |
| -from fastText import tokenize |
149 |
| -with open(<PATH>, 'r') as f: |
150 |
| - tokens = tokenize(f.read()) |
151 |
| -``` |
| 88 | +The length of a token is the number of UTF-8 characters by considering the [leading two bits of a byte](https://en.wikipedia.org/wiki/UTF-8#Description) to identify [subsequent bytes of a multi-byte sequence](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc). Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h)) is considered a character and will not be broken into subwords. |
0 commit comments