Skip to content

Commit 195aab5

Browse files
committed
leaky splits
1 parent dc28213 commit 195aab5

File tree

1 file changed

+78
-76
lines changed

1 file changed

+78
-76
lines changed

docs/source/brain.rst

Lines changed: 78 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -74,9 +74,9 @@ workflow:
7474
examples to train on in your data and for visualizing common modes of the
7575
data.
7676

77-
* :ref:`Leaky Splits <brain-image-leaky-splits>`:
77+
* :ref:`Leaky splits <brain-image-leaky-splits>`:
7878
Often when sourcing data en masse, duplicates and near duplicates can slip
79-
through the cracks. The FiftyOne Brain offers a *leaky-splits analysis* that
79+
through the cracks. The FiftyOne Brain offers a *leaky splits analysis* that
8080
can be used to find potential leaks between dataset splits. These splits can
8181
be misleading when evaluating a model, giving an overly optimistic measure
8282
for the quality of training.
@@ -1766,25 +1766,21 @@ samples being less representative and closer samples being more representative.
17661766
:alt: representativeness
17671767
:align: center
17681768

1769-
17701769
.. _brain-image-leaky-splits:
17711770

1772-
Leaky Splits
1771+
Leaky splits
17731772
____________
17741773

17751774
Despite our best efforts, duplicates and other forms of non-IID samples
17761775
show up in our data. When these samples end up in different splits, this
17771776
can have consequences when evaluating a model. It can often be easy to
1778-
overestimate model capability due to this issue. The FiftyOne Brain offers a way
1779-
to identify such cases in dataset splits.
1777+
overestimate model capability due to this issue. The FiftyOne Brain offers a
1778+
way to identify such cases in dataset splits.
17801779

1781-
The leaks of a |Dataset| or |DatasetView| can be computed directly without the need
1782-
for the predictions of a pre-trained model via the
1780+
The leaks of a |Dataset| or |DatasetView| can be computed directly without the
1781+
need for the predictions of a pre-trained model via the
17831782
:meth:`compute_leaky_splits() <fiftyone.brain.compute_leaky_splits>`
1784-
method:. The splits of a dataset can be defined in three ways. Through tags, by
1785-
tagging samples with their corresponding split. Through a field, by giving each
1786-
split a unique value in that field. And finally through views, by having views
1787-
corresponding to each split.
1783+
method:
17881784

17891785
.. code-block:: python
17901786
:linenos:
@@ -1793,115 +1789,121 @@ corresponding to each split.
17931789
import fiftyone.brain as fob
17941790
17951791
dataset = fo.load_dataset(...)
1796-
1797-
# splits via tags
1798-
split_tags = ['train', 'test']
1799-
index, leaks = fob.compute_leaky_splits(dataset, split_tags=split_tags)
1800-
1801-
# splits via field
1802-
split_field = ['split'] # holds split values e.g. 'train' or 'test'
1803-
index, leaks = fob.compute_leaky_splits(dataset, split_field=split_field)
1804-
1805-
# splits via views
1806-
split_views = {
1807-
'train' : some_view
1808-
'test' : some_other_view
1809-
}
1810-
index, leaks = fob.compute_leaky_splits(dataset, split_views=split_views)
18111792
1812-
Here is a sample snippet to run this on the `COCO <https://cocodataset.org/#home>`_ dataset.
1813-
Try it for yourself and see what you may find.
1793+
# Splits defined via tags
1794+
split_tags = ["train", "test"]
1795+
index = fob.compute_leaky_splits(dataset, splits=split_tags)
1796+
leaks = index.leaks_view()
1797+
1798+
# Splits defined via field
1799+
split_field = "split" # holds split values e.g. 'train' or 'test'
1800+
index = fob.compute_leaky_splits(dataset, splits=split_field)
1801+
leaks = index.leaks_view()
1802+
1803+
# Splits defined via views
1804+
split_views = {"train": train_view, "test": test_view}
1805+
index = fob.compute_leaky_splits(dataset, splits=split_views)
1806+
leaks = index.leaks_view()
1807+
1808+
Notice how the splits of the dataset can be defined in three ways: through
1809+
sample tags, through a string field that assigns each split a unique value in
1810+
the field, or by directly providing views that define the splits.
1811+
1812+
Here is a sample snippet to run this on the
1813+
`COCO dataset <https://cocodataset.org/#home>`_. Try it for yourself and see
1814+
what you find:
18141815

18151816
.. code-block:: python
18161817
:linenos:
18171818
18181819
import fiftyone as fo
1820+
import fiftyone.brain as fob
18191821
import fiftyone.zoo as foz
18201822
import fiftyone.utils.random as four
1821-
from fiftyone.brain import compute_leaky_splits
18221823
1823-
# load coco
1824+
# Load some COCO data
18241825
dataset = foz.load_zoo_dataset("coco-2017", split="test")
1825-
1826-
# set up splits via tags
1826+
1827+
# Set up splits via tags
18271828
dataset.untag_samples(dataset.distinct("tags"))
18281829
four.random_split(dataset, {"train": 0.7, "test": 0.3})
18291830
1830-
# compute leaks
1831-
index, leaks = compute_leaky_splits(dataset, split_tags=['train', 'test'])
1831+
# Find leaks
1832+
index = fob.compute_leaky_splits(dataset, splits=["train", "test"])
1833+
leaks = index.leaks_view()
18321834
1833-
Once you have these leaks, it is wise to look through them. You may gain some insight
1834-
into the source of the leaks.
1835+
The
1836+
:meth:`leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.leaks_view>`
1837+
method returns a view that contains only the leaks in the input splits. Once
1838+
you have these leaks, it is wise to look through them. You may gain some
1839+
insight into the source of the leaks:
18351840

18361841
.. code-block:: python
18371842
:linenos:
18381843
18391844
session = fo.launch_app(leaks)
18401845
18411846
Before evaluating your model on your test set, consider getting a version of it
1842-
with the leaks removed. This can be easily done with the built in method
1843-
:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`.
1847+
with the leaks removed. This can be easily done via
1848+
:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`:
18441849

18451850
.. code-block:: python
18461851
:linenos:
18471852
1848-
# if you already have it
1849-
test_set = some_view
1853+
# The original test split
1854+
test_set = index.split_views["test"]
18501855
1851-
# can also be found with the variable `split_views` from the index
1852-
# make sure to put in the right string based on the field/tag/key in view dict
1853-
# passed when building the index
1854-
test_set = index.split_views['test']
1856+
# The test set with leaks removed
1857+
test_set_no_leaks = index.no_leaks_view(test_set)
18551858
1856-
test_set_no_leaks = index.no_leaks_view(test_set) # return a view with leaks removed
18571859
session.view = test_set_no_leaks
18581860
1859-
# do evaluations on test_set_no_leaks rather than test_set
1860-
18611861
Performance on the clean test set will can be closer to the performance of the
18621862
model in the wild. If you found some leaks in your dataset, consider comparing
18631863
performance on the base test set against the clean test set.
18641864

18651865
**Input**: A |Dataset| or |DatasetView|, and a definition of splits through one
18661866
of tags, a field, or views.
18671867

1868-
**Output**: An index that will allow you to look through your leaks and
1869-
provides some useful actions once they are discovered such as automatically
1870-
cleaning the dataset with
1868+
**Output**: An index that will allow you to look through your leaks with
1869+
:meth:`leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.leaks_view>`
1870+
and also provides some useful actions once they are discovered such as
1871+
automatically cleaning the dataset with
18711872
:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`
1872-
or tagging them for the future with
1873+
or tagging the leaks for the future action with
18731874
:meth:`tag_leaks() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.tag_leaks>`.
1874-
Besides this, a view with all leaks is returned. Visualization of this view
1875-
can give you an insight into the source of the leaks in your dataset.
18761875

1877-
**What to expect**: Leakiness find leaks by embedding samples with a powerful
1876+
**What to expect**: Leaky splits works by embedding samples with a powerful
18781877
model and finding very close samples in different splits in this space. Large,
18791878
powerful models that were *not* trained on a dataset can provide insight into
18801879
visual and semantic similarity between images, without creating further leaks
18811880
in the process.
18821881

1883-
**Similarity**: At its core, the leaky-splits module is a wrapper for the brain's
1884-
:class:`SimilarityIndex <fiftyone.brain.similarity.SimilarityIndex>`. Any similarity
1885-
backend, (see :ref:`similarity backends <brain-similarity-backends>`) that implements
1886-
the :class:`DuplicatesMixin <fiftyone.brain.similarity.DuplicatesMixin>` can be used
1887-
to compute leaky splits. You can either pass an existing similarity index by passing
1888-
its brain key to the argument `similarity_brain_key`, or have the method create one on
1889-
the fly for you. If there is a specific configuration for `Similarity` you would like
1890-
to use, pass it in the argument `similarity_config_dict`.
1891-
1892-
**Models and Embeddings**: If you opt for the method to create a `SimilarityIndex`
1893-
for you, you can still bring you own model by passing it in the `model` argument.
1894-
Alternatively, compute embeddings and pass the field that they reside on. We will
1895-
handle the rest.
1896-
1897-
**Thresholds**: The leaky-splits module uses a threshold to decide what samples
1898-
are 'too close' and mark them as potential leaks. This threshold can be changed
1899-
either by passing a value to the `threshold` argument of the `compute_leaky_splits()`
1900-
method, or by using the
1882+
**Similarity index**: Under the hood, leaky splits leverages the brain's
1883+
:class:`SimilarityIndex <fiftyone.brain.similarity.SimilarityIndex>` to detect
1884+
leaks. Any :ref:`similarity backend <brain-similarity-backends>` that
1885+
implements the
1886+
:class:`DuplicatesMixin <fiftyone.brain.similarity.DuplicatesMixin>` can be
1887+
used to compute leaky splits. You can either pass an existing similarity index
1888+
by passing its brain key to the argument `similarity_brain_key`, or have the
1889+
method create one on the fly for you.
1890+
1891+
**Embeddings**: You can customize the model used to compute embeddings via the
1892+
`model` argument of
1893+
:meth:`compute_leaky_splits() <fiftyone.brain.compute_leaky_splits>`. You can
1894+
also precompute embeddings and tell leaky splits to use them by passing them
1895+
via the `embeddings` argument.
1896+
1897+
**Thresholds**: Leaky splits uses a threshold to decide what samples are
1898+
too close and thus mark them as potential leaks. This threshold can be
1899+
customized either by passing a value to the `threshold` argument of
1900+
:meth:`compute_leaky_splits() <fiftyone.brain.compute_leaky_splits>` or after
1901+
the fact via the
19011902
:meth:`set_threshold() <fiftyone.brain.internal.core.leaky_splits.SimilarityIndex.set_threshold>`
1902-
method. The best value for your use-case may vary depending on your dataset, as well
1903-
as the embeddings used. A threshold that's too big will have a lot of false positives,
1904-
a threshold that's too small will have a lot of false negatives.
1903+
method. The best value for your use case may vary depending on your dataset, as
1904+
well as the embeddings used. A threshold that's too big may have a lot of
1905+
false positives, while a threshold that's too small may have a lot of false
1906+
negatives.
19051907

19061908
.. image:: /images/brain/brain-leaky-splits.png
19071909
:alt: leaky-splits

0 commit comments

Comments
 (0)