@@ -74,9 +74,9 @@ workflow:
74
74
examples to train on in your data and for visualizing common modes of the
75
75
data.
76
76
77
- * :ref: `Leaky Splits <brain-image-leaky-splits >`:
77
+ * :ref: `Leaky splits <brain-image-leaky-splits >`:
78
78
Often when sourcing data en masse, duplicates and near duplicates can slip
79
- through the cracks. The FiftyOne Brain offers a *leaky- splits analysis * that
79
+ through the cracks. The FiftyOne Brain offers a *leaky splits analysis * that
80
80
can be used to find potential leaks between dataset splits. These splits can
81
81
be misleading when evaluating a model, giving an overly optimistic measure
82
82
for the quality of training.
@@ -1766,25 +1766,21 @@ samples being less representative and closer samples being more representative.
1766
1766
:alt: representativeness
1767
1767
:align: center
1768
1768
1769
-
1770
1769
.. _brain-image-leaky-splits :
1771
1770
1772
- Leaky Splits
1771
+ Leaky splits
1773
1772
____________
1774
1773
1775
1774
Despite our best efforts, duplicates and other forms of non-IID samples
1776
1775
show up in our data. When these samples end up in different splits, this
1777
1776
can have consequences when evaluating a model. It can often be easy to
1778
- overestimate model capability due to this issue. The FiftyOne Brain offers a way
1779
- to identify such cases in dataset splits.
1777
+ overestimate model capability due to this issue. The FiftyOne Brain offers a
1778
+ way to identify such cases in dataset splits.
1780
1779
1781
- The leaks of a |Dataset | or |DatasetView | can be computed directly without the need
1782
- for the predictions of a pre-trained model via the
1780
+ The leaks of a |Dataset | or |DatasetView | can be computed directly without the
1781
+ need for the predictions of a pre-trained model via the
1783
1782
:meth: `compute_leaky_splits() <fiftyone.brain.compute_leaky_splits> `
1784
- method:. The splits of a dataset can be defined in three ways. Through tags, by
1785
- tagging samples with their corresponding split. Through a field, by giving each
1786
- split a unique value in that field. And finally through views, by having views
1787
- corresponding to each split.
1783
+ method:
1788
1784
1789
1785
.. code-block :: python
1790
1786
:linenos:
@@ -1793,115 +1789,121 @@ corresponding to each split.
1793
1789
import fiftyone.brain as fob
1794
1790
1795
1791
dataset = fo.load_dataset(... )
1796
-
1797
- # splits via tags
1798
- split_tags = [' train' , ' test' ]
1799
- index, leaks = fob.compute_leaky_splits(dataset, split_tags = split_tags)
1800
-
1801
- # splits via field
1802
- split_field = [' split' ] # holds split values e.g. 'train' or 'test'
1803
- index, leaks = fob.compute_leaky_splits(dataset, split_field = split_field)
1804
-
1805
- # splits via views
1806
- split_views = {
1807
- ' train' : some_view
1808
- ' test' : some_other_view
1809
- }
1810
- index, leaks = fob.compute_leaky_splits(dataset, split_views = split_views)
1811
1792
1812
- Here is a sample snippet to run this on the `COCO <https://cocodataset.org/#home >`_ dataset.
1813
- Try it for yourself and see what you may find.
1793
+ # Splits defined via tags
1794
+ split_tags = [" train" , " test" ]
1795
+ index = fob.compute_leaky_splits(dataset, splits = split_tags)
1796
+ leaks = index.leaks_view()
1797
+
1798
+ # Splits defined via field
1799
+ split_field = " split" # holds split values e.g. 'train' or 'test'
1800
+ index = fob.compute_leaky_splits(dataset, splits = split_field)
1801
+ leaks = index.leaks_view()
1802
+
1803
+ # Splits defined via views
1804
+ split_views = {" train" : train_view, " test" : test_view}
1805
+ index = fob.compute_leaky_splits(dataset, splits = split_views)
1806
+ leaks = index.leaks_view()
1807
+
1808
+ Notice how the splits of the dataset can be defined in three ways: through
1809
+ sample tags, through a string field that assigns each split a unique value in
1810
+ the field, or by directly providing views that define the splits.
1811
+
1812
+ Here is a sample snippet to run this on the
1813
+ `COCO dataset <https://cocodataset.org/#home >`_. Try it for yourself and see
1814
+ what you find:
1814
1815
1815
1816
.. code-block :: python
1816
1817
:linenos:
1817
1818
1818
1819
import fiftyone as fo
1820
+ import fiftyone.brain as fob
1819
1821
import fiftyone.zoo as foz
1820
1822
import fiftyone.utils.random as four
1821
- from fiftyone.brain import compute_leaky_splits
1822
1823
1823
- # load coco
1824
+ # Load some COCO data
1824
1825
dataset = foz.load_zoo_dataset(" coco-2017" , split = " test" )
1825
-
1826
- # set up splits via tags
1826
+
1827
+ # Set up splits via tags
1827
1828
dataset.untag_samples(dataset.distinct(" tags" ))
1828
1829
four.random_split(dataset, {" train" : 0.7 , " test" : 0.3 })
1829
1830
1830
- # compute leaks
1831
- index, leaks = compute_leaky_splits(dataset, split_tags = [' train' , ' test' ])
1831
+ # Find leaks
1832
+ index = fob.compute_leaky_splits(dataset, splits = [" train" , " test" ])
1833
+ leaks = index.leaks_view()
1832
1834
1833
- Once you have these leaks, it is wise to look through them. You may gain some insight
1834
- into the source of the leaks.
1835
+ The
1836
+ :meth: `leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.leaks_view> `
1837
+ method returns a view that contains only the leaks in the input splits. Once
1838
+ you have these leaks, it is wise to look through them. You may gain some
1839
+ insight into the source of the leaks:
1835
1840
1836
1841
.. code-block :: python
1837
1842
:linenos:
1838
1843
1839
1844
session = fo.launch_app(leaks)
1840
1845
1841
1846
Before evaluating your model on your test set, consider getting a version of it
1842
- with the leaks removed. This can be easily done with the built in method
1843
- :meth: `no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view> `.
1847
+ with the leaks removed. This can be easily done via
1848
+ :meth: `no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view> `:
1844
1849
1845
1850
.. code-block :: python
1846
1851
:linenos:
1847
1852
1848
- # if you already have it
1849
- test_set = some_view
1853
+ # The original test split
1854
+ test_set = index.split_views[ " test " ]
1850
1855
1851
- # can also be found with the variable `split_views` from the index
1852
- # make sure to put in the right string based on the field/tag/key in view dict
1853
- # passed when building the index
1854
- test_set = index.split_views[' test' ]
1856
+ # The test set with leaks removed
1857
+ test_set_no_leaks = index.no_leaks_view(test_set)
1855
1858
1856
- test_set_no_leaks = index.no_leaks_view(test_set) # return a view with leaks removed
1857
1859
session.view = test_set_no_leaks
1858
1860
1859
- # do evaluations on test_set_no_leaks rather than test_set
1860
-
1861
1861
Performance on the clean test set will can be closer to the performance of the
1862
1862
model in the wild. If you found some leaks in your dataset, consider comparing
1863
1863
performance on the base test set against the clean test set.
1864
1864
1865
1865
**Input **: A |Dataset | or |DatasetView |, and a definition of splits through one
1866
1866
of tags, a field, or views.
1867
1867
1868
- **Output **: An index that will allow you to look through your leaks and
1869
- provides some useful actions once they are discovered such as automatically
1870
- cleaning the dataset with
1868
+ **Output **: An index that will allow you to look through your leaks with
1869
+ :meth: `leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.leaks_view> `
1870
+ and also provides some useful actions once they are discovered such as
1871
+ automatically cleaning the dataset with
1871
1872
:meth: `no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view> `
1872
- or tagging them for the future with
1873
+ or tagging the leaks for the future action with
1873
1874
:meth: `tag_leaks() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.tag_leaks> `.
1874
- Besides this, a view with all leaks is returned. Visualization of this view
1875
- can give you an insight into the source of the leaks in your dataset.
1876
1875
1877
- **What to expect **: Leakiness find leaks by embedding samples with a powerful
1876
+ **What to expect **: Leaky splits works by embedding samples with a powerful
1878
1877
model and finding very close samples in different splits in this space. Large,
1879
1878
powerful models that were *not * trained on a dataset can provide insight into
1880
1879
visual and semantic similarity between images, without creating further leaks
1881
1880
in the process.
1882
1881
1883
- **Similarity **: At its core, the leaky-splits module is a wrapper for the brain's
1884
- :class: `SimilarityIndex <fiftyone.brain.similarity.SimilarityIndex> `. Any similarity
1885
- backend, (see :ref: `similarity backends <brain-similarity-backends >`) that implements
1886
- the :class: `DuplicatesMixin <fiftyone.brain.similarity.DuplicatesMixin> ` can be used
1887
- to compute leaky splits. You can either pass an existing similarity index by passing
1888
- its brain key to the argument `similarity_brain_key `, or have the method create one on
1889
- the fly for you. If there is a specific configuration for `Similarity ` you would like
1890
- to use, pass it in the argument `similarity_config_dict `.
1891
-
1892
- **Models and Embeddings **: If you opt for the method to create a `SimilarityIndex `
1893
- for you, you can still bring you own model by passing it in the `model ` argument.
1894
- Alternatively, compute embeddings and pass the field that they reside on. We will
1895
- handle the rest.
1896
-
1897
- **Thresholds **: The leaky-splits module uses a threshold to decide what samples
1898
- are 'too close' and mark them as potential leaks. This threshold can be changed
1899
- either by passing a value to the `threshold ` argument of the `compute_leaky_splits() `
1900
- method, or by using the
1882
+ **Similarity index **: Under the hood, leaky splits leverages the brain's
1883
+ :class: `SimilarityIndex <fiftyone.brain.similarity.SimilarityIndex> ` to detect
1884
+ leaks. Any :ref: `similarity backend <brain-similarity-backends >` that
1885
+ implements the
1886
+ :class: `DuplicatesMixin <fiftyone.brain.similarity.DuplicatesMixin> ` can be
1887
+ used to compute leaky splits. You can either pass an existing similarity index
1888
+ by passing its brain key to the argument `similarity_brain_key `, or have the
1889
+ method create one on the fly for you.
1890
+
1891
+ **Embeddings **: You can customize the model used to compute embeddings via the
1892
+ `model ` argument of
1893
+ :meth: `compute_leaky_splits() <fiftyone.brain.compute_leaky_splits> `. You can
1894
+ also precompute embeddings and tell leaky splits to use them by passing them
1895
+ via the `embeddings ` argument.
1896
+
1897
+ **Thresholds **: Leaky splits uses a threshold to decide what samples are
1898
+ too close and thus mark them as potential leaks. This threshold can be
1899
+ customized either by passing a value to the `threshold ` argument of
1900
+ :meth: `compute_leaky_splits() <fiftyone.brain.compute_leaky_splits> ` or after
1901
+ the fact via the
1901
1902
:meth: `set_threshold() <fiftyone.brain.internal.core.leaky_splits.SimilarityIndex.set_threshold> `
1902
- method. The best value for your use-case may vary depending on your dataset, as well
1903
- as the embeddings used. A threshold that's too big will have a lot of false positives,
1904
- a threshold that's too small will have a lot of false negatives.
1903
+ method. The best value for your use case may vary depending on your dataset, as
1904
+ well as the embeddings used. A threshold that's too big may have a lot of
1905
+ false positives, while a threshold that's too small may have a lot of false
1906
+ negatives.
1905
1907
1906
1908
.. image :: /images/brain/brain-leaky-splits.png
1907
1909
:alt: leaky-splits
0 commit comments