Skip to content

Commit

Permalink
checkpoint - several content updates
Browse files Browse the repository at this point in the history
  • Loading branch information
erikhatcher committed Jan 14, 2025
1 parent 2161325 commit 7c21a23
Show file tree
Hide file tree
Showing 15 changed files with 183 additions and 41 deletions.
4 changes: 3 additions & 1 deletion docs/01_Welcome.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ Fill in any other blanks for welcoming and setting the stage for the workshop.

## The Exercises

This workshop uses the Atlas Search Playground for the exercises. All you need is a browser and network connectivity. This handy developer tool allows us to work in an isolated, focused environment
This workshop uses the Atlas Search Playground for the exercises.
All you need is a browser and network connectivity.
This handy developer tool allows us to work in an isolated, focused environment
with no setup.

13 changes: 10 additions & 3 deletions docs/10_About_workshop/1_intro.mdx
Original file line number Diff line number Diff line change
@@ -1,16 +1,23 @@
# 📘 Atlas Search Playground

The Atlas Search Playground (or just Playground) is used for the exercises in this workshop. The Playground is a self-contained lightweight, yet feature-rich Atlas Search environment which does not require an Atlas account to use.
The Atlas Search Playground (or just Playground) is used for the exercises in this workshop.
The Playground is a self-contained lightweight, yet feature-rich Atlas Search environment
which does not require an Atlas account to use.

There are two tools in the Playground:
* `Code Sandbox`: your data, an index configuration, an aggregation pipeline, and results
* `Search Demo Builder`: a configurable search UI on your data

The exercises will only use the Code Sandbox, as it allows saving and sharing links to the full environment and allows us to work on one topic at a time. We'll cover the Search Demo Builder briefly near the end of thw workshop.
The exercises will only use the Code Sandbox, as it allows saving and sharing links to
the full environment and allows us to work on one topic at a time.

We'll cover the Search Demo Builder briefly near the end of the workshop.

## Code Sandbox layout

To begin, navigate to the [Atlas Search Playground](https://search-playground.mongodb.com/). In the next section, you'll work through the first exercise to get familiar with the Playground's Code Sandbox.
To begin, navigate to the [Atlas Search Playground](https://search-playground.mongodb.com/).
In the next section, you'll work through the first exercise to get familiar with the
Playground's Code Sandbox.

Let's dive into the world of Atlas Search using this convenient and powerful playground!

Expand Down
9 changes: 7 additions & 2 deletions docs/20_Intro_to_Atlas_Search/1_system.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,14 @@ or dynamically mapping any and all fields supported.
![system diagram](/img/system_diagram.png)

Changes to a collection via updates, deletes, or additions are *eventually consistent*, meaning the
index is updated independently of changes to the collection in a separate process, asynchronously. The lag between a change made to the database and refelected in a subsequent search is dependent on many factors such as deployment tier and architecture, the complexity of the index mapping, the other changes that are also queued, and the laws of physics.
index is updated independently of changes to the collection in a separate process, asynchronously.
The lag between a change made to the database and refelected in a subsequent search is dependent
on many factors such as deployment tier and architecture, the complexity of the index mapping,
the other changes that are also queued, and the laws of physics.

The Atlas Search process can be deployed either coupled alongside the database nodes, or on separate dedicated nodes. Dedicated nodes provide separation of concerns, alleviating resource contention. Dedicated search nodes are recommended for production workloads.
The Atlas Search process can be deployed either coupled alongside the database nodes,
or on separate dedicated nodes. Dedicated nodes provide separation of concerns, alleviating
resource contention. Dedicated search nodes are recommended for production workloads.

## Coupled nodes
![coupled nodes](/img/coupled.png)
Expand Down
56 changes: 54 additions & 2 deletions docs/20_Intro_to_Atlas_Search/2_aggregation_stages.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,58 @@ where the magic happens

## $searchMeta

The `$searchMeta` stage performs the same search that `$search` does, but only returns the results metadata, not actual matching documents. Results metadata includes the count of matching results and facets. This same metadata is available when using `$search` too, accessible in the $$SEARCH_META context variable.
The `$searchMeta` stage performs the same search that `$search` does,
but only returns the results metadata, not actual matching documents.
Results metadata includes the count of matching results and facets.
This same metadata is available when using `$search` too,
accessible in the $$SEARCH_META context variable.

**TODO**: Add exercise using $search and have developer switch to $searchMeta to see the results
## Exercises: search pipeline stages

### Step 1
1. Navigate to the original Playground used in the last section's exercise
https://search-playground.mongodb.com/tools/code-sandbox/snapshots/6782aea0667feaaf06324b87
2. Press Run. Got the empty `[]` array of results?
3. Change `$search` to `$searchMeta` (in the Query pane), and press Run again.

<details>
<summary>Here's the expected results...</summary>
<div>
```js
[
{
"count": {
"lowerBound": 0
}
}
]
```
</div>
</details>

### Step 2

1. Now fix the query to match the document as you did previously
2. Press Run again
3. Did the `$searchMeta` results change?

<details>
<summary>Here's the expected results...</summary>
<div>
```js
[
{
"count": {
"lowerBound": 1
}
}
]
```
</div>
</details>

## Post $search-stages

* Such as $sort, $group, etc any stage that consumes **all** documents from previous stage.
* $limit, $addFields, $project and the like are fine as they only operate on one doc at a time
or cut-off
27 changes: 27 additions & 0 deletions docs/20_Intro_to_Atlas_Search/3_lucene.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# 📘 Powered by Lucene

https://lucene.apache.org

## Anatomy of a Lucene index

A Lucene index encapsulates specialized data structures unique to each type of data indexed.

* Numbers and dates: ...
* Geo-spatial: ...
* Text: via inverted indexes

Each field is indexed independently.

Segmented architecture, append-only, for fast indexing. Background processes to optimize the index
segments.

## Inverted Index

![inverted index](/img/analysis_lucene_standard.png)

## Search algorithms

* "index intersection" using skip lists
* link to Adrien's presentation

Atlas Search translates its search operators to Lucene's `Query` API.
4 changes: 4 additions & 0 deletions docs/30_Index_configuration/2_index_config.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,13 @@ Documents are mapped to an index through a flexible configuration.

## Dynamic mapping

You can configure an entire index to use dynamic mappings, or specify individual fields,
such as fields of type `document`, to be dynamically mapped.

## Configuring a real Atlas Search index

* Atlas Search Visual Editor or JSON Editor
* via Compass
* Atlas CLI
* Driver commands

46 changes: 13 additions & 33 deletions docs/30_Index_configuration/6_string.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,36 +14,16 @@ See also: Relevant As-You-Type Suggestions Search Solution
Basic examples
lucene.standard matching case-insensitive: https://search-playground.mongodb.com/tools/code-playground/snapshots/664738af4e0a3f240a5de9d9

## Analysis matters
“searches” query does not match “Search” with lucene.standard: https://search-playground.mongodb.com/tools/code-playground/snapshots/664739964e0a3f240a5de9db
“searches” matches “Search” using lucene.english: https://search-playground.mongodb.com/tools/code-playground/snapshots/66473aa64e0a3f240a5de9dd

## Custom analyzers
Last 4 digit of phone number matching (regex extraction during indexing, keyword analysis at query time): https://search-playground.corp.mongodb.com/tools/code-playground/snapshots/669e6c98d49ef6fad98118ba
Example of being able to do ‘startsWith’ and ‘endsWith’ using wildcard and ‘reverse’ token filter:
https://search-playground.mongodb.com/tools/code-playground/snapshots/6683c8bc4a45448733549bbc

Example of being able to do ‘startsWith’, ‘endsWith’ and ‘contains’ using nGrams: https://search-playground.mongodb.com/tools/code-playground/snapshots/6683c999934a05d9b585b6e7
Relevancy
Example of an as-you-type suggest configuration; sophisticated use of multi and several weighted query clauses: https://search-playground.mongodb.com/tools/code-playground/snapshots/66473b744e0a3f240a5de9e1
$project score?
$project scoreDetails?

# multi
Why?
Relevancy example: boost each multi uniquely
Multiple language example: may not know the language of the content and each document could be different - multi across all possible languages, query across them as desired at query-time, let relevancy sort it out
Example of being able to do ‘startsWith’ and ‘endsWith’ using wildcard and ‘reverse’ token filter:
https://search-playground.mongodb.com/tools/code-playground/snapshots/6683c8bc4a45448733549bbc


* Text: the heart and soul of your content
* Strings are analyzed, tokenized into terms
* Multiple analyzers can be used for a single string field (multi)
* Terms: words, fragments, atomic searchable units
* An inverted index structure organizes terms lexicographically/alphabetically for quick lookup (aka a dictionary)
* Term statistics:
* Posting list: document identifiers
* Term frequency (tf): how many occurrences of the term per document
* Document frequency (df): how many documents contain the term
* Positions: where in the document does this term occur
Query operators:
text: matches any of the query terms; can include synonyms and fuzziness
phrase: matches query terms that occur in proximity
regex: pattern matching
wildcard: matches across missing characters
moreLikeThis: matches documents that overlap important terms
Analysis occurs on query values
Except on regex and wildcard operators: partial strings not analyzable
Index-time and search-time analyzers can be different, if needed
Remember: it’s a dictionary
Index it how you’d like to find it; search for it how you indexed it
Leverage analyzers to index text efficiently for searching
Index statistics factor into score computations
7 changes: 7 additions & 0 deletions docs/40_Analysis/custom.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Composed from analyzer building blocks
charFilters: pre-process characters of text for filtering/replacing (optional)
htmlStrip, icuNormalize, mapping, persian
tokenizer: splits text into tokens
edgeGram, keyword, nGram, regexCaptureGroup, regexSplit, standard, uaxUrlEmail, whitespace
tokenFilters: processes individual tokens (optional)
asciiFolding, daitchMokotoffSoundex, edgeGram, englishPossessive, flattenGraph, icuFolding, icuNormalizer, kStemming, length, lowercase, nGram, porterStemming, regex, reverse, shingle, snowballStemming, spanishPluralStemming, stempel, stopword, trim, wordDelimiterGraph
39 changes: 39 additions & 0 deletions docs/40_Analysis/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,42 @@ https://www.mongodb.com/docs/atlas/atlas-search/analyzers/

![Visual Editor standard analyzer output](/img/editor_analysis.png)

## Analysis matters
“searches” query does not match “Search” with lucene.standard: https://search-playground.mongodb.com/tools/code-playground/snapshots/664739964e0a3f240a5de9db
“searches” matches “Search” using lucene.english: https://search-playground.mongodb.com/tools/code-playground/snapshots/66473aa64e0a3f240a5de9dd

## Custom analyzers
Last 4 digit of phone number matching (regex extraction during indexing, keyword analysis at query time): https://search-playground.corp.mongodb.com/tools/code-playground/snapshots/669e6c98d49ef6fad98118ba
Example of being able to do ‘startsWith’ and ‘endsWith’ using wildcard and ‘reverse’ token filter:
https://search-playground.mongodb.com/tools/code-playground/snapshots/6683c8bc4a45448733549bbc

Example of being able to do ‘startsWith’, ‘endsWith’ and ‘contains’ using nGrams: https://search-playground.mongodb.com/tools/code-playground/snapshots/6683c999934a05d9b585b6e7
Relevancy
Example of an as-you-type suggest configuration; sophisticated use of multi and several weighted query clauses: https://search-playground.mongodb.com/tools/code-playground/snapshots/66473b744e0a3f240a5de9e1
$project score?
$project scoreDetails?

# multi
Why?
Relevancy example: boost each multi uniquely
Multiple language example: may not know the language of the content and each document could be different - multi across all possible languages, query across them as desired at query-time, let relevancy sort it out
Example of being able to do ‘startsWith’ and ‘endsWith’ using wildcard and ‘reverse’ token filter:
https://search-playground.mongodb.com/tools/code-playground/snapshots/6683c8bc4a45448733549bbc


* Text: the heart and soul of your content
* Strings are analyzed, tokenized into terms
* Multiple analyzers can be used for a single string field (multi)
* Terms: words, fragments, atomic searchable units
* An inverted index structure organizes terms lexicographically/alphabetically for quick lookup (aka a dictionary)
* Term statistics:
* Posting list: document identifiers
* Term frequency (tf): how many occurrences of the term per document
* Document frequency (df): how many documents contain the term
* Positions: where in the document does this term occur

lucene.standard (default): tokenizes at word break characters, removes punctuation, and lowercases
lucene.english: standard tokenization plus de-pluralization, stop word removal, and stemming
lucene.keyword: Tokenizes text as a single term; suitable for wildcard or regex matching over entire value
Many language-specific analyzers built-in: (lucene.)arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, japanese, korean, kuromoji, latvian, lithuanian, morfologik, nori, norwegian, persian, polish, portuguese, romanian, russian, smartcn, sorani, spanish, swedish, thai, turkish, ukrainian

4 changes: 4 additions & 0 deletions docs/50_Operators/1_index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,7 @@

* Search operators w/ quick labs for each one
* compound: filter, mustNot, must, should

# TODO
* "For string type, the moreLikeThis and queryString operators don't support an array of strings."
huh? Really?
8 changes: 8 additions & 0 deletions docs/50_Operators/compound.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# `compound` operators

* should
* must
* mustNot
* filter

`minimumShouldMatch`
5 changes: 5 additions & 0 deletions docs/60_Relevancy/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,8 @@
* TF/IDF, BM25
* compound: additive clause scoring
* scoreDetails w/ demonstrative lab

Relevancy
Example of an as-you-type suggest configuration; sophisticated use of multi and several weighted query clauses: https://search-playground.mongodb.com/tools/code-playground/snapshots/66473b744e0a3f240a5de9e1
$project score?
$project scoreDetails?
2 changes: 2 additions & 0 deletions docs/99_TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,5 @@ https://search-playground.corp.mongodb.com/tools/code-playground/snapshots/669e8
Which embedded document matched?
https://search-playground.corp.mongodb.com/tools/code-playground/snapshots/669e850dd49ef6fad98118d6
Using scoreDetails to glimpse analysis in action:

Synonyms: https://search-playground.mongodb.com/tools/code-sandbox/snapshots/6785d30eb6487c1cfd0bb817
Binary file added static/img/analysis_lucene_english.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/img/analysis_lucene_standard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 7c21a23

Please sign in to comment.