Search Request
Query
Search allows multiple query types to be performed on Full Text Indexes. Each of these query types helps enhance the search and retrievability of the indexed data.
These capabilities include:
-
Input-text and target-text can be analyzed: this transforms input-text into token-streams, according to different specified criteria, allowing richer and more finely controlled forms of text-matching.
-
The fuzziness of a query can be specified, so that the scope of matches can be constrained to a particular level of exactitude. A high degree of fuzziness means that a large number of partial matches may be returned.
-
Multiple queries can be specified for simultaneous processing, with one given a higher boost than another, ensuring that its results are returned at the top of the set.
-
Regular expressions and wildcards can be used in text-specification for search-input.
-
Compound queries can be designed such that an appropriate conjunction or disjunction of the total result-set can be returned.
-
Geospatial queries can be used for finding the nearest neighbor or points of interest in a bounded region.
All the above options, and others, are explained in detail in Supported Queries
Consistency
A mechanism to ensure that the Full Text Search (FTS) index can obtain the most up-to-date version of the document written to a collection or a bucket.
The consistency mechanism provides Consistency Vectors as objects in the search query that ensures FTS index searches all your last data written to the vBucket.
The search service does not respond to the query until the designated vBucket receives the correct sequence number.
The search query remains blocked while continuously polling the vBucket for the requested data. Once the sequence number of the data is obtained, the query is executed over the data written to the vBucket.
When using this consistency mode, the query service will ensure that the indexes are synchronized with the data service before querying.
Workflow to understand Consistency
-
Create an FTS index in Couchbase.
-
Write a document to the Couchbase cluster.
-
Couchbase returns the associate vector to the app, which needs to issue a query request with the vector.
-
The FTS index starts searching the data written to the vBucket.
In this workflow, it is possible that the document written to the vBucket is not yet indexed. So, when FTS starts searching that document, the most up-to-date document versions are not retrieved, and only the indexed versions are queried.
Therefore, the Couchbase server provides a consistency mechanism to overcome this issue and ensures that the FTS index can search the most up-to-date document written to the vBucket.
Consistency Level
The consistency level is a parameter that either takes an empty string indicating the unbounded (not_bounded) consistency or at_plus indicating the bounded consistency.
at_plus
Executes the query, requiring indexes first to be updated to the timestamp of the last update.
This implements bounded consistency. The request includes a scan_vector parameter and value, which is used as a lower bound. This can be used to implement read-your-own-writes (RYOW).
If index-maintenance is running behind, the query waits for it to catch up.
not_bounded
Executes the query immediately, without requiring any consistency for the query. No timestamp vector is used in the index scan.
This is the fastest mode, because it avoids the costs of obtaining the vector and waiting for the index to catch up to the vector.
If index-maintenance is running behind, out-of-date results may be returned.
Consistency Vectors
The consistency vectors supporting the consistency mechanism in Couchbase contain the mapping of the vBucket and sequence number of the data stored in the vBucket.
For more information about consistency mechanism, see Consistency
Example
{
"ctl": {
"timeout": 10000,
"consistency": {
"vectors": {
"index1": {
"607/205096593892159": 2,
"640/298739127912798": 4
}
},
"level": "at_plus"
}
},
"query": {
"match": "jack",
"field": "name"
}
}
In the example above, this is the set of consistency vectors.
"index1": { "607/205096593892159": 2, "640/298739127912798": 4 }
The query is looking within the FTS index "index1" - for:
-
vbucket 607 (with UUID 205096593892159) to contain sequence number 2
-
vbucket 640 (with UUID 298739127912798) to contain sequence number 4
Consistency Timeout
It is the amount of time (in milliseconds) the search service will allow for a query to execute at an index partition level.
If the query execution surpasses this timeout
value, the query is canceled. However, at this point if some of the index partitions have responded, you might see partial results, otherwise no results at all.
{
"ctl": {
"timeout": 10000,
"consistency": {
"vectors": {
"index1": {
"607/205096593892159": 2,
"640/298739127912798": 4
}
},
"level": "at_plus"
}
},
"query": {
"match": "jack",
"field": "name"
}
}
Consistency Results
Consistency result is the attribute that you can use to set the query result option, such as complete.
The "Complete" option
The complete option allows you to set the query result as "complete" which indicates that if any of the index partitions are unavailable due to the node not being reachable, the query will display an error in response instead of partial results.
Consistency Tips and Recommendations
Consistency vectors provide "read your own writes" functionality where the read operation waits for a specific time until the write operation is finished.
When users know that their queries are complex which require more time in completing the write operations, they can set the timeout value higher than the default timeout of 10 seconds so that consistency can be obtained in the search operations.
However, if this consistency is not required, the users can optimize their search operations by using the default timeout of 10 seconds.
Size/From/Pages
The number of results obtained for a Full Text Search request can be large. Pagination of these results becomes essential for sorting and displaying a subset of these results.
There are multiple ways to achieve pagination with settings within a search request. Pagination will fetch a deterministic set of results when the results are sorted in a certain fashion.
Pagination provides the following options:
Size/from or offset/limit
This pagination settings can be used to obtain a subset of results and works deterministically when combined with a certain sort order.
Using size/limit
and offset/from
would fetch at least size + from
ordered results from a partition and then return the size
number of results starting at offset from
.
Deep pagination can therefore get pretty expensive when using size + from
on a sharded index due to each shard having to possibly return large resultsets (at least size + from
) over the network for merging at the coordinating node before returning the size
number of results starting at offset from
.
The default sort order is based on score (relevance) where the results are ordered from the highest to the lowest score.
Search after/before
For an efficient pagination, you can use the search_after/search_before
settings.
search_after
is designed to fetch the size
number of results after the key specified and search_before
is designed to fetch the size
number of results before the key specified.
These settings allow for the client to maintain state while paginating - the sort key of the last result (for search_after) or the first result (for search_before) in the current page.
Both the attributes accept an array of strings (sort keys) - the length of this array will need to be the same length of the "sort" array within the search request.
You cannot use both search_after and search_before in the same search request.
|
Example
Here are some examples using search_after/search_before
over sort key "_id" (an internal field that carries the document ID).
{
"query": {
"match": "California",
"field": "state"
},
"sort": ["_id"],
"search_after": ["hotel_10180"],
"size": 3
}
{
"query": {
"match": "California",
"field": "state"
},
"sort": ["_id"],
"search_before": ["hotel_17595"],
"size": 4
}
A Full Text Search request that doesn’t carry any pagination settings will return the first 10 results ("size: 10", "from": 0 ) ordered by score sequentially from the highest to lowest.
|
Pagination tips and recommendations
The pagination of search results can be done using the from
and size
parameters in the search request. But as the search gets into deeper pages, it starts consuming more resources.
To safeguard against any arbitrary higher memory requirements, FTS provides a configurable limit bleveMaxResultWindow (10000 default) on the maximum allowable page offsets. However, bumping this limit to higher levels is not a scalable solution.
To circumvent this problem, the concept of key set pagination in FTS, is introduced.
Instead of providing from
as a number of search results to skip, the user will provide the sort value of a previously seen search result (usually, the last result shown on the current page). The idea is that to show the next page of the results, we just want the top N results of that sort after the last result from the previous page.
This solution requires a few preconditions be met:
-
The search request must specify a sort order.
The sort order must impose a total order on the results. Without this, any results which share the same sort value might be left out when handling the page navigation boundaries. |
A common solution to this is to always include the document ID as the final sort criteria.
For example, if you want to sort by [“name”, “-age”], instead of sort by [“name”, “-age”, "_id"].
With search_after
/search_before
paginations, the heap memory requirement of deeper page searches is made proportional to the requested page size alone. So it reduces the heap memory requirement of deeper page searches significantly down from the offset+from values.
.
Sorting
The FTS results are returned as objects. FTS query includes options to order the results.
Sorting Result Data
FTS sorting is sorted by descending order of relevance. It can, however, be customized to sort by different fields, depending on the application.
On query-completion, sorting allows specified members of the result-set to be displayed prior to others: this facilitates a review of the most significant data.
Within a JSON query object, the required sort-type is specified by using the sort
field.
This takes an array of either strings, objects, or numeric as its value.
Sorting with Strings
You can specify the value of the sort
field as an array of strings.
These can be of three types:
-
field name: Specifies the name of a field.
If multiple fields are included in the array, the sorting of documents begins according to their values for the field whose name is first in the array.
If any number of these values are identical, their documents are sorted again, this time according to their values for the field whose name is second; then, if any number of these values are identical, their documents are sorted a third time, this time according to their values for the field whose name is third; and so on.
Any document-field may be specified to hold the value on which sorting is to be based, provided that the field has been indexed in some way, whether dynamically or specifically.
The default sort-order is ascending. If a field-name is prefixed with the
-
character, that field’s results are sorted in descending order. -
_id
:Refers to the document identifier. Whenever encountered in the array, causes sorting to occur by document identifer. -
_score
: Refers to the score assigned the document in the result-set. Whenever encountered in the array, causes sorting to occur by score.
Example
"sort": ["country", "state", "city","-_score"]
This sort
statement specifies that results will first be sorted by country
.
If some documents are then found to have the same value in their country
fields, they are re-sorted by state
.
Next, if some of these documents are found to have the same value in their state
fields, they are re-sorted by city
.
Finally, if some of these documents are found to have the same value in their city
fields, they are re-sorted by score
, in descending order.
The following JSON query demonstrates how and where the sort
property can be specified:
{
"explain": false,
"fields": [
"title"
],
"highlight": {},
"sort": ["country", "-_score","-_id"],
"query":{
"query": "beautiful pool"
}
}
The following example shows how the sort
field accepts combinations of strings and objects as its value.
{
...
"sort": [
"country",
{
"by" : "field",
"field" : "reviews.ratings.Overall",
"mode" : "max",
"missing" : "last",
"type": "number"
},
{
"by" : "field",
"field" : "reviews.ratings.Location",
"mode" : "max",
"missing" : "last",
"type": "number"
},
"-_score"
]
}
Sorting with Objects
Fine-grained control over sort-procedure can be achieved by specifying objects as array-values in the sort
field.
Each object can have the following fields:
-
by
: Sorts results onid
,score
, or a specifiedfield
in the Full Text Index. -
field
: Specifies the name of a field on which to sort. Used only iffield
has been specified as the value for theby
field; otherwise ignored. -
missing
: Specifies the sort-procedure for documents with a missing value in a field specified for sorting. The value ofmissing
can befirst
, in which case results with missing values appear before other results; orlast
(the default), in which case they appear after. -
mode
: Specifies the search-order for index-fields that contain multiple values (in consequence of arrays or multi-token analyzer-output). Thedefault
order is undefined but deterministic, allowing the paging of results fromfrom (offset)
, with reliable ordering. To sort using the minimum or maximum value, the value ofmode
should be set to eithermin
ormax
. -
type
: Specifies the type of the search-order field value. For example,string
for text fields,date
for DateTime fields, ornumber
for numeric/geo fields.
To fetch more accurate sort results, we strongly recommend specifying the type
of the sort fields in the sort section of the search request.
Example
The example below shows how to specify the object-sort.
The below sample assumes that the travel-sample bucket has been loaded, and a default index has been created on it.
|
{
"explain": false,
"fields": [
"*"
],
"highlight": {},
"query": {
"match": "bathrobes",
"field": "reviews.content",
"analyzer": "standard"
},
"size" : 10,
"sort": [
{
"by" : "field",
"field" : "reviews.ratings.Overall",
"mode" : "max",
"missing" : "last",
"type": "number"
}
]
}
For information on loading sample buckets, see Sample Buckets. For instructions on creating a default Full Text Index by means of the Couchbase Web Console, see Creating Index from UI.
This query sorts search-results based on reviews.ratings.Overall
— a field that is normally multi-valued because it contains an array of different users' ratings.
When there are multiple values, the highest Overall
ratings are used for sorting.
Hotels with no Overall
rating are placed at the end.
The following example shows how the sort
field accepts combinations of strings and objects as its value.
{
"sort": [
"country",
{
"by" : "field",
"field" : "reviews.ratings.Overall",
"mode" : "max",
"missing" : "last",
"type": "number"
},
{
"by" : "field",
"field" : "reviews.ratings.Location",
"mode" : "max",
"missing" : "last",
"type": "number"
},
"-_score"
]
}
Sorting with Numeric
You can specify the value of the sort
field as a numeric type. You can use the type
field in the object that you specify with the sort.
With type
field, you can specify the type of the search order to numeric, string, or DateTime.
Example
The example below shows how to specify the object-sort with type field as number
.
{
"explain": false,
"fields": [
"*"
],
"highlight": {},
"query": {
"match": "bathrobes",
"field": "reviews.content",
"analyzer": "standard"
},
"size" : 10,
"sort": [
{
"by" : "field",
"field" : "reviews.ratings.Overall",
"mode" : "max",
"missing" : "last",
"type": "number"
}
]
}
Tips for Sorting with fields
When you sort results on a field that is not indexed, or when a particular document is missing a value for that field, you will see the following series of Unicode non-printable characters appear in the sort field:
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd
The same characters may render differently when using a graphic tool or command line tools like jq
.
"sort": [
"����������",
"hotel_9723",
"_score"
]
Check your index definition to confirm that you are indexing all the fields you intend to sort by. You can control the sort behavior for missing attributes using the missing field.
Scoring
Search result scoring occurs at a query time. The result of the search request is ordered by score (relevance), with the descending sort order unless explicitly set not to do so.
Couchbase uses a slightly modified version of the standard tf-idf algorithm. This deviation is to normalize the score and is based on tf-idf algorithm.
For more details on tf-idf, refer tf-idf
By selecting the explain score
option within the search request, you can obtain the explanation of how the score was calculated for a result alongside it.
Search query scores all the qualified documents for relevance and applies relevant filters.
In a search request, you can set score
to none
to disable scoring by. See Score:none
Example
The following sample query response shows the score field for each document retrieved for the query request:
"hits": [
{
"index": "DemoIndex_76059e8b3887351c_4c1c5584",
"id": "hotel_10064",
"score": 10.033205341869529,
"sort": [
"_score"
],
"fields": {
"_$c": "hotel"
}
},
{
"index": "DemoIndex_76059e8b3887351c_4c1c5584",
"id": "hotel_10063",
"score": 10.033205341869529,
"sort": [
"_score"
],
"fields": {
"_$c": "hotel"
}
}
],
"total_hits": 2,
"max_score": 10.033205341869529,
"took": 284614211,
"facets": null
}
tf-idf
tf-idf
, a short form of term frequency-inverse document frequency, is a numerical statistical value that is used to reflect how important a word is to a document in collection or scope.
tf-idf
is used as a weighting factor in a search for information retrieval and text mining. The tf–idf
value increases proportionally to the number of times a word appears in the document, and it is offset by the number of documents in the collection or scope that contains the word.
Search engines often use the variations of tf-idf
weighting scheme as a tool in scoring and ranking a document’s relevance for a given query. The tf-idf scoring for a document relevancy is done on the basis of per-partition index, which means that documents across different partitions may have different scores.
When bleve scores a document, it sums a set of sub scores to reach the final score. The scores across different searches are not directly comparable as the scores are directly dependent on the search criteria. So, changing the search criteria, like terms, boost factor etc. can vary the score.
The more conjuncts/disjuncts/sub clauses in a query can influence the scoring. Also, the score of a particular search result is not absolute, which means you can only use the score as a comparison to the highest score from the same search result.
FTS does not provide any predefined range for valid scores.
In Couchbase application, you get an option to explore the score computations during any search in FTS.
On the Search page, you can search for a term in any index. The search result displays the search records along with the option Explain Scoring to view the score deriving details for search hits and which are determined by using the tf-idf
algorithm.
Score:none
You can disable the scoring by setting score
to none
in the search request. This is recommended in a situation where scoring (document relevancy) is not needed by the application.
Using "score": "none" is expected to boost query performance in certain situations.
|
Scoring Tips and Recommendations
For a select term, FTS calculates the relevancy score. So, the documents having a higher relevancy score automatically appear at the top in the result.
It is often observed that users are using Full-Text Search for the exact match queries with a bit of fuzziness or other search-specific capabilities like geo.
Text relevancy score does not matter when the user is looking for exact or more targeted searches with many predicates or when the dataset size is small.
In such a case, FTS unnecessarily uses more resources in calculating the relevancy score. Users can, however, optimize the query performance by skipping the scoring. Users may skip the scoring by passing a “score”: “none” option in the search request.
Highlighting
The Highlight
object indicates whether highlighting was requested.
The pre-requisite includes term vectors and store options to be enabled at the field level to support Highlighting.
The highlight object contains the following fields:
-
style - (Optional) Specifies the name of the highlighter. For example, "html"or "ansi".
-
fields - Specifies an array of field names to which Highlighting is restricted.
Example 1
As per the following example, when you search the content in the index, the matched content in the address
field is highlighted in the search response.
curl -u username:password -XPOST -H "Content-Type: application/json" \
http://localhost:8094/api/index/travel-sample-index/query \
-d '{
"explain": true,
"fields": [
"*"
],
"highlight": {
"style":"html",
"fields": ["address"]
},
"query": {
"query": "address:farm"
}
}'
Example 2
As per the following example, when you search the content in the index, the matched content in the description
field is highlighted in the search response.
curl -u username:password -XPOST -H "Content-Type: application/json" \
http://localhost:8094/api/index/travel-sample-index/query \
-d '{
"explain": true,
"fields": [
"*"
],
"highlight": {
"style":"html",
"fields": ["description"]
},
"query": {
"query": "description:complementary breakfast"
}
}'
Fields
You can store specific document fields within FTS and retrieve those as a part of the search results.
It involves the following two-step process:
-
Indexing
you need to specify the desired fields of the matching documents to be retrieved as a part of the index definition. To do so, select the "store" option checkbox in the field mapping definition for the desired fields. The FTS index will store the original field contents intact (without applying any text analysis) as a part of its internal storage.
For example, if you want to retrieve the field "description" in the document, then enable the "store" option like below.
-
Searching
you need to specify the fields to be retrieved in the "fields" setting within the search request. This setting takes an array of field names which will be returned as part of the search response. The field names must be specified as strings. While there is no field name pattern matching available, you can use an asterisk ("*") to specify that all stored fields be returned with the response.
For retrieving the contents of the aforementioned "description" field, you may use the following search request.
curl -XPOST -H "Content-Type: application/json" -u <username>:<password> \ http://host:port/api/index/FTS/query -d '{ "fields":[ "description" ], "query":{ "field":"queryFieldName", "match":"query text" } }'
Search Facets
Facets are aggregate information collected on a particular result set. For any search, the user can collect additional facet information along with it.
All the facet examples below, are for the query "water
" on the beer-sample dataset.
FTS supports the following types of facets:
Term Facet
A term facet counts how many matching documents have a particular term for a specific field.
When building a term facet, use the keyword analyzer. Otherwise, multi-term values get tokenized, and the user gets unexpected results. |
Example
-
Term Facet - computes facet on the type field which has two values:
beer
andbrewery
.curl -X POST -H "Content-Type: application/json" -u <username>:<password>\ http://localhost:8094/api/index/bix/query -d \ '{ "size": 10, "query": { "boost": 1, "query": "water" }, "facets": { "type": { "size": 5, "field": "type" } } }'
The result snippet below only shows the facet section for clarity. Run the curl command to see the HTTP response containing the full results.
"facets": { "type": { "field": "type", "total": 91, "missing": 0, "other": 0, "terms": [ { "term": "beer", "count": 70 }, { "term": "brewery", "count": 21 } ] } }
Numeric Range Facet
A numeric range facet works by the users defining their own buckets (numeric ranges).
The facet then counts how many of the matching documents fall into a particular bucket for a particular field.
Example
-
Numeric Range Facet - computes facet on the
abv
field with 2 buckets describinghigh
(greater than 7) andlow
(less than 7).curl -X POST -H "Content-Type: application/json" -u <username>:<password>\ http://localhost:8094/api/index/bix/query -d \ '{ "size": 10, "query": { "boost": 1, "query": "water" }, "facets": { "abv": { "size": 5, "field": "abv", "numeric_ranges": [ { "name": "high", "min": 7 }, { "name": "low", "max": 7 } ] } } }'
Results:
"facets": { "abv": { "field": "abv", "total": 70, "missing": 21, "other": 0, "numeric_ranges": [ { "name": "high", "min": 7, "count": 13 }, { "name": "low", "max": 7, "count": 57 } ] } }
Date Range Facet
The Date Range facet is same as numeric facet, but on dates instead of numbers.
Full text search and Bleve expect dates to be in the format specified by RFC-3339, which is a specific profile of ISO-8601 that is more restrictive.
Example
-
Date Range Facet - computes facet on the ‘updated’ field that has 2 values old and new.
curl -XPOST -H "Content-Type: application/json" -u <username>:<password> \ http://<node>:8094/api/index/bix/query -d '{ "ctl":{ "timeout":0 }, "from":0, "size":0, "query":{ "field":"country", "term":"united" }, "facets":{ "types":{ "size":10, "field":"updated", "date_ranges":[ { "name":"old", "end":"2010-08-01" }, { "name":"new", "start":"2010-08-01" } ] } } }'
Results
"facets":{ "types":{ "field":"updated", "total":954, "missing":0, "other":0, "date_ranges":[ { "name":"old", "end":"2010-08-01T00:00:00Z", "count":934 }, { "name":"new", "start":"2010-08-01T00:00:00Z", "count":20 } ] } }
Collections
Collections field lets the user specify an optional list of collection names.
This would help the users scope their search request to only those specified collections within a multi-collection index.
This becomes useful with multi-collection indexes as it can speed up searches, as well as the user can manage the Role Based Access Control more granularly with this option. Ie the search user only needs permissions for the requested collections and not for every other collection indexed within the index.
In the absence of any collection names, the search request would be treated as a normal search request and would retrieve documents from all the indexed collections within the index.
IncludeLocations
Search is capable of returning the array positions for the search term relative to the whole document hierarchical structure. If the user sets it to true then the search service returns the array_positions
of the search term occurrences inside the document. The user has to enable the term_vector
field option for the relevant field during the indexing for fetching the location details during the search time.