From bf028fa8a653d6379a3257176ce43873f5163798 Mon Sep 17 00:00:00 2001 From: Teodor Sigaev Date: Tue, 31 Oct 2006 16:23:05 +0000 Subject: [PATCH] Add description of new features --- contrib/tsearch2/docs/tsearch-V2-intro.html | 6 +- contrib/tsearch2/docs/tsearch2-guide.html | 52 ++- contrib/tsearch2/docs/tsearch2-ref.html | 535 ++++++++++++++++++++++++---- 3 files changed, 503 insertions(+), 90 deletions(-) diff --git a/contrib/tsearch2/docs/tsearch-V2-intro.html b/contrib/tsearch2/docs/tsearch-V2-intro.html index b9cb80574e..8b2514e5be 100644 --- a/contrib/tsearch2/docs/tsearch-V2-intro.html +++ b/contrib/tsearch2/docs/tsearch-V2-intro.html @@ -427,9 +427,9 @@ concatenation also works with NULL fields.

We need to create the index on the column idxFTI. Keep in mind that the database will update the index when some action is taken. In this case we _need_ the index (The whole point of Full Text -INDEXINGi ;-)), so don't worry about any indexing overhead. We will -create an index based on the gist function. GiST is an index -structure for Generalized Search Tree.

+INDEXING ;-)), so don't worry about any indexing overhead. We will +create an index based on the gist or gin function. GiST is an index +structure for Generalized Search Tree, GIN is a inverted index (see The tsearch2 Reference: Indexes).

         CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
         VACUUM FULL ANALYZE;
diff --git a/contrib/tsearch2/docs/tsearch2-guide.html b/contrib/tsearch2/docs/tsearch2-guide.html
index 5540e5d323..d2d764580c 100644
--- a/contrib/tsearch2/docs/tsearch2-guide.html
+++ b/contrib/tsearch2/docs/tsearch2-guide.html
@@ -1,7 +1,6 @@
 
 
 
-
 tsearch2 guide
 
 
@@ -9,16 +8,13 @@
 
 

Brandon Craig Rhodes
30 June 2003 +
Updated to 8.2 release by Oleg Bartunov, October 2006

This Guide introduces the reader to the PostgreSQL tsearch2 module, version 2. More formal descriptions of the module's types and functions are provided in the tsearch2 Reference, which is a companion to this document. -You can retrieve a beta copy of the tsearch2 module from the -GiST for PostgreSQL -page — look under the section entitled Development History -for the current version.

First we will examine the tsvector and tsquery types and how they are used to search documents; @@ -32,15 +28,40 @@ you should be able to run the examples here exactly as they are typed.


Table of Contents

+Introduction to FTS with tsearch2
Vectors and Queries
A Simple Search Engine
Ranking and Position Weights
Casting Vectors and Queries
Parsing and Lexing
+Additional information

+ +

Introduction to FTS with tsearch2

+The purpose of FTS is to +find documents, which satisfy query and optionally return +them in some order. +Most common case: Find documents containing all query terms and return them in order +of their similarity to the query. Document in database can be +any text attribute, or combination of text attributes from one or many tables +(using joins). +Text search operators existed for years, in PostgreSQL they are +~,~*, LIKE, ILIKE, but they lack linguistic support, +tends to be slow and have no relevance ranking. The idea behind tsearch2 is +is rather simple - preprocess document at index time to save time at search stage. +Preprocessing includes + +Tsearch2, in a nutshell, provides FTS operator (contains) for two new data types, +which represent document and query - tsquery @@ tsvector. + +

Vectors and Queries

@@ -79,6 +100,8 @@ Preparing your document index involves two steps: on the tsvector column of a table, which implements a form of the Berkeley Generalized Search Tree. + Since PostgreSQL 8.2 tsearch2 supports Gin index, + which is an inverted index, commonly used in search engines. It adds scalability to tsearch2. Once your documents are indexed, performing a search involves: @@ -251,7 +274,7 @@ and give you an error to prevent this mistake:
 =# SELECT to_tsquery('the')
-NOTICE:  Query contains only stopword(s) or doesn't contain lexeme(s), ignored
+NOTICE:  Query contains only stopword(s) or doesn't contain lexem(s), ignored
  to_tsquery 
 ------------
  
@@ -483,8 +506,8 @@ The rank() function existed in older versions of OpenFTS,
 and has the feature that you can assign different weights
 to words from different sections of your document.
 The rank_cd() uses a recent technique for weighting results
-but does not allow different weight to be given
-to different sections of your document.
+and also allows  different weight to be given
+to different sections of your document (since 8.2).
 

Both ranking functions allow you to specify, as an optional last argument, @@ -511,9 +534,6 @@ for details see the section on ranking in the Reference.

-The rank() function offers more flexibility -because it pays attention to the weights -with which you have labelled lexeme positions. Currently tsearch2 supports four different weight labels: 'D', the default weight; and 'A', 'B', and 'C'. @@ -730,7 +750,7 @@ The main problem is that the apostrophe and backslash are important both to PostgreSQL when it is interpreting a string, and to the tsvector conversion function. You may want to review section -1.1.2.1, + “String Constants” in the PostgreSQL documentation before proceeding.

@@ -1051,6 +1071,14 @@ using the same scheme to determine the dictionary for each token, with the difference that the query parser recognizes as special the boolean operators that separate query words. + +

Additional information

+More information about tsearch2 is available from +tsearch2 page. +Also, it's worth to check +tsearch2 wiki pages. + + diff --git a/contrib/tsearch2/docs/tsearch2-ref.html b/contrib/tsearch2/docs/tsearch2-ref.html index 85401e83e7..7edcc55a9b 100644 --- a/contrib/tsearch2/docs/tsearch2-ref.html +++ b/contrib/tsearch2/docs/tsearch2-ref.html @@ -1,53 +1,74 @@ - -tsearch2 reference + + + +tsearch2 reference

The tsearch2 Reference

Brandon Craig Rhodes
30 June 2003 (edited by Oleg Bartunov, 2 Aug 2003). -

+
Massive update for 8.2 release by Oleg Bartunov, October 2006 +

+

This Reference documents the user types and functions of the tsearch2 module for PostgreSQL. An introduction to the module is provided -by the tsearch2 Guide, +by the tsearch2 Guide, a companion document to this one. -You can retrieve a beta copy of the tsearch2 module from the -GiST for PostgreSQL -page -- look under the section entitled Development History -for the current version. +

+ +

Table of Contents

+
+Vectors and Queries
+Vector Operations
+Query Operations
+Full Text Search Operator
+Configurations
+Testing
+Parsers
+Dictionaries
+Ranking
+Headlines
+Indexes
+Thesaurus dictionary
+
+ + -

Vectors and Queries

-Vectors and queries both store lexemes, +

Vectors and Queries

+ +Vectors and queries both store lexemes, but for different purposes. A tsvector stores the lexemes of the words that are parsed out of a document, and can also remember the position of each word. A tsquery specifies a boolean condition among lexemes. -

-Any of the following functions with a configuration argument +

+Any of the following functions with a configuration argument can use either an integer id or textual ts_name to select a configuration; if the option is omitted, then the current configuration is used. For more information on the current configuration, read the next section on Configurations. +

-

Vector Operations

+

Vector Operations

- to_tsvector( [configuration,] - document TEXT) RETURNS tsvector -
- Parses a document into tokens, +to_tsvector( [configuration,] + document TEXT) RETURNS TSVECTOR +
+ Parses a document into tokens, reduces the tokens to lexemes, and returns a tsvector which lists the lexemes together with their positions in the document. For the best description of this process, - see the section on Parsing and Stemming + see the section on Parsing and Stemming in the accompanying tsearch2 Guide.
- strip(vector tsvector) RETURNS tsvector + strip(vector TSVECTOR) RETURNS TSVECTOR
Return a vector which lists the same lexemes as the given vector, @@ -56,10 +77,10 @@ read the next section on Configurations. While the returned vector is thus useless for relevance ranking, it will usually be much smaller.
- setweight(vector tsvector, letter) RETURNS tsvector + setweight(vector TSVECTOR, letter) RETURNS TSVECTOR
This function returns a copy of the input vector - in which every location has been labelled + in which every location has been labeled with either the letter 'A', 'B', or 'C', or the default label 'D' @@ -68,11 +89,11 @@ read the next section on Configurations. These labels are retained when vectors are concatenated, allowing words from different parts of a document to be weighted differently by ranking functions. -
- vector1 || vector2 -
- concat(vector1 tsvector, vector2 tsvector) - RETURNS tsvector + +
+ vector1 || vector2
+ concat(vector1 TSVECTOR, vector2 TSVECTOR) + RETURNS TSVECTOR
Returns a vector which combines the lexemes and position information in the two vectors given as arguments. @@ -95,27 +116,81 @@ read the next section on Configurations. to the rank() function that assigns different weights to positions with different labels.
- tsvector_size(vector tsvector) RETURNS INT4 + length(vector TSVECTOR) RETURNS INT4
Returns the number of lexemes stored in the vector.
- text::tsvector RETURNS tsvector + text::TSVECTOR RETURNS TSVECTOR
Directly casting text to a tsvector allows you to directly inject lexemes into a vector, with whatever positions and position weights you choose to specify. The text should be formatted like the vector would be printed by the output of a SELECT. - See the Casting + See the Casting section in the Guide for details. -
+
+ tsearch2(vector_column_name[, (my_filter_name | text_column_name1) [...] ], text_column_nameN) +
+tsearch2() trigger used to automatically update vector_column_name, my_filter_name +is the function name to preprocess text_column_name. There are can be many +functions and text columns specified in tsearch2() trigger. +The following rule used: +function applied to all subsequent text columns until next function occurs. +Example, function dropatsymbol replaces all entries of @ +sign by space. +
+CREATE FUNCTION dropatsymbol(text) RETURNS text 
+AS 'select replace($1, ''@'', '' '');'
+LANGUAGE SQL;
+
+CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT 
+ON tblMessages FOR EACH ROW EXECUTE PROCEDURE 
+tsearch2(tsvector_column,dropatsymbol, strMessage);
+
+
-

Query Operations

+
+stat(sqlquery text [, weight text]) RETURNS SETOF statinfo +
+Here statinfo is a type, defined as + +CREATE TYPE statinfo as (word text, ndoc int4, nentry int4) + and sqlquery is a query, which returns column tsvector. +

+This returns statistics (the number of documents ndoc and total number nentry of word +in the collection) about column vector tsvector. +Useful to check how good is your configuration and +to find stop-words candidates.For example, find top 10 most frequent words: +

+=# select * from stat('select vector from apod') order by ndoc desc, nentry desc,word limit 10;
+
+Optionally, one can specify weight to obtain statistics about words with specific weight. +
+=# select * from stat('select vector from apod','a') order by ndoc desc, nentry desc,word limit 10;
+
-
- to_tsquery( [configuration,] - querytext text) RETURNS tsvector +
+
+TSVECTOR < TSVECTOR
+TSVECTOR <= TSVECTOR
+TSVECTOR = TSVECTOR
+TSVECTOR >= TSVECTOR
+TSVECTOR > TSVECTOR
+All btree operations defined for tsvector type. tsvectors compares +with each other using lexicographical order. +
+ + +

Query Operations

+ +
+
+ to_tsquery( [configuration,] + querytext text) RETURNS TSQUERY[A +
+
Parses a query, which should be single words separated by the boolean operators "&" and, @@ -123,14 +198,27 @@ read the next section on Configurations. and "!" not, which can be grouped using parenthesis. Each word is reduced to a lexeme using the current - or specified configuration. - + or specified configuration. + Weight class can be assigned to each lexeme entry + to restrict search region + (see setweight for explanation), for example + "fat:a & rats". +
+
+ plainto_tsquery( [configuration,] + querytext text) RETURNS TSQUERY +
+
+Transforms unformatted text to tsquery. It is the same as to_tsquery, +but assumes "&" boolean operator between words and doesn't +recognizes weight classes.
- querytree(query tsquery) RETURNS text + + querytree(query TSQUERY) RETURNS text
- This might return a textual representation of the given query. +This returns a query which actually used in searching in GiST index.
- text::tsquery RETURNS tsquery + text::TSQUERY RETURNS TSQUERY
Directly casting text to a tsquery allows you to directly inject lexemes into a query, @@ -139,7 +227,117 @@ read the next section on Configurations. like the query would be printed by the output of a SELECT. See the Casting section in the Guide for details. -
+ +
+ numnode(query TSQUERY) RETURNS INTEGER +
+This returns the number of nodes in query tree +
+ TSQUERY && TSQUERY RETURNS TSQUERY +
+AND-ed TSQUERY +
+ TSQUERY || TSQUERY RETURNS TSQUERY +
+ OR-ed TSQUERY +
+ !! TSQUERY RETURNS TSQUERY +
+ negation of TSQUERY +
+
+TSQUERY < TSQUERY
+TSQUERY <= TSQUERY
+TSQUERY = TSQUERY
+TSQUERY >= TSQUERY
+TSQUERY > TSQUERY +
+All btree operations defined for tsquery type. tsqueries compares +with each other using lexicographical order. +
+ + +

Query rewriting

+Query rewriting is a set of functions and operators for tsquery type. +It allows to control search at query time without reindexing (opposite to thesaurus), for example, +expand search using synonyms (new york, big apple, nyc, gotham). +

+rewrite() function changes original query by replacing target by sample. +There are three possibilities to use rewrite() function. Notice, that arguments of rewrite() +function can be column names of type tsquery. +

+create table rw (q TSQUERY, t TSQUERY, s TSQUERY);
+insert into rw values('a & b','a', 'c');
+
+
+
rewrite (query TSQUERY, target TSQUERY, sample TSQUERY) RETURNS TSQUERY +
+
+
+=# select rewrite('a & b'::TSQUERY, 'a'::TSQUERY, 'c'::TSQUERY);
+  rewrite
+  -----------
+   'c' & 'b'
+
+
+
rewrite (ARRAY[query TSQUERY, target TSQUERY, sample TSQUERY]) RETURNS TSQUERY +
+
+
+=# select rewrite(ARRAY['a & b'::TSQUERY, t,s]) from rw;
+  rewrite
+  -----------
+   'c' & 'b'
+
+
+
rewrite (query TSQUERY,'select target ,sample from test'::text) RETURNS TSQUERY +
+
+
+=# select rewrite('a & b'::TSQUERY, 'select t,s from rw'::text);
+  rewrite
+  -----------
+   'c' & 'b'
+
+
+
+Two operators defined for tsquery type: +
+
TSQUERY @ TSQUERY
+
+ Returns TRUE if right agrument might contained in left argument. +
+
TSQUERY ~ TSQUERY
+
+ Returns TRUE if left agrument might contained in right argument. +
+
+To speed up these operators one can use GiST index with gist_tp_tsquery_ops opclass. +
+create index qq on test_tsquery using gist (keyword gist_tp_tsquery_ops);
+
+ +

Full Text Search operator

+ +
+TSQUERY @@ TSVECTOR
+TSVECTOR @@ TSQUERY +
+
+Returns TRUE if TSQUERY contained in TSVECTOR and +FALSE otherwise. +
+=# select 'cat & rat':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
+ ?column?
+ ----------
+  t
+=# select 'fat & cow':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
+ ?column?
+ ----------
+  f
+
+
+

Configurations

@@ -147,7 +345,7 @@ A configuration specifies all of the equipment necessary to transform a document into a tsvector: the parser that breaks its text into tokens, and the dictionaries which then transform each token into a lexeme. -Every call to to_tsvector() (described above) +Every call to to_tsvector(), to_tsquery() (described above) uses a configuration to perform its processing. Three configurations come with tsearch2: @@ -157,7 +355,10 @@ Three configurations come with tsearch2: and the simple dictionary for all others.
  • default_russian -- Indexes words and numbers, using the en_stem English Snowball stemmer for Latin-alphabet words - and the ru_stem Russian Snowball dictionary for all others. + and the ru_stem Russian Snowball dictionary for all others. It's default + for ru_RU.KOI8-R locale. +
  • utf8_russian -- the same as default_russian but +for ru_RU.UTF-8 locale.
  • simple -- Processes both words and numbers with the simple dictionary, which neither discards any stop words nor alters them. @@ -239,7 +440,8 @@ Here:
  • description - human readable name of tok_type
  • token - parser's token
  • dict_name - dictionary used for the token -
  • tsvector - final result
  • +
  • tsvector - final result
  • +

    Parsers

    @@ -300,20 +502,40 @@ the current parser is used when this argument is omitted.

    Dictionaries

    -Dictionaries take textual tokens as input, -usually those produced by a parser, -and return lexemes which are usually some reduced form of the token. +Dictionary is a program, which accepts lexeme(s), usually those produced by a parser, +on input and returns: +
      +
    • array of lexeme(s) if input lexeme is known to the dictionary +
    • void array - dictionary knows lexeme, but it's stop word. +
    • NULL - dictionary doesn't recognized input lexeme +
    +Usually, dictionaries used for normalization of words ( ispell, stemmer dictionaries), +but see, for example, intdict dictionary (available from +Tsearch2 home page, +which controls indexing of integers. + +

    Among the dictionaries which come installed with tsearch2 are:

    • simple simply folds uppercase letters to lowercase before returning the word. -
    • en_stem runs an English Snowball stemmer on each word +
    • +
    • ispell_template - template for ispell dictionaries. +
    • +
    • en_stem runs an English Snowball stemmer on each word that attempts to reduce the various forms of a verb or noun to a single recognizable form. -
    • ru_stem runs a Russian Snowball stemmer on each word. -
    - +
  • ru_stem_koi8, ru_stem_utf8 runs a Russian Snowball stemmer on each word. +
  • +
  • synonym - simple lexeme-to-lexeme replacement +
  • +
  • thesaurus_template - template for thesaurus dictionary. It's +phrase-to-phrase replacement +
  • + + +

    Each dictionary is defined by an entry in the pg_ts_dict table:

    CREATE TABLE pg_ts_dict (
    @@ -332,6 +554,12 @@ it specifies a file from which stop words should be read.
     The dict_comment is a human-readable description of the dictionary.
     The other fields are internal function identifiers
     useful only to developers trying to implement their own dictionaries.
    +
    +
    +WARNING: Data files, used by dictionaries, should be in server_encoding to +avoid possible problems ! +
    +

    The argument named dictionary in each of the following functions @@ -355,6 +583,27 @@ if omitted then the current dictionary is used. from which an inflected form could arise. +

    Using dictionaries template

    +Templates used to define new dictionaries, for example, +
    +INSERT INTO pg_ts_dict
    +               (SELECT 'en_ispell', dict_init,
    +                       'DictFile="/usr/local/share/dicts/ispell/english.dict",'
    +                       'AffFile="/usr/local/share/dicts/ispell/english.aff",'
    +                       'StopFile="/usr/local/share/dicts/english.stop"',
    +                       dict_lexize
    +               FROM pg_ts_dict
    +               WHERE dict_name = 'ispell_template');
    +
    + +

    Working with stop words

    +Ispell and snowball stemmers treat stop words differently: +
      +
    • ispell - normalize word and then lookups normalized form in stop-word file +
    • snowball stemmer - first, it lookups word in stop-word file and then does it job. +The reason - to minimize possible 'noise'. +
    +

    Ranking

    Ranking attempts to measure how relevant documents are to particular queries @@ -364,26 +613,18 @@ Note that this information is only available in unstripped vectors -- ranking functions will only return a useful result for a tsvector which still has position information!

    -Both of these ranking functions -take an integer normalization option -that specifies whether a document's length should impact its rank. -This is often desirable, -since a hundred-word document with five instances of a search word -is probably more relevant than a thousand-word document with five instances. -The option can have the values: - -

      -
    • 0 (the default) ignores document length. -
    • 1 divides the rank by the logarithm of the length. -
    • 2 divides the rank by the length itself. -
    +Notice, that ranking functions supplied are just an examples and +doesn't belong to the tsearch2 core, you can +write your very own ranking function and/or combine additional +factors to fit your specific interest. +

    The two ranking functions currently available are:
    CREATE FUNCTION rank(
    [ weights float4[], ] - vector tsvector, query tsquery, + vector TSVECTOR, query TSQUERY, [ normalization int4 ]
    ) RETURNS float4
    @@ -399,8 +640,8 @@ The two ranking functions currently available are: and make them more or less important than words in the document body.
    CREATE FUNCTION rank_cd(
    - [ K int4, ] - vector tsvector, query tsquery, + [ weights float4[], ] + vector TSVECTOR, query TSQUERY, [ normalization int4 ]
    ) RETURNS float4
    @@ -409,20 +650,51 @@ The two ranking functions currently available are: as described in Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three Term Queries" in the 1999 Information Processing and Management. - The value K is one of the values from their formula, - and defaults to K=4. - The examples in their paper K=16; - we can roughly describe the term - as stating how far apart two search terms can fall - before the formula begins penalizing them for lack of proximity. -
    + +
    + CREATE FUNCTION get_covers(vector TSVECTOR, query TSQUERY) RETURNS text +
    +
    + Returns extents, which are a shortest and non-nested sequences of words, which satisfy a query. + Extents (covers) used in rank_cd algorithm for fast calculation of proximity ranking. + In example below there are two extents - {1...}1 and {2 ...}2. +
    +=# select get_covers('1:1,2,10 2:4'::tsvector,'1& 2');
    +get_covers
    +----------------------
    +1 {1 1 {2 2 }1 1 }2
    +
    +
    + + + +

    +Both of these (rank(), rank_cd()) ranking functions +take an integer normalization option +that specifies whether a document's length should impact its rank. +This is often desirable, +since a hundred-word document with five instances of a search word +is probably more relevant than a thousand-word document with five instances. +The option can have the values, which could be combined using "|" ( 2|4) to +take into account several factors: + +

    +
      +
    • 0 (the default) ignores document length.
    • +
    • 1 divides the rank by the 1 + logarithm of the length
    • +
    • 2 divides the rank by the length itself.
    • +
    • 4 divides the rank by the mean harmonic distance between extents
    • +
    • 8 divides the rank by the number of unique words in document
    • +
    • 16 divides the rank by 1 + logarithm of the number of unique words in document +
    • +

    Headlines

    CREATE FUNCTION headline(
    [ id int4, | ts_name text, ] - document text, query tsquery, + document text, query TSQUERY, [ options text ]
    ) RETURNS text
    @@ -448,10 +720,123 @@ The two ranking functions currently available are: with a word which has this many characters or less. The default value of 3 should eliminate most English conjunctions and articles. +
  • HighlightAll -- + boolean flag, if TRUE, than the whole document will be highlighted.
  • Any unspecified options receive these defaults: -
    StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3
    + 
    StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
      
    + +

    Indexes

    +Tsearch2 supports indexed access to tsvector in order to further speedup FTS. Notice, indexes are not mandatory for FTS ! +
      +
    • RD-Tree (Russian Doll Tree, matryoshka), based on GiST (Generalized Search Tree) +
          
      +    =# create index fts_idx on apod using gist(fts);
      +
      +
    • GIN - Generalized Inverted Index +
             
      +        =# create index fts_idx on apod using gin(fts);
      +
      +
    +GiST index is very good for online update, but is not as scalable as GIN index, +which, in turn, isn't good for updates. Both indexes support concurrency and recovery. + +

    Thesaurus dictionary

    + +

    +Thesaurus - is a collection of words with included information about the relationships of words and phrases, +i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred, related terms,etc.

    +

    Basically,thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, +preserves them for indexing. Thesaurus used when indexing, so any changes in thesaurus require reindexing. +Tsearch2's thesaurus dictionary (TZ) is an extension of synonym dictionary +with phrase support. Thesaurus is a plain file of the following format: +

    +# this is a comment 
    +sample word(s) : indexed word(s)
    +...............................
    +
    +
      +
    • Colon (:) symbol used as a delimiter.
    • +
    • Use asterisk (*) at the beginning of indexed word to skip subdictionary. +It's still required, that sample words should be known.
    • +
    • thesaurus dictionary looks for the most longest match
    +

    +TZ uses subdictionary (should be defined in tsearch2 configuration) +to normalize thesaurus text. It's possible to define only one dictionary. +Notice, that subdictionary produces an error, if it couldn't recognize word. +In that case, you should remove definition line with this word or teach subdictionary to know it. +

    +

    Stop-words recognized by subdictionary replaced by 'stop-word placeholder', i.e., +important only their position. +To break possible ties thesaurus applies the last definition. For example, consider +thesaurus (with simple subdictionary) rules with pattern 'swsw' +('s' designates stop-word and 'w' - known word):

    +
    +a one the two : swsw
    +the one a two : swsw2
    +
    +

    Words 'a' and 'the' are stop-words defined in the configuration of a subdictionary. +Thesaurus considers texts 'the one the two' and 'that one then two' as equal and will use definition +'swsw2'.

    +

    As a normal dictionary, it should be assigned to the specific lexeme types. +Since TZ has a capability to recognize phrases it must remember its state and interact with parser. +TZ use these assignments to check if it should handle next word or stop accumulation. +Compiler of TZ should take care about proper configuration to avoid confusion. +For example, if TZ is assigned to handle only lword lexeme, then TZ definition like +' one 1:11' will not works, since lexeme type digit doesn't assigned to the TZ.

    + +

    Configuration

    + +
    tsearch2

    tsearch2 comes with thesaurus template, which could be used to define new dictionary:

    +
    INSERT INTO pg_ts_dict
    +               (SELECT 'tz_simple', dict_init,
    +                        'DictFile="/path/to/tz_simple.txt",'
    +                        'Dictionary="en_stem"',
    +                       dict_lexize
    +                FROM pg_ts_dict
    +                WHERE dict_name = 'thesaurus_template');
    +
    +
    +

    Here:

    +
      +
    • tz_simple - is the dictionary name
    • +
    • DictFile="/path/to/tz_simple.txt" - is the location of thesaurus file
    • +
    • Dictionary="en_stem" defines dictionary (snowball english stemmer) to use for thesaurus normalization. Notice, that en_stem dictionary has it's own configuration (stop-words, for example).
    • +
    +

    Now, it's possible to use tz_simple in pg_ts_cfgmap, for example:

    +
    +update pg_ts_cfgmap set dict_name='{tz_simple,en_stem}' where ts_name = 'default_russian' and 
    +tok_alias in ('lhword', 'lword', 'lpart_hword');
    +
    +

    Examples

    +

    tz_simple:

    +
    +one : 1
    +two : 2
    +one two : 12
    +the one : 1
    +one 1 : 11
    +
    +

    To see, how thesaurus works, one could use to_tsvector, to_tsquery or plainto_tsquery functions:

    =# select plainto_tsquery('default_russian',' one day is oneday');
    +    plainto_tsquery
    +------------------------
    + '1' & 'day' & 'oneday'
    +
    +=# select plainto_tsquery('default_russian','one two day is oneday');
    +     plainto_tsquery
    +-------------------------
    + '12' & 'day' & 'oneday'
    +
    +=# select plainto_tsquery('default_russian','the one');
    +NOTICE:  Thesaurus: word 'the' is recognized as stop-word, assign any stop-word (rule 3)
    + plainto_tsquery
    +-----------------
    + '1'
    +
    + +Additional information about thesaurus dictionary is available from +Wiki page. -- 2.11.0