Add note about bigm_similarity function.

author MasaoFujii <masao.fujii@gmail.com>

Thu, 21 Nov 2013 13:01:54 +0000 (22:01 +0900)

committer MasaoFujii <masao.fujii@gmail.com>

Thu, 21 Nov 2013 13:01:54 +0000 (22:01 +0900)
author MasaoFujii <masao.fujii@gmail.com>
Thu, 21 Nov 2013 13:01:54 +0000 (22:01 +0900)
committer MasaoFujii <masao.fujii@gmail.com>
Thu, 21 Nov 2013 13:01:54 +0000 (22:01 +0900)
diff --git a/html/pg_bigm-1-1.html b/html/pg_bigm-1-1.html

index 22118cc..dd83af2 100644 (file)
--- a/html/pg_bigm-1-1.html
+++ b/html/pg_bigm-1-1.html
@@ -261,6 +261,9 @@ $ su
   pg_trgm
  (2 rows)
  </pre>
+<p>
+類似度の計算方法や注意点については、類似度を計算する関数<a href="#bigm_similarity">bigm_similarity</a>を参照してください。
+</p>
  
  <h2 id="functions">提供関数</h2>
  <h3 id="likequery">likequery</h3>
@@ -280,7 +283,6 @@ $ su
  
  <p>pg_bigmでは、LIKE演算子の中間一致検索により全文検索を行います。このため、上記のとおり検索文字列を変換して、LIKE演算子に渡す必要があります。この変換は、通常、クライアントアプリケーション側で実装しなければなりません。しかし、likequeryを利用することで、その実装の手間を省くことができます。</p>
  
-<p>実行例</p>
  <pre>
  =# SELECT likequery('pg_bigmは検索性能を200%向上させました。');
                    likequery
@@ -310,8 +312,6 @@ $ su
  <p>2-gram文字列とは、文字列の先頭と末尾に空白文字を追加した上で、文字列を1文字ずつずらしながら、2文字単位で抽出した文字列のことです。例えば、文字列「ABC」の2-gram文字列は、「(空白)A」「AB」「BC」「C(空白)」の4つになります。</p>
  
  <pre>
-実行例
-
  =#  SELECT show_bigm('PostgreSQLの全文検索');
                              show_bigm
  -----------------------------------------------------------------
@@ -334,8 +334,6 @@ $ su
  </p>
  
  <pre>
-実行例
-
  =# SELECT bigm_similarity('PostgreSQLの全文検索', 'postgresの検索');
   bigm_similarity 
  -----------------
@@ -343,6 +341,53 @@ $ su
  (1 row)
  </pre>
  
+<p>
+類似度計算に使われる2-gram文字列は、文字列の先頭と末尾に空白文字が追加された上で作成されることに注意してください。
+このため、例えば、文字列「B」は文字列「ABC」に含まれますが、下記のとおり一致する2-gram文字列がないため類似度は0になります。
+一方、文字列「A」は、下記のとおり一致する2-gram文字列があるため類似度は0より大きくなります。
+これは、pg_trgmのsimilarity関数と基本的に同じ挙動です。
+
+<ul>
+<li>文字列「ABC」の2-gram文字列は「(空白)A」「AB」「BC」「C(空白)」</li>
+<li>文字列「A」の2-gram文字列は「(空白)A」「A(空白)」</li>
+<li>文字列「B」の2-gram文字列は「(空白)B」「B(空白)」</li>
+</ul>
+</p>
+
+<pre>
+=# SELECT bigm_similarity('ABC', 'A');
+ bigm_similarity 
+-----------------
+            0.25
+(1 row)
+
+=# SELECT bigm_similarity('ABC', 'B');
+ bigm_similarity 
+-----------------
+               0
+(1 row)
+</pre>
+
+<p>
+bigm_similarityは、英字の大文字と小文字を区別することに注意してください。
+一方、pg_trgmのsimilarity関数は、英字の大文字と小文字を区別しません。
+例えば、「ABC」と「abc」の類似度は、pg_trgmのsimilarity関数では1ですが、bigm_similarityでは0です。
+</p>
+
+<pre>
+=# SELECT similarity('ABC', 'abc');
+ similarity 
+------------
+          1
+(1 row)
+
+=# SELECT bigm_similarity('ABC', 'abc');
+ bigm_similarity 
+-----------------
+               0
+(1 row)
+</pre>
+
  <h3 id="pg_gin_pending_stats">pg_gin_pending_stats</h3>
  <p>pg_gin_pending_statsは、GINインデックス(引数1)の待機リストに含まれているデータのページ数とタプル数を返却する関数です。</p>
  
@@ -354,7 +399,6 @@ $ su
  
  <p>GINインデックスの待機リストの詳細は、<a href="http://www.postgresql.jp/document/current/html/gin-implementation.html#GIN-FAST-UPDATE">GIN高速更新手法</a>を参照してください。</p>
  
-<p>実行例</p>
  <pre>
  =# SELECT * FROM pg_gin_pending_stats('pg_tools_idx');
   pages | tuples
@@ -367,8 +411,6 @@ $ su
  <h3 id="last_update">pg_bigm.last_update</h3>
  <p>pg_bigm.last_updateは、pg_bigmモジュールの最終更新日付を報告するパラメータです。このパラメータは読み取り専用です。 postgresql.confやSET文で設定値を変更することはできません。</p>
  <pre>
-実行例
-
  =# SHOW pg_bigm.last_update;
   pg_bigm.last_update
  ---------------------
diff --git a/html/pg_bigm_en-1-1.html b/html/pg_bigm_en-1-1.html

index 93beebe..1baeb69 100644 (file)
--- a/html/pg_bigm_en-1-1.html
+++ b/html/pg_bigm_en-1-1.html
@@ -260,6 +260,9 @@ $ su
   pg_trgm
  (2 rows)
  </pre>
+<p>
+Please see <a href="#bigm_similarity">bigm_similarity</a> function for details of how to calculate the similarity.
+</p>
  
  <h2 id="functions">Functions</h2>
  <h3 id="likequery">likequery</h3>
@@ -279,7 +282,6 @@ $ su
  
  <p>In pg_bigm, full text search is performed by using LIKE pattern matching. Therefore, the search keyword needs to be converted into the pattern string that LIKE operator can handle properly. Usually a client application should be responsible for this conversion. But, you can save the effort of implementing such a conversion logic in the application by using likequery function.</p>
  
-<p>Example</p>
  <pre>
  =# SELECT likequery('pg_bigm has improved the full text search performance by 200%');
                               likequery                             
@@ -308,7 +310,6 @@ $ su
  
  <p>A 2-gram that show_bigm returns is a group of two consecutive characters taken from a string that blank character has been appended into the beginning and the end. For example, the 2-grams of the string "ABC" are "(blank)A" "AB" "BC" "C(blank)".</p>
  
-<p>Example</p>
  <pre>
  =# SELECT show_bigm('full text search');
                              show_bigm                             
@@ -330,7 +331,6 @@ $ su
  This function measures the similarity of two strings by counting the number of 2-grams they share. The range of the similarity is zero (indicating that the two strings are completely dissimilar) to one (indicating that the two strings are identical).
  </p>
  
-<p>Example</p>
  <pre>
  =# SELECT bigm_similarity('full text search', 'text similarity search');
   bigm_similarity 
@@ -339,6 +339,52 @@ This function measures the similarity of two strings by counting the number of 2
  (1 row)
  </pre>
  
+<p>
+Note that each argument is considered to have one space prefixed and suffixed when determining the set of 2-grams contained in the string for calculation of similarity.
+For example, though the string "ABC" contains the string "B", their similarity is 0 because there are no 2-grams they share as follows.
+On the other hand, the string "ABC" and "A" share one 2-gram "(blank)A" as follows, so their similarity is higher than 0.
+This is basically the same behavior as pg_trgm's similarity function.
+
+<ul>
+<li>The 2-grams of the string "ABC" are "(blank)A" "AB" "BC" "C(blank)".</li>
+<li>The 2-grams of the string "A" are "(blank)A" "A(blank)".</li>
+<li>The 2-grams of the string "B" are "(blank)B" "B(blank)".</li>
+</ul>
+</p>
+
+<pre>
+=# SELECT bigm_similarity('ABC', 'A');
+ bigm_similarity 
+-----------------
+            0.25
+(1 row)
+
+=# SELECT bigm_similarity('ABC', 'B');
+ bigm_similarity 
+-----------------
+               0
+(1 row)
+</pre>
+
+<p>
+Note that bigm_similarity is NOT case-sensitive, but pg_trgm's similarity function is case-sensitive.
+For example, the similarity of the strings "ABC" and "abc" is 1 in pg_trgm's similarity function but 0 in bigm_similarity.
+</p>
+
+<pre>
+=# SELECT similarity('ABC', 'abc');
+ similarity 
+------------
+          1
+(1 row)
+
+=# SELECT bigm_similarity('ABC', 'abc');
+ bigm_similarity 
+-----------------
+               0
+(1 row)
+</pre>
+
  <h3 id="pg_gin_pending_stats">pg_gin_pending_stats</h3>
  <p>pg_gin_pending_stats is a function that returns the number of pages and tuples in the pending list of GIN index.</p>
  
@@ -350,7 +396,6 @@ This function measures the similarity of two strings by counting the number of 2
  
  <p>Please see <a href="http://www.postgresql.org/docs/current/static/gin-implementation.html#GIN-FAST-UPDATE">GIN Fast Update Technique</a> for details of the pending list.</p>
  
-<p>Example</p>
  <pre>
  =# SELECT * FROM pg_gin_pending_stats('pg_tools_idx');
   pages | tuples
@@ -363,7 +408,6 @@ This function measures the similarity of two strings by counting the number of 2
  <h3 id="last_update">pg_bigm.last_update</h3>
  <p>pg_bigm.last_update is a parameter that reports the last updated date of the pg_bigm module. This parameter is read-only. You cannot change the value of this parameter at all.</p>
  
-Example
  <pre>
  =# SHOW pg_bigm.last_update;
   pg_bigm.last_update
author	MasaoFujii <masao.fujii@gmail.com>
	Thu, 21 Nov 2013 13:01:54 +0000 (22:01 +0900)
committer	MasaoFujii <masao.fujii@gmail.com>
	Thu, 21 Nov 2013 13:01:54 +0000 (22:01 +0900)
html/pg_bigm-1-1.html		patch \| blob \| history
html/pg_bigm_en-1-1.html		patch \| blob \| history