Tricks using Full Text Search with Chinese content

SQL Full Text Search offers a fuzz logic search. To get desired search result with Chinese content, the following steps have to be followed (among others):

  • Set language for columns (and you can’t mix different language settings for a table): If the default setting (English) if used (this is a English OS/SQL), SQL server wont’ be able to correctly decide the word boundaries. And the word boundaries for English language will be used – white space and some form of punctuation. So if you search on 程序员, you will only get back entries with space before 程序员, i.e.  ” 程序员”, “.程序员” etc.
  • Use CONTAINS, not FREETEXT. Freetext is the easiest, since you don’t have to replace the spaces with AND/OR keyword. That means you can just pass the keyword over to SQL. However, if you do FREETEXT search on 程序员, you will get back everything that has any word of 程序员. It’s like you search on CONTAINS ( table, *, ‘程 or 序 or 员’.
  • Convert the content to either simplified or traditional font: the search keyword on simplified Chinese font won’t get back traditional font entries.  One workaround is that you have separate DB columns to store a converted version (either S or T).  Build full text search catalog on that column. Obviously you have to convert the search keyword to the font of your choice.
  • Stripe the formatting characters: this isn’t actually limited to Chinese search but applicable for all. If the content contains HTML format characters, the search content might be separated by html tags such as <font>, <b> etc.  Striping all HTML tags and store the text-only version to a separate search column is a good idea.

P.S. simple Full Text Search syntax reference:

  • SELECT * FROM job where FREETEXT(title, ‘美容師”)
  • SELECT jobID, title, cc.rank
    FROM job j INNER JOIN FREETEXTTABLE( job, *, ‘美容師 數名’) as cc
    ON cc.[KEY]=j.JobID ORDER BY cc.RANK
  • SELECT jobID, title, cc.rank
    FROM job j INNER JOIN CONTAINSTABLE( job, (title, description), ‘美容師 AND 數名’) as cc
    ON cc.[KEY]=j.JobID
Advertisements
This entry was posted in ASP.NET, SQL and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s