Tricks using Full Text Search with Chinese content

February 10, 2010

SQL Full Text Search offers a fuzz logic search. To get desired search result with Chinese content, the following steps have to be followed (among others):

  • Set language for columns (and you can’t mix different language settings for a table): If the default setting (English) if used (this is a English OS/SQL), SQL server wont’ be able to correctly decide the word boundaries. And the word boundaries for English language will be used – white space and some form of punctuation. So if you search on 程序员, you will only get back entries with space before 程序员, i.e.  ” 程序员”, “.程序员” etc.
  • Use CONTAINS, not FREETEXT. Freetext is the easiest, since you don’t have to replace the spaces with AND/OR keyword. That means you can just pass the keyword over to SQL. However, if you do FREETEXT search on 程序员, you will get back everything that has any word of 程序员. It’s like you search on CONTAINS ( table, *, ‘程 or 序 or 员’.
  • Convert the content to either simplified or traditional font: the search keyword on simplified Chinese font won’t get back traditional font entries.  One workaround is that you have separate DB columns to store a converted version (either S or T).  Build full text search catalog on that column. Obviously you have to convert the search keyword to the font of your choice.
  • Stripe the formatting characters: this isn’t actually limited to Chinese search but applicable for all. If the content contains HTML format characters, the search content might be separated by html tags such as <font>, <b> etc.  Striping all HTML tags and store the text-only version to a separate search column is a good idea.

P.S. simple Full Text Search syntax reference:

  • SELECT * FROM job where FREETEXT(title, ‘美容師”)
  • SELECT jobID, title, cc.rank
    FROM job j INNER JOIN FREETEXTTABLE( job, *, ‘美容師 數名’) as cc
    ON cc.[KEY]=j.JobID ORDER BY cc.RANK
  • SELECT jobID, title, cc.rank
    FROM job j INNER JOIN CONTAINSTABLE( job, (title, description), ‘美容師 AND 數名’) as cc
    ON cc.[KEY]=j.JobID

Redirecting from osssearchresults.aspx

February 11, 2008

In SharePoint MOSS 2007, when you turn on the custom scope in site collection’s “Search Settings”, most of search results will be displayed at /SearchCenter/Pages/results.aspx. The exceptions are at the contextual search (This site, This List: Documents etc) and it always displays the search result in OSSearchResults.aspx page, which you can not customize through web parts (as you can do with /SearchCenter/Pages/restuls.aspx). This trick seems works very well (original post here):

Open OSSearchResult.aspx in layout folder, and add this block:

<script language = “javascript”>
function getURLParam(strParamName){
var strReturn = “”;
var strHref = window.location.href;
if ( strHref.indexOf(“?”) > -1 ){
var strQueryString = strHref.substr(strHref.indexOf(“?”)).toLowerCase();
var aQueryString = strQueryString.split(“&”);
for ( var iParam = 0; iParam < aQueryString.length; iParam++ ){
if (
aQueryString[iParam].indexOf(strParamName.toLowerCase() + “=”) > -1 ){
var aParam = aQueryString[iParam].split(“=”);
strReturn = aParam[1];
break;
}
}
}
return unescape(strReturn);
}
var urlstring = ‘http://your site url/custom-search.aspx?k=’ + getURLParam(‘k’) + ‘&cs=’ + getURLParam(‘cs’) + ‘&u=’ + getURLParam(‘u’)

location.replace(urlstring);
</script>

I added right after the stylesheet block and I guess you can remove some un-used code block since this page will not be used at all.


Load testing on indexing BDC data

December 12, 2007

The BDC source here are in a MS SQL database. In most of the test the SharePoint indexing server, SharePoint SQL and the BDC data source SQL are in same server (a very powerful one).

Initially the speed was very slow (7.5K per hour) and seems it didn”t downgrade as the crawled records accumulated all the way to 1+ million. Later it turned out to be that the source DB was not properly indexed. So the source DB was the bottleneck (CPU was constantly 90%+).

After the index was added to source DB, the speed became 160K per hour. But as the crawled records went up, the speed was slowed down to 40K per hour (with 2 million records).

bdc-load-test.gif

The space it uses seems to be about 5 times of the original SQL database.

Small note: it will take about 18 minutes to get ID list of the every 1 million records.


Findings on SharePoint Search – BDC – 2

November 8, 2007

In search result page, when you click on the “View Profile” page, it does call the database to retrieve the data. So Authentication mode is important here. If you specify “PassThrough”, make sure the “Default content access account” (seem in Search Settings page) has access to that DB. Or, change to “RevertToSelf” and the identify of the application pool will be used – please note this is the SSP application pool, not the central admin application pool. However, for unknown reason, The “Default content access account” still need access to the database, otherwise the crawling will fail even the “View Profile” URL can retrieve the data from database. Also, don’t forget to give “Default content access account” (the account used to crawl) appropriate rights to the BDC application and entity.

Please note that there might be a delay of a few minutes after the permission is changed on BDC application and entity. If the crawling still has “Access denied b BDC” error, wait for a few minutes.

You can delete and re-import the BDC application definition at the anytime – no need to re-crawl as long as the BDC name is not changed.

Incremental Crawl – it’s said that if you add a LastModifiedDate column to the record, SharePoint indexing service will use that as a time stamp and incremental crawling is enabled. However in my various tests, it took virtually same time in full and incremental crawling. Either this doesn’t work, or in my test case, most of the time was spent on retrieving the data instead of indexing them.

Some good articles to read:


Findings on SharePoint Search – BDC – 1

November 7, 2007

Steps to add BDC to Sharepoint search:

  1. Define BDC XML file (you can try this GUI editor but it doesn’t help that much. You’d better still have to read the XML file through and understand it.)
  2. Import BDC Application file.
  3. Create a new Content Type and choose the BDC as the source
  4. Start Full Crawl
  5. Create a Scope to use this Content Type
  6. Configure (if needed) the web site to use this scope (by clicking on the Scope display group name)

After a content source is deleted, the index items of the content source will NOT be deleted immediately. Instead, the SharePoint “Gather” will delete them one by one at the background, and a warning message will be generated for each deleted items! It seems no full crawl can be done before this process is finished (at least no BDC crawling). It will take more than a day to remove 1.5 M records – aboutn 600 records a minute. Slooooooow! If one just want to erase all indexed content, one can click on “Reset all crawled content” link on search setting page. That is very fast.

It seems no way to purge the humongous crawl log. MS site says it will deleted after 5 days – to be confirmed.

In Search page (search web part), if choose to show Scope drop down, the search will be limited to the selected scope. However if scope drop down is not showed, it seems always search on a hard-coded “All Sites” scope! (Yes I tried to delete that scope and I got error message saying ‘Scope does not exist’). Even worse, search web service (search.asmx) can not even specify scopes.

Update: got solution for UI search page (original post):

“You need to add a new search results page to your search center (or edit an existing results page). Set the Scope property in the “Search core results” web part (in the “Miscellaneous” section) on the search results page to your dedicated scope.”

However this doesn’t exactly solve the problem. When the URL contains s=[scope name], it still shows the result from other scope (this is reasonable). And it doesn’t help on web service call.

Update: solution for web service search.asmx (link here):

SELECT URL, Title, Description FROM SCOPE() WHERE “scope”=’All Sites’ AND FREETEXT(‘gallery hinges’) AND SITE = “http://supportdesk” AND NOT CONTAINS(‘brass’)


Follow

Get every new post delivered to your Inbox.