Add support for PDF indexing – and more about iFilters

December 21, 2007

Out of the box, SharePoint only index/crawl DOC, XLS, PPT, TXT and HTM files. Here are the steps to install Acrobat PDF IFilter 6.0 on a SharePoint 2007 server to allow PDF file content to be indexed by the Search and for the correct icon to be shown on web UI.

First download Adobe PDF IFilter 6.0.

  1. Run the Adobe PDF IFilter 6.0 Setup program. Note: if you have SQL on a different server then you need to install the iFilter on the SQL Server not on the IIS server. (But I also installed on SSP server just in case.)
  2. The following steps are on the web server.
  3. Copy the ICPDF.gif file to “\12\Template\Images”. ( I googled this icpdf.gif from internet)
  4. Edit the file \12\TEMPLATE\XML\DOCICON.XML to add an entry for the .pdf extension.
    1. <Mapping Key=”pdf” Value=”icpdf.gif”/>
  5. Do an iisreset or recycle the appropriate application pool
  6. Add the .pdf file type to the index list:
    1. Go to SSP->Search Settings and next to File Type, add a new file type pdf
  7. Perform a Full Crawl
  8. My experience indicates that a server reboot is necessary.

RTF filter is available for SharePoint 2003 but the download link was removed from MS site. Hopefully MS will have a updated version for MOSS 2007.

Some other userful links:

  • Microsoft SharePoint team just recently released more filters to support more file types (Office 2007 type, Visio, zip etc): http://blogs.msdn.com/sharepoint/archive/2007/12/20/ms-filter-pack-released.aspx
  • A iFilter shop you can buy more filters: http://www.ifiltershop.com/products.html

Reference:

  • http://support.microsoft.com/kb/555209
  • http://weblogs.asp.net/bugmandan/archive/2007/03/09/sharepoint-2007-acrobat-pdf-ifilter-6-0.aspx

Updated 7/17/2008:

Thanks Francois for this tip:

It is important to note that the regular iFilter doesn’t support 64Bit version of SharePoint and a special iFilter needs to be installed. The following link will shed more light on the iFilter from Foxit. http://naveedullah.wordpress.com/2007/05/12/64-bit-pdf-ifilter-for-moss-now-available/


Cannot rename file if you don’t have delete permission

December 21, 2007

In SharePoint, there are different “Permission Levels” you can define (Site Actions->Manage All Site Settings->Advanced Permissions->Settings->Permission Levels). One of the permission is delete.  Recently I just found out that in document library, SharePoint requires “delete” permission to rename a file, which makes sense (but our users will scream on this). I didn’t find the confirmation in the help/documentation but this blog confirms it.


10 Things To Optimize your SharePoint Server Indexing

December 21, 2007

Good post from Joel Oleson’s blog – I found it (and the links inside) very useful.

Quote below:

1) Put your Search db and on separate disks transaction logs, both the fastest most optimized disks with fast optimized spindles for writing (dedicated disks are essential)
2) Optimize your temp db: grow it, give it space, you can even split it into multiple dbs and ensure it is on the most optimized disks (dedicated disks are essential). Don’t forget to optimize the transaction logs of the temp db either!
3) Optimize the network between the servers you are indexing and the index server (GB NIC speed is preferred within a farm)
4) Consider topology changes to optimize network throughput and eliminate double hops (Index server crawling a separate front end (shared by user traffic) to pull changes. Adding the WFE role to your Index server and adding applicable host files is a great way to optimize the indexing and optimize your traffic at the same time.
5) Increase the your RAM on your x64 SQL Servers (8 GB is really a good place to start, 16 GB or more is looking better and better.)
6) Defragment your databases, and applicable drives (if fragmented) and run relevant dbcc consistency checks – Refer to KB on SharePoint Safe DBCC commands
7) Increase the # of crawl threads (you’ll have to watch this, it is the easiest way to speed up your crawls, but watch the box it is “attacking” it can be heavy handed.)
8) Reduce the maximum index file size (optional)
9) Remove any unused, single threaded and poor performing ifilters
10) Reduce the amount of full indexes, run incremental crawls on a schedule where they can complete, and remove non essentials such as every 5 minute incremental jobs these will simply cause unnecessary churn.

Bonus: Install the public update or the service pack when it comes out (includes a few SharePoint Indexing related fixes).

More on disk optimization on a post I did a while back… Also there’s a great paper that just got posted on storage and performance optimization. It is a MUST READ. Performance recommendations for storage planning and monitoring.

Getting crawled and you don’t want to? Here’s a recent KB on how to configure the robots.txt in your SharePoint deployment. There is some more info in this post from the field. It is very easy to have 50% of traffic as the crawl account. Optimizing the indexing by reducing even authentication traffic is a big deal. Use accounts that are in the same domain and where the DCs are fast and local if using NTLM. Kerberos might end up being slightly faster, but does add complexity.


Automatically checkin files after uploading

December 20, 2007

In a typical document library of SharePoint MOSS2007, when use “Upload Multiple Files”, or use drag & drop in Explorer View, you can upload multiple files or folder easily. However sometimes the files are kept in “Checked out” mode and will not be available for others until they are checked in. Unfortunately SharePoint doesn’t have a “bulk check-in” function. So how do you have the files automatically checked in?

The key is in the setting “Require documents to be checked out before they can be edited” – this has to be “No”. Change it through “Document Library Settings”->”Versioning Settings”. The option is at the bottom.

Another setting that could prevent files to be auto checked in is the required columns. Make sure you turn off all mandatory columns, or give them a default value.

Other settings in “Versioning Settings” (such as Content Approval, version numbers etc) do not matter.


Error event ID 6398 and 6482 – about security rights of OSearch service

December 18, 2007

I have huge number of following erros in the event log “Application folder”:

Event ID 6398
The Execute method of job definition Microsoft.Office.Server.Search.Administration.IndexingScheduleJobDefinition (ID b94e8106-b5f9-4c2d-ad98-2871bcc4c669) threw an exception. More information is included below.
Retrieving the COM class factory for component with CLSID {3D42CCB1-4665-4620-92A3-478F47389230} failed due to the following error: 80070005.

And this:

Event ID 6482
Application Server Administration job failed for service instance Microsoft.Office.Server.Search.Administration.SearchServiceInstance (7d8b475a-6dda-47e8-8ab7-dbd171926b39).
Reason: Retrieving the COM class factory for component with CLSID {3D42CCB1-4665-4620-92A3-478F47389230} failed due to the following error: 8007000e.

Actually in “System” folder there are 2 events revealing more information about it. It turned out that I need to grant the account that is used by “Office SharePoint Server Search” with Local Activation rights.

Open Component Services->DCOM Config->OSearch->Properties->Security, I added Network Service (may not need) and the account to run “Office SharePoint Server Search” service and gave them “Local Activation” rights (in “Launch and Activation Permissions” group). Those error messages do not appear anymore.

This doesn’t limit to this service and SharePoint only. You can actually search on registry on that CLSID and it might be different DCOM object. For example the other day I got same error with CLSID
{61738644-F196-11D0-9953-00C04FD919C1} and it turned out to IIS WAMREG admin Service.


Target Audiences – doesn’t work with domain groups

December 17, 2007

I have a webpart that I want to restrict the access (kind of) by specifying Target Audiences. I created a SharePoint security group (i.e. SP-Group1) which has domain users AND a domain group (say AD-group1) in it. The webpart works for those who were directly added to SP-Group1 but not for those only in AD-group1. So nested group doesn’t work in “Target Audiences”? (I’m pretty sure nested domain groups in SharePoint groups work in permission/login configurations).

Then I tried to specify the domain group AD-group1 directly in that Target Audiences field but it didn’t work. It (and the “Select Audiences” dialog) can not find the domain group at all. It looks like it doesn’t make a trip to domain controller at all!

I also tried to create a new audience in SSP. When I was trying to add a rule saying this audience should contain all members of a domain group AD-group1, that interface can’t find AD-group1 either. Someone said you need to import the profiles from the Active Directory first. This isn’t really nice.

I will try it later since our DC has more than 10,000 users and I don’t want to import everyone to my site.  So I will just add everyone to the SharePoint group directly.


Load testing on indexing BDC data

December 12, 2007

The BDC source here are in a MS SQL database. In most of the test the SharePoint indexing server, SharePoint SQL and the BDC data source SQL are in same server (a very powerful one).

Initially the speed was very slow (7.5K per hour) and seems it didn”t downgrade as the crawled records accumulated all the way to 1+ million. Later it turned out to be that the source DB was not properly indexed. So the source DB was the bottleneck (CPU was constantly 90%+).

After the index was added to source DB, the speed became 160K per hour. But as the crawled records went up, the speed was slowed down to 40K per hour (with 2 million records).

bdc-load-test.gif

The space it uses seems to be about 5 times of the original SQL database.

Small note: it will take about 18 minutes to get ID list of the every 1 million records.


Business Data Catalog Links

December 6, 2007

One authentication mode:


Follow

Get every new post delivered to your Inbox.