21 02 2013
SharePoint 2010 Server Search Only Crawling Top Level Site
This one cost me three evenings, and turned out to be a silly little thing, hope it can help someone else.
One web application has one site collection. Only pages in the top level site are being indexed by the crawler. These pages happen to be ones linked by the homepage. Another web application with a site collection and site hierarchy is able to be fully crawled. When I added a static link to a sub site, that sub site’s homepage (and other pages it linked to) were being crawled OK, but no other sub sites. No SharePoint list items or documents in document libraries were being indexed in any site (even the top level site).
This was a UAT environment, and our live environment, which appeared to be configured identically, did not have this issue – the entire site collection on live was able to be crawled without issue.
Here’s what I checked:
- Web App URL was set in the Content Source
- Content Source was set as SharePoint Sites
- Content Source was set to crawl entire web application
- Crawler Account had Full Read on User Policy for the web app
- Same behaviour when setting the Crawler Account to a site collection admin account
- Same behaviour when crawling any AAM URL for the web app
- Was able to log in as any set crawler account and browse the full site
- There were no errors in the Crawl Log, only the 6 successes
- There were no errors reported in ULS, even when bumped to Verbose
- When using Fiddler as a reserve proxy, there were no errors reporting as the crawler happily crawled the 6 pages it could see
- Deleting and re-creating the Search Service Application from scratch had no effect
- Removing all Scopes and Crawl Rules had no effect
What was the problem?
Missing ‘MicrosoftSharePointTeamServices’ header in the web application configuration in IIS, as I eventually found in a hint from this MSDN forum post (they are useful sometimes!):
It turns out this header is required by the SharePoint crawler to ensure that the crawled site is indeed a SharePoint site and it can use the standard SharePoint APIs to discover the site’s content. If it can’t see this, the crawler treats the site as a standard static HTML web site and follows links to discover content. That is why I could only see a handful of pages that were linked from the homepage in the crawl log.
I was able to prove this first by checking the response headers from hitting both the UAT and live environments and checking the headers – when I noticed the the MicrosoftSharePointTeamServices response header missing from the UAT environment, I knew where to start looking.
If you come across this issue, beware that there are several places in an ASP.NET web site where you can control/alter the HTTP Response headers – you can do this in a custom Control (ASCX or CLR type style), in a HTTP Module or, as was the case here, in the application configuration (done in IIS, but can also be done in the web.config). Do also ensure when you manually recreate this you set the version number to the correct version of your SharePoint farm. Subsequent updates to your environment should keep this value up to date. (Actually, I’m unsure if the PSConfig wizard will update this header if it has been removed from a web app…).
Why was this removed?
Security (I guess, someone else did it but has since left the company).
But do beware, if you choose to follow these practices, ensure you have a web application available which could just be internally accessable but lets search work properly!