ResearchBuzz
ResearchBuzz Logo
Search Engine News and More Since 1998

Sign up to receive ResearchBuzz FREE every week by e-mail. Your information is NEVER shared with anyone.
Email address:
Privacy Policy

ResearchBuzz:

 

Archives

Category-Based:

 
Date-Based:

All monthly archives

 

Last Five Entries:

[an error occurred while processing this directive]

September 30, 2004

How Much Page Does Yahoo Index?

One of the lesser-discussed facets of Web searching is the spidering limits of search engines. Even if a search engine is a full-text engine, it may not search the entirety of a given page if it's too large. In Google's case the limit is 101K for HTML pages (its spider will only index the first 101K of an HTML Web page; search Google for aardvark apple zither zephyr filetype:html and look at the file sizes of the results) and ? for PDF pages. (I can't see the limit; if you look at http://tinyurl.com/4px8n ; you'll see that about two-thirds of the pages listed in the TOC are available in Google's HTML version. 300K limit? 500K?)

I knew that Yahoo had a larger index limit, but I didn't know how large. I learned earlier this week that Yahoo's limit is the first 150K of a Web page, while its PDF indexing limit is 500K.

... this is what I'm told, anyway. However, I'm finding something interesting. If you search Yahoo for aardvark apple zither zephyr originurlextension:html (originurlextension: is Yahoo's gawdawful syntax for filetype:; I'm told they'll be fixing it soon. Propburgers to Greg Notess of http://www.searchengineshowdown.com for educating me about it) you'll find that filesizes are listed with search results, and the filesizes listed are well over 150K -- I see page sizes of over 800K listed here! At least one of the pages listed, at 173K, appears from its cache to be fully indexed (the headers, footers, and copyright disclaimers are all in place -- it doesn't look "cut off") and a cache copied-and-pasted into a text editor weighs in at well over 200K.

The bottom line is that Yahoo indexes far more of HTML pages than Google; if you're running searches which might tend to focus on large pages (like word listing searches that might point you to dictionaries) try Yahoo first.

Posted in the following categories: Search Engines-Yahoo | TrackBack
Take this title and: Google It | Yahoo It | Teoma It | Gigablast It | Amazon It

Entry Count

ResearchBuzz has 2165 entries in over 200 categories.
Search ResearchBuzz:

[an error occurred while processing this directive]

[an error occurred while processing this directive]