Sign up for ResearchBuzz FREE every week by e-mail.
|
January 31, 2005Has Google Dropped Their 101K Cache Limit?First Google got rid of their 10-word query limit (it's now 32 words, see www.researchbuzz.org/what_happened_to_googles_tenword_query_limit.shtml for details) and now it appears to me that they're indexing more of an HTML file. Google, is this a constant series of improvements and upgrades or do you feel the hot breath of Yahoo / MSN on the back of your neck? (Whichever it is, I wish they'd be a little more overt about discussing it. I noticed the change this morning. I wasn't anywhere where I could update ResearchBuzz, and I was worried about forgetting to check it later, so I wrote CACHE in large Sharpie marker on my hand and walked around with it all day. Because I am a nerd.) Heretofore Google would only index the first 101K of a Web page, no matter how long it was. This was bad if you were searching for pages that tended to be really large (some resource roundup pages, LOTS of PDF documents) because you couldn't be sure that you were searching the entire document. Now it appears that they're indexing more entire pages. To find large documents, search for words beginning with the first four letters and the last four letters of the alphabet. Like this: apple bassoon consequently deer wind xylophone yellow zip (No proper nouns. And it works best if one or two of the words are a bit uncommon.) This works because of the number of lexicon/dictionary pages out there. When you run this search you'll get 33 results, and you'll see that the very first one has a 148K cache, as opposed to the usual 101K. I went looking for the largest cache I could find and discovered that Google has indexed a page which apparently contains the entire 1911 Roget's Thesaurus, and has apparently cached the entire thing too: 1324k. As I mentioned in September ( www.researchbuzz.org/how_much_page_does_yahoo_index.shtml ), Yahoo's indexing limit is ostensibly 150K for HTML pages and 500K for PDF pages. But when I run the apple zip search on Yahoo, I get 11 results, at least one of which has a 200K+ cache. With the ability to search for more words, the ability to search a larger percentage of a Web page (it looks like many cases 100% of a Web page!) will come in really handy. Update: I am getting feedback from readers that they can't replicate my search results, and are not seeing caches of over 101K. This is very strange. I can't replicate my results either. From 30odd results, I am now getting only two, and they are both related to this page. This is doubly strange because some of those pages did not have over 101K caches. Did I hit a test server? I'm fairly sure I wasn't hallucinating. Let me go see if I can dig up a cache -- okay, here we go. Try this direct link to a Google cache, which worked a few minutes ago when I tested it. I'll give you the huge URL and a simple link. This is the cache I found yesterday of Roget's Thesaurus from 1911. a simple link to a cached Google page Now, cutting and pasting that cache into a text editor shows that it's over 1.3 MB in size. You will no longer get this page in your search result if you search for apple bassoon consequently etc. However, if you search for roget inurl:weblib.wl.dvgu.ru You will get that page as your third result with a stated cache size of 101K. BUT WHEN YOU LOOK AT THE CACHE, I mean, when you cut it and paste it into a text editor, it's exactly the same size as the cache from the first link I gave you. Which means as far as I can tell -- and please, if I'm getting this wrong somebody help me understand -- Google's search result pages are not giving correct cache sizes. Posted to Search Engines-Google | TrackBack
|
|||||