Sites de Busca

Básico

Google
Yahoo!
msn
dmoz

Outras SEs

Mais Info

Retornar à página sobre a Google

5. Resultados e Desempenho (essa seção não foi traduzida, por referir-se a dados de 1996 - 1997, bastante desatualizadas)


   
Query: bill clinton
http://www.whitehouse.gov/   
100.00%  (no date) (0K)   
http://www.whitehouse.gov/   
      Office of the President   
        99.67% (Dec 23 1996) (2K)    
        http://www.whitehouse.gov/WH/EOP/OP/html/OP_Home.html  
      Welcome To The White House   
        99.98%  (Nov 09 1997) (5K)  
        http://www.whitehouse.gov/WH/Welcome.html    
      Send Electronic Mail to the President   
        99.86%  (Jul 14 1997) (5K)    
        http://www.whitehouse.gov/WH/Mail/html/Mail_President.html   
mailto:president@whitehouse.gov   
99.98%    
      mailto:President@whitehouse.gov   
        99.27%    
The "Unofficial" Bill Clinton    
94.06% (Nov 11 1997) (14K)   
http://zpub.com/un/un-bc.html   
       Bill Clinton Meets The Shrinks    
         86.27%  (Jun 29 1997) (63K)    
         http://zpub.com/un/un-bc9.html   
President Bill Clinton - The Dark Side   
97.27%  (Nov 10 1997) (15K)   
http://www.realchange.org/clinton.htm   
$3 Bill Clinton   
94.73%  (no date) (4K) http://www.gatewy.net/~tjohnson/clinton1.html   
Figure 4. Sample Results from Google
The most important measure of a search engine is the quality of its search results. While a complete user evaluation is beyond the scope of this paper, our own experience with Google has shown it to produce better results than the major commercial search engines for most searches. As an example which illustrates the use of PageRank, anchor text, and proximity, Figure 4 shows Google's results for a search on "bill clinton". These results demonstrates some of Google's features. The results are clustered by server. This helps considerably when sifting through result sets. A number of results are from the whitehouse.gov domain which is what one may reasonably expect from such a search. Currently, most major commercial search engines do not return any results from whitehouse.gov, much less the right ones. Notice that there is no title for the first result. This is because it was not crawled. Instead, Google relied on anchor text to determine this was a good answer to the query. Similarly, the fifth result is an email address which, of course, is not crawlable. It is also a result of anchor text.
All of the results are reasonably high quality pages and, at last check, none were broken links. This is largely because they all have high PageRank. The PageRanks are the percentages in red along with bar graphs. Finally, there are no results about a Bill other than Clinton or about a Clinton other than Bill. This is because we place heavy importance on the proximity of word occurrences. Of course a true test of the quality of a search engine would involve an extensive user study or results analysis which we do not have room for here. Instead, we invite the reader to try Google for themselves at http://google.stanford.edu.

5.1 Storage Requirements
Aside from search quality, Google is designed to scale cost effectively to the size of the Web as it grows. One aspect of this is to use storage efficiently. Table 1 has a breakdown of some statistics and storage requirements of Google. Due to compression the total size of the repository is about 53 GB, just over one third of the total data it stores. At current disk prices this makes the repository a relatively cheap source of useful data. More importantly, the total of all the data used by the search engine requires a comparable amount of storage, about 55 GB. Furthermore, most queries can be answered using just the short inverted index. With better encoding and compression of the Document Index, a high quality web search engine may fit onto a 7GB drive of a new PC.
   
Storage Statistics
Total Size of Fetched Pages 147.8 GB
Compressed Repository 53.5 GB
Short Inverted Index 4.1 GB
Full Inverted Index 37.2 GB
Lexicon 293 MB
Temporary Anchor Data  
(not in total) 6.6 GB
Document Index Incl.  
Variable Width Data 9.7 GB
Links Database 3.9 GB
Total Without Repository 55.2 GB
Total With Repository 108.7 GB
 
Web Page Statistics
Number of Web Pages Fetched 24 million
Number of Urls Seen 76.5 million
Number of Email Addresses 1.7 million
Number of 404's 1.6 million
 
Table 1. Statistics
   
 5.2 System Performance
It is important for a search engine to crawl and index efficiently. This way information can be kept up to date and major changes to the system can be tested relatively quickly. For Google, the major operations are Crawling, Indexing, and Sorting. It is difficult to measure how long crawling took overall because disks filled up, name servers crashed, or any number of other problems which stopped the system. In total it took roughly 9 days to download the 26 million pages (including errors). However, once the system was running smoothly, it ran much faster, downloading the last 11 million pages in just 63 hours, averaging just over 4 million pages per day or 48.5 pages per second. We ran the indexer and the crawler simultaneously. The indexer ran just faster than the crawlers. This is largely because we spent just enough time optimizing the indexer so that it would not be a bottleneck. These optimizations included bulk updates to the document index and placement of critical data structures on the local disk. The indexer runs at roughly 54 pages per second. The sorters can be run completely in parallel; using four machines, the whole process of sorting takes about 24 hours.
5.3 Search Performance
Improving the performance of search was not the major focus of our research up to this point. The current version of Google answers most queries in between 1 and 10 seconds. This time is mostly dominated by disk IO over NFS (since disks are spread over a number of machines). Furthermore, Google does not have any optimizations such as query caching, subindices on common terms, and other common optimizations. We intend to speed up Google considerably through distribution and hardware, software, and algorithmic improvements. Our target is to be able to handle several hundred queries per second. Table 2 has some sample query times from the current version of Google. They are repeated to show the speedups resulting from cached IO.
  Initial Query Same Query Repeated (IO mostly cached)  
Query CPU Time(s) Total Time(s) CPU Time(s) Total Time(s)
al gore 0.09 2.13 0.06 0.06
vice president 1.77 3.84 1.66 1.80
hard disks 0.25 4.86 0.20 0.24
search engines 1.31 9.63 1.16 1.16
 
Table 2. Search Times
   

<< Google: Anatomia do Sistema Conclusões e Agradecimentos >>