Sites de Busca
Básico
Google
Yahoo!
msn
dmoz
Outras SEs
Mais Info
|
Retornar à página sobre a Google
5. Resultados e Desempenho (essa seção não foi
traduzida, por referir-se a dados de 1996 - 1997, bastante
desatualizadas)
Query: bill clinton
http://www.whitehouse.gov/
100.00% (no date) (0K)
http://www.whitehouse.gov/
Office of the President
99.67% (Dec 23 1996)
(2K)
http://www.whitehouse.gov/WH/EOP/OP/html/OP_Home.html
Welcome To The White House
99.98% (Nov 09 1997)
(5K)
http://www.whitehouse.gov/WH/Welcome.html
Send Electronic Mail to the
President
99.86% (Jul 14 1997)
(5K)
http://www.whitehouse.gov/WH/Mail/html/Mail_President.html
mailto:president@whitehouse.gov
99.98%
mailto:President@whitehouse.gov
99.27%
The "Unofficial" Bill Clinton
94.06% (Nov 11 1997) (14K)
http://zpub.com/un/un-bc.html
Bill Clinton Meets The
Shrinks
86.27% (Jun 29
1997) (63K)
http://zpub.com/un/un-bc9.html
President Bill Clinton - The Dark Side
97.27% (Nov 10 1997) (15K)
http://www.realchange.org/clinton.htm
$3 Bill Clinton
94.73% (no date) (4K)
http://www.gatewy.net/~tjohnson/clinton1.html
Figure 4. Sample Results from Google
The most important measure of a search engine is the quality of its
search results. While a complete user evaluation is beyond the scope of
this paper, our own experience with Google has shown it to produce
better results than the major commercial search engines for most
searches. As an example which illustrates the use of PageRank, anchor
text, and proximity, Figure 4 shows Google's results for a search on
"bill clinton". These results demonstrates some of Google's features.
The results are clustered by server. This helps considerably when
sifting through result sets. A number of results are from the
whitehouse.gov domain which is what one may reasonably expect from such
a search. Currently, most major commercial search engines do not return
any results from whitehouse.gov, much less the right ones. Notice that
there is no title for the first result. This is because it was not
crawled. Instead, Google relied on anchor text to determine this was a
good answer to the query. Similarly, the fifth result is an email
address which, of course, is not crawlable. It is also a result of
anchor text.
All of the results are reasonably high quality pages and, at last
check, none were broken links. This is largely because they all have
high PageRank. The PageRanks are the percentages in red along with bar
graphs. Finally, there are no results about a Bill other than Clinton
or about a Clinton other than Bill. This is because we place heavy
importance on the proximity of word occurrences. Of course a true test
of the quality of a search engine would involve an extensive user study
or results analysis which we do not have room for here. Instead, we
invite the reader to try Google for themselves at
http://google.stanford.edu.
5.1 Storage Requirements
Aside from search quality, Google is designed to scale cost effectively
to the size of the Web as it grows. One aspect of this is to use
storage efficiently. Table 1 has a breakdown of some statistics and
storage requirements of Google. Due to compression the total size of
the repository is about 53 GB, just over one third of the total data it
stores. At current disk prices this makes the repository a relatively
cheap source of useful data. More importantly, the total of all the
data used by the search engine requires a comparable amount of storage,
about 55 GB. Furthermore, most queries can be answered using just the
short inverted index. With better encoding and compression of the
Document Index, a high quality web search engine may fit onto a 7GB
drive of a new PC.
Storage Statistics
Total Size of Fetched Pages 147.8 GB
Compressed Repository 53.5 GB
Short Inverted Index 4.1 GB
Full Inverted Index 37.2 GB
Lexicon 293 MB
Temporary Anchor Data
(not in total) 6.6 GB
Document Index Incl.
Variable Width Data 9.7 GB
Links Database 3.9 GB
Total Without Repository 55.2 GB
Total With Repository 108.7 GB
Web Page Statistics
Number of Web Pages Fetched 24 million
Number of Urls Seen 76.5 million
Number of Email Addresses 1.7 million
Number of 404's 1.6 million
Table 1. Statistics
5.2 System Performance
It is important for a search engine to crawl and index efficiently.
This way information can be kept up to date and major changes to the
system can be tested relatively quickly. For Google, the major
operations are Crawling, Indexing, and Sorting. It is difficult to
measure how long crawling took overall because disks filled up, name
servers crashed, or any number of other problems which stopped the
system. In total it took roughly 9 days to download the 26 million
pages (including errors). However, once the system was running
smoothly, it ran much faster, downloading the last 11 million pages in
just 63 hours, averaging just over 4 million pages per day or 48.5
pages per second. We ran the indexer and the crawler simultaneously.
The indexer ran just faster than the crawlers. This is largely because
we spent just enough time optimizing the indexer so that it would not
be a bottleneck. These optimizations included bulk updates to the
document index and placement of critical data structures on the local
disk. The indexer runs at roughly 54 pages per second. The sorters can
be run completely in parallel; using four machines, the whole process
of sorting takes about 24 hours.
5.3 Search Performance
Improving the performance of search was not the major focus of our
research up to this point. The current version of Google answers most
queries in between 1 and 10 seconds. This time is mostly dominated by
disk IO over NFS (since disks are spread over a number of machines).
Furthermore, Google does not have any optimizations such as query
caching, subindices on common terms, and other common optimizations. We
intend to speed up Google considerably through distribution and
hardware, software, and algorithmic improvements. Our target is to be
able to handle several hundred queries per second. Table 2 has some
sample query times from the current version of Google. They are
repeated to show the speedups resulting from cached IO.
Initial Query Same Query Repeated (IO mostly cached)
Query CPU Time(s) Total Time(s) CPU Time(s) Total Time(s)
al gore 0.09 2.13 0.06 0.06
vice president 1.77 3.84 1.66 1.80
hard disks 0.25 4.86 0.20 0.24
search engines 1.31 9.63 1.16 1.16
Table 2. Search Times
|
|