New York Times escreve sobre a Google
Um jornalista do New York Times foi autorizado a participar de uma reunião entre os engenheiros responsáveis pelo Controle de Qualidade dos Resultados na Google. O texto do jornalista está aqui. Matt Cutts, um dos engenheiros da Google responsáveis pela Qualidade dos resultados, comentou a reportagem aqui, e disse “in my opinion it does a good job of describing search quality at Google.”
O post do Matt é permanente, mas o artigo do NYTimes deve tornar-se ‘apenas para assinantes’ em breve. Abaixo, alguns trechos do artigo:
Online stores, he notes, find that a quarter to a half of their visitors, and most of their new customers, come from search engines. And media sites are discovering that many people are ignoring their home pages — where ad rates are typically highest — and using Google to jump to the specific pages they want.
“Google has become the lifeblood of the Internet,” Mr. Battelle says. “You have to be in it.”
….
Some complaints involve simple flaws that need to be fixed right away. Recently, a search for “French Revolution” returned too many sites about the recent French presidential election campaign — in which candidates opined on various policy revolutions — rather than the ouster of King Louis XVI. A search-engine tweak gave more weight to pages with phrases like “French Revolution” rather than pages that simply had both words.
At other times, complaints highlight more complex problems. In 2005, Bill Brougher, a Google product manager, complained that typing the phrase “teak patio Palo Alto” didn’t return a local store called the Teak Patio.
So Mr. Singhal fired up one of Google’s prized and closely guarded internal programs, called Debug, which shows how its computers evaluate each query and each Web page. He discovered that Theteakpatio.com did not show up because Google’s formulas were not giving enough importance to links from other sites about Palo Alto.
It was also a clue to a bigger problem. Finding local businesses is important to users, but Google often has to rely on only a handful of sites for clues about which businesses are best. Within two months of Mr. Brougher’s complaint, Mr. Singhal’s group had written a new mathematical formula to handle queries for hometown shops.
…THE QDF solution revolves around determining whether a topic is “hot.” If news sites or blog posts are actively writing about a topic, the model figures that it is one for which users are more likely to want current information. The model also examines Google’s own stream of billions of search queries, which Mr. Singhal believes is an even better monitor of global enthusiasm about a particular subject.
…
As Google compiles its index, it calculates a number it calls PageRank for each page it finds. This was the key invention of Google’s founders, Mr. Page and Sergey Brin. PageRank tallies how many times other sites link to a given page. Sites that are more popular, especially with sites that have high PageRanks themselves, are considered likely to be of higher quality.
Mr. Singhal has developed a far more elaborate system for ranking pages, which involves more than 200 types of information, or what Google calls “signals.” PageRank is but one signal. Some signals are on Web pages — like words, links, images and so on. Some are drawn from the history of how pages have changed over time. Some signals are data patterns uncovered in the trillions of searches that Google has handled over the years.
“The data we have is pushing the state of the art,” Mr. Singhal says. “We see all the links going to a page, how the content is changing on the page over time.”
…
These signals and classifiers calculate several key measures of a page’s relevance, including one it calls “topicality” — a measure of how the topic of a page relates to the broad category of the user’s query. A page about President Bush’s speech about Darfur last week at the White House, for example, would rank high in topicality for “Darfur,” less so for “George Bush” and even less for “White House.” Google combines all these measures into a final relevancy score.
The sites with the 10 highest scores win the coveted spots on the first search page, unless a final check shows that there is not enough “diversity” in the results. “If you have a lot of different perspectives on one page, often that is more helpful than if the page is dominated by one perspective,” Mr. Cutts says. “If someone types a product, for example, maybe you want a blog review of it, a manufacturer’s page, a place to buy it or a comparison shopping site.”
…
Yahoo is now developing special search formulas for specific areas of knowledge, like health. Microsoft has bet on using a mathematical technique to rank pages known as neural networks that try to mimic the way human brains learn information.
Google’s use of signals and classifiers, by contrast, is more rooted in current academic literature, in part because its leaders come from academia and research labs. Still, Google has been able to refine and advance those ideas by using computer and programming resources that no university can afford.
“People still think that Google is the gold standard of search,” Mr. Battelle says. “Their secret sauce is how these guys are doing it all in aggregate. There are 1,000 little tunings they do.”
Muita informação interessante, confirmada pela própria Google. Para ler e reler algumas vezes.