The halo effect has graduated from inflating stock prices to making companies godlike. Thus, they can do anything – mere mortals can just speculate. The truth, however, is frequently mundane.
The source for this headline comes directly from Google:
So it’s confirmed, then.
Actually, that’s what another expert says:
Mr. Buler believes this is a great accomplishment, and quite unknown.
He’s right on one count.
While our surfacing approach has generated considerable
traffic, there remains a large number of forms that continue
to present a significant challenge to automatic analysis. For
and onsubmit tags that enable the execution of arbitrary
Further, many forms involve inter-related inputs and accessing
the sites involve correctly (and automatically) identifying
their underlying dependencies. Addressing these and
other such challenges efficiently on the scale of millions is
part of our continuing effort to make the contents of the
Deep Web more accessible to search engine users
It would seem they solved this problem! (This is a big accomplishment). When did they solve it? Recently?
Well, sort of. In a 2009 paper called “Harnessing the Deep Web: Past, Present, and Future.” In it, they say this:
We note that the canonical example of correlated inputs,
namely, a pair of inputs that specify the make and model of
cars (where the make restricts the possible models) is typically
such correlations easily.
So let’s back up.
What is Google going? They’re accessing structured data hidden behind form submissions. Now, we say the information is “hidden” behind form submissions because you have to submit the form to get the data. One approach – the ”dumb” approach – is to generate all possible result URLs and then crawl all of them.
But. Those clever folks at Google noticed this might be a problem:
For example, the search form on cars.com has 5 inputs and a Cartesian product will yield over 200 million URLs, even though cars.com has only 650,000 cars on sale.
The challenge, then, is making fewer URLs. Thus, they intelligently developed an algorithm with this property:
We have found that the number of URLs our algorithms generate is proportional to the size of the underlying database, rather than the number of possible queries.
How do they do this? Well, one big challenge is (as noted above) the inputs in one field can depend on the inputs in another field. Google has taken to constructing databases of “interrelated data” (like manufacturer and car model) so they can automatically detect the data the form wants and limit their indexing accordingly.
Well, the clever researchers at Google knew they needed to determine which fields in a form were interrelated. They also figured that they only needed to determine this once, because once they knew which fields were related, they could automatically generate their URLs using their generation algorithms.
As you can imagine, if you only need to do it once (for each form), then it becomes practical to emulate. You emulate one form, and get 650,000 URLs to index with solid data. It’s cheap – so cheap, it’s almost worth getting a human to do it. (Except no Googler would think of that!).
It is galling to see a reporter say that something is “unclear” when it is very difficult to make something clearer. In 2008, Jayant Madhavan wrote on the Google Webmaster Central blog talks about crawling through forms to get to the Deep Web – this stuff isn’t restricted to academic papers easily accessible through Google Scholar and surfaced in regular Google results. No, it’s even in the blogosphere.
I think I’ve gone a bit too far, so I’ll stop now.
There are no revisions for this post.