New Search Technology Advances: The Mining of the Invisible Web
Yahoo, Google, and other major search engine companies have estimated there are between 8 and 20 billion documents on the Web. It isn’t so. The truth is that the Web is home to more than 500 billion Web based documents that have thus far been unreachable by today’s crop of search engines. There is a vast amount of data and documents stored in databases that can’t be accessed by the current versions of Web crawlers. General purpose search engines have not been designed to search for information located in every single database, data caches, or data created on the fly. As a result in order to find more obscure information, you have to know exactly where to look for it.
For its first test, Glenbrook built a search engine that extracts job listings from databases found in different Web sites including those on hundreds of corporate sites, job billboards, professional associations, as well as from HotJobs. If one were to use Google or Yahoo to do something similar, it would be necessary to visit each individual company, go to their job listings page and extract the information one company at a time. The current crop of search engines just aren’t capable of extracting information from what is commonly called the deep web or invisible web.
What Glenbrook’s next generation Web crawlers do is analyze the forms on a web page and use artificial intelligence to walk through complicated Web forms, sometimes answering questions in order to locate the desired information. Edward and Julie Komissarchik, the father and daughter brains behind this technology use the analogy of sending out an advanced scout to case a joint in order to find the location of the safe and ultimately to crack the combination to get at the riches hidden within. The scout will go through various forms and try various combinations of options to determine what results will occur.
This is not as sinister as it may appear. Most of this undetected information actually never exists until the moment when someone completes a form on a website and asks for the specific information. The Glenbrook Web crawlers are just accessing information that companies want people to see, but isn’t accessible to anyone unless they go purposely to that company’s web site and complete or traverse a form in order to get the specific information. Other examples besides job listings that work in a similar fashion are online travel sites, library catalogs, medical information databases and a myriad of other similar databases.
Certainly Yahoo and other search engine companies are already attempting to access portions of this clandestine information. Yahoo has partnered with National Public Radio, the Library of Congress, and others in order to index the content found in those databases. Meanwhile Google is creating a comprehensive bibliographic database that it calls WorldCat to search for and find information formerly only found in libraries. Other invisible web search engines such as BrightPlanet’s completeplanet, kartoo, search-22, and a few dozen other search engines are also busy mining the gems of information hidden in databases.
The Bottom Line
Hurwitz and Associates believes that the wealth of information locked within the invisible Web holds future opportunities for many industries and companies to move us to the next generation of the Web by overcoming the obstacles to getting timely information relevant to individuals, businesses and industries. These new Web crawlers will be able to put required information within the reach of anyone accessing the Web. The opportunity to collect and categorize information in new ways is intoxicating to archivists, librarians, and the greatest scientific minds of our generation.
More importantly, the untapped value of these next generation search engines lies in their ability to begin to understand what you are searching for rather than just responding the characters that are typed on a keyboard. Once we are able to achieve this degree of sophistication, we will be able to access hidden information as easily as finding the latest sports score or time the next movie starts.