New Search Technology Advances

August 24, 2005

New Search Technology Advances

New Search Technology Advances: The Mining of the Invisible Web 

by Rikki Kirzner, Partner

Yahoo, Google, and other major search engine companies have estimated there are between 8 and 20 billion documents on the Web.  It isn’t so. The truth is that the Web is home to more than 500 billion Web based documents that have thus far been unreachable by today’s crop of search engines.  There is a vast amount of data and documents stored in databases that can’t be accessed by the current versions of Web crawlers. General purpose search engines have not been designed to search for information located in every single database, data caches, or data created on the fly.  As a result in order to find more obscure information, you have to know exactly where to look for it.

As an example, suppose you were trying to find information on the age when a Great Dane puppy develops its adult second and third premolars.  Using the popular search engines to search for puppy teeth will give you information on biting, teething, jaw alignment, Great Dane breeders, and human tooth development web sites.  It would be necessary to go to a veterinary book on breed specific dentistry to find the information you are searching for.  Although the information exists, it is not accessible through a search of the general Internet.
Making the Inaccessible, Available
A San Mateo, CA start-up called Glenbrook Networks is the latest company claiming to have developed a new way to extract data from this formerly inaccessible portion of the Web.  The new search engine extracts information found in multiple databases and then allows it to be searched from a single site. 

For its first test, Glenbrook built a search engine that extracts job listings from databases found in different Web sites including those on hundreds of corporate sites, job billboards, professional associations, as well as from HotJobs.  If one were to use Google or Yahoo to do something similar, it would be necessary to visit each individual company, go to their job listings page and extract the information one company at a time. The current crop of search engines just aren’t capable of extracting information from what is commonly called the deep web or invisible web. 

What Glenbrook’s next generation Web crawlers do is analyze the forms on a web page and use artificial intelligence to walk through complicated Web forms, sometimes answering questions in order to locate the desired information.  Edward and Julie Komissarchik, the father and daughter brains behind this technology use the analogy of sending out an advanced scout to case a joint in order to find the location of the safe and ultimately to crack the combination to get at the riches hidden within.    The scout will go through various forms and try various combinations of options to determine what results will occur. 

The scout will pass then this information back to the master safecracker who will use the information to devise a way to open the forms and release the information locked within them.  Once the forms are opened, a team of information harvesters will gather up as much information as they can find not only from that specific form but from all similar forms that exist on that specific database or set of databases at that site.  

This is not as sinister as it may appear.  Most of this undetected information actually never exists until the moment when someone completes a form on a website and asks for the specific information.  The Glenbrook Web crawlers are just accessing information that companies want people to see, but isn’t accessible to anyone unless they go purposely to that company’s web site and complete or traverse a form in order to get the specific information.  Other examples besides job listings that work in a similar fashion are online travel sites, library catalogs, medical information databases and a myriad of other similar databases.

Certainly Yahoo and other search engine companies are already attempting to access portions of this clandestine information.  Yahoo has partnered with National Public Radio, the Library of Congress, and others in order to index the content found in those databases.  Meanwhile Google is creating a comprehensive bibliographic database that it calls WorldCat to search for and find information formerly only found in libraries.  Other invisible web search engines such as BrightPlanet’s completeplanet, kartoo, search-22, and a few dozen other search engines are also busy mining the gems of information hidden in databases.   

The Bottom Line 

Hurwitz and Associates believes that the wealth of information locked within the invisible Web holds future opportunities for many industries and companies to move us to the next generation of the Web by overcoming the obstacles to getting timely information relevant to individuals, businesses and industries.  These new Web crawlers will be able to put required information within the reach of anyone accessing the Web.  The opportunity to collect and categorize information in new ways is intoxicating to archivists, librarians, and the greatest scientific minds of our generation.

More importantly, the untapped value of these next generation search engines lies in their ability to begin to understand what you are searching for rather than just responding the characters that are typed on a keyboard. Once we are able to achieve this degree of sophistication, we will be able to access hidden information as easily as finding the latest sports score or time the next movie starts. 

This ability will enable us to create new ways of structuring information and linking information resources.  The ability to concatenate and modularize information in new ways, the ability to work more efficiently with content-bearing metadata, and possibly even to resolve semantic differences among various information resources will certainly advance and benefit every industry.



Newsletters 2005
About admin

Leave a Reply

Your email address will not be published.