Recent adjustments in Google’s PageRank algorithm have caused a stir the blogosphere because of blogs rising or dropping significantly. However, Google seems to be facing problems with the indexing of the blogosphere in general.
Google Alerts is one way that allows me to keep track of where my blog is mentioned and by whom. I receive a daily “Comprehensive” e-mail that includes mentions in Google News, Google Blogs, Google Web, Google Video and Google Groups. Google Alerts allows you to filter your alerts so that you will only be notified if you are being mentioned on a blog as indexed by Google Blogs. However, Google seems to be confused about what a blog is.
In theory Google offers distinctive filtering of its alerts that distinguishes between alerts from the Web and the blogosphere. I recently noticed that in practice Google is mixing the Web and the blogosphere in its mentions. Going back through my e-mail I found out that Google has been doing this for a long time but I did not previously notice that Google is splitting up blogs in its alerts.
Google lists my Blog Herald post on ‘Rethinking the Blog as Database‘ under Google Blog Alerts while it lists my Blog Herald author page under Google Web Alerts. Google makes a clear distinction between (dynamic) blog posts and (static) blog pages. Michael Stevenson, who ironically noted that the internet is finally subsumed by blogs, commented that “It seems this might be because posts are syndicated (RSS) which is not the case for pages.” Google is separating blogs by treating content within blogs differently.
The Web as database does not pose a structure on the Web. Search engines such as Google and Technorati pose a structure on the web for us. Google explicitly states that its mission is “to organize the world’s information and make it universally accessible and useful.” In order to do so they have a giant database that they query using particular algorithms that give particular results. A little tweak in the algorithm, as happened with PageRank recently, can cause a big change.
Not everything that is part of the Web is part of the (Google) web database. Different search engines index different (amounts of) websites and blogs. The deep Web is the part of the Web that is not indexed by search engines. A recent study shows that the deep Web is 500 times larger than the surface Web. They also found that the major search engines were able to index one third of the deep Web, which still leaves two thirds of the deep Web unindexed (He et. al. 2007). This means that we usually only see a tip of the iceberg and that Google is a long way from realizing its dream.
Most Web databases remain invisible, providing no link-based access, and are thus not indexable by current crawling techniques; and even when crawlable, Web databases are rather dynamic, and thus crawling cannot keep up with their updates. (He, Patel, Zhang, and Chang. “Accessing the deep web.” Commun. ACM 50, no. 5 (2007): 94-101.)
The current web is so dynamic that Google and Technorati both seem to be facing this problem. As the blogosphere keeps on growing the indexers are faced with more and more frequent updates and their crawlers are having a hard time keeping up. In ‘Has Technorati Stop Indexing Blogs?’ Darren Rowse and his Problogger readers noticed that Technorati recently experienced such severe indexing problems.
This makes me wonder if Google and Technorati can handle the maturing of the blogosphere? What do you think?