Friday, June 30, 2006

Interview with Google Sitemap Team

Source: http://www.smart-it-consulting.com/article.htm?node=166&page=135

This is another compilation about Google Sitemap. I hope you guys pick something from this:

Google's Sitemaps Team, interviewed in January 2006, provides great insights and information about the Sitemaps program, crawling and indexing in general, handling of vanished pages (404 vs. 410) and the URL removal tool, and valuable assistance on many frequently asked questions. Matt Cutts chimed in and stated '...It's definitely a good idea to join Sitemaps so that you can be on the ground floor and watch as Sitemaps improves'. This interview is a must-read for Webmasters.

Google Sitemaps KB · Index · Expand · Web Feed

Previous PageAbout the Google Sitemaps Knowledge Base Project

Canonical server name issuesNext Page

Google Sitemaps was launched in June 2005 to enhance Google's Web crawling process in cooperation with Webmasters and site owners. Collaborative crawling takes Webmasters into the boat to some degree, and both sides did learn a lot from each other in the last months. Google's Sitemaps Team does listen to Joe Webmaster's needs, questions, and suggestions. They have implemented a lot of very useful features based on suggestions in the Google Sitemaps Group, an open forum where members of the Sitemaps team communicate with their users, handing out technical advice even on weekends. The nickname Google Employee used by the Sitemaps team makes it regularly on the list of This month's top posters.


The Sitemaps community, producing an average of 1,500 posts monthly, suffered from repetitive topics diluting the archives. As the idea of a Google Sitemaps Knowledge Base was born in the group, I've discussed the idea with the Sitemaps team. Vanessa Fox, who is blogging for the Sitemaps team from Kirkland, Washington, suggested doing "an e-mail interview to answer some of the more frequently asked questions", so here we are.


Vanessa, thank you for taking the time to support the group's knowledge base project. Before we discuss geeky topics which usually are dull as dust, would you mind to introduce the Sitemaps team? I understand that you're an international team, your team members are working in Kirkland, Mountain View, and Zürich on different components of the Google Sitemaps program. Can you tell us who is who in your team?

Vanessa: You're right. Our team is located in offices around the globe, which means someone on the team is working on Sitemaps nearly around the clock. A few team members were able to take some time to answer your questions, including Shiva Shivakumar, engineering director who started the Google Sitemaps project (and whose interview with Danny Sullivan you may have seen when we initially launched), Grace and Patrik from our Zurich office, Michael and Andrey from our Kirkland office, and Shal from our Mountain View office. We also got Matt Cutts to chime in.

My amateurish try to summarize the Google Sitemaps program is "Aimed crawling makes Google's search results fresh, and Webmasters happy". How would you outline your service, its goals, intentions, and benefits?

Our goal is two-way communication between Google and webmasters. Google Sitemaps is a free tool designed so webmasters can let us know about all the pages on their sites and so we can provide them with detailed reports on how we see their sites (such as their top Google search queries and URLs we had trouble crawling).

We can reach many pages that our discovery crawl cannot find and Sitemaps convey some very important metadata about the sites and pages which we could not infer otherwise, like the page's priority and refresh cycle. In particular, the refresh cycle should allow us to download pages only when they change and thus reduce needless downloads, saving bandwidth.

You've announced collaborative crawling as "an experiment called Google Sitemaps that will either fail miserably, or succeed beyond our wildest dreams, in making the web better for webmasters and users alike". Eight months later I think it's safe to say that that your experiment has been grown up to a great success. How much has your project contributed to the growth, freshness and improved quality of Google's search index?

We've had a huge response from webmasters, who have submitted a great deal of high-quality pages. Many pages would have never found through our usual crawl process, such as URLs with content found behind forms or locked up in databases or behind content management systems.

Also we have many clients supporting Sitemaps natively (many are listed at http://code.google.com/sm_thirdparty.html), in addition to our initial Open Source python client. We are working with more such clients to automatically support Sitemaps natively, and larger websites to compute these automatically as well.

Also, Michael points out "As some of you may have noticed, Sitemaps has also served as an...uhm...impromptu stress test of different parts of the Google infrastructure, and we're working hard to fix those parts."

A major source of misunderstandings and irritations is the common lack of knowledge on how large IT systems -- especially search engines -- work. This leads to unrealistic expectations and rants like "Google is broke because it had fetched my sitemap, the download status shows an OK, but none of my new pages appear in search results for their deserved keywords".


Matt's recent article How does Google collect and rank results? sheds some light on the three independent processes crawling, indexing, and ranking in response to user queries, and I've published some speculations in my Sitemaps FAQ. Can you provide us with a spot on description of the process starting with a Sitemap download, its validation and passing of its URLs to the crawling engine, which sends out the Googlebots to fetch the files and hand them over to the indexer? I'm sure such an anatomic insight would help Sitemaps users to think in realistic time tables.

Your description is pretty close. Sitemaps are downloaded periodically and then scanned to extract links and metadata. The valid URLs are passed along to the rest of our crawling pipeline -- the pipeline takes input from 'discovery crawl' and from Sitemaps. The pipeline then sends out the Googlebots to fetch the URLs, downloads the pages and submits them to be considered for our different indices.

Obviously you can't reveal details about the scores applied to Web sites (besides PageRank) which control priorities and frequency of your (sitemap based as well as regular) crawling and indexing, and -- due to your efforts to ensure a high quality of search results and other factors as well -- you cannot guarantee crawling and indexing of all URLs submitted via Sitemaps. However, for a quite popular and reputable site which meets your quality guidelines, what would you expect as the best/average throughput, or time to index, for both new URLs and updated content as well?

You're right. Groups users report being indexed in a matter of days or weeks. Crawling is a funny business when you are crawling several billions of pages -- we need to crawl lots of new pages and refresh a large subset of previously crawled pages periodically as well, with finite capacity. So we're always working on decreasing the time it takes to index new information and to process updated information, while focusing on end user search results quality. Currently Sitemaps feeds the URL and metadata into the existing crawling pipeline like our discovered URLs.

Matt adds "it's useful to remember that our crawling strategies change and improve over time. As Sitemaps gains more and more functionality, I wouldn't be surprised to see this data become more important. It's definitely a good idea to join Sitemaps so that you can be on the 'ground floor' and watch as Sitemaps improves."

My experiments have shown that often a regular crawl spots and fetches fresh content before you can process the updated Sitemap, and that some URLs harvested from Sitemaps are even crawled when the page in question is unlinked and its URL cannot be found elsewhere. Also, many archived pages lastly crawled in the stone age get revisited all of a sudden. These findings lead to two questions.


First, to what degree can a Google Sitemap help to direct Googlebot to updates and new URLs on well structured sites, when the regular crawling is that sophisticated? Second, how much of the formerly 'hidden Web' -- especially unhandily linked contents -- did you discover with the help of Sitemaps, and what do you do with unlinked orphan pages?

Sitemaps offer search engines with precise information than can be found through discovery crawling. All sites can potentially benefit from Sitemaps in this way, particularly as the metadata is used in more and more ways.

Grace says: "As for the 'hidden Web', lots of high quality pages have indeed been hidden. In many high quality sites that have submitted Sitemaps, we now see 10-20 times as many pages to consider for crawling."

Some Webmasters fear that a Google Sitemap submission might harm their positioning, and in the discussion group we can often read postings asserting that Google has removed complete Web sites from the search index shortly after a Sitemap submission. My standard answer is "don't blame the Sitemap when a site gets tanked", and looking at most of the posted URLs the reasons causing invisibility on the SERPs become obvious at first glance: the usual quality issues. However, in very few of the reported cases it seems possible that a Sitemaps submission could result in a complete wipe-out or move to the supplemental index, followed by heavy crawling and fresh indexing in pretty good shape after a while.


Machine readable mass submissions would allow a few holistic quality checks before the URLs are passed to the crawling engine. Do you handle URLs harvested from XML sitemaps other than URLs found on the Web or submitted via the Add-Url page? Do mass submissions of complete sites speed up the process of algorithmic quality judgements and spam filtering?

Sitemap URLs are currently handled in the same way as discovery URLs in terms of penalties. If a site is penalized for violating the webmaster guidelines, that penalty would apply whether Googlebot followed links from a Sitemap, as part of the regular discovery crawl, or from the Add URL page.

Many users expect Google Sitemaps to work like Infoseek submissions in the last century, that is instant indexing and ranking of each and every URL submission. Although unverified instant listings are technically possible, they would dilute the quality of any search index, because savvy spammers could flood the index with loads of crap in no time.


Experienced Webmasters and SEOs do read and understand your guidelines, play by the rules, and get their contents indexed. Blogs and other easy to use content management systems (CMS) brought millions of publishers to the Web, who usually can't be bothered with all the technical stuff involved.


Those technically challenged publishers deserve search engine traffic, but most probably they would never visit a search engine's Webmasters section. Natural publishing and networking leads to indexing eventually, but there are lots of pitfalls, for example the lack of search engine friendly CMS software and so many Web servers which are misconfigured by default.


What's your best advice for the novice Publisher not willing -- or not able -- to wear a Webmaster hat? What are your plans to reach those who don't get your message yet, and how do you think you can help to propagate a realistic management of expectations?

Our FAQ pages are a good starting place for all webmasters. For those who are using hosting or CMS systems they don't have a lot of experience with, Sitemaps can help alert them to issues they may not know about, such as problems Googlebot has had crawling their pages.

Michael adds that "our webmaster guidelines are intended to be readable and usable by non-experts. Create lots of good and unique content, don't try to be sneaky or underhanded, and be a bit patient. The Web is a very, very big place, but there's still a lot of room for new contributions. If you want to put up a new website, it may be helpful to think about how your website will be an improvement over whatever is already out there. If you can't think of any reasons why your site is special or different, then it's likely that search engines won't either, and that may be frustrating."

Many Web sites use eCommerce systems and CMS software producing cluttered non-standard HTML code, badly structured SE-unfriendly navigation, and huge amounts of duplicated textual content available from various URLs. In conjunction with architectural improvements like SE-friendly cloaking to enhance such a site's crawl-ability, a Google Sitemap will help to increase the number of crawled and indexed pages. In some cases this may be a double-edged sword, because on formerly rarely indexed sites a full crawl may reveal unintended content duplication, which leads to suppressed search results caused by your newer filters.


What is your advice for (large) dynamic sites suffering from sessionIDs and UI specific query string parameters, case issues in URLs, navigation structures which create multiple URLs pointing to the same content, excessively repeated text snippets, or thin product pages without unique content except of the SKU and a picture linked to the shopping cart? Will or do you use the (crawling) priority attribute to determine whether you index a URL from the sitemap, or a variant -- with similar or near duplicated content -- found by a regular crawl?

Everything that applies to regular discovery crawling of sites applies to pages listed in Sitemaps. Our webmaster guidelines provide many tips about these issues. Take a look at your site in a text-only browser. What content is visible?

As for dynamic pages that cause duplicate content listings in the Sitemap, make sure that a version of each page exists that doesn't include things like a session ID in the URL and then list that version of the page in your Sitemap.

Make sure that the Sitemap doesn't include multiple versions of the same page that differ only in session ID, for instance.

If your site uses a content management system or database, you probably want to review your generated Sitemap before submitting to make sure that each page of your site is only listed once. After all, you only want each page listed once in the search results, not listed multiple times with different variations of the URL.

Google is the first major search engine informing Webmasters -- not only Sitemaps users -- about the crawlability of their Web sites. Besides problem reports your statistics even show top search queries, the most clicked search terms, PageRank distribution, overviews on content types, encoding and more. You roll out new reports quite frequently, what is the major source of your inspiration?

We look at what users are asking for in the discussion groups and we work very closely with the crawling and indexing teams within Google.

Andrey adds, "if we find a particular statistic that is useful to webmasters and doesn't expose confidential details of the Google algorithms, we queue that up for release."

On the error pages you show a list of invalid URLs, for example 404 responses and other errors. If the source is "Web", that is the URL was found on the site or anywhere on the Web during the regular crawling process and not harvested from a Sitemap, it's hard to localize the page(s) carrying the dead link. A Web search does not always lead to the source, because you don't index every page you've crawled. Thus in many cases it's impossible to ask the other Webmaster for a correction.


Since Google knows every link out there, or at least should know the location where an invalid URL was found in the first place, can you report the sources of dead links on the error page? Also, do you have plans to show more errors?

We have a long list of things we'd love to add, but we can only work so fast. :) Meanwhile, we do read every post in the group to look for common requests and suggestions, so please keep them coming!

HTTP/1.1 introduced the 410-Gone response code, which is supported by the newer Mozilla-compatible Googlebot, but not by the older crawler which still does HTTP/1.0 requests. The 404-Not found response indicates that the requested resource may reappear, so the right thing to do is responding with a 410 error if a resource has been removed permanently.

Would you consider it safe with Google to make use of the 410 error code, or do your prefer a 404 response and want Webmasters to manually remove outdated pages with the URL console, which scares the hell out of most Webmasters who fear to get their complete site suspended for 180 days?

Webmasters can use either a 404 or 410. When Googlebot receives either response when trying to crawl a page, that page doesn't get included in the refresh of the index. So, over time, as the Googlebot recrawls your site, pages that no longer exist should fall out of our index naturally.

Webmasters shouldn't fear our automated URL removal tool. We realize that the description says if you use the tool, your site will be removed for 180 days and we are working to get this text updated. What actually happens is that we check to make sure the page you want removed doesn't exist or that you've blocked it using a robots.txt file or META tags (depending on which removal option you choose). If everything checks out OK, we remove the page from our index. If later on, you decide you do want the page indexed and you add it back to your site or unblock it, we won't add it back to our index for at least 180 days. That's what the warning is all about. You need to be sure you really don't want this page indexed.

But I would suggest that webmasters not bother with this tool for pages that no longer exist. They should really only use it for pages that they meant to block from the index with a robots.txt file or META tags originally. For instance, if a hypothetical webmaster has a page on a site that lists sensitive customer information and that hypothetical webmaster is also somewhat forgetful and doesn't remember to add that page to the robots.txt file, the sensitive page will likely get indexed. That hypothetical forgetful webmaster would then probably want to add the page to the site's robots.txt file and then use the URL removal tool to remove that page from the index.

As you mentioned earlier, webmasters might also be concerned about pages listed in the Site Errors tab (under HTTP errors). These are pages that Googlebot tried to crawl but couldn't. These pages are not necessarily listed in the index. In fact, if the page is listed in the Errors tab, it's quite possible that the page isn't in the index or will not be included in the next refresh (because we couldn't crawl it). Even if you were to remove this using the URL removal tool, you still might see it show up in the Errors tab if other sites continue to link to that page since Googlebot tries to crawl every link it finds.

We list these pages just to let you know we followed a link to your site and couldn't crawl the page and in some cases that's informational data only. For these pages, check to see if they are pages you thought existed. In this case, maybe you named the page on your site incorrectly or you listed it in your Sitemap or linked to it incorrectly. If you know these pages don't exist on your site, make sure you don't list them in your Sitemap or link to them on your site. (Remember that if we tried to crawl a page from your Sitemap, we indicate that next to the URL.) If external sites are linking to these non-existent pages, you may not be able to do anything about the links, but don't worry that the index will be cluttered up with these non-existent pages. If we can't crawl the page, we won't add it to the index.

Although the data appearing in the current reports aren't that sensitive (for now), you've a simple and effective procedure in place to make sure that only the site owner can view the statistics. To get access to the stats one must upload a file with a unique name to the Web server's root level, and you check for its existence. To ensure this lookup can't be fooled by a redirect, you also request a file which should not exist. The verification is considered successful when the verification URL responds with a 200-Ok code, and the server responds to the probe request with a 404-Not found error. To enable verification of Yahoo stores and sites on other hosts with case restrictions you've recently changed the verification file names to all lower case.


Besides a couple quite exotic configurations, this procedure does not work well with large sites like AOL or eBay, which generate a useful page on the fly even if the requested URI does not exist, and it keeps out all sites on sub-domains, where the Webmaster or publisher can't access the root level, for example free hosts, hosting services like ATT, and your very own Blogger Web logs.

Can you think of an alternative verification procedure to satisfy the smallest as well as the largest Web sites out there?

We are actively working improving the verification process. We know that there are some sites that have had issues with our older verification process, and we updated it to help (allowing lowercase verification filenames), but we know there are still users out there who have problems. We'd love to hear suggestions from users as to what would or wouldn't work for them!

In some cases verification requests seem to stick to the queue, and every once in a while a verfied site falls back in the pending status. You've posted that's an inconvenience you're working on, can you estimate when delayed verifications will be the news of yesterday?

We've been making improvements in this area, but we know that some webmasters are still having trouble and it's very frustrating. We are working on a complete resolution as quickly as we can. We have sped up the time from "pending" to "verified" and you shouldn't see issues with verified sites falling back into pending.

If your site goes from pending to not verified, then likely we weren't able to successfully process the verification request. We are working on adding error details so you can see exactly why the request wasn't successful, but until we have that ready, if your verification status goes from "pending" to "not verified", check on the following as these are the most comment problems we encounter:

* that your verification file exists in the correct location and is named correctly
* that your webserver is up and responding to requests when we attempt the verification
* that your robots.txt file doesn't block our access to the file
* that your server doesn't return a status of 200 in the header of 404 pages

Once you've checked these things and made any needed changes, click Verify again.

The Google Sitemaps Protocol is sort of a "robots inclusion protocol" respecting the gatekeeper robots.txt, standardized in the "robots exclusion protocol". Some Sitemaps users are pretty much confused by those mutually exclusive standards with regard to Web crawler support. Some of their suggestions like restricting crawling to URIs in the Sitemap make no sense, but adding a way to remove URIs from a search engine's index for example is a sound idea.


Others have suggested to add attributes like title, abstract, and parent respectively level. Those attributes would allow Webmasters to integrate the XML sitemaps (formatted by XSLT stylesheets) in the user interface. I'm sure you've gathered many more suggestions, do you have plans to change your protocol?

Because the protocol is an open standard, we don't want to make too many changes. We don't want to disrupt those who have adopted it. However, over time, we may augment the protocol in cases where we find that to be useful. The protocol is designed to be extensible, so if people find extended uses for it, they should go for it.

You've launched the Google Sitemaps Protocol under the terms of the Attribution-ShareAlike Creative Commons License, and your Sitemaps generator as open source software as well. The majority of the 3rd party Sitemaps tools are free too.

Eight months after the launch of Google Sitemaps, MSN search silently accepts submissions of XML sitemaps, but does not officially support the Sitemaps Protocol. Shortly after your announcement Yahoo started to accept mass submissions in plain text format, and has added support of RSS and ATOM feeds a few months later. In the meantime each and every content management system produces XML Sitemaps, there are tons of great sitemap generators out there, and many sites have integrated Google Sitemaps individually.


You've created a huge user base, are you aware of other search engines planning to implement the Sitemaps Protocol?

We hope that all search engines adopt the standard so that webmasters have one easy way to tell search engines comprehensive information about their sites. We are actively helping more content management systems and websites to support Sitemaps. We believe the open protocol will help the web become cleaner with regard to crawlers, and are looking forward to widespread adoption.

Is there anything else you'd like to tell your users, and the Webmasters not yet using your service as well?

We greatly appreciate the enthusiastic participation of webmasters around the world. The input continues to help us learn what webmasters would most like and how we can deliver that to them.

Shiva says that "eight months back we released Sitemaps as an experiment, wondering if webmasters and client developers would just ignore us or if they'd see the value in a more open dialogue between webmasters and search engines. We're not wondering that anymore. With the current rate of adoption, we are now looking at how to work better with partners and supporting their needs so that more and more webmasters with all kinds of hosting environments can take advantage of what Sitemaps can offer."

And Michael wants to remind everyone that "We really do read every post on the Sitemaps newsgroup and we really want people to keep posting there."

Thanks, Sebastian, for giving us this opportunity to talk to webmasters. That's what the Sitemaps project is all about.

4 comments:

Anonymous said...

|
In aFvance oF branFing selection [url=http://www.germanylovelv.com/]louis vuitton knolckoffs[/url]
be taken, iFentiFy probably the most valuable iFentity with the new business. Possibly it鈥檚 holFing one name anF removing the other as Cingular FiF when it obtaineF AT&T Wireleshttp://www.germanylovelv.com/
It may be a combination oF the names like Exxon anF Mobil or may be creating a new title entirely as Verizon FiF when Bell Atlantic anF GTE mergeF.
|

Anonymous said...

[url=http://www.squidoo.com/saclongchampa]longchamp soldes[/url] If you've ever looked for standard daybed Best Sale Women Mulberry Somerset Shopper Bag Light Coffee,Mulberry Outlet For Sale With Up To 60% Off. bedding, you know that the options are endless. From whimsical pink Barbie prints designed for little girls to upscale micro-suede fabrics of the Armani variety, it is possible to achieve a number of different looks. When it comes to hi-risers, the story is a bit different and finding attractive bedding requires a bit more ingenuity..
[url=http://longchampsoldesa.postbit.com/]sacs longchamps[/url] I would love to own the Roses duffel, or any of the Roses bags, but gosh. even the Speedy 30 (it's only 12" x 8") sells for over $1300. That is certainly not a bargain. I love all types of photography. Be it portraits, macros, landscapes. but I tend to lean on the Highly Appreciated Mulberry Women's Bayswater Leather Clutch Black Bag, cheap and fashion Mulberry handbags with high quality and discount price. side of street photography.
[url=http://longchamppliagea.webnode.nl/]sacs longchamps[/url] When a user puffs on an electronic cigarette, a battery is activated th . Also, if your common or maybe specific niche market distinct lookup directories adeq . Buy high quality Mulberry Outlet Sale Women's Alexa Marshmallow White Spotted Haircalf Bags sale for you You just need to do what you have to to stay alive, even if you have to stab them in the eye. It isn't wise to buy fine leather goods from mall kiosks, street venders or Internet auctions since if you buy a fake, you're usually stuck with it. These dealers disappear as soon as they have your money. If buying online, look for customer reviews on the quality of the products and the customer service of the dealer... Multi-use carry bags - these bags also called totes are renowned in the market because of their features. Customers love them because they can be used for carrying your things in a very convenient and easy manner. But what if you have lashings of material to take? Well, you don't have to fret because the multi-function carry leather handbags can contain them in the most organized manner..

Anonymous said...

What good question
Yes, really. All above told the truth. Let's discuss this question.
I apologise, but this variant does not approach me.
I consider, that you are mistaken. I can prove it. Write to me in PM, we will communicate.
In my opinion, you on a false way.

[url=http://cheapmkbag2.manifo.com/][b]michael kors outlet online[/b][/url]
[url=http://shenenmaoyipp.yolasite.com/][b]michael kors outlet online[/b][/url]
[url=http://cheapmkbag2.weebly.com/][b]michael kors outlet online[/b][/url]
[url=http://yaba.com/michaelkorschea/blog/45183/][b]michael kors outlet online[/b][/url]
[url=http://shenenmaoyiyy.webstarts.com/?r=20130115212219][b]michael kors outlet online[/b][/url]

heatmaps said...

Thank you so much for taking the time to share this exciting information. But, I would be grateful to you if you could provide some more details about heat map issue.