by Bob Treuber and David Norris
Most companies struggle when implementing Web metrics, but they also tell you that it was the smartest thing they ever did. If you think that the companies that “get it” don’t have any problems with metrics tools, it might surprise you to hear that a company doing a great job using analytics has had its share of difficulties along the way.
House of Antique Hardware is a pure-play “etailer”. Our customers are homeowners, trade professionals and government agencies. With only a handful of exceptions, our customers are in the US and Canada. Our products are authentic antique reproductions. We offer around 9,000 SKUs in the following categories:
There are about a dozen significant competitors. Only a few cities have a local “brick & mortar” store for these products. Most people source product from a Web site or a print catalog.
The company was started in 1998 by Roy Prange in his home. The company has moved to larger facilities three times since its start. Currently, we operate from a 10,000 sq. ft. warehouse and office suite in Northeast Portland, Oregon.
The company has had significant year-to-year growth since inception. The site began life in the early days as hand-coded HTML store with a simple checkout, no search or order history, and a paper-based order processing system. By focusing on data-driven metrics rather than our opinions about what works best, we are able to incrementally improve the Web site, measuring the progress as we go. We focus on ROI primarily for our metrics.
The Web site is product-centric. Layout, navigation and architecture are designed to enable a shopper who is looking for a specific product. The site is 100% ecommerce. We measure page views, time on site, conversions, and average sales order.
We started testing new ideas for our site with Google Website Optimizer because it’s free. We began testing various photos for the header of product listing pages. We would pick 4 or 5 photos and run a simple GWO test on them. Most of the tests were inconclusive, with the performance of the variations within the margin of error. Sometimes the test would run for months and GWO would still say it needed more time, and all the variations were performing nearly the same. Occasionally we would get a clear winner or loser, but we may have chosen photos that were too similar, or maybe people just don’t react that differently to one nice photo vs. another.
Our biggest frustrations with GWO are:
The case study we’d like to share is a series of tests we ran on our “add to cart” button. We did 3 tests, refining our options each time. The first test consisted of 12 combinations, 2 of which were control copies. One of the controls matched properly, while the other control was off by 33%. We found that the 3 green buttons and the oval-shaped options performed the worst in the first round, and our original add to cart button either did the best or close to the worst, depending on which copy you rely on.
We then stopped the first test and began round 2. We again found that oval shapes under-performed, but beyond that the results were too close to call. We expected that as we refined ideas and got closer to testing a bunch of good options, that the test results would converge to the point where the variations we tested were just too similar to see any delta in performance.
Finally, we ran a 3rd round of tests. The conversion rate at first was around 20%, and we disabled 2 options that were under-performing—both with the words “buy now” in them (the winners use add to cart). We then let the test run a while longer, and watched the conversion rate on all the active options shrink to 15% on average. The variations still had roughly the same relative scores, but the mysterious 25% drop in conversion rate is concerning.
At the end of the day, we have seen too many weird and unexplained results from GWO to have much confidence in it. We assume that the numbers it’s giving us are reliable, but every single time we’ve tried to confirm them we’ve been foiled. There is no conceivable reason for 2 variations that are identical to not perform very closely after enough visitors have seen them. The vast majority of our tests never confirmed properly the test and control were so far off that it invalidated the results completely. I still have never heard a plausible reason for this to happen, unless Google is exceptionally bad at splitting traffic up randomly, which is hard to believe.
Posted July 30, 2009
Do you have a story to tell where you were able to “do it wrong quickly”? E-mail your entry to me and maybe it will appear on this site, too. For more tips on how to super-charge your Internet marketing, order Do It Wrong Quickly and Search Engine Marketing, Inc. today.