I’m currently leading a team at Bing called “Whole Page Organization”. We are responsible for a range of features that are displayed in the web search results page. One of the key common threads between the features that we build is that they are centered around understanding heterogeneous data coming from various backend services, and they are intended to add a level of cohesiveness and richness to the experience for our users. We are a very data driven team. Whenever we build a new feature, or make non-trivial changes to an existing feature, we strive to measure and understand the change as deeply as we can. Sometimes, however, we need to make a decision that conflicts with what the data tells us.
Whenever we test a new feature, we measure it using two primary approaches: offline metrics, and online metrics. Offline metrics are human judged, usually by trained editorial staff. These metrics range from very targeted tasks, such as identifying specific classes of defects, to very general tasks, where judges might be asked two subjectively rate the whole page. Offline metrics are primarily used to gain insight in to the quality of specific algorithms or data sets.

Online metrics are built on user interaction data. We collect detailed information about how our users interact with every part of the page, and compute aggregate statistics on page click rates, user sessions, dwell times, and so on. Online metrics are generally used to assist in understanding the overall success of a feature.

Both of these classes of metrics are indispensable. They provide rich data to assist in making decisions, and often provide unambiguous outcomes. In some cases, however, the data can be inconclusive. Each of the metrics in both of these classes give insight in to very specific aspects of a feature. Sometimes the “big picture” is unclear due to conflicting metrics, or weak signals. Some metrics, such as “sessions per unique user” (a measure of how many times our users visit the site) aim to provide an overarching view, but these usually have very low resolution and are subject to a great deal of noise. They are rarely sensitive enough to provide a clear and statistically significant signal.

A recent feature that my team as built is called “people aggregation” (see figure below). The concept is that when a user is searching for a person on the web, we will group together results that we know are related to a specific individual. Many names, such as “Danny Sullivan” are ambiguous—there are multiple individuals with that name that users are likely to be looking for. By grouping the results about each person, the user can more easily distinguish between results about the person they’re searching for, and results that are unrelated. In the current implementation, the original placement of the results is left untouched, so the grouped results are duplicated on the page, albeit with a different presentation. After extensive measurement and experimentation, we shipped the feature, since we saw a slight improvement in user engagement on the page when our feature was shown.
Picture
Once we had shipped we saw a potential problem. In some cases, such as “Lady Gaga”, a name query is not ambiguous, and all of the results on the page relate to a single person. When we group the results in these cases, we’re generally not adding any value, since we’re just duplicating the first few results and not making it substantially easier to find the right documents.

When we saw this, we went straight to the data. We hypothesized that users would not be engaging much with the feature when the query is unambiguous, and the gains that we saw in our experiments were coming from ambiguous queries. We divided the data into two sets: one where we were grouping results only within the top 5 ranked documents (unambiguous queries), and one where the grouped documents were originally more spread out on the page (ambiguous queries). What we saw was surprising—the difference in engagement was marginal.

So were faced with the decision of what to do about this situation. In these unambiguous queries, the feature added weight to the page (which has an impact on load time) and intuitively it did not add substantial value, however the data was not clearly supporting our intuition. There were several courses of action we considered. First, we could leave the feature as it was. Second, we could disable the feature for these unambiguous queries. Third, we could turn off the feature entirely. Fourth, we could adjust the behavior of the feature so that it would intuitively add value by removing the original results from the page. The last option would reduce the page weight even more than turning the feature off, and make the load time faster without losing good results from the page.

Turning off the feature entirely seemed like an overreaction. We had data to show that people aggregation had some small positive impact, and there were cases where we felt that it was useful. We eliminated this option first.

Redesigning the feature had some promise, but what we were considering was a risky change. We would have to carefully analyze the impact. We had occasionally done similar things in the past, and we knew that it would be difficult to predict what impact it would have. This was a course of action that would take some time (possibly months). In the meantime, we had a feature in production that we were not happy with. We needed to fix the problem sooner than that, so we shelved this option as an area for future experimentation.

That left us with the first two options: leave it alone; or disable it for unambiguous queries. We had no real data to guide the decision. At this point we went back to think about the design goals that we had for this feature, and consider how the current behavior fit with our future plans.

One of the key design goals that we had was to organize the results on the page around entities (that is, individual people) so that users could more easily identify results that were related to the person they were searching for. Since there was only one person represented on the page in these problematic cases, we were not making it easier for users to find the right documents. It was clear that these cases did not align with our design goals.

Our future plans for the feature included sevral ideas related to variations in the presentation of the groups. These concepts did not lend themselves well to cases where the results page is dominated by a single entity.

After considering the initial design goals and future plans, we made the decision to disable the people aggregation feature for queries where the results that were being grouped were already in the top few documents ranked on the page. Sometimes the right thing for the user, and the product as a whole, is not what the data tells us.

 


Comments

10/04/2013 4:09pm

I really like this post, i enjoy every minute reading it. It's cool, that author decided to write on this theme. I would like to read another post this author with great pleasure. It might be really interesting.

Reply
10/31/2013 6:37am

Great blog.

Reply
11/01/2013 5:52am

The students of Australia especially university going are needed to be take special care at the time of australian writings for report. For the reason that the Australian writing is highly qualified than others. Apart these, in Sydney the reports are to be highly qualified for getting academic distinction.

Reply
12/18/2013 3:08am

I really like this post, i enjoy every minute reading it. It's cool, that author decided to write on this theme. I would like to read another post this author with great pleasure. It might be really interesting.

Reply

Your comment will be posted after it is approved.


Leave a Reply

    Author

    Tim is a software guy. He's been researching and building software since the mid nineties. He's passionate about tech, but thinks of himself as a people person and a collaborator.

    Archives

    January 2013
    October 2012
    September 2012

    Categories

    All