October 12th - 10:00

Deep Learning and Similarity Search

Bobby Jaros - Yahoo Labs.

Deep Learning has received tremendous attention recently thanks to its impressive results in computer vision, speech, medicine, robotics, and beyond. Although many of the highly visible results have been in a classification setting, a prime motivation for deep learning has been to learn rich feature vectors that are useful across a wide array tasks. One goal of such features — for example, in perception-oriented tasks — might be that items deemed similar by humans would have mathematically similar feature vectors. As deep learning continues to advance, we can expect continued improvement in our ability to identify more and more interesting and subtle notions of similarity. In the other direction, similarity search can also empower deep learning, as recent work invokes similarity search as a core module of deep learning systems.

Bobby Jaros leads the deep learning group of Yahoo Labs, with a focus on natural language understanding, search, and computer vision. Prior to Yahoo he was the cofounder of LookFlow, which used deep learning to create similarity maps for image search and discovery. He started LookFlow while a PhD student at Stanford, where he also earned his BSEE (Terman Scholar), MSEE (Mayfield Fellow), and MBA.

October 13th - 9:00

Large-scale similarity joins with guarantees

Rasmus Pagh - IT University of Copenhagen, Denmark.

The ability to handle noisy or imprecise data is becoming increasingly important in computing. In the information retrieval community the notion of similarity join has been studied extensively, yet existing solutions have offered weak performance guarantees. Either they are based on deterministic filtering techniques that often, but not always, succeed in reducing computational costs, or they are based on randomized techniques that have improved guarantees on computational cost but come with a probability of not returning the correct result.
The aim of this talk is to give an overview of randomized techniques for high-dimensional similarity search, and discuss recent advances towards making these techniques more widely applicable by eliminating probability of error and improving the locality of data access.

Rasmus Pagh graduated from Aarhus University in 2002, and is now a full professor at the IT University of Copenhagen. His work is centered around efficient algorithms for big data, with an emphasis on randomized techniques. His publications span theoretical computer science, databases, information retrieval, knowledge discovery, and parallel computing. His most well-known work is the cuckoo hashing algorithm (2001), which has led to new developments in several fields. In 2014 he received the best paper award at the WWW Conference for a paper with Pham and Mitzenmacher on similarity estimation, and started a 5-year research project funded by the European Research Council on scalable similarity search.

October 14th - 9:00

Directions for Similarity Search in Television Recommender Systems

Billy Wallace - Founding Developer, ThinkAnalytics.

Recommender systems require similarity search in order to find a movie or tv show that is similar to another. There are interesting constraints however, that differentiate this application from a pure similarity search. Just finding similar content does not give good recommendations, as we are trying to fulfil a business use-case such as up-selling paid-for content or exposing users to content on channels they don't normally watch. Instead, we use similarity almost as a bloom filter, where we populate a "candidate set" using similarity search and then use a second pass to select good recommendations based on the requirements of the use-case. It is common that we can't find enough recommendations to fulfil a request from the candidate set unless we supply some hints to the indexes being used to execute the similarity search, for example, prefer new content, prefer popular content or candidates must be in the user's "package". Measuring the success of such recommender systems is difficult. There are no standard test sets available, and it is difficult to convince broadcasters that they should share data that they may not own outright, or which may present privacy issues if shared. We will discuss an approach that we are starting to look at. Although the scale of the catalogues indexed is modest, with numbers of items in the hundred thousands rather than millions, there are scalability concerns due to the number of requests - millions of customers requiring thousands of requests per second with sub-second response times - and also the fact that the catalogue changes frequently - usually in it's entirety several times per day. It is hoped that by sharing insights from current commercial work in this area, that new research directions, or applications of existing research are suggested.

Billy Wallace is one of the founding developers at ThinkAnalytics. ThinkAnalytics produces the most widely-deployed recommendations engine in the television industry, and was recently awarded an Emmy for its technology. Billy started his career as a researcher in the Intelligent Knowledge Based Systems Group at the University of Strathclyde, where he worked on various near-market projects concerning optimisation, machine learning, hypertext and document retrieval. After working in consulting and technical training, he helped to start ThinkAnalytics, which initially marketed a data mining platform that was successfully used by several tier-1 telecoms companies as well as AT&T labs, Bayer Pharmaceuticals, Johns Hopkins APL, the Mayo Clinic in Florida and others. He worked on the project that took the data mining platform and leveraged it to build the recommendations engine that has become so successful. Billy set-up ThinkAnalytics' Silicon Valley presence, and after a short spell at Google, rejoined ThinkAnalytics in their Los Angeles office before finally relocating back to headquarters in Glasgow to found and run the information science team there.