New Jersey Institute of Technology
Title: From Intrinsic Dimensionality to Chaos and Control: Towards a Unified Theoretical View
Researchers have long considered the analysis of similarity applications in terms of the intrinsic dimensionality (ID) of the data. Although traditionally ID has been viewed as a characterization of the complexity of discrete datasets, more recently a local model of intrinsic dimensionality (LID) has been extended to the case of smooth growth functions in general, and distance distributions in particular, from its first principles in terms of similarity, features, and probability. Since then, LID has found applications — practical as well as theoretical — in such areas as similarity search, data mining, and deep learning. LID has also been shown to be equivalent under transformation to the well-established statistical framework of extreme value theory (EVT). In this presentation, we will survey some of the wider connections between ID and other forms of complexity analysis, including EVT, power-law distributions, chaos theory, and control theory, and show how LID can serve as a unifying framework for the understanding of these theories. Finally, we will reinterpret recent empirical findings in the area of deep learning in light of these connections.
Michael Houle obtained his PhD degree in 1989 from McGill University in Canada, in the area of computational geometry. Since then, he developed research interests in algorithmics, data structures, and relational visualization, first at Kyushu University and the University of Tokyo in Japan, and from 1992 at the University of Newcastle and the University of Sydney in Australia. From 2001 to 2004, while at IBM Japan's Tokyo Research Laboratory, he first began working on approximate similarity search and shared-neighbor clustering methods for data mining applications. From 2004, at the National Institute of Informatics, Tokyo, his research interests expanded to include dimensionality and scalability in the context of fundamental AI / machine learning / data mining tasks such as search, clustering, classification, and outlier detection. In 2021, he relocated to Vancouver, BC, Canada. Currently he is with the New Jersey Institute of Technology in Newark, NJ, USA, and divides his time between Newark and Vancouver.
Title:The Rise of HNSW: Understanding Key Factors Driving the Adoption of Search Libraries in Machine Learning
As representation learning and large language models continue to evolve, the need for efficient similarity search techniques has grown exponentially in the last few years. HNSW has emerged as a leading algorithm for nearest neighbor search, finding applications in a diverse range of products such as Weavite, Qdrant, Vespa, Milvus, Zilliz, Faiss, Elasticsearch, Redis and others. In this talk, we will explore the core principles and development of HNSW, as well as the key design decisions and factors that have contributed to its widespread adoption beyond its high performance. Through these insights, we aim to guide developers in creating innovative libraries and solutions to address the ever-increasing demand for efficient search libraries and machine learning tools in general.
Universidad Nacional de Educación a Distancia (UNED), Spain
Title:Towards a Universal Similarity Function: the Information Contrast Model and its Application as Evaluation Metric in Artificial Intelligence Tasks
Computing similarity implies, at least, two aspects: how to represent items, and how to compare item representations (similarity functions). Item representation is a task-dependent problem, but what about similarity functions? Is it possible to study the design of optimal similarity functions from a universal, application-free perspective? In the talk, we start by proposing a set of formal constraints on the space of permissible similarity functions for Information Access problems and comparing it with other related axiomatic formulations of similarity in other fields (cognitive science and algebra). Then, we propose a new parameterized similarity function, ICM, which satisfies all constraints for a given range of values of its parameters. We discuss the usefulness of ICM in two very different application domains: first, to compute textual similarity under different application scenarios and representation paradigms, which was the original task for which ICM was designed. But ICM can be successfully applied outside its intended original scope: in the talk, we show how it can be used as an evaluation measure in Artificial Intelligence that computes the similarity between system outputs and gold standards, and how it may bring formal and empirical advantages in this area.
Julio Gonzalo is director of the UNED Research Center in Natural Language Processing (NLP) and Information Retrieval (IR) and deputy Vicerrector of Research at UNED. Along his career he has worked on topics such as online reputation monitoring, Information Access technologies for Social Media, interactive cross-language search, toxicity and misinformation in Social Media, computational creativity and semantic similarity. He has also worked extensively in the design and assessment of evaluation metrics for a wide range of Artificial Intelligence problems, which led to a Google Faculty Research Award (together with Enrique Amigó and Stefano Mizzaro) for his work in this area. He has recently been co-chair of ACM SIGIR 2022 and co-chair of IberLEF, the annual evaluation campaign for NLP systems in Spanish and other Iberian languages (2019-2022).