SEBD 2024 Keynotes Talks

Enhancing Data Precision with Large Language Models: Analyzing Failures and Innovating Database Curation

Georg Gottlob - University of Calabria, Italy

Abstract: The advent of Large Language Models (LLMs) such as ChatGPT represents a significant milestone in the AI revolution. This talk commences with an exploration of text-based generative AI tools, highlighting exemplary performances in producing elegantly crafted texts. However, LLMs often fail, particularly when tasked with generating precise data absent from established databases like Wikipedia. This phenomenon is critically examined through a “psychoanalysis” of LLMs that identifies fundamental causes for such failures and hallucinations. In response to these challenges, the second part of the talk introduces the Chat2Data method and system, an innovative framework designed to harness the capabilities of LLMs for the automatic generation, enrichment, and verification of databases and data sets. Chat2Data automatically generates sophisticated workflows that incorporate problem decomposition, strategic LLM querying, and meticulous analysis of responses. To refine reliability and accuracy, the system integrates supplementary technologies such as Retrieval-Augmented Generation (RAG), rule-based knowledge processors, and data-graph analysis. This comprehensive approach not only mitigates the pitfalls identified but also significantly advances the utility of LLMs in complex data environments.

About the Speaker: Georg Gottlob is a Professor of Computer Science at the University of Calabria. Until recently, he was a Royal Society Research Professor at the Computer Science Department of the University of Oxford, a Fellow of St John’s College, Oxford, and an Adjunct Professor at TU Wien. His interests include knowledge representation, database theory, query processing, web data extraction, and (hyper)graph decomposition techniques. Gottlob has received the Wittgenstein Award from the Austrian National Science Fund and the Ada Lovelace Medal (UK). He is a Fellow of the Royal Society, and a member of the Austrian Academy of Sciences, the German National Academy of Sciences, and the Academia Europaea. He was a founder of Lixto, a web data extraction firm acquired in 2013 by McKinsey & Company. In 2015 he co-founded Wrapidity, a spin out of Oxford University based on fully automated web data extraction technology developed in the context of an ERC Advanced Grant.. Wrapidity was acquired by Meltwater, an internationally operating media intelligence company. Gottlob then co-founded the Oxford spin-out DeepReason.AI, which provided knowledge graph and rule-based reasoning software to customers in various industries. DeeoReason.AI was also acquired by Meltwater.

back

Detecting and Fixing Unfairness in Data-Driven Decision Making

H. V. Jagadish - University of Michigan, U.S.A.

Abstract: Algorithmic decision-makers can be unfairly biased for many reasons. Most of these have to do with the data. We consider challenges in the data used as well as in the results produced. We will look at multiple examples of how to detect unfairness and how to mitigate it when found.

About the Speaker: H. V. Jagadish is Edgar F. Codd Distinguished University Professor and Bernard A Galler Collegiate Professor of Electrical Engineering and Computer Science at the University of Michigan in Ann Arbor, and Director of the Michigan Institute for Data Science. Prior to 1999, he was Head of the Database Research Department at AT&T Labs, Florham Park, NJ. Professor Jagadish is well known for his broad-ranging research on information management, and has over 200 major papers and 38 patents, with an H-index of 101. He is a fellow of the ACM since 2003 and of AAAS since 2018. He currently chairs the board of the Academic Data Science Alliance and previously served on the board of the Computing Research Association (2009-2018). He has been an Associate Editor for the ACM Transactions on Database Systems (1992-1995), Program Chair of the ACM SIGMOD annual conference (1996), Program Chair of the ISMB conference (2005), a trustee of the VLDB (Very Large DataBase) foundation (2004-2009), Founding Editor-in-Chief of the Proceedings of the VLDB Endowment (2008-2014), and Program Chair of the VLDB Conference (2014). Since 2016, he is Editor of the Springer (previously Morgan & Claypool) Synthesis Lecture Series on Data Management. Among his many awards, he won the David E Liddle Research Excellence Award (at the University of Michigan) in 2008, the ACM SIGMOD Contributions Award in 2013, and the Distinguished Faculty Achievement Award (at the University of Michigan) in 2019. His popular MOOC on Data Science Ethics is available on both EdX and Coursera.

H V Jagadish website

back

Scalable Vector Analytics: A Story of Twists and Turns

Themis Palpanas - University Paris Cite, France

Abstract: Similarity search in high-dimensional data spaces was a relevant and challenging data management problem in the early 1970s, when the first solutions to this problem were proposed. Today, fifty years later, we can safely say that the exact same problem is more relevant (from Time Series Management Systems to Vector Databases) and challenging than ever. Very large amounts of high-dimensional data are now omnipresent (ranging from traditional multidimensional data to time series and deep embeddings), and the performance requirements (i.e., response-time and accuracy) of a variety of applications that need to process and analyze these data have become very stringent and demanding. In these past fifty years, high-dimensional similarity search has been studied in its many flavors. Similarity search algorithms for exact and approximate, one-off and progressive query answering. Approximate algorithms with and without (deterministic or probabilistic) quality guarantees. Solutions for on-disk and in-memory data, static and streaming data. Approaches based on multidimensional space-partitioning and metric trees, random projections and locality-sensitive hashing (LSH), product quantization (PQ) and inverted files, k-nearest neighbor graphs and optimized linear scans. Surprisingly, the work on data-series (or time-series) similarity search has recently been shown to achieve the state-of-the-art performance for several variations of the problem, on both time-series and general high-dimensional vector data. In this talk, we will touch upon the different aspects of this interesting story, present some of the state-of-the-art solutions, and discuss open research directions.

About the Speaker: Themis Palpanas is an elected Senior Member of the French University Institute (IUF), a distinction that recognizes excellence across all academic disciplines, and Distinguished Professor of computer science at the University Paris Cite (France), where he is director of the Data Intelligence Institute of Paris (diiP), and director of the data management group, diNo. He received the BS degree from the National Technical University of Athens, Greece, and the MSc and PhD degrees from the University of Toronto, Canada. He has previously held positions at the University of California at Riverside, University of Trento, and at IBM T.J. Watson Research Center, and visited Microsoft Research, and the IBM Almaden Research Center. His interests include problems related to data science (big data analytics and machine learning applications). He is the author of 14 patents. He is the recipient of 3 Best Paper awards, and the IBM Shared University Research (SUR) Award. His service includes the VLDB Endowment Board of Trustees (2018-2023), Editor-in-Chief for PVLDB Journal (2024-2025) and BDR Journal (2016- 2021), PC Chair for IEEE BigData 2023 and ICDE 2023 Industry and Applications Track, General Chair for VLDB 2013, Associate Editor for the TKDE Journal (2014-2020), and Research PC Vice Chair for ICDE 2020.

Themis Palpanas website

back