Welcome to the homepage of the DIG seminar, which is the regular seminar of the DIG team of LTCI at Télécom Paris. The seminar has been formerly called "DBWeb seminar" and "IC2 seminar".

The seminar features talks by members of the team and guests from other research groups, as well as discussions on topics of relevance to the team. Talks are held at Télécom Paris, 19 place Marguerite Perey, 91120 Palaiseau (directions). The talks indicated as "online" take place at the URL advertised in the email announcements.

Attendance is open to the public, but please register in advance by emailing me at a3nm.seminar@a3nm.net if you are planning to attend.

You can subscribe to the mailing-list of seminar announcements here (the Mailman interface is in French, but all emails are in English), or by sending an email with subject "subscribe" to dig-seminar-subscribe@listes.telecom-paris.fr and replying to the confirmation email. You can also subscribe to the seminar sessions as an iCalendar feed (e.g., with ICSx5 on Android) using the following URL: https://a3nm.net/work/seminar/calendar.ics

## 24 November 2022, 11:45, 4D19¶

Cyril Chhun
PhD Midterm Presentation - Automatic Story Generation and Evaluation

The main topic of my thesis was originally to study the representation of stories as logical chains of events, and it progressively leaned towards the tasks of automatic story generation (ASG) and evaluation. After briefly working on a graph-based model for ASG, I spent most of the first half of my PhD building HANNA, an annotated dataset for ASG evaluation, and performing a meta-evaluation of the current state of ASG evaluation. Among other things, our analysis highlighted the weaknesses of current measures and allowed us to formulate practical recommendations for ASG evaluation. Since we showed stronger automatic measures are needed for ASG, I am currently working on the design of a more robust measure of story quality. This thesis started in May 2021, with a scheduled defense in spring 2024. My supervisors are Fabian Suchanek and Chloé Clavel.

Bio: Cyril is a PhD student in the team.

## 28 October 2022, 11:45, 4D19¶

Léo Laugier
Analysis and Control of Online Interactions through Neural Natural Language Processing

Natural Language Processing is motivated by applications where computers should gain a semantic and syntactic understanding of human language. Recently, the field has been impacted by a paradigm shift. Deep learning architectures coupled with self-supervised training have become the core of state-of-the-art models used in Natural Language Understanding and Natural Language Generation. Sometimes considered as foundation models, these systems pave the way for novel use cases. Driven by an academic-industrial partnership between the Institut Polytechnique de Paris and Google AI Research, the present research has focused on investigating how pretrained neural Natural Language Processing models could be leveraged to improve online interactions.

This thesis first explored how self-supervised style transfer could be applied to the toxic-to-civil rephrasing of offensive comments found in online conversations. In the context of toxic content moderation online, we proposed to fine-tune a pretrained text-to-text model (T5) with a denoising and cyclic auto-encoder loss.

Then, a subsequent work investigated the human labeling and automatic detection of toxic spans in online conversations. We released a new labeled dataset to train and evaluate systems, which led to a shared task at the 15th International Workshop on Semantic Evaluation.

Finally, we developed a recommender system based on online reviews of items, taking part in the topic of explaining users' tastes considered by the predicted recommendations. The method uses textual semantic similarity models to represent a user's preferences as a graph of textual snippets, where the edges are defined by semantic similarity.

## 20 October 2022, 11:45, 4D19¶

Martín Muñoz, Pontificia Universidad Católica de Chile
Constant-delay enumeration for SLP-compressed documents

We study the problem of enumerating answers to a query over a compressed document. For our queries, we use a model called Annotated Automata which is equivalent to Monadic Second Order (with open second-order variables). For the compression, we use Straight-line programs, which are built through a context-free grammar that produces only one string -- the document to be compressed. We will show an algorithm that does a preprocessing which is linear on the size of the compression scheme and enumerates the answers with constant delay. This is an improvement over Schmid and Schweidkardt's result that, in the context of Regular Spanners, enumerates with a delay which is logarithmic on the size of the uncompressed document.

## 11 October 2022, 11:45, 4D19¶

Tiphaine Viard, Télécom Paris
Looking at artificial intelligence as a socio-technical system

Artificial intelligence, and in particular machine learning, has been developing at an increasingly fast rate over the last decade; this is also true of its potential applicative settings, with so-called "algorithms" being used in a wide array of contexts for modelling and prediction.

We follow the trend that these systems should be jointly studied, from both technical and social perspectives. Indeed, the technical issues these systems face shape their social implications, the reverse being also true. One may think, for instance, of most natural language processing datasets being in english; or, how the technical limitations of recommender systems shape what users see.

The fields of Explainable AI (XAI) and AI Ethics have formed as a response to these questions. In this seminar, we will discuss two directions of work. The first one focuses on the current limitations and goals of interactive explanations. The second one focuses on the social processes in AI ethics that form at a mesosocial scale, describing the landscape of current AI ethics; we use as a corpus the plethora of AI ethics manifestos that have been published in the last few years, effectively landscaping current AI ethics.

The speaker is an associate professor in the Digital Technology, Organization and Society team at Telecom Paris. Her analysis standpoint is rooted at the interplay of machine learning, graph analysis, and neostructural sociology.

## 27 September 2022, 11:45, 4D19¶

Hoang Anh Ngo
Online clustering: algorithms, evaluation, metrics, applications and benchmarking using River (slides)

Online clustering algorithms play a critical role in data science, especially with the advantages regarding time, memory usage and complexity, while maintaining a high performance compared to traditional clustering methods. This tutorial serves, first, as a survey on online machine learning and, in particular, data stream clustering methods. This introduction will be then put into the context with River, a go-to Python library merged between Creme and scikit-multiflow. We propose applications and settings for benchmarking, using real-world problems and datasets.

## 15 September 2022, 11:45, 4D19¶

Mehwish Alam (FIZ Karlsruhe - Leibniz Institute for Information Infrastructure)
Knowledge Graph Completion using Embeddings (slides)

Abstract: Knowledge Graphs (KGs) constitute a large network of real-world entities and relationships between these entities. KGs have recently gained attention in many tasks such as recommender systems, question answering, etc. Due to automated generation and open-world assumption, these KGs are never complete. Recent years have witnessed many studies on link prediction using KG embeddings which is one of the mainstream tasks in KG completion. To do so, most of the existing methods learn the latent representation of the entities and relations whereas only a few of them consider contextual information as well as the textual descriptions of the entities. This talk will particularly focus on an attentive encoder-decoder based link prediction approach, MADLINK, leveraging the contextual information of the entities as well as their descriptions. A newly created set of benchmark datasets for KG completion will also be introduced which is extracted from Wikidata and Wikipedia, named LiterallyWikidata. It has been prepared with the main focus on providing benchmark datasets for multimodal KG Embedding models. This talk will also give an overview of the methods for entity type prediction which is a subtask of KG completion and will discuss some future directions.

## 6 September 2022, 11:45, 4D19¶

Judith Jeyafreeda (University of Manchester)
MASK: A framework for de-identification of Medical Records

Abstract: In recent times, all medical information is being stored digitally. Medical records of patients contain valuable information that can be used to improve our understanding of disease patterns, for example how frequent a particular disease is, and which treatments work best for which patients. However, a medical letter includes sensitive information along with the others. Any patient/doctor identifiable information is known as sensitive information. This includes name of patient/doctor, profession of patient, age, place etc. This information needs to be Masked before researchers can have access to any medical letter. In this work, we propose a framework called MASK that would identify and mask such sensitive information. The framework allows a selection of NER algorithms including CRF, BiLSTM and BERT to identify entities and two different approaches to mask this identified entities.

Bio: I am Judith Jeyafreeda. I am a Postdoc at the University of Manchester. I work within the “Assembling the data JigSaw” project which aims at analysing medical records to allow clinicians to interpret/predict the courses of medications and diagnosis easily. I use NER algorithms to help deidentify sensitive information on clinical records and identify diagnosis in medical texts. I finished my PhD on “Task oriented Web Page Segmentation” at the University of Caen, France. I used several clustering methods using task oriented criteria for this purpose. My research interests include Named Entity Recognition methods, Machine Learning for text analysis and Social Media text analysis for sentiment/hate analysis.

## 12 July 2022, 11:45, 4D19¶

Yael Amsterdamer
Consent Management for Databases

Abstract: Data sharing is commonplace on the cloud, in social networks and other platforms. To use the data fairly, following regulations such as GDPR, one often needs the consent of the data owners. The standard solution is to require this consent in advance, before data is provided to the system. However, specific uses of data are hard to anticipate, and therefore platforms often ask for coarse-grained consent that is either too broad or too restrictive. The problem is exacerbated when data of multiple owners is jointly used in a query, which makes it harder to identify whose consent is needed to use the query results. We propose an alternative, fine-grained model for consent management which involves identifying relevant data owners and frugally probing them for their consent for specific data uses. We study different aspects of the problem, including optimal and approximate probing strategies for the results of different classes of queries. Our solutions leverage techniques from different areas including provenance and interactive Boolean evaluation.

Short bio: Yael Amsterdamer is a Senior Lecturer at the Department of Computer Science, Bar-Ilan University (Ramat-Gan, Israel), and the head of the Data Management Lab. She received her Ph.D. in Computer Science from Tel-Aviv University, and has been a visiting Scholar at the University of Pennsylvania, Philadelphia, PA and jointly at Télécom Paris and INRIA institute (Paris, France). Her research is in the field of data management spanning topics such as crowd-powered data management, provenance and interactive summarization. She has won competitive grants such as the Israeli Science Foundation and Ministry of Science and regularly serves in program committees of top conferences.

## 22 June 2022, 11:45, 4D19¶

Étienne Houzé
XAI for the smart home (PhD rehearsal)

Abstract: Smart homes are Cyber-Physical Systems where various components cooperate to fulfill high-level goals such as user comfort or safety. These autonomic systems can adapt at runtime without requiring human intervention. This adaptation is hard to understand for the occupant, which can hinder the adoption of smart home systems.

Since the mid 2010s, explainable AI has been a topic of interest, aiming to open the black box of complex AI models. The difficulty to explain autonomic systems does not come from the intrinsic complexity of their components, but rather from their self-adaptation capability which leads changes of configuration, logic or goals at runtime. In addition, the diversity of smart home devices makes the task harder.

To tackle this challenge, we propose to add an explanatory system to the existing smart home autonomic system, whose task is to observe the various controllers and devices to generate explanations. We define six goals for such a system. 1) To generate contrastive explanations in unexpected or unwanted situations. 2) To generate a shallow reasoning, whose different elements are causaly closely related to each other. 3) To be transparent, i.e. to expose its entire reasoning and which components are involved. 4) To be self-aware, integrating its reflective knowledge into the explanation. 5) To be generic and able to adapt to diverse components and system architectures. 6) To preserve privacy and favor locality of reasoning.

Our proposed solution is an explanatory system in which a central component, name the Spotlight'', implements an algorithm named D-CAS. This algorithm identifies three elements in an explanatory process: conflict detection via observation interpretation, conflict propagation via abductive inference and simulation of possible consequences. All three steps are performed locally, by Local Explanatory Components which are sequentially interrogated by the Spotlight. Each Local Component is paired to an autonomic device or controller and act as an expert in the related knowledge domain. This organization enables the addition of new components, integrating their knowledge into the general system without need for reconfiguration. We illustrate this architecture and algorithm in a proof-of-concept demonstrator that generates explanations in typical use cases. We design Local Explanatory Components to be generic platforms that can be specialized by the addition of modules with predefined interfaces. This modularity enables the integration of various techniques for abduction, interpretation and simulation. Our system aims to handle unusual situations in which data may be scarce, making past occurrence-based abduction methods inoperable. We propose a novel approach: to estimate events memorability and use them as relevant hypotheses to a surprising phenomenon.

Our high-level approach to explainability aims to be generic and paves the way towards systems integrating more advanced modules, guaranteeing smart home explainability. The overall method can also be used for other Cyber-Physical Systems.

Bio: Etienne Houzé is a PhD student at DIG. He started his PhD in 2019, working with Jean-Louis Dessalles and Ada Diaconescu on Explainable AI, in collaboration with EDF. His works focus on finding innovative tools to explain abnormal situations that occur in smart homes. His PhD defense will be held on the 24th of June, 2022, at Télécom.

## 26 April 2022, 11:45, 4D19¶

Tian Tian
Tailoring a Controlled Language Out of a Corpus of Maintenance Reports (slides)

Abstract: A controlled natural language is a language instance that is intermediate between natural language (as English, French, etc.), and formal language. Compared to natural languages, controlled natural languages are often more precise and unambiguous and can be parsed or translated more easily. At the same time, controlled natural languages are more intelligible by humans than formal languages.

In the project "Learn AI", we deal with two kinds of synchronous data: parallel time series and textual data. We try to merge these two parts, extract a knowledge base and then predict anomalies. This presentation focuses only on the textual part: the corpus consists of a span of three years of daily maintenance reports of a thermal power station, written by maintenance technicians. The specialized language used in these reports has a limited vocabulary, often using short expressions rather than complete sentences and it contains lots of abbreviations, spelling errors, etc.

In this presentation, we introduce a method for tailoring a controlled natural language out of this specialized language, as well as for training the user to ensure a smooth transition between the specialized and the controlled natural language. Our method is based on the selection of maximal coverage syntax rules. The number of rules chosen induces the level of naturalness or formality of the generated controlled natural language. We also propose a training tool that displays segmentation into left-to-right maximal parsed sentences and allows utterance modification by the user until a complete parsing is achieved. We have applied our method to the French corpus of maintenance reports of boilers in a thermal power station and provide coverage and segmentation results.

Bio: Dr. Tian Tian is a postdoc researcher in the Computer Science Department at IMT Atlantique in Brest, France since January 2021. Before that, she worked as ATER (Attaché temporaire d'enseignement et de recherche, equivalent to an assistant teacher) at laboratory STIH (Sens, Texte, Informatique, Histoire), Sorbonne Université in Paris, France. She obtained her PhD degree in 2019 from laboratory Lattice (Langues, Textes, Traitements informatiques, Cognition) of university Sorbonne Nouvelle in Paris, France. Her interests cover all parts of NLP (Natural Language Processing) along with Machine Learning methods. She has published around ten papers as the first author or corresponding author. The paper titled "Tailoring a Controlled Language Out of a Corpus of Maintenance Reports" received the best paper reward of conference CNL (Controlled Natural Langue) 2021. She currently works on "Learn AI (Artificial Intelligence)" project on knowledge extraction by textual analysis of industrial device maintenance reports and merging with time-series data.

## 12 April 2022, 11:45, 4D19¶

Antonio Penta

Abstract: Knowledge remains a key challenge for organizations, and its creation and management are often an afterthought in highly accelerated product/service development cycles. Despite great efforts in both academic and industrial research on this topic, extracting meaningful knowledge from data remains a challenge. In this talk, I will first introduce some of my recent research projects related to this domain, and then describe how we approached the problem of extracting actionable information from voice data to improve customer service. I will also illustrate some of the research results that are the core of this work.

Bio: Antonio Penta is Senior Principal at Accenture The Dock, which is Accenture's flagship R&D and global innovation center. Currently, he leads several research projects related to decision support systems based on heterogeneous data. His R&D has impacted many real-world systems used around the world today. Prior to The Dock, he was a Senior Research Scientist at Raytheon Technologies working as PI in several EU-funded projects on advanced multimedia systems for security and safety applications. He also had several research fellowships in Italian and UK universities. He holds a PhD degree in Computer Science and Engineering from the University of Naples "Federico II" in Italy. His research interests include multimedia information extraction, multimedia knowledge management, and reasoning.

## 5 April 2022, 11:45, 4D19¶

Jieying Chen
Knowledge Extraction from Description Logic Terminologies and its Applications in Industry (slides)

Abstract: An increasing number of large ontologies are being developed and made available, e.g., in repositories such as the NCBO Bioportal. Ensuring access to the knowledge contained in ontologies that is most relevant to users has been identified as an important challenge. In this work, we tackle this challenge by proposing three different approaches to extracting knowledge from Description Logic ontologies: computing intersection and union of all justifications (the intersections and union of minimal sub-ontologies that preserve the given logical consequence); extracting minimal ontology modules (i.e., sub-ontologies that are minimal w.r.t. set inclusion while still preserving all entailments over a given vocabulary); computing best ontology excerpts (a certain, small number of axioms that best capture the knowledge about the vocabulary while allowing for a degree of semantic loss); and determining projection modules (sub-ontologies of a target ontology that entails subsumption, instance or conjunctive queries that follow from a reference ontology). For each of these approaches, we are interested in extracting not only one but all instances of the module notion. For computing minimal modules and best excerpts, we introduce the notion of subsumption justification as a generalisation of the notion of a justification (a minimal set of axioms needed to preserve a given logical consequence) to capture the subsumption knowledge over the vocabulary. Similarly, for computing projection modules, we introduce the notion of projection justifications that preserve the answers to one of three query types as given by a reference ontology. A prototype implementation of the algorithms were provided on large ontologies. Finally, we introduce the applications of proposed approaches, semantic technologies and machine learning in industry, i.e., in digital health, manufacturing and energy sectors.

Bio: Dr. Jieying Chen is a senior lecturer at University of Oslo (Norway) and she is course coordinator of the course IN3060/IN4060 Semantic Technologies in 2021 and 2022. She is also a researcher at SIRIUS, University of Oslo (Norway). SIRIUS is a research innovation centre that bridges academia and industry in energy, oil and gas and other areas. From March 2019 to August 2019, she was a research associate at The University of Manchester (UK) and worked on the EPSRC IAA Project "Comparison and Abstraction of SNOMED CT Ontologies” in collaboration with SNOMED International, a non-profit organization that owns and develops one of the most widely-used medical terminologies, Snomed CT. Before that, she was a PhD student in the LaHDAK group at LRI, Université Paris-Saclay/Université Paris-Sud (France). In June 2018, she was selected to attend the French American Doctoral Exchange Seminar (FADEx) 2018 organised by Mission pour la Science et la Technologie aux États-Unis. From October 2013 to March 2014, she was an exchange student in the international masters program of computational logic at TU Dresden (Germany) funded by the European program Erasmus Mundus. She has been involved in both theoretical and industrial projects in semantic technologies and machine learning.

## 29 March 2022, 11:45, 4D19¶

Rita Ribeiro
Imbalanced regression and extreme value prediction (slides)

Abstract: Research in imbalanced domain learning has almost exclusively focused on solving classification tasks for accurate prediction of cases labelled with a rare class. Approaches for addressing such problems in regression tasks are still scarce due to two main factors. First, standard regression tasks assume each domain value as equally important. Second, standard evaluation metrics focus on assessing the performance of models on the most common values of data distributions. In this paper, we present an approach to tackle imbalanced regression tasks where the objective is to predict extreme (rare) values. We propose an approach to formalise such tasks and to optimise/evaluate predictive models, overcoming the factors mentioned and issues in related work. We present an automatic and non-parametric method to obtain relevance functions, building on the concept of relevance as the mapping of target values into non-uniform domain preferences. Then, we propose SERA, a new evaluation metric capable of assessing the effectiveness and of optimising models towards the prediction of extreme values while penalising severe model bias. An experimental study demonstrates how SERA provides valid and useful insights into the performance of models in imbalanced regression tasks.

Bio: Rita P. Ribeiro is an Assistant Professor at the Department of Computer Science of the Faculty of Sciences of the University of Porto (FCUP) and a Researcher at the Artificial Intelligence and Decision Support Lab (LIAAD) of the Institute for Systems and Computer Engineering, Technology and Science (INESCTEC). She holds a PhD in Computer Science from the University of Porto. Her main research topics are imbalanced domain learning, outlier detection, evaluation issues on learning tasks and problems related to social good. She has been involved in several research projects concerning environmental problems, fraud detection, and predictive maintenance applications. She is a member of the program committee of several conferences, serving as a reviewer of several journals and has been involved in the organization of some scientific events. Currently, she is also the director of the Masters in Data Science at FCUP.

## 25 March 2022, 11:45, 4D19¶

Marianna Girlando
Conditional logics: from models to automated reasoning (slides)

Abstract: Conditional logics are defined by adding to classical propositional logic a binary modal operator expressing fine-grained notions of conditionality, such as counterfactual and non-monotonic inferences. In this talk, I will introduce the family of conditional logics and their semantics, defined in terms of neighbourhood models. Then, I will present different kinds of proof systems for the logics, and show how they can be implemented to design automated reasoning tools for conditional logics. This talk is based on research conducted during my Ph.D. thesis: https://hal.archives-ouvertes.fr/tel-02077109v1

## 21 March 2022, 11:45, 4D19¶

Modeling Implicit Learning: Extracting Implicit Rules from Sequences using LSTM (slides)

Abstract: Humans acquire different kinds of knowledge employing different types of memory systems. Implicit knowledge is a non-expressible knowledge of which the individual is not aware and that is acquired through implicit learning. The main characteristics of implicit learning are: a) encoded rules can not be categorized explicitly, b) it impacts the subsequent reasoning process when new rules are encoded, c) there is no notion of positive or negative example learned through the implicit learning ability in the case of humans, d) the knowledge, i.e the rules, is hidden in the temporal expression of behavior and more specifically in sequences of behaviourally significant events. In this presentation, I will present a methodology for extracting structured knowledge from data corresponding to sequences of behavior. The hypothesis is that this structured knowledge reflects the expression of skills acquired by implicit learning. With a connectionist approach, I will focus on RNNs with Long Short Term Memory (LSTM) and describe results applied on three different kinds of datasets: (i) sequences generated using the Reber grammar and its variations used for cognitive psychology experiments in the study of implicit learning ability in humans, (ii) sequences of electrical components from an industrial partner and (iii) sequences of token from java code.

Bio: Postdoctoral researcher in the DECIDE (Decision AID and Information Discovery) team of the Lab-STICC UMR 6285 CNRS and IMT Atlantique (Brest, France), Ikram is currently working on knowledge extraction from heterogeneous time series representing complex systems using and designing eXplainable artificial intelligence (XAI) models. Ikram did her Ph.D. in computer science within the Mnemosyne team - an Inria project-team in the Neurosciences and Digital Medicine theme- and the Neurodegenerative Diseases Institute, on the design and implementation of interpretability algorithms inspired by human cognition. The core of her research work focuses on the extraction of the mental representation of experts from data streams rich in implicit information.

## 8 March 2022, 11:45, 0D20¶

Victor David (CNAM)
Dealing with Similarity in Argumentation + Temporal Parametric Semantics from Knowledge Graph and Ontology (slides)

Presentation: In this seminar, I will start by introduce you what argumentation systems are and why the notion of similarity between arguments is important. Then I will have a brief look at the part of my thesis on the evaluation of the similarity between arguments in order to talk more about how to integrate these measures in the evaluation of arguments and to make the link with my current research work on the evaluation of temporal markov logic network interpretation (built from knowledge bases and ontologies).

Abstract: Argumentation systems are approaches of reasoning based on the justification of claims by arguments. Thus, arguments represent the reasons for accepting these statements. Since the late 1980s, argumentation systems have attracted a lot of interest in the artificial intelligence community, especially as a unifying approach to non-monotonic reasoning. They were then used to solve different problems such as reasoning in the presence of inconsistent information, decision making, negotiation, persuasion, etc. Argumentation also has several practical applications, notably in the legal and medical fields. Whatever the problem to be solved, an argumentation process usually follows three main steps:

1. Generate an argumentation framework,
2. Evaluating the strength of the arguments, and
3. Conclude a final result.

In my thesis, we discussed the notion of similarity between arguments. We looked at two aspects: how to measure it and how to take it into account in the evaluation of strengths of the arguments. Concerning the first aspect, we were interested in logical arguments, more precisely arguments built from propositional knowledge bases. We started by proposing a set of axioms that a similarity measure between logical arguments must satisfy. Then, we proposed different measures and studied their properties. The second part of the thesis consisted in defining the theoretical foundations that describe the principles and processes involved in the definition of an evaluation method of arguments considering similarity. Such a method computes the strength of an argument based on the strengths of its attackers, the similarities between them, and an initial weight of the argument. Formally, an evaluation method is defined by three functions, one of which, called "adjustment function", is in charge of readjusting the strengths of the attackers according to their similarity. We have proposed properties that the three functions must satisfy, then we have defined a large family of methods and studied their properties. Finally, we defined different adjustment functions, showing that different strategies can be followed to circumvent the redundancy that may exist between the attackers of an argument. In the context of my post-doctorate (multidisciplinary between computer science and history), I am working in collaboration with researchers specialised in the field of databases. The objective of this project is to develop an inference tool to answer historians' queries from databases. The problem is that the information is imprecise, uncertain, temporal and inconsistent. My first contribution, which is currently under submission to IJCAI, consists in improving the following reasoning process: starting from an uncertain and temporal knowledge graph (e.g. in RDF format) and an ontology with uncertain and temporal rules, they are transformed under a logical formalism (i.e. temporal Markov logic network) in order to infer and determine the most probable, extended and consistent knowledge set.

## 1 March 2022, 11:45, 4D19¶

Maria Boritchev (Mathematical Institute of the Polish Academy of Sciences)
Dialogue Modeling in a Dynamic Framework (slides)

Abstract: Formal studies of discourse raise numerous interrogations on the nature and the definition of the way consecutive sentences coherently combine with one another. Language is intrinsically dynamic: in its semantics in context (e.g. use of references) and in the interaction (e.g. connections between dialogue acts). The shift from discourse to dialogue brings forward even more specific issues among which the ones related to questions and answers articulation. In order to address these issues, we start by focusing on questions from a semantic point of view. There are numerous existing formalisms and frameworks for formal semantics of declarative sentences and discourse; dialogue is broadly studied from a linguistic and Natural Language Processing point of view. The goal of the work presented in this talk is to bring classical formal semantics theories to use in a setting oriented towards real-life dialogue. We produce models of dialogue and in particular of the articulation of questions and answers by mingling Neo-Davidsonian Event Semantics (NDES, as presented in Champollion, 2017) with Inquisitive Semantics (IS, Ciardelli et al, 2017) in a compositional and dynamic way through the use of Continuation Style Dynamic Semantics (CSDS, de Groote, 2006, extended in Lebedeva, 2012). Our model is rooted in a syntax-semantics interface implementation called Abstract Categorial Grammars (ACG, de Groote, 2001).

Bio: My name is Maria Boritchev, I am currently a postdoc in the Mathematical Institute of the Polish Academy of Science in Warsaw, Poland. In Poland, I work on hybrid models of natural reasoning. I have recently defended my thesis, entitled Dialogue Modeling in a Dynamic Framework , conducted under the supervision by Maxime Amblard and Philippe de Groote, at LORIA, Inria Nancy Grand-Est. My thesis work involved the development of several formal semantic models of dialogue and the creation of a transcribed corpus of spontaneous spoken French. I also worked on the development and application of several cross-lingual annotation schemes of dialogue, in English, French, Italian, Spanish and Mandarin Chinese. Nowadays, I focus on the application of neural-networks-based methods to the selection of premises for automatic provers, in an attempt to test the capacity of neural networks to model human reasoning in limited natural-language settings.

## 22 February 2022, 11:45, 4D19¶

Léo Laugier (Télécom Paris)
Semantic Encoding of Review Sentences for Memory-Based Recommenders (slides)

Abstract: We explore a novel use of review text to represent user-preferences for rating prediction. The approach leverages textual semantic similarity models to represent a user's preferences as a graph of textual snippets, where the edges are defined by semantic similarity. This textual, memory-based approach to rating prediction offers the promise of improved explanations for recommendations. The method is evaluated quantitatively, highlighting that leveraging text in this way can out-perform both memory-based and model-based collaborative filtering baselines.

Bio: Léo Laugier is a PhD student in the team.

## 28 October 2021, 11:45, 4D19¶

Cedric Kulbach (Karlsruher Institut für Technologie)
Online Automated Machine Learning (slides)

Abstract: The ever-growing demand for machine learning has led to the development of automated machine learning (AutoML) systems that can be used off the shelf by non-experts. Further, the demand for ML applications with high predictive performance exceeds the number of ML experts and makes the development of AutoML systems necessary. Automated Machine Learning tackles the problem of finding machine learning models with high predictive performance. Existing approaches assume that all data is available at the beginning of the training process. They configure and optimize a pipeline of preprocessing, feature engineering, and model selection by choosing suitable hyperparameters in each pipeline step.

By training ML models incrementally the flood of data can be processed sequentially within data streams. However, if one assumes this streaming scenario, where an AutoML instance executes on evolving data streams, the question for the best model and its configuration remains open. In this talk, we address the adaption problem of ML pipelines and their configurations to evolving data streams.

Bio: I am a PhD. Student at the Karlsruhe Institute of Technology in Germany and I am currently in the third year of my PhD. In my research and project work so far, I have been working on the topic of Automated Machine Learning (AutoML). In addition to the possibility of automating ML pipelines, these systems should also lead to personalized suggestions. One of my research areas is therefore the personalisation of AutoML. Based on the idea of personalizing AutoML, my focus of this exchange lies in the adaption of AutoML to data streams. Besides research, I enjoy cycling, meeting people and have discovered rowing as my passion.

## 14 October 2021, 11:45, 4D19¶

Stefano Zacchiroli
Software Heritage: Analyzing the Global Graph of Public Software Development (slides)

Abstract: The Software Heritage project has assembled the largest existing archive of publicly available software source code and associated development history, for more than 10 billion unique source code files and 2 billion unique commits, coming from more than 160 million software development projects.

In this talk we will review the project background and current status with a focus on its graph-based data model and its research applications. The archive is a Merkle DAG whose nodes stand for source code development artifacts such as source files, code trees, commits, releases, and version control system (VCS) snapshots. The graph is typed, fully-deduplicated, and global, allowing to keep track of all the different places (e.g., different VCS repositories) from which a given artifacts have been distributed from. The graph is big, with about 200 billion edges and 20 billion nodes and exponentially growing, doubling every 2 years. The graph network topology and growth dynamics are being studied, but still largely unknown at this stage.

We will discuss the state-of-the-art of operating, analyzing, and querying the Software Heritage graph, as well as early results in applying graph compression techniques to it to make it more easily exploitable. We will conclude with an in-depth discussion of open questions, challenges, and actionable research directions.

Bio: Stefano Zacchiroli is full professor of computer science at Télécom Paris, Polytechnic Institute of Paris. His current research interests span digital commons, open source software engineering, computer security, and the software supply chain. He is co-founder and CTO of Software Heritage, the largest public archive of software source code. He is a Debian developer since 2001, where he served as Debian project leader from 2010 to 2013. He is a former board director of the Open Source Initiative (OSI) and recipient of the 2015 O’Reilly Open Source Award.

## 23 September 2021, 12:30, 4D19¶

Cedric Kulbach (Karlsruher Institut für Technologie)
Online Automated Machine Learning

Note: as the speaker was finally not available, the seminar became an informal round-table discussion and the talk was rescheduled to October 28.

## 9 September 2021, 13:00, 4D19¶

Nils Holzenberger

Abstract: NLP research has produced increasingly powerful models for language understanding, which hold the promise of automating the processing of large corpora and yielding new insights by connecting seemingly unrelated documents. This ability is particularly relevant for tax law: some companies manage to pay less taxes than expected, by leveraging interactions between tax regulations unforeseen by the lawmaker. For an NLP system to discover these loopholes automatically, we need models able to understand the logic behind legal rules. First, I will describe the task of statutory reasoning, which asks whether a given tax law statute applies to a given case. We designed and constructed the SARA dataset as a test bed for a computational model's understanding of tax law. We built baselines using state-of-the-art, language-model based machine reading, and found that statutory reasoning is a serious challenge. We further decompose statutory reasoning as a set of language understanding problems, which connect to existing NLP tasks, and obtain improvements over prior baselines. Our decomposition into subtasks facilitates finer-grained model diagnostics and clearer incremental progress. Beyond the legal domain, progress on statutory reasoning has the potential to inform the design of NLP models able to utilize prescriptive rules stated in natural language.

Bio: Nils Holzenberger is a fifth-year PhD student at Johns Hopkins University, advised by Prof. Benjamin Van Durme. Affiliated with the Center for Language and Speech Processing, his research interests center on NLP for the legal domain, and currently focus on models able to reason with prescriptive rules specified in natural language.

## 29 July 2021, 11:45, 4D19¶

Shrestha Ghosh
Count Information in Knowledge Bases and Text (slides)

Abstract: In this talk I'll introduce count information derived from relations shared between an entity and a set of entities, how it is different from numerical information and how it could be useful for KB recall assessment and downstream tasks like question answering. I will talk about how to identify count predicates in a KB (DBpedia and Wikidata) and align related count predicates. We will also go over extracting count information from text in the context of question answering.

Bio: Shrestha Ghosh is a third year PhD in the Databases and Information Systems group at the Max Planck Institute for Informatics, Saarbruecken, where she works on Knowledge Base recall and NLP.  She completed her masters in Computer Science from Saarland University on 2019. She briefly worked as a Systems Engineer in India after received her bachelor's degree in 2016.

## 15 July 2021, 12:00, 4D19¶

Using contrast to generate relevant descriptions

Abstract: We focus on the generation of insightful descriptions of a given entity, considering both the context, if any, and the recipient's level of expertise in the field in question. We propose an algorithm that considers the complexity of the description as defined by Kolmogorov and that seeks out descriptions based on the most relevant characteristics of the entity. This selection is made through the concept of contrast (comparison of an object to a prototype to identify the characteristics that make it stand out) as opposed to a criterion that would aim at ruling out as many objects as possible that could be characterized by our proposed description. We present the output of our algorithm when describing a movie or a track.

Étienne Li
The mystery of aspectual relations: How close to successful implementation (slides)

Abstract: In linguistics, aspect relates to the different ways a situation extends over time (Is it ongoing ? Is it contained in another situation ?...). In a given language, aspectual processing is an almost trivial task for any native speaker. However this process is still out of reach for artificial intelligence today. Most existing models of natural language processing using neural networks lack parsimony (in a "minimum description length" way). For our approach, we give priority to this parsimony principle. We present a parsimonious model of aspectual processing. Syntagmas from a sentence are converted into fixed size structures, with a limited number of attributes. Those structures are then processed by a limited number of unary (two of them) and binary (four of them) operations, responsible for merging the structures. We test our model on corpora in three different languages : French, English, Chinese. To our knowledge, it is the first study on the aspectual processing of those three languages with a single parsimonious model.

Bio: Hanady and Etienne are DIG interns from Ecole Polytechnique (3A).

## 8 July 2021, 11:00, online¶

Rodrigo Mello
On the Complexity of Labeled Datasets

Abstract: The Statistical Learning Theory (SLT) provides the foundation to ensure that a supervised algorithm generalizes the mapping f: X -> Y given f is selected from its search space bias F. SLT depends on the Shattering coefficient function N(F,n) to upper bound the empirical risk minimization principle, from which one can estimate the necessary training sample size to ensure the probabilistic learning convergence and, most importantly, the characterization of the capacity of F, including its underfitting and overfitting abilities while addressing specific target problems. However, the analytical solution of the Shattering coefficient is still an open problem since the first studies by Vapnik and Chervonenkis in 1962, which we address, in this talk, by employing equivalence relations from Topology, the data separability results by Har-Peled and Jones, and combinatorics. Our approach computes the Shattering coefficient for both binary and multi-class problems, leading to the following additional contributions: (i) the estimation of the required number of hyperplanes in the best and worst-case classification scenarios; (ii) the estimation of the training sample sizes required to ensure supervised learning; and (iii) the comparison of dataset embeddings, once they (re)organize samples into some new space configuration. All results introduced and discussed along this talk are supported by the R package shattering.

Bio: Rodrigo Fernandes de Mello is Chief Data Scientist at Itaú Unibanco SA, supporting +260 data scientists in the largest bank in Latin America with 56 million of customers, 34 million of credit cards, more than 4 thousand transactions per second, and 70 Petabytes of data currently available for Data Science and Machine Learning strategies. His interests include Statistical Learning, Time Series Analysis, and several other application domains.

## 24 June 2021, 11:00, online¶

Andrian Putina
Random Histogram Forest for Unsupervised Anomaly Detection (slides)

Abstract: Anomaly detection consists of identifying instances whose features significantly deviate from the rest of the input data. It is one of the most widely studied problems in unsupervised machine learning, boasting applications in network intrusion detection, healthcare and many others. Several methods have been developed in recent years, however, a satisfactory solution is still missing to the best of our knowledge. In particular, most of the methods presented recently exploit sampling techniques to detect anomalies but this can be counterproductive in large datasets in which the anomalies are few. Moreover, in high dimensional datasets most of the features may prove to be irrelevant and consequently damage the performance of the algorithms. We present \emph{Random Histogram Forest} an effective approach for unsupervised anomaly detection. Our approach is probabilistic, which has been proved to be effective in identifying anomalies. Moreover, it employs the fourth central moment (aka \emph{kurtosis}), so as to guide the detection towards the most relevant dimensions. We conduct an extensive experimental evaluation on 38 publicly available datasets including all benchmarks for anomaly detection and 64 private datasets, as well as the most successful algorithms for unsupervised anomaly detection, to the best of our knowledge. We evaluate all the approaches in terms of the average precision of the area under the precision-recall curve (AP). Our evaluation shows that our approach significantly outperforms all other approaches in terms of AP while boasting linear running time.

Bio: Andrian Putina is a PhD candidate at Telecom Paris working on Anomaly Detection and Telemetry in computer networks. He received his MSc degree in Information and Communication Technologies for Smart Societies from Politecnico di Torino, Italy in 2017.

## 10 June 2021, 11:00, online¶

Thomas Bonald (Télécom Paris)
A Streaming Algorithm for Logistic Regression (slides)

We focus on the problem of logistic regression in a streaming setting where data samples must be processed immediately when coming. We propose a novel algorithm based on Newton’s method and the Sherman-Morrison formula. This algorithm, which is the exact analogue of that known for linear regression, can be interpreted as the maximum a posteriori (MAP) estimation of the regressor in a Bayesian setting with Gaussian prior and Bernoulli observations. We show the efficiency of our algorithm on both synthetic and real data, including user interests in online flash sales proposed by a retailer company.

Joint work with Till Wolhfarth (Veepee).

## 3 June 2021, 11:00, online¶

Tiphaine Viard (Télécom Paris)
What is Normal, What is Strange, and What is Missing in a Knowledge Graph: Unified Characterization via Inductive Summarization (slides)

Tiphaine will present the work by Belth, Zheng, Vreeken, and Koutra, whose abstract follows.

Knowledge graphs (KGs) store highly heterogeneous information about the world in the structure of a graph, and are useful for tasks such as question answering and reasoning. However, they often contain errors and are missing information. Vibrant research in KG refinement has worked to resolve these issues, tailoring techniques to either detect specific types of errors or complete a KG. In this work, we introduce a unified solution to KG characterization by formulating the problem as unsupervised KG summarization with a set of inductive, soft rules, which describe what is normal in a KG, and thus can be used to identify what is abnormal, whether it be strange or missing. Unlike first-order logic rules, our rules are labeled, rooted graphs, i.e., patterns that describe the expected neighborhood around a (seen or unseen) node, based on its type, and information in the KG. Stepping away from the traditional support/confidence-based rule mining techniques, we propose KGist, Knowledge Graph Inductive SummarizaTion, which learns a summary of inductive rules that best compress the KG according to the Minimum Description Length principle---a formulation that we are the first to use in the context of KG rule mining. We apply our rules to three large KGs (NELL, DBpedia, and Yago), and tasks such as compression, various types of error detection, and identification of incomplete information. We show that KGist outperforms task-specific, supervised and unsupervised baselines in error detection and incompleteness identification, (identifying the location of up to 93% of missing entities---over 10% more than baselines), while also being efficient for large knowledge graphs.

## 20 May 2021, 11:00, online¶

Confident Interpretations of Black Box Classifiers (slides)

Deep Learning models provide state of the art classification results, but are not human-interpretable. We propose a novel method to interpret the classification results of a black box model a posteriori. We emulate the complex classifier by surrogate decision trees. Each tree mimics the behavior of the complex classifier by overestimating one of the classes. This yields a global, interpretable approximation of the black box classifier. Our method provides interpretations that are at the same time general (applying to many data points), confident (generalizing well to other data points), faithful to the original model (making the same predictions), and simple (easy to understand). Our experiments show that our method beats competing methods in these desiderata, and our user study shows that users prefer this type of interpretations over others.

François Amat (Télécom Paris)
Application talk rehearsal (slides)

## 29 April 2021, 11:00, online¶

Pierre-Henri Paris (Télécom Paris)
Vagueness, non-named entities, and knowledge representation (slides)

Representing and reasoning with knowledge is a task that scientists have been working on for a very long time. One objective of the NoRDF project is to obtain a formalism allowing representing complex knowledge while keeping reasoning decidable. This presentation is divided into three parts. First, we will address the problem of vagueness in noun phrases, such as "a rich person." Indeed, no current system can handle vagueness from end to end.

Then, we will talk about non-named entities which, unlike named entities, are largely absent from modern knowledge bases. Finally, we will present the different dimensions that we wish to address in the NoRDF project, both from a modeling and a reasoning point of view. We will introduce the different formalisms we could use to obtain a formalism corresponding to our specifications.

Bio: Pierre-Henri Paris is a post-doctoral researcher at Télécom Paris, Institut Polytechnique de Paris. During his PhD thesis at CNAM/Sorbonne Université, he worked on identity in knowledge graphs. He is currently part of the NoRDF project, led by professors Fabian M. Suchanek and Chloé Clavel. The objective is to model and extract complex information from natural language texts.

## 15 April 2021, 11:00, online¶

Louis Jachiet (Télécom Paris)
Ranked Enumeration of MSO Logic on Words (slides)

While the use of "MSO logic" in the title might be frigthening, we will see that "MSO logic on words" is essentially a formalization and a generalization of the capture mechanism of regular expressions that you can see in your favorite language.

Tools dealing with MSO logic over words (=regular expressions) generally extract information in an order that depends on the algorithm used and the text order. For instance, the first match returned, is the first appearing in the text, etc. In some cases, this order is not the most useful to the user who might prefer to receive the matching in a different order (e.g. if they are interested in the top-k results for instance). Enumerating results efficiently in an order given by the user is the problem we solve in this paper.

The presentation will try to be as pedagogical as possible focusing on the result (explaining what is the problem and why we want to solve it) and high level constructions rather than the technical details.

Bio: Louis Jachiet is an associate professor in the team.

## 11 March 2021, 11:00, online¶

Antoine Amarilli (Télécom Paris)
Uniform Reliability of Self-Join-Free Conjunctive Queries (slides)

Abstract: This is a presentation of my upcoming conference talk at the ICDT 2021 conference, presenting joint work with Benny Kimelfeld (Technion).

The reliability of a Boolean Conjunctive Query (CQ) over a tuple-independent probabilistic database is the probability that the CQ is satisfied when the tuples of the database are sampled one by one, independently, with their associated probability. For queries without self-joins (repeated relation symbols), the data complexity of this problem is fully characterized in a known dichotomy: reliability can be computed in polynomial time for hierarchical queries, and is #P-hard for non-hierarchical queries. Hierarchical queries also characterize the tractability of queries for other tasks: having read-once lineage formulas, supporting insertion/deletion updates to the database in constant time, and having a tractable computation of tuples' Shapley and Banzhaf values. In this work, we investigate a fundamental counting problem for CQs without self-joins: how many sets of facts from the input database satisfy the query? This is a simpler, uniform variant of the query reliability problem, where the probability of every tuple is required to be 1/2. Of course, for hierarchical queries, uniform reliability is in polynomial time, like the reliability problem. However, it is an open question whether being hierarchical is necessary for the uniform reliability problem to be in polynomial time. In fact, the complexity of the problem has been unknown even for the simplest non-hierarchical CQs without self-joins. We solve this open question by showing that uniform reliability is #P-complete for every non-hierarchical CQ without self-joins. Hence, we establish that being hierarchical also characterizes the tractability of unweighted counting of the satisfying tuple subsets. We also consider the generalization to query reliability where all tuples of the same relation have the same probability, and give preliminary results on the complexity of this problem.

## 18 February 2021, 11:00, online¶

Mathilde Hutin (LISN-CNRS) and Yaru Wu (LISN/CNRS & LPP/CNRS):
OTELO: OnTologies to Enrich Linguistic analyses of Oral speech (slides for Yaru, slides for Mathilde)

Abstract: Human languages are intrinsically polysemic and, therefore, ambiguous. Although humans usually solve potential ambiguities rather quickly by efficiently taking advantage of the communication context, machines do not. The OTELO project proposes a multi-level analysis of spoken language from large oral corpora that were segmented and annotated automatically to help (i) understand what characteristics of the acoustic signal may be used by humans to disambiguate connected speech so easily and (ii) find out whether these characteristics can be used to improve our technologies and, if so, which ones. In this presentation, we will show the current state of the project. Since the oral data has to be both segmented into phones and words, and enriched with knowledge about the grammatical status of words and their syntactic and semantic relationships in context, we will first describe the development of such enriched databases for French and how we plan on expanding future analyses to gernamic languages and other Romance languages using similar methods. After this, preliminary results showing how phonetic detail can help disambiguate homophones will be presented.

Bio of the speakers:

Yaru Wu (LISN/CNRS & LPP/CNRS) is a phonetician specialized in automatic language and speech processing. She is interested in variation in continuous speech in different languages and the contributions of large corpora and automatic speech processing to phonetic studies. She completed her PhD thesis on variation in continuous French using large corpora and tools from automatic speech recognition at the Laboratory of phonetics and phonology (LPP) - Université Sorbonne Nouvelle in 2018. After two years as a Temporary Research and Teaching Attachée (ATER) in the Language Sciences department at the Paris Nanterre University, she joined the LISN laboratory (former LIMSI) in October 2020 as a post-doctoral fellow. In the framework of the OTELO project at the LISN laboratory, she works on the extraction of linguistic information using machine learning techniques and on the contribution of linguistic knowledge to automatic speech recognition systems.

Mathilde Hutin (LISN-CNRS) is a linguist specialized in phonology. Her research focuses on the phonetics-phonology interface, and more precisely on what phonetic detail, in the acoustic signal, is relevant to language processing. After defending her thesis on the role of fine-grained phonetic detail during second language processing at the Université Paris 8 (SFL lab), she started working as a post-doc fellow at LISN (former LIMSI) in October 2019. She conducted her work in the framework of a Digicosme postdoctoral project during which she investigated fine-grained phonetic variation in large automatically-aligned corpora and its implications for linguistic theory, both from a synchronic and a micro-diachronic perspective. As a member of the OTELO project, she focuses on the phonetics-semantics, phonetics-syntax and phonetics-pragmatics interfaces by investigating how fine-grained variation can help disambiguate identical words with different meanings, natures or functions.

## 11 February 2021, 11:00, online¶

Jonathan Lajus, Télécom Paris
Fast, Exact, and Exhaustive Rule Mining in large Knowledge Bases

Abstract: The Semantic Web has quickly become a constellation of large and interconnected entity-centric Knowledge Bases. These KBs contain domain-specific knowledge that can be used for multiple application such as question answering or automatic reasoning. But in order to take full advantage of this data, it is essential to understand the schema and the patterns of the KB. A simple and expressive manner to describe the dependencies in a KB is to use rules. Thus it is crucial to be able to perform rule mining at scale.

In this thesis, we introduce novel approaches and optimizations designed to speed up the process of rule mining on large Knowledge Bases. We present two algorithms that implements these optimizations: the AMIE 3 algorithm (the successor of the exact rule mining algorithm AMIE+) and the Pathfinder algorithm, a novel algorithm specialized in mining path rules. These two algorithms are exhaustive with regard to the parameters provided by the user, they compute the quality measures of each rule exactly and efficiently scale to large KB and longer rules.

Résumé: Au fil des ans, le Web Sémantique s'est agrandi pour regrouper une constellation d'énormes Bases de Connaissances interconnectées. Ces bases répertorient nos connaissances du monde sous la forme de faits structurés et sont utilisées pour la réponse automatique de questions ainsi que pour le raisonnement automatique. Mais pour tirer pleinement avantage de ce vivier d'informations, il est essentiel de comprendre le schéma et les interdépendances intrinsèques à ces données. En particulier, les dépendances fonctionnelles entre les différentes relations peuvent être représentées sous la forme de règles simples. Il est donc crucial de pouvoir extraire ces règles efficacement à partir de nos données.

Dans cette thèse, on introduit de nouvelles approches et optimisations pour accélérer l'extraction de règles dans de larges Bases de Connaissances. On présente deux nouveaux algorithmes implémentant ces optimisations: AMIE 3 (le successeur de l'algorithme exact AMIE+) et Pathfinder, un nouvel algorithme spécialisé dans l'extraction de règles chaînées. Ces deux algorithmes sont exhaustifs, ils calculent la qualité des règles de manière exacte et passent à l'échelle de manière efficace sur un plus grand volume de données et sur des règles plus complexes.

## 4 February 2021, 11:00, online¶

Climbing towards NLU: On Meaning, Form, Understand in the Age of Data (slides)

Abstract: Named entity recognition (NER) plays a significant role in many applications such as information extraction, information retrieval, question answering, and even machine translation. Most of the work on NER using deep learning was done for non-Arabic languages like English and French, and only few studies focused on Arabic. This paper proposes a semi-supervised learning approach to train a BERT-based NER model using labeled and semi-labeled datasets.  We compared our approach against various baselines, and state-of-the-art Arabic NER tools on three datasets: AQMAR, NEWS, and TWEETS. We report a significant improvement in F-measure for the AQMAR and the NEWS datasets, which are written in Modern Standard Arabic (MSA), and competitive results for the TWEETS dataset, which contains tweets that are mostly in the Egyptian dialect and contain many mistakes or misspellings.

Bio: Chadi Helwe is a first year Ph.D. student at Telecom Paris - Institut Polytechnique de Paris under the supervision of Prof. Fabian Suchanek and Prof. Chloe Clavel. He received his B.Sc degree from Notre Dame University in Lebanon and his M.Sc degree from the American University of Beirut. His main research interests are in Machine Learning, Natural Language Processing, Biomedical Imaging.

Jean-Louis Dessalles, Télécom Paris:
Context-dependent relevant descriptions (slides)

We want to find relevant descriptions for entities, e.g. Musashisakai. You may locate it with coordinates 35.702, 139.544, or by saying that it is a train station 20km west of Tokyo (leaving an indeterminacy) or by saying that it is the main train station closest to TUFS (Tokyo Univ. of foreign studies). Can we order such descriptions by relevance to a specific user? Maybe, using algorithmic information. But help and advice from the DIG team is desperately needed to get it automated…

## 28 January 2021, 11:00, online¶

Lihu Chen, Télécom Paris:
A Lightweight Neural Model for Biomedical Entity Linking (slides)

Abstract: Biomedical entity linking aims to map biomedical mentions, such as diseases and drugs, to standard entities in a given knowledge base. The specific challenge in this context is that the same biomedical entity can have a wide range of names, including synonyms, morphological variations, and names with different word orderings. Recently, BERT-based methods have advanced the state-of-the-art by allowing for rich representations of word sequences. However, they often have hundreds of millions of parameters and require heavy computing resources, which limits their applications in resource-limited scenarios. Here, we propose a lightweight neural method for biomedical entity linking, which needs just a fraction of the parameters of a BERT model and much less computing resources. Our method uses a simple alignment layer with attention mechanisms to capture the variations between mention and entity names. Yet, we show that our model is competitive with previous work on standard evaluation benchmarks.

Bio: Lihu Chen is currently a second-year PhD candidate at Telecom Paris, supervised by Fabian Suchanek (Telecom) and Gael Varoquaux (INRIA). His research topics are information extraction and knowledge base.

## 21 January 2021, 11:00, online¶

Matthieu Jonckheere, Télécom Paris:
Flexible EM Clustering (Beyond the i.i.d paradigm) (slides)

Abstract. Though very popular, it is well known that the EM algorithm for clustering mixture of Gaussian suffers when applied to non-Gaussian distribution shapes and/or from outliers and high-dimensionality... We designed a new robust clustering algorithm that can efficiently deal with noise and outliers in diverse data sets. As an EM-like algorithm, it is based on both estimations of clusters centers and covariances but in addition it also uses an unknown scale parameter (nuisance parameter) per data-point. This allows the algorithm to accommodate for heavier tails distributions and outliers without significantly loosing efficiency in various classical scenarios. We analyze the proposed algorithm in the context of elliptical distributions, showing in particular important insensitivity properties to the underlying data distributions. Then, we show that the proposed algorithm outperforms other classical unsupervised methods of the literature such as k-means, the EM for Gaussian mixture models and its recent modifications or spectral clustering when applied to classical data sets as MNIST, NORB, and 20newsgroups.

This is joint work with Frédéric Pascal and Violeta Roizman.

## 9 December 2020, 11:00, online¶

Thomas Bonald, Louis Jachiet, Tiphaine Viard
Short research presentation (slides by Louis, slides by Tiphaine)

Abstract: Thomas, Louis, and Tiphaine will each give a high-level presentation of the kind of research that they are doing.

Bio: Thomas, Louis, and Tiphaine are permanent researchers in the team.

## 26 November 2020, 11:00, online¶

Climbing towards NLU: On Meaning, Form, Understand in the Age of Data (slides)

Abstract: The success of the large neural language models on many NLP tasks is exciting. However, we find that these successes sometimes lead to hype in which these models are being described as “understanding” language or capturing “meaning”. In this position paper, we argue that a system trained only on form has a priori no way to learn meaning. In keeping with the ACL 2020 theme of “Taking Stock of Where We’ve Been and Where We’re Going”, we argue that a clear understanding of the distinction between form and meaning will help guide the field towards better science around natural language understanding.

Bio: Chadi Helwe is a first year Ph.D. student at Telecom Paris - Institut Polytechnique de Paris under the supervision of Prof. Fabian Suchanek and Prof. Chloe Clavel. He received his B.Sc degree from Notre Dame University in Lebanon and his M.Sc degree from the American University of Beirut. His main research interests are in Machine Learning, Natural Language Processing, Biomedical Imaging.

## 29 October 2020, 12:30, online¶

Antoine Amarilli, Télécom Paris:
A Dichotomy for Homomorphism-Closed Queries on Probabilistic Graphs (slides)

Abstract: I will present the task of query evaluation on probabilistic graphs. In this problem, you want to answer a query (to find interesting patterns in data), but you are uncertain about the underlying data, and so you must give a probabilistic answer. More precisely, the input is data represented as a graph where every edge is annotated with a probability between 0 and 1, denoting the odds that the edge exists, independently from all other edges. This represents a probability distribution on all the subgraphs of the graphs. We are interested in figuring out what is the total probability of the subgraphs that fall into a fixed query class of graphs (e.g., graphs that contain a triangle). The goal is to compute this probability more efficiently than by summing over the exponential number of possible subgraphs.

In this talk I will present our results about the computational complexity of calculating this probability, for classes of graphs that are closed under homomorphism. This is a class generalizing many query languages in database theory (conjunctive queries, Datalog, regular path queries): formally, all graphs in which we can homomorphically embed a graph of the class also belong to the class. We show that, for any such graph class, the problem is computationally intractable (#P-hard), except the classes corresponding to so-called safe unions of conjunctive queries, which had already been shown to be tractable by Dalvi and Suciu. The key result is to show intractability for so-called unbounded classes. This is joint work with İsmail İlkan Ceylan (University of Oxford) and was presented at ICDT'20 where it received the best paper award.

Bio: Antoine Amarilli is an associate professor in the DIG team, working on database theory and other mysterious theoretical topics.

## 15 October 2020, 12:30, 0D20¶

Léo Laugier, Télécom Paris
Civil Rephrases Of Toxic Texts With Self-Supervised Transformers (slides)

Abstract: Platforms that support online commentary, from social networks to news sites, are increasingly leveraging machine learning to assist their moderation efforts. But this process does not typically provide feedback to the author that would help them contribute according to the community guidelines. This is prohibitively time-consuming for human moderators to do, and computational approaches are still nascent. This work focuses on models that can help suggest rephrasings of toxic comments in a more civil manner. Inspired by recent progress in unpaired sequence-to-sequence tasks, a self-supervised learning model is introduced, called CAE-T5. CAE-T5 employs a pre-trained text-to-text transformer, which is fine tuned with a denoising and cyclic auto-encoder loss. Experimenting with the largest toxicity detection dataset to date (Civil Comments) our model generates sentences that are more fluent and better at preserving the initial content compared to earlier text style transfer systems which we compare with using several scoring systems and human evaluation.

Bio: Léo Laugier is a Ph.D. candidate in the INFRES department at Télécom Paris and Institut Polytechnique de Paris, working on Conversation AI research. He is supervised by Prof. Thomas Bonald (LTCI) and Dr. Lucas Dixon (Google Brain). His thesis focus is on developing systems to detect and mitigate subtle forms of toxicity in online conversations, and more generally on conditional Natural Language Generation. Prior to joining IP Paris, he worked on Graph Convolutional Neural Networks at the Institute for Infocomm Research (I²R) of the Agency for Science, Technology and Research (A*STAR).

## 1 October 2020, 12:30, 0D20¶

Pierre-Henri Paris, Télécom Paris
Contextual Propagation of Properties for Knowledge Graphs: A Sentence Embedding Based Approach (slides)

Abstract: With the ever-increasing number of RDF-based knowledge graphs, the number of interconnections between these graphs using the owl:sameAs property has exploded. Moreover, as several works indicate, the identity as defined by the semantics of owl:sameAs could be too rigid, and this property is therefore often misused. Indeed, identity must be seen as context-dependent. These facts lead to poor quality data when using the owl:sameAs inference capabilities. Therefore, contextual identity could be a possible path to better quality knowledge. Unlike classical identity, with contextual identity, only certain properties can be propagated between contextually identical entities. Continuing this work on contextual identity, we propose an approach, based on sentence embedding, to find semi-automatically a set of properties, for a given identity context, that can be propagated between contextually identical entities. Quantitative experiments against a gold standard show that our approach achieved promising results. Besides, the use cases provided demonstrate that identifying the properties that can be propagated helps users achieve the desired results that meet their needs when querying a knowledge graph, i.e., more complete and accurate answers.

Bio: Pierre-Henri Paris is a post-doctoral researcher at Télécom Paris, Institut Polytechnique de Paris. During his PhD thesis at CNAM/Sorbonne Université, he worked on contextual identity in knowledge graphs, and in particular on the conditions under which two entities can share a certain amount of information. He is currently part of the NoRDF project, led by professors Fabian M. Suchanek and Chloé Clavel. The objective is to model and extract complex information from natural language texts.

## 12 March 2020, 13:15, 4A113¶

Mikaël Monet, Instituto Milenio Fundamentos de los datos
Logical Expressiveness of Graph Neural Networks (slides)

Graph Neural Networks (GNNs) are a family of machine learning architectures that has recently become popular for applications dealing with structured data, such as molecule classification and knowledge graph completion. Recent work on the expressive power of GNNs has established a close connection between their ability to classify nodes in a graph and the Weisfeiler-Lehman (WL) test for checking graph isomorphism. In turn, a seminal result by Cai et al. establishes that the WL test is tightly connected to the two-variable fragment of first-order logic extended with counting capabilities (FOC2). However, these results put together do not seem to characterize the relationship between GNNs and FOC2. This motivates the following question: which FOC2 node properties are expressible by GNNs? We start by considering GNNs that update the feature vector of a node by combining it with the aggregation of the vectors of its neighbors; we call these aggregate-combine GNNs (AC-GNNs). On the negative side, we present a simple FOC2 node property that cannot be captured by any AC-GNN. On the positive side, we identify a natural fragment of FOC2 whose expressiveness is subsumed by that of AC-GNNs. This fragment corresponds to graded modal logic, or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community. Next we improve the AC-GNN architecture by allowing global readouts, where in each layer we can compute a feature vector for the whole graph and combine it with local aggregations; we call these aggregate-combine-readout GNNs (ACR-GNNs). In this setting, we prove that each FOC2 formula is captured by an ACR-GNN classifier. Besides their own value, these results put together indicate that readouts strictly increase the discriminative power of GNNs. (Ongoing work with Pablo Barceló, Egor Kostylev, Jorge Pérez, Juan Reutter and Juan Pablo Silva)

## 21 November 2019, 12:00, 3.A26¶

Louis Jachiet, Télécom Paris
Reasoning about Disclosure in Data Integration in the Presence of Source Constraints (slides)

This talk will be about the recent paper "Reasoning about Disclosure in Data Integration in the Presence of Source Constraints" presented at IJCAI 19. The talk will mix material from the paper and a general introduction to the tools used in the paper (such as the "Chase"). Here is the abstract of this paper:

Data integration systems allow users to access data sitting in multiple sources by means of queries over a global schema, related to the sources via mappings. Data sources often contain sensitive information, and thus an analysis is needed to verify that a schema satisfies a privacy policy, given as a set of queries whose answers should not be accessible to users. Such an analysis should take into account not only knowledge that an attacker may have about the mappings, but also what they may know about the semantics of the sources.

In this paper, we show that source constraints can have a dramatic impact on disclosure analysis. We study the problem of determining whether a given data integration system discloses a source query to an attacker in the presence of constraints, providing both lower and upper bounds on source-aware disclosure analysis.

## 17 October 2019, 12:00, C47¶

Julien Romero, Télécom Paris
Commonsense Properties from Query Logs and Question Answering Forums (slides)

Abstract: Commonsense knowledge about object properties, human behavior, and general concepts is crucial for robust AI applications. However, the automatic acquisition of this knowledge is challenging because of sparseness and bias in online sources. This talk presents Quasimodo, a methodology and tool suite for distilling commonsense properties from non-standard web sources. We devise novel ways of tapping into search-engine query logs and QA forums and combining the resulting candidate assertions with statistical cues from encyclopedias, books and image tags in a corroboration step. Unlike prior work on commonsense knowledge bases, Quasimodo focuses on salient properties that are typically associated with certain objects or concepts. Extensive evaluations, including extrinsic use-case studies, show that Quasimodo provides better coverage than state-of-the-art baselines with comparable quality.

Bio: Julien Romero is a PhD student in the group, whose thesis is supervised by Fabian Suchanek.

The seminar will be followed by a presentation and discussion on the future AIDA center by Guillaume Desvaux (head of the future center).

## 2 October 2019, 12:00, C46¶

Nesime Tatbul, Intel Labs and MIT CSAIL
Practical Tools for Time Series Anomaly Detection

Abstract: From autonomous driving to industrial IoT, the age of billions of intelligent devices generating time-varying data is here. There is a growing need to ingest and analyze high-volumes of time series data at scale. In our Metronome Project, we have been broadly exploring novel data management, machine learning, and interactive visualization techniques for supporting the practical development and deployment of predictive time series analytics applications. This talk will focus on our efforts in time series anomaly detection, including: (i) a customizable scoring model for evaluating accuracy, which extends the classical precision/recall model to range-based data; (ii) a zero-positive learning paradigm, which enables training anomaly detectors in absence of labeled datasets; and (iii) Metro-Viz, a visual tool for interactively analyzing time series anomalies.

Bio: Nesime Tatbul is a senior research scientist at Intel’s Parallel Computing Lab (PCL) and a visiting scientist at MIT’s Computer Science and Artificial Intelligence Lab (CSAIL). Previously, she served on the computer science faculty of ETH Zurich after receiving a Ph.D. degree from Brown University. Her research interests are in large-scale data management systems and modern data-intensive applications. She is most known for her contributions to stream processing, which include the Aurora/Borealis systems (now TIBCO StreamBase) and the S-Store system (the first streaming OLTP system). Nesime is the recipient of an IBM Faculty Award (2008), two ACM SIGMOD Best Demonstration Awards (2005 and 2019), and ACM DEBS Grand Challenge and Best Poster Awards (2011). She has served on the organization and program committees for various conferences including SIGMOD, VLDB, ICDE, and DEBS, and on the editorial boards of the SIGMOD Record and the VLDB Journal.

## 11 September 2019, 12:00, C47¶

Martín Muñoz, Pontificia Universidad Católica de Chile
Descriptive Complexity for Counting Complexity Classes (slides)

Abstract: The goal of Descriptive Complexity is to measure the complexity of computational problems by characterizing them in terms of logics. However, the study of Descriptive Complexity has been mainly focused in decision problems, and not as much insight has been given into how to logically capture counting problems.

This paper builds from the idea of Weighted Logics to obtain a framework called Quantitative Second Order Logics (QSO). Our main contributions are showing how this framework can be used to logically capture many of the well-studied counting complexity classes (like FP and #P); using QSO to find classes below #P, with good closure and approximation properties; and show how to use quantitative recursion over QSO to capture lower classes like #L.

Bio: I am a PhD. student at the Pontificia Universidad Católica de Chile and a member of the Millennium Institute for Foundational Research on Data. I received a M.Sc. degree and my professional degree in Computer Engineering in 2017, both from the Pontificia Universidad Católica de Chile. My doctoral research has focused on document spanners and enumeration complexity. I am also a member of the progcomp group in Chile that works to promote competitive programming to students in higher education.

## 4 July 2019, 11:00, B310¶

Gianmarco de Francisci Morales, ISI Foundation:
Controversy on Social Media: Collective Attention, Echo Chambers, and Price of Bipartisanship (slides)

Abstract: How do we discuss controversial topics on social media? Answering this question is not only interesting from a societal point of view, but also has concrete implications for policy makers, news agencies, and internet companies. In this talk, we first take a look at how collective attention, which is typically related to external events that increase the visibility of the topic, changes the debate. Our analysis shows that, in long-lived controversial debates on Twitter, increased collective attention is associated with increased network polarization. Then, we show how content and network interact in the formation of echo chambers. As expected, Twitter users are mostly exposed to political opinions that agree with their own. In addition, users who try to bridge the echo chambers by sharing content with diverse leaning have to pay a “price of bipartisanship” in terms of their network centrality and content appreciation.

Bio: Gianmarco De Francisci Morales is a Senior Researcher at ISI Foundation in Turin. Previously he worked as a Scientist at Qatar Computing Research Institute in Doha, as a Visiting Scientist at Aalto University in Helsinki, as a Research Scientist at Yahoo Labs in Barcelona, and as a Research Associate at ISTI-CNR in Pisa. He received his Ph.D. in Computer Science and Engineering from the IMT Institute for Advanced Studies of Lucca in 2012. His research focuses on scalable data mining, with an emphasis on Web mining and data-intensive scalable computing systems. He is an active member of the open source community of the Apache Software Foundation, working on the Hadoop ecosystem, and a committer for the Apache Pig project. He is one of the lead developers of Apache SAMOA, an open-source platform for mining big data streams. He commonly serves on the PC of several major conferences in the area of data mining, including WSDM, KDD, CIKM, and WWW. He co-organizes the workshop series on Social News on the Web (SNOW), co-located with the WWW conference. He has won best paper awards at WSDM and WebSci.

## 20 June 2019, 12:00, C47¶

Cyril Chhun, Télécom Paris:
Clustering by contrast (slides)

Abstract: The objective is to design a clustering method that learns both prototypes and contrasts with prototypes, to obtain descriptions such as "large spade" where both "large" and "spade" can be learned in one shot. Here, the word "spade" is attached to the closest prototype to the instance, and "large" is attached to the prototype of the contrast between the instance and its prototype.

Bio: Cyril is finishing is internship at DIG/Télécom-Paris, together with his 3rd year of study at École Polytechnique.

Julien Panis-Lie, Télécom Paris
Extreme social signals (slides)

Altruistic forms of behaviour abound: writing open source programs, providing answers in technical forums, contributing to Wikipedia, volunteer work, charity, heroism,... Basic game theory predicts that such altruism should not exist, especially when it is anonymous. We explore the hypothesis that these forms of altruism may serve a signalling purpose. We test the model on the most extreme case: suicide for the group.

Bio: Julien is currently interning at DIG/Télécom-Paris. After Polytechnique, he studied economy and sociology, before following the CogMaster, which he is completing with this internship.

## 6 June 2019, 14:00, C49¶

LTCI Data Science Seminar session. Speakers: Pavlo Mozharovskyi and Silviu Maniu

See the LTCI Data Science Seminar Webpage for details.

## 9 May 2019, 14:00, C47¶

Kuldeep S. Meel, National University of Singapore:
Beyond NP Revolution (slides)

The paradigmatic NP-complete problem of Boolean satisfiability (SAT) solving is a central problem in Computer Science. While the mention of SAT can be traced to early 19th century, efforts to develop practically successful SAT solvers go back to 1950s. The past 20 years have witnessed a "NP revolution" with the development of conflict-driven clause-learning (CDCL) SAT solvers. Such solvers combine a classical backtracking search with a rich set of effective heuristics. While 20 years ago SAT solvers were able to solve instances with at most a few hundred variables, modern SAT solvers solve instances with up to millions of variables in a reasonable time.

The "NP-revolution" opens up opportunities to design practical algorithms with rigorous guarantees for problems in complexity classes beyond NP by replacing a NP oracle with a SAT Solver. In this talk, we will discuss how we use NP revolution to design practical algorithms for two fundamental problems in artificial intelligence and formal methods: Constrained Counting and Sampling

Bio:

Kuldeep Meel is an Assistant Professor of Computer Science in School of Computing at the National University of Singapore where he holds the Sung Kah Kay Assistant Professorship. He received his Ph.D. (2017) and M.S. (2014) degree in Computer Science from Rice University. He holds B. Tech. (with Honors) degree (2012) in Computer Science and Engineering from Indian Institute of Technology, Bombay. His research interests lie at the intersection of Artificial Intelligence and Formal Methods. Meel has co-presented tutorials at top-tier AI conferences, IJCAI 2018, AAAI 2017, and UAI 2016. His work received the 2018 Ralph Budd Award for Best PhD Thesis in Engineering, 2014 Outstanding Masters Thesis Award from Vienna Center of Logic and Algorithms and Best Student Paper Award at CP 2015. He received the IBM Ph.D. Fellowship and the 2016-17 Lodieska Stockbridge Vaughn Fellowship for his work on constrained sampling and counting.

## 14 March 2019, 14:00, C49¶

LTCI Data Science Seminar session. Speakers: Thomas Bonald and Alessandro Rudi.

See the LTCI Data Science Seminar Webpage for details.

## 21 February 2019, 12:00, C47¶

Louis Jachiet, Inria Lille:
On the optimization of recursive queries over graphs (slides)

Abstract: Since its introduction, the relational model has seen various attempts to extend it with recursion and it is now possible to use recursion in several SQL or Datalog database systems. The optimization of such recursive queries remains, however, a challenge.

In this talk, we will introduce μ-RA, a variation of the Relational Algebra that allows for the expression of relational queries with recursion. μ-RA can express unions of conjunctive regular path queries over graphs (similar to the Property Paths of SPARQL) as well as certain non-regular properties.

We will present its syntax, semantics and the rewriting rules we specifically devised to tackle the optimization of recursive queries. We will also present our implementation and a benchmark comparing our prototype with respect to state-of-the-art systems.

Bio: Louis Jachiet is a post-doctorate in the Spirals team at Inria Lille where he studies the data security in databases. He previously was a teaching assistant at the École Normale Supérieure in Paris and did his PhD in the Tyrex team at Inria Grenoble on the topic of the optimization of SPARQL queries for distributed systems.

## 24 January 2019, 12:00, C47¶

Knowledge graph embedding for mining cultural heritage data (slides)

Abstract: In this talk we present a method for mining cultural heritage data using knowledge graph embedding models and the preliminary results of our ongoing work. This work is supported by the Data & Musée project which main goal is to define a model to integrate data produced by different cultural institutions in order to recommend useful original content for users (visitors or people belonging to the institution). First, we create a global context graph for the target domain. The graph is built as the union of individual contextual graphs of all entities representing the input data from two main cultural institutions: Paris Musées (PM) et Centre des Monuments Nationaux (CMN). Second, we propose to mine the resulting knowledge graph using neural network based graph embedding model with biased graph walks. The model calculates an embedding vector for each entity in the graph that could be used for different machine learning tasks to reach the objectives of this research.

Bio: Nada Mimouni is a postdoc at Télécom ParisTech in the IDS department.

## 10 January 2019, 12:00, C46¶

Explainable Artificial Intelligence (slides)

Abstract: In recent years, machine learning and artificial intelligence systems are reaching, sometimes even exceeding, the human performance in tasks such as image recognition, speech understanding, or strategic decision making. The main problem with many of these models is their lack of transparency and interpretability: There is no information about how exactly they reached their predictions. This is a major issue in sensitive fields such as healthcare, policing, and finance. To address these issues, explainable artificial intelligence (XAI) has become an important topic of interest in the research community.

Through our research, we want to address this problem with insights from another field that has recently celebrated great advances: that of large knowledge bases (KBs). By contributing the link to the real world, KBs can give a semantic dimension to machine learning (ML) algorithms. While semantic background knowledge has long been used in ML, we believe that the recent explosion of the size of KBs warrants a revisit of this approach. KBs are now much larger, much broader in terms of thematic coverage, and much cleaner at scale. We imagine that a symbiosis between these new KBs and ML could take several forms: semantics can be injected a posteriori into a learned model; semantics can be taken into account as background knowledge during the learning process, or the learning process can feed directly from the semantic data. We aim to systematically explore all of these possibilities and investigate how they can serve to make AI and ML models more interpretable, more explainable, and ultimately more human-intelligible.

Bio: I studied Electrical Engineering and Computer Science at School of Electrical Engineering, University of Belgrade, Serbia. This year I obtained M. Sc. in Computer Science at Télécom ParisTech. I am starting Ph.D. studies with Professors Albert Bifet and Fabian Suchanek. The research topic of my Ph.D. is Explainable AI.

## 13 December 2018, 12:00, C47¶

David Carral, TU Dresden
Reasoning with Description Logics Ontologies and Knowledge Graphs (slides)

Abstract: Ontology-based access to knowledge graphs (KGs) has recently gained a lot of attention. One of the research challenges when accessing these large data structures is to enable "the capability of combining diverse reasoning methods and knowledge representations while guaranteeing the required scalability, according to the reasoning task at hand." [1]

In our work, we address this challenge with a focus on reasoning with KGs extended with Description Logics (DL) ontologies. In principle, one could make use of existing DL reasoners to solve these reasoning tasks. However, DL reasoners---which are designed to deal with complex terminological axioms---do not scale well in the presence of large amounts of assertional information. In contrast, existing rule engines such as VLog or RDFOx can efficiently reason with data-intensive knowledge bases. To take advantage of these powerful implementations, we propose several data-independent mappings from DL TBoxes into rule sets that preserve the outcomes of conjunctive query (CQ) answering. Our experiments indicate that reasoning with rule engines over the resulting CQ-preserving rewritings can be significantly more efficient than using state-of-the-art DL reasoners over the original DL ontologies.

[1] This quote is taken from the description of a recent Daghstul seminar on Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web

Bio: Since October 2016, I am a postdoctoral scholar at the Knowledge-Based Systems group led by Prof. Markus Krötzsch at Technische Universität Dresden. I completed my doctor’s degree at Wright State University under the supervision of Prof. Pascal Hitzler. For a couple months at the beginning of my Ph.D., I was an exchange student at the University of Oxford, working under the supervision of Prof. Bernardo Cuenca Grau.

Broadly speaking, I am interested in the study of logical languages such as Description Logics and existential rules, the implementation of reasoning algorithms for these languages, and the use and application of semantic web technologies in different domains.

## 11 December 2018, 14:00, C48¶

LTCI Data Science Seminar session. Speaker: Shai Ben-David.

See the LTCI Data Science Seminar Webpage for details.

## 29 November 2018, 14:00, C48¶

LTCI Data Science Seminar session. Speakers: Rodrigo Mello and Olivier Sigaud.

See the LTCI Data Science Seminar Webpage for details.

## 15 November 2018, 12:00, B551¶

Arnaud Soulet, Université de Tours
Representativeness of Knowledge Bases with the Generalized Benford’s Law (slides)

Abstract: Knowledge bases (KBs) such as DBpedia, Wikidata, and YAGO contain a huge number of entities and facts. Several recent works induce rules or calculate statistics on these KBs. Most of these methods are based on the assumption that the data is a representative sample of the studied universe. Unfortunately, KBs are biased because they are built from crowdsourcing and opportunistic agglomeration of available databases. This work aims at approximating the representativeness of a relation within a knowledge base. For this, we use the Generalized Benford's law, which indicates the distribution expected by the facts of a relation. We then compute the minimum number of facts that have to be added in order to make the KB representative of the real world. Experiments show that our unsupervised method applies to a large number of relations. For numerical relations where ground truths exist, the estimated representativeness proves to be a reliable indicator.

Bio: Arnaud Soulet is an associate professor at University of Tours. His research interests include databases, data mining and knowledge bases.

## 8 November 2018, 12:00, C47¶

Borja Balle, Amazon Research
Privacy-Aware Machine Learning Systems (slides)

Abstract: Privacy-aware machine learning systems allow us to train models on sensitive data without the need to have plain-text access to the data. For example, such systems could enable hospitals in different countries to learn models on their combined datasets without the need to entrust the data held by each hospital to a centralized computing node. In this talk I will describe how several privacy-enhancing technologies like differential privacy and secure multi-party computation come together in this line of work. In particular, I will highlight our current progress in this space and the remaining challenges to obtain scalable and trusted large-scale deployments.

Bio: Borja Balle is currently a Machine Learning Scientist at Amazon Research in Cambridge (UK). Before joining Amazon, Borja was a lecturer at Lancaster University (2015-2017), a postdoctoral fellow at McGill University (2013-2015), and a graduate student at Universitat Politecnica de Catalunya where he obtained his PhD in 2013. His main research interest is in privacy-preserving machine learning, including the use of differential privacy and multi-party computation in distributed learning systems, and the mathematical foundations of privacy-aware data science.

## 25 October 2018, 12:00, C47¶

Fabian M. Suchanek, Télécom ParisTech
An introduction to deep learning (slides)

Abstract: in this talk, I will present the basics of deep learning. The goal of the presentation is two fold: 1) share what I learnt about deep learning with those who would like to know what it is and 2) receive feedback from those who already know more than myself about it. I have slides, but the presentation will follow the interaction with the audience.

Biography: Fabian Suchanek is a professor in the group.

## 4 October 2018, 13:00, C47¶

Quentin Lobbé, Télécom ParisTech
Where the dead blogs are: a disaggregated exploration of Web archives to reveal extinct online collectives (slides)

Abstract: The Web is an unsteady environment. As Web sites emerge and expand every days, whole communities may fade away over time by leaving too few or incomplete traces on the living Web. Worldwide volumes of Web archives preserve the history of the Web and reduce the loss of this digital heritage. Web archives remain essential to the comprehension of the lifecycles of extinct online collectives. In my talk, I will introduce a framework to follow the intern dynamics of vanished Web communities, based on the exploration of corpora of Web archives. To achieve this goal, I propose the definition of a new unit of analysis called Web fragment: a semantic and syntactic subset of a given Web page, designed to increase historical accuracy. This contribution has practical value for those who conduct large-scale archive exploration (in terms of time range and volume) or are interested in computational approach to Web history and social science. By applying this framework to the Moroccan archives of the e-Diasporas Atlas, we will first witness the collapsing of an established community of Moroccan migrant blogs. We will show its progressive mutation towards rising social platforms, between 2008 and 2018. Then, we will study the sudden creation of an ephemeral collective of forum members gathered by the wave of the Arab Spring in the early 2011. We will finally yield new insights into historical Web studies by suggesting the concept of pivot moment of the Web.

Biography: Quentin Lobbé is a PhD student in the group; this is a rehearsal talk for the BDA conference.

## 6 September 2018, 12:00, C47¶

Rodrigo Mello, University of Sao Paulo
The Statistical Learning Theory in Practical Problems (slides, code)

Abstract: In this 30-minute talk Prof. Rodrigo Mello will introduce its main research interests in a informal way: the Statistical Learning Theory, Data Streams/Time Series modeling using Statistics and Dynamical Systems, and How Theoretical Aspects can support the design of Deep Learning architectures. Several applications will be also mentioned during this talk.

Biography: Rodrigo Mello is currently an Associate Professor at the Institute of Mathematics and Computer Sciences, Department of Computer Science, University of São Paulo, São Carlos, Brazil. Prof. Mello is currently in a sabbatical year as invited professor at Télécom ParisTech, after an invitation by Prof. Albert Bifet. He completed his PhD degree from University of São Paulo, São Carlos in 2003 and has another one-year experience as Invited Professor at St. Francis Xavier University, Antigonish, NS, Canada. His research interests are mostly associated to theoretical aspects of Machine Learning, Data Streams/Time Series modeling and prediction, and Deep Learning.

## 12 July 2018, 12:00, C48¶

Towards a solution to the “sameAs problem” (slides)

Abstract: In the absence of a central naming authority on the Semantic Web, it is common for different datasets to refer to the same thing by different IRIs. Whenever multiple names are used to denote the same thing, owl:sameAs statements are needed in order to link the data and foster reuse. However, studies that date back as far as 2009 have observed that the Semantic Web identity predicate is sometimes used incorrectly, leaving multiple incorrect owl:sameAs statements in the Web. This problem is known as the “sameAs problem”. In this talk, we show how network metrics, such as the community structure of the owl:sameAs graph, can be used for detecting such possibly erroneous statements. One benefit of the here presented approach is that it can be applied to the network of owl:sameAs links itself, and does not rely on any additional knowledge. In order to illustrate its ability to scale, the approach is evaluated on the largest collection of identity links to date, containing over 558 million owl:sameAs links scraped from the LOD Cloud.

Biography: Joe is a PhD student at the University of Paris-Saclay, and member of the LINK (AgroParisTech-INRA, Paris) and LAHDAK teams (LRI, Orsay). His current research comprises knowledge representation using Semantic Web languages, as well as studying the use of identity in the Semantic Web.

## 5 July 2018, 12:00, C47¶

Thomas Rebele, Télécom ParisTech, DIG team
Extending the YAGO knowledge base (slides)

Abstract: A knowledge base is a set of facts about the world. YAGO was one of the first large-scale knowledge bases that were constructed automatically. This presentation shows our work on extending the YAGO knowledge base along two axes: extraction and preprocessing.

The first part of the talk presents methods that increase the number of facts about people in YAGO. We have developed algorithms and heuristics for extracting more facts about birth and death date, about gender, and about the place of residence. We also show how to use these data for studies in Digital Humanities.

The second part discusses two algorithms for repairing a regular expression automatically so that it matches a given set of words. Experiments on various datasets show the effectiveness and generality of these algorithms. Both algorithms improve the recall of the initial regular expression while achieving a similar or better precision.

The third part presents a system for translating database queries into Bash scripts. This approach allows preprocessing large tabular datasets and knowledge bases by executing Datalog and SPARQL queries, without installing any software beyond a Unix-like operating system. Experiments show that the performance of our system is comparable with state-of-the-art systems.

Biography: Thomas Rebele is a PhD student in our group.

## 14 June 2018, 12:00, C47¶

Lucie-Aimée Kaffee, University of Southampton
Multilinguality of Wikidata (slides)

Abstract: The web in general shows a lack of support for non-English languages. One way of overcoming this lack of information is using multilingual linked data. Wikidata data supports over 400 languages in theory. In practice, however, not all languages are equally supported. As a first step, we want to explore the language distribution of a collaboratively edited knowledge base such as Wikidata d label coverage of the web of data in general. Labels are the access point for humans to the web of data, and a lack thereof means limited reusability. Wikipedia is an ideal candidate for reuse of the multilingual data: the project has instances in over 280 languages, but the number of articles differ drastically. For many readers it could be a first starting point to get information. wever, with a lack of information the project is unlikely to attract new community members that could create new articles. We investigate the possibility of neural natural language generation for underserved Wikipedia communities, using kidata’s facts and evaluate this approach with the help of the Arabic and Esperanto Wikipedia communities. This approach can only be as good as the amount of multilingual data we have at our disposal. Therefore, we discuss future ways of improving the coverage of under-resourced languages’ information in Wikidata.

Biograhphy: Lucie is a PhD student at the School of Electronics and Computer Science, University of Southampton, as part of the Web and Internet Science (WAIS) research group. Additionally, she is part of the part of the Marie Skłodowska-Curie ITN Aqua. Generally, she is working on how to support underserved languages on the web with the means of linked data. Therefore, her research interests include linked data, multilinguality, Wikidata, underserved languages on the web and most recently natural language generation and relation extraction. Before getting involved with research, she worked as a software developer at Wikimedia Deutschland in the Wikidata team. There she was already involved in the previously mentioned topics, developing the ArticlePlaceholder extension, ich includes Wikidata’s structured knowledge on Wikipedias of small languages, a project she continued research on. She is still involved in Open Source projects, mainly Wikimedia related, where she is currently part of the Code of Conduct Committee for technical spaces.

## 23 May 2018, 12:05, C47¶

Viktor Losing, University of Bielefeld, HONDA Research Institute Europe
Memory Models for Incremental Learning Architectures (slides)

Abstract: There are more and more products available with automated functions for human assistance or autonomous services in home or outdoor environments. A common problem is the inadequate match between user expectations which are highly individual and the assistant system function which is typically rather standardized. Incremental learning methods offer a way to adapt the parameters and behavior of an assistant system according to user needs and preferences. In this talk, I will illustrate the benefits of personalization and incremental learning using the task of driver maneuver prediction at intersections. The study is based on a collection of commuting drivers who recorded their daily routes with a standard smart phone and GPS receiver. The personalized prediction based on at least one experience of a certain intersection already improves the prediction performance over an average prediction model trained.

A closely related topic is incremental learning in non-stationary data streams which is highly challenging, since the possibly occurring types of drift are fundamentally different and undermine classical assumptions such as data independence or stationary distributions. Here, I will introduce the Self Adjusting Memory (SAM) model for the k Nearest Neighbor (kNN) algorithm. The basic idea is to construct dedicated models for the current and former concepts and apply them according to the demands of the given situation. In an extensive evaluation, SAM-kNN achieves highly competitive results throughout all experiments, underlining its robustness and capability to handle heterogeneous concept drift.

Biography: Viktor Losing received his M. Sc. in Intelligent Systems at the University of Bielefeld in 2014. Since 2015 he is a PhD student at the CoR-Lab of the University of Bielefeld in cooperation with the HONDA Research Institute Europe. His research interests comprise incremental and online learning, learning under concept drift as well as corresponding real-world applications.

## 28 March 2018, 12:05, C47¶

Romain Giot, IUT Bordeaux and LaBRI
Biometric performance evaluation with novel visualization (slides)

Abstract: Biometric authentication verifies the identity of individuals based on what they are. However, biometric authentication systems are error prone and can reject genuine individuals or accept impostors. Researchers on biometric authentication quantify the quality of their algorithm by benchmarking it several databases. However, although the standard evaluation metrics state the performance of a system, they are not able to explain the reasons of these errors.

After presenting the existing evaluation procedures of biometric authentication systems as well as visualisation properties, this talk presents a novel visual evaluation of the results of a biometric authentication system which helps to find which individuals or samples are sources of errors and could help to fix the algorithms. Two variants are proposed: one where the individuals of the database are modelled as a firected graph and another one where the biometric database of scores is modelled as a partitioned power-graph where nodes represent biometric samples and power-nodes represent individuals. A novel recursive edge bundling method is also applied to reduce clutter. This proposal has been successfully applied on several biometric databases and proved its interest.

Biography: I am associate professor at the IUT de Bordeaux and the LaBRI and head of the team “Back to Bench and Beyond” of the group “Bench to Knowledge end Beyond”. I have a research experience in biometric authentication (as a PhD student at the university of Caen where I worked on template update and multibiometrics for keystroke dynamics), anomaly detection (as a postdoctoral researcher at Orange Labs where I worked on fraud detection in mobile payment), and large graph visualisation (since I'm associate professor at Bordeaux).

## 5 March 2018, 12:05, C46¶

Analogical Transfer: a Form of Similarity-Based Inference? (slides)

Abstract: Making an analogical transfer consists in assuming that if two situations are alike in some ways, they may be alike in others. Such a cognitive process is the inspiration for different machine learning approaches like analogical classification, the k-nearest neighbors algorithm, or case-based reasoning. This talk explores the role of similarity in the transfer phase of analogy, by taking a qualitative reasoning viewpoint. We first show that there exists an intimate link between the qualitative measurement of similarity and computational analogy. Essential notions of formal models of analogy, such as analogical equalities/inequalities, or analogical dissimilarity, and the related inferences (mapping and transfer) can be formulated as operations on ordinal similarity relations. In the light of these observations, we will defend the idea that analogical transfer is a form of similarity-based inference.

Biography: Fadi Badra is an assistant professor at Paris 13 University, and is a member of the Medical Informatics and Knowledge Engineering Research Group (LIMICS) in Paris, France. He completed his PhD in the Orpailleur Research Group at the LORIA Lab in Nancy, France. His current research interests are in the area of computational analogy and case-based reasoning, with a particular focus on its adaptation phase.

## 22 November 2017, 12:00, C47¶

Vwani Roychowdhury, UCLA
The Unreasonable Effectiveness of Data: A Scalable framework for "Understanding" Social Forums and Online Discussions (no slides provided)

Abstract: As humans we interpret and react to the world around us in terms of narratives. At a basic level, a narrative is comprised of principal actors and entities, their interactions, and finally the decisions they make to reinforce and protect their interests. The primary question we address in this talk is whether a computer can automatically distill and create such narrative maps from millions of posts and discussions that happen in the online world. How much and which parts of the underlying narratives can be extracted via unsupervised statistical methods, and how much "humanness" needs to becoded into a computer? We provide a framework that uses statistical techniques to generate automated summaries, and show that when augmented with a small-size dictionary that encodes "humanness," the framework can generate effective narratives from a number of domains. We will present several sets of empirical results where millions of posts are processed to generate story graphs and plots of the underlying discussions.

Biography: Vwani Roychowdhury is a Professor of Electrical and Computer Engineering at University of California, Los Angeles (UCLA). He specializes in interdisciplinary work that deal with the modeling and design of information and computing systems, ranging from the physical, biological and engineered systems. He has done pioneering work in Quantum Computing, Nanoelectronics, Peer-to-Peer (P2P), social and complex networks, machine learning, text mining, artificial neural networks, computer vision, and Internet-Scale data processing. He has published more than 200 peer reviewed journal and conference papers, and co-authored several books. He has also cofounded several silicon valley startups, including www.netseer.com and www.stieleeye.com.

## 18 October 2017, 12:00, C47¶

Yun Sing Koh, University of Auckland
Using Volatility in Concept Drift Detection and Capturing Recurrent Concept Drift in Data Streams (slides)

Abstract: Much of scientific research involves the generation and testing of hypotheses that can facilitate the development of accurate models for a system. In machine learning the automated building of accurate models is desired. However traditional machine learning often assumes that the underlying models are static and unchanging over time. In reality there are many applications that analyse data streams where the underlying model or system changes over time. This may be caused by changes in the conditions of the system, or a fundamental change in how the system behaves. In this talk, I will present a change detector called SEED, and how we capture stream volatility. We coin the term stream volatility, to describe the rate of changes in a stream. A stream has a high volatility if changes are detected frequently and has a low volatility if changes are detected infrequently. I will also present a drift prediction algorithm to predict the location of future drift points based on historical drift trends which we model as transitions between stream volatility patterns. Our method uses a probabilistic network to learn drift trends and is independent of the drift detection technique. I will then present a meta-learner, Concept Profiling Framework (CPF) that uses a concept drift detector and a collection of classification models to perform effective classification on data streams with recurrent concept drifts, through relating models by similarity of their classifying behaviour.

Biography: Yun Sing Koh is a Senior Lecturer at the Department of Computer Science, The University of Auckland, New Zealand. She completed her PhD at the Department of Computer Science, University of Otago, New Zealand in 2007. Her current research interest is in the area of data mining and machine learning, specifically data stream mining and pattern mining.

## 12 September 2017, 12:00, C47¶

Bob Durrant, University of Waikato
Random Projections for Dimensionality Reduction (slides)

## 12 July 2017, 12:00, C47¶

Amin Mantrach, Criteo Research
Deep Character-Level Click-Through Rate Prediction for Sponsored Search (slides)

## 31 May 2017, 12:00, C48¶

Quentin Lobbé, Télécom ParisTech
An exploration of web archives beyond the pages : Introducing web fragments (slides)
Mikaël Monet, Télécom ParisTech
Probabilistic query evaluation: towards tractable combined complexity (slides)

## 26 April 2017, 12:00, C47¶

Themis Palpanas, LIPADE, Paris Descartes University
Riding the Big IoT Data Wave: Complex Analytics for IoT Data Series (slides)

## 8 March 2017, 12:00, C47¶

Thomas Bonald, Télécom ParisTech
Community detection in graphs (slides)

## 27 February 2017, 12:00, C46¶

Laurent Decreusefond, Télécom ParisTech
Stochastic geometry, random hypergraphs, random walks (slides)

## 26 January 2017, 12:00, C47¶

Nofar Carmeli, Technion
Efficiently Enumerating Tree Decompositions (slides)

## 11 January 2017, 12:00, C47¶

Simon Razniewski, Free University of Bozen-Bolzano
Query-driven Data Completeness Assessment (slides)

## 14 December 2016, 12:00, C47¶

Fabian M. Suchanek, Télécom ParisTech
A hitchhiker’s guide to Ontology (slides)

## 23 November 2016, 12:00, C47¶

Ngurah Agus Sanjaya ER, Télécom ParisTech
Set of T-uples Expansion by Example (slides)
Qing Liu, National University of Singapore
Top-k Queries over Uncertain Scores (slides)

## 26 October 2016, 12:00, C46¶

Maria Koutraki, Université Paris-Saclay
Approaches towards unified models for integrating Web knowledge bases. (slides)

## From November 2013 to September 2016¶

During this time, the DBWeb seminar was held as part of the IC2 group seminar. These seminars used to be listed on the IC2 seminar Web page at https://www.infres.telecom-paristech.fr/wp/ic2/seminar/, but this link no longer works so they are probably lost to time.

## 10 September 2013, 14:00, C49¶

Antoine Amarilli
Taxonomy-Based Crowd Mining (slides)
Jean-Louis Dessalles
Relevance (slides)

## 14 January 2013, 10:00, B549¶

Vincent Lepage, Cinequant
Cinequant, datamining pour le monde réel
Jean Marc Vanel, Déductions SARL
EulerGUI, un outil libre pour le Web Sémantique et l'inférence

## 04 December 2012, 10:00, C017¶

Jean-Louis Dessalles
Why spend (so much) time on the social Web? A model of investment in communication
François Rousseau
Short talk and brainstorming on graph based text representation and mining

## 20 November 2012, 10:00, C017¶

Mohamed-Amine Baazizi
Static analysis for optimizing the update of large temporal XML documents
Christos Giatsidis
S-cores and degeneracy based graph clustering

## 6 November 2012, 10:00, C49¶

Jonathan Michaux, Télécom ParisTech
Interaction safety in Web service orchestrations (slides)
Georges Gouriten
Brainstorming on knowledge-based content suggestions on the social Web

## 16 October 2012, 10:00, C49¶

Clémence Magnien, Université Pierre et Marie Curie
Measuring, studying, and modelling the dynamics of Internet topology
Imen Ben Dhia
Evaluating reachability queries over large social graphs (slides)

## 2 October 2012, 10:00, C017¶

Idrissa Sarr, Université Cheikh Anta Diop
Dealing with the disappearance of nodes in social networks (slides)
Damien Munch
“Eating cake during a scientific talk:” Can we reverse-engineer natural language aspectual processing? (slides)

## 18 September 2012, 10:00, C017¶

Silviu Maniu
Context-Aware Top-k Processing using Views
Asma Souihli
Optimizing Approximations of DNF Query Lineage in Probabilistic XML (slides)

## 4 September 2012, 10:00, C017¶

Antoine Amarilli
Advances in holistic ontology alignment (slides)
Yannis Papakonstantinou, University of California, San Diego
Declarative, optimizable data-driven specifications of web and mobile applications