Welcome to the homepage of the DIG seminar, which is the regular seminar of the DIG team of LTCI at Télécom Paris. The seminar features talks by members of the team and guests from other research groups, as well as discussions on topics of relevance to the team. Talks are held at Télécom Paris, 19 place Marguerite Perey, 91120 Palaiseau (directions).

Attendance is open to the public, but please register in advance by emailing me at a3nm.seminar<REMOVETHIS>@a3nm.net if you are planning to attend.

You can subscribe to the mailing-list of seminar announcements here (the Mailman interface is in French, but all emails are in English), or by sending an email with subject "subscribe" to dig-seminar-subscribe@listes.telecom-paris.fr and replying to the confirmation email. You can also subscribe to the seminar sessions as an iCalendar feed (e.g., with ICSdroid on Android) using the following URL: https://a3nm.net/work/seminar/calendar.ics

The seminar has been formerly called "DBWeb seminar" and "IC2 seminar". You may also be interested in the LTCI Data Science Seminar, which is co-organized by DIG and S2A.

12 March 2020, 13:15, 4A113

Mikaël Monet, Instituto Milenio Fundamentos de los datos
Logical Expressiveness of Graph Neural Networks

Graph Neural Networks (GNNs) are a family of machine learning architectures that has recently become popular for applications dealing with structured data, such as molecule classification and knowledge graph completion. Recent work on the expressive power of GNNs has established a close connection between their ability to classify nodes in a graph and the Weisfeiler-Lehman (WL) test for checking graph isomorphism. In turn, a seminal result by Cai et al. establishes that the WL test is tightly connected to the two-variable fragment of first-order logic extended with counting capabilities (FOC2). However, these results put together do not seem to characterize the relationship between GNNs and FOC2. This motivates the following question: which FOC2 node properties are expressible by GNNs? We start by considering GNNs that update the feature vector of a node by combining it with the aggregation of the vectors of its neighbors; we call these aggregate-combine GNNs (AC-GNNs). On the negative side, we present a simple FOC2 node property that cannot be captured by any AC-GNN. On the positive side, we identify a natural fragment of FOC2 whose expressiveness is subsumed by that of AC-GNNs. This fragment corresponds to graded modal logic, or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community. Next we improve the AC-GNN architecture by allowing global readouts, where in each layer we can compute a feature vector for the whole graph and combine it with local aggregations; we call these aggregate-combine-readout GNNs (ACR-GNNs). In this setting, we prove that each FOC2 formula is captured by an ACR-GNN classifier. Besides their own value, these results put together indicate that readouts strictly increase the discriminative power of GNNs. (Ongoing work with Pablo Barceló, Egor Kostylev, Jorge Pérez, Juan Reutter and Juan Pablo Silva)

21 November 2019, 12:00, 3.A26

Louis Jachiet, Télécom Paris
Reasoning about Disclosure in Data Integration in the Presence of Source Constraints (slides)

This talk will be about the recent paper "Reasoning about Disclosure in Data Integration in the Presence of Source Constraints" presented at IJCAI 19. The talk will mix material from the paper and a general introduction to the tools used in the paper (such as the "Chase"). Here is the abstract of this paper:

Data integration systems allow users to access data sitting in multiple sources by means of queries over a global schema, related to the sources via mappings. Data sources often contain sensitive information, and thus an analysis is needed to verify that a schema satisfies a privacy policy, given as a set of queries whose answers should not be accessible to users. Such an analysis should take into account not only knowledge that an attacker may have about the mappings, but also what they may know about the semantics of the sources.

In this paper, we show that source constraints can have a dramatic impact on disclosure analysis. We study the problem of determining whether a given data integration system discloses a source query to an attacker in the presence of constraints, providing both lower and upper bounds on source-aware disclosure analysis.

17 October 2019, 12:00, C47

Julien Romero, Télécom Paris
Commonsense Properties from Query Logs and Question Answering Forums (slides)

Abstract: Commonsense knowledge about object properties, human behavior, and general concepts is crucial for robust AI applications. However, the automatic acquisition of this knowledge is challenging because of sparseness and bias in online sources. This talk presents Quasimodo, a methodology and tool suite for distilling commonsense properties from non-standard web sources. We devise novel ways of tapping into search-engine query logs and QA forums and combining the resulting candidate assertions with statistical cues from encyclopedias, books and image tags in a corroboration step. Unlike prior work on commonsense knowledge bases, Quasimodo focuses on salient properties that are typically associated with certain objects or concepts. Extensive evaluations, including extrinsic use-case studies, show that Quasimodo provides better coverage than state-of-the-art baselines with comparable quality.

Bio: Julien Romero is a PhD student in the group, whose thesis is supervised by Fabian Suchanek.

The seminar will be followed by a presentation and discussion on the future AIDA center by Guillaume Desvaux (head of the future center).

2 October 2019, 12:00, C46

Nesime Tatbul, Intel Labs and MIT CSAIL
Practical Tools for Time Series Anomaly Detection

Abstract: From autonomous driving to industrial IoT, the age of billions of intelligent devices generating time-varying data is here. There is a growing need to ingest and analyze high-volumes of time series data at scale. In our Metronome Project, we have been broadly exploring novel data management, machine learning, and interactive visualization techniques for supporting the practical development and deployment of predictive time series analytics applications. This talk will focus on our efforts in time series anomaly detection, including: (i) a customizable scoring model for evaluating accuracy, which extends the classical precision/recall model to range-based data; (ii) a zero-positive learning paradigm, which enables training anomaly detectors in absence of labeled datasets; and (iii) Metro-Viz, a visual tool for interactively analyzing time series anomalies.

Bio: Nesime Tatbul is a senior research scientist at Intel’s Parallel Computing Lab (PCL) and a visiting scientist at MIT’s Computer Science and Artificial Intelligence Lab (CSAIL). Previously, she served on the computer science faculty of ETH Zurich after receiving a Ph.D. degree from Brown University. Her research interests are in large-scale data management systems and modern data-intensive applications. She is most known for her contributions to stream processing, which include the Aurora/Borealis systems (now TIBCO StreamBase) and the S-Store system (the first streaming OLTP system). Nesime is the recipient of an IBM Faculty Award (2008), two ACM SIGMOD Best Demonstration Awards (2005 and 2019), and ACM DEBS Grand Challenge and Best Poster Awards (2011). She has served on the organization and program committees for various conferences including SIGMOD, VLDB, ICDE, and DEBS, and on the editorial boards of the SIGMOD Record and the VLDB Journal.

11 September 2019, 12:00, C47

Martín Muñoz, Pontificia Universidad Católica de Chile
Descriptive Complexity for Counting Complexity Classes (slides)

Abstract: The goal of Descriptive Complexity is to measure the complexity of computational problems by characterizing them in terms of logics. However, the study of Descriptive Complexity has been mainly focused in decision problems, and not as much insight has been given into how to logically capture counting problems.

This paper builds from the idea of Weighted Logics to obtain a framework called Quantitative Second Order Logics (QSO). Our main contributions are showing how this framework can be used to logically capture many of the well-studied counting complexity classes (like FP and #P); using QSO to find classes below #P, with good closure and approximation properties; and show how to use quantitative recursion over QSO to capture lower classes like #L.

Bio: I am a PhD. student at the Pontificia Universidad Católica de Chile and a member of the Millennium Institute for Foundational Research on Data. I received a M.Sc. degree and my professional degree in Computer Engineering in 2017, both from the Pontificia Universidad Católica de Chile. My doctoral research has focused on document spanners and enumeration complexity. I am also a member of the progcomp group in Chile that works to promote competitive programming to students in higher education.

4 July 2019, 11:00, B310

Gianmarco de Francisci Morales, ISI Foundation:
Controversy on Social Media: Collective Attention, Echo Chambers, and Price of Bipartisanship (slides)

Abstract: How do we discuss controversial topics on social media? Answering this question is not only interesting from a societal point of view, but also has concrete implications for policy makers, news agencies, and internet companies. In this talk, we first take a look at how collective attention, which is typically related to external events that increase the visibility of the topic, changes the debate. Our analysis shows that, in long-lived controversial debates on Twitter, increased collective attention is associated with increased network polarization. Then, we show how content and network interact in the formation of echo chambers. As expected, Twitter users are mostly exposed to political opinions that agree with their own. In addition, users who try to bridge the echo chambers by sharing content with diverse leaning have to pay a “price of bipartisanship” in terms of their network centrality and content appreciation.

Bio: Gianmarco De Francisci Morales is a Senior Researcher at ISI Foundation in Turin. Previously he worked as a Scientist at Qatar Computing Research Institute in Doha, as a Visiting Scientist at Aalto University in Helsinki, as a Research Scientist at Yahoo Labs in Barcelona, and as a Research Associate at ISTI-CNR in Pisa. He received his Ph.D. in Computer Science and Engineering from the IMT Institute for Advanced Studies of Lucca in 2012. His research focuses on scalable data mining, with an emphasis on Web mining and data-intensive scalable computing systems. He is an active member of the open source community of the Apache Software Foundation, working on the Hadoop ecosystem, and a committer for the Apache Pig project. He is one of the lead developers of Apache SAMOA, an open-source platform for mining big data streams. He commonly serves on the PC of several major conferences in the area of data mining, including WSDM, KDD, CIKM, and WWW. He co-organizes the workshop series on Social News on the Web (SNOW), co-located with the WWW conference. He has won best paper awards at WSDM and WebSci.

20 June 2019, 12:00, C47

Cyril Chhun, Télécom Paris:
Clustering by contrast (slides)

Abstract: The objective is to design a clustering method that learns both prototypes and contrasts with prototypes, to obtain descriptions such as "large spade" where both "large" and "spade" can be learned in one shot. Here, the word "spade" is attached to the closest prototype to the instance, and "large" is attached to the prototype of the contrast between the instance and its prototype.

Bio: Cyril is finishing is internship at DIG/Télécom-Paris, together with his 3rd year of study at École Polytechnique.

Julien Panis-Lie, Télécom Paris
Extreme social signals (slides)

Altruistic forms of behaviour abound: writing open source programs, providing answers in technical forums, contributing to Wikipedia, volunteer work, charity, heroism,... Basic game theory predicts that such altruism should not exist, especially when it is anonymous. We explore the hypothesis that these forms of altruism may serve a signalling purpose. We test the model on the most extreme case: suicide for the group.

Bio: Julien is currently interning at DIG/Télécom-Paris. After Polytechnique, he studied economy and sociology, before following the CogMaster, which he is completing with this internship.

6 June 2019, 14:00, C49

LTCI Data Science Seminar session. Speakers: Pavlo Mozharovskyi and Silviu Maniu

See the LTCI Data Science Seminar Webpage for details.

9 May 2019, 14:00, C47

Kuldeep S. Meel, National University of Singapore:
Beyond NP Revolution (slides)

The paradigmatic NP-complete problem of Boolean satisfiability (SAT) solving is a central problem in Computer Science. While the mention of SAT can be traced to early 19th century, efforts to develop practically successful SAT solvers go back to 1950s. The past 20 years have witnessed a "NP revolution" with the development of conflict-driven clause-learning (CDCL) SAT solvers. Such solvers combine a classical backtracking search with a rich set of effective heuristics. While 20 years ago SAT solvers were able to solve instances with at most a few hundred variables, modern SAT solvers solve instances with up to millions of variables in a reasonable time.

The "NP-revolution" opens up opportunities to design practical algorithms with rigorous guarantees for problems in complexity classes beyond NP by replacing a NP oracle with a SAT Solver. In this talk, we will discuss how we use NP revolution to design practical algorithms for two fundamental problems in artificial intelligence and formal methods: Constrained Counting and Sampling


Kuldeep Meel is an Assistant Professor of Computer Science in School of Computing at the National University of Singapore where he holds the Sung Kah Kay Assistant Professorship. He received his Ph.D. (2017) and M.S. (2014) degree in Computer Science from Rice University. He holds B. Tech. (with Honors) degree (2012) in Computer Science and Engineering from Indian Institute of Technology, Bombay. His research interests lie at the intersection of Artificial Intelligence and Formal Methods. Meel has co-presented tutorials at top-tier AI conferences, IJCAI 2018, AAAI 2017, and UAI 2016. His work received the 2018 Ralph Budd Award for Best PhD Thesis in Engineering, 2014 Outstanding Masters Thesis Award from Vienna Center of Logic and Algorithms and Best Student Paper Award at CP 2015. He received the IBM Ph.D. Fellowship and the 2016-17 Lodieska Stockbridge Vaughn Fellowship for his work on constrained sampling and counting.

14 March 2019, 14:00, C49

LTCI Data Science Seminar session. Speakers: Thomas Bonald and Alessandro Rudi.

See the LTCI Data Science Seminar Webpage for details.

21 February 2019, 12:00, C47

Louis Jachiet, Inria Lille:
On the optimization of recursive queries over graphs (slides)

Abstract: Since its introduction, the relational model has seen various attempts to extend it with recursion and it is now possible to use recursion in several SQL or Datalog database systems. The optimization of such recursive queries remains, however, a challenge.

In this talk, we will introduce μ-RA, a variation of the Relational Algebra that allows for the expression of relational queries with recursion. μ-RA can express unions of conjunctive regular path queries over graphs (similar to the Property Paths of SPARQL) as well as certain non-regular properties.

We will present its syntax, semantics and the rewriting rules we specifically devised to tackle the optimization of recursive queries. We will also present our implementation and a benchmark comparing our prototype with respect to state-of-the-art systems.

Bio: Louis Jachiet is a post-doctorate in the Spirals team at Inria Lille where he studies the data security in databases. He previously was a teaching assistant at the École Normale Supérieure in Paris and did his PhD in the Tyrex team at Inria Grenoble on the topic of the optimization of SPARQL queries for distributed systems.

24 January 2019, 12:00, C47

Nada Mimouni, Télécom ParisTech:
Knowledge graph embedding for mining cultural heritage data (slides)

Abstract: In this talk we present a method for mining cultural heritage data using knowledge graph embedding models and the preliminary results of our ongoing work. This work is supported by the Data & Musée project which main goal is to define a model to integrate data produced by different cultural institutions in order to recommend useful original content for users (visitors or people belonging to the institution). First, we create a global context graph for the target domain. The graph is built as the union of individual contextual graphs of all entities representing the input data from two main cultural institutions: Paris Musées (PM) et Centre des Monuments Nationaux (CMN). Second, we propose to mine the resulting knowledge graph using neural network based graph embedding model with biased graph walks. The model calculates an embedding vector for each entity in the graph that could be used for different machine learning tasks to reach the objectives of this research.

Bio: Nada Mimouni is a postdoc at Télécom ParisTech in the IDS department.

10 January 2019, 12:00, C46

Nedeljko Radulović, Télécom ParisTech
Explainable Artificial Intelligence (slides)

Abstract: In recent years, machine learning and artificial intelligence systems are reaching, sometimes even exceeding, the human performance in tasks such as image recognition, speech understanding, or strategic decision making. The main problem with many of these models is their lack of transparency and interpretability: There is no information about how exactly they reached their predictions. This is a major issue in sensitive fields such as healthcare, policing, and finance. To address these issues, explainable artificial intelligence (XAI) has become an important topic of interest in the research community.

Through our research, we want to address this problem with insights from another field that has recently celebrated great advances: that of large knowledge bases (KBs). By contributing the link to the real world, KBs can give a semantic dimension to machine learning (ML) algorithms. While semantic background knowledge has long been used in ML, we believe that the recent explosion of the size of KBs warrants a revisit of this approach. KBs are now much larger, much broader in terms of thematic coverage, and much cleaner at scale. We imagine that a symbiosis between these new KBs and ML could take several forms: semantics can be injected a posteriori into a learned model; semantics can be taken into account as background knowledge during the learning process, or the learning process can feed directly from the semantic data. We aim to systematically explore all of these possibilities and investigate how they can serve to make AI and ML models more interpretable, more explainable, and ultimately more human-intelligible.

Bio: I studied Electrical Engineering and Computer Science at School of Electrical Engineering, University of Belgrade, Serbia. This year I obtained M. Sc. in Computer Science at Télécom ParisTech. I am starting Ph.D. studies with Professors Albert Bifet and Fabian Suchanek. The research topic of my Ph.D. is Explainable AI.

13 December 2018, 12:00, C47

David Carral, TU Dresden
Reasoning with Description Logics Ontologies and Knowledge Graphs (slides)

Abstract: Ontology-based access to knowledge graphs (KGs) has recently gained a lot of attention. One of the research challenges when accessing these large data structures is to enable "the capability of combining diverse reasoning methods and knowledge representations while guaranteeing the required scalability, according to the reasoning task at hand." [1]

In our work, we address this challenge with a focus on reasoning with KGs extended with Description Logics (DL) ontologies. In principle, one could make use of existing DL reasoners to solve these reasoning tasks. However, DL reasoners---which are designed to deal with complex terminological axioms---do not scale well in the presence of large amounts of assertional information. In contrast, existing rule engines such as VLog or RDFOx can efficiently reason with data-intensive knowledge bases. To take advantage of these powerful implementations, we propose several data-independent mappings from DL TBoxes into rule sets that preserve the outcomes of conjunctive query (CQ) answering. Our experiments indicate that reasoning with rule engines over the resulting CQ-preserving rewritings can be significantly more efficient than using state-of-the-art DL reasoners over the original DL ontologies.

[1] This quote is taken from the description of a recent Daghstul seminar on Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web

Bio: Since October 2016, I am a postdoctoral scholar at the Knowledge-Based Systems group led by Prof. Markus Krötzsch at Technische Universität Dresden. I completed my doctor’s degree at Wright State University under the supervision of Prof. Pascal Hitzler. For a couple months at the beginning of my Ph.D., I was an exchange student at the University of Oxford, working under the supervision of Prof. Bernardo Cuenca Grau.

Broadly speaking, I am interested in the study of logical languages such as Description Logics and existential rules, the implementation of reasoning algorithms for these languages, and the use and application of semantic web technologies in different domains.

11 December 2018, 14:00, C48

LTCI Data Science Seminar session. Speaker: Shai Ben-David.

See the LTCI Data Science Seminar Webpage for details.

29 November 2018, 14:00, C48

LTCI Data Science Seminar session. Speakers: Rodrigo Mello and Olivier Sigaud.

See the LTCI Data Science Seminar Webpage for details.

15 November 2018, 12:00, B551

Arnaud Soulet, Université de Tours
Representativeness of Knowledge Bases with the Generalized Benford’s Law (slides)

Abstract: Knowledge bases (KBs) such as DBpedia, Wikidata, and YAGO contain a huge number of entities and facts. Several recent works induce rules or calculate statistics on these KBs. Most of these methods are based on the assumption that the data is a representative sample of the studied universe. Unfortunately, KBs are biased because they are built from crowdsourcing and opportunistic agglomeration of available databases. This work aims at approximating the representativeness of a relation within a knowledge base. For this, we use the Generalized Benford's law, which indicates the distribution expected by the facts of a relation. We then compute the minimum number of facts that have to be added in order to make the KB representative of the real world. Experiments show that our unsupervised method applies to a large number of relations. For numerical relations where ground truths exist, the estimated representativeness proves to be a reliable indicator.

Bio: Arnaud Soulet is an associate professor at University of Tours. His research interests include databases, data mining and knowledge bases.

8 November 2018, 12:00, C47

Borja Balle, Amazon Research
Privacy-Aware Machine Learning Systems (slides)

Abstract: Privacy-aware machine learning systems allow us to train models on sensitive data without the need to have plain-text access to the data. For example, such systems could enable hospitals in different countries to learn models on their combined datasets without the need to entrust the data held by each hospital to a centralized computing node. In this talk I will describe how several privacy-enhancing technologies like differential privacy and secure multi-party computation come together in this line of work. In particular, I will highlight our current progress in this space and the remaining challenges to obtain scalable and trusted large-scale deployments.

Bio: Borja Balle is currently a Machine Learning Scientist at Amazon Research in Cambridge (UK). Before joining Amazon, Borja was a lecturer at Lancaster University (2015-2017), a postdoctoral fellow at McGill University (2013-2015), and a graduate student at Universitat Politecnica de Catalunya where he obtained his PhD in 2013. His main research interest is in privacy-preserving machine learning, including the use of differential privacy and multi-party computation in distributed learning systems, and the mathematical foundations of privacy-aware data science.

25 October 2018, 12:00, C47

Fabian M. Suchanek, Télécom ParisTech
An introduction to deep learning (slides)

Abstract: in this talk, I will present the basics of deep learning. The goal of the presentation is two fold: 1) share what I learnt about deep learning with those who would like to know what it is and 2) receive feedback from those who already know more than myself about it. I have slides, but the presentation will follow the interaction with the audience.

Biography: Fabian Suchanek is a professor in the group.

4 October 2018, 13:00, C47

Quentin Lobbé, Télécom ParisTech
Where the dead blogs are: a disaggregated exploration of Web archives to reveal extinct online collectives (slides)

Abstract: The Web is an unsteady environment. As Web sites emerge and expand every days, whole communities may fade away over time by leaving too few or incomplete traces on the living Web. Worldwide volumes of Web archives preserve the history of the Web and reduce the loss of this digital heritage. Web archives remain essential to the comprehension of the lifecycles of extinct online collectives. In my talk, I will introduce a framework to follow the intern dynamics of vanished Web communities, based on the exploration of corpora of Web archives. To achieve this goal, I propose the definition of a new unit of analysis called Web fragment: a semantic and syntactic subset of a given Web page, designed to increase historical accuracy. This contribution has practical value for those who conduct large-scale archive exploration (in terms of time range and volume) or are interested in computational approach to Web history and social science. By applying this framework to the Moroccan archives of the e-Diasporas Atlas, we will first witness the collapsing of an established community of Moroccan migrant blogs. We will show its progressive mutation towards rising social platforms, between 2008 and 2018. Then, we will study the sudden creation of an ephemeral collective of forum members gathered by the wave of the Arab Spring in the early 2011. We will finally yield new insights into historical Web studies by suggesting the concept of pivot moment of the Web.

Biography: Quentin Lobbé is a PhD student in the group; this is a rehearsal talk for the BDA conference.

6 September 2018, 12:00, C47

Rodrigo Mello, University of Sao Paulo
The Statistical Learning Theory in Practical Problems (slides, code)

Abstract: In this 30-minute talk Prof. Rodrigo Mello will introduce its main research interests in a informal way: the Statistical Learning Theory, Data Streams/Time Series modeling using Statistics and Dynamical Systems, and How Theoretical Aspects can support the design of Deep Learning architectures. Several applications will be also mentioned during this talk.

Biography: Rodrigo Mello is currently an Associate Professor at the Institute of Mathematics and Computer Sciences, Department of Computer Science, University of São Paulo, São Carlos, Brazil. Prof. Mello is currently in a sabbatical year as invited professor at Télécom ParisTech, after an invitation by Prof. Albert Bifet. He completed his PhD degree from University of São Paulo, São Carlos in 2003 and has another one-year experience as Invited Professor at St. Francis Xavier University, Antigonish, NS, Canada. His research interests are mostly associated to theoretical aspects of Machine Learning, Data Streams/Time Series modeling and prediction, and Deep Learning.

12 July 2018, 12:00, C48

Joe Raad, Université Paris-Saclay
Towards a solution to the “sameAs problem” (slides)

Abstract: In the absence of a central naming authority on the Semantic Web, it is common for different datasets to refer to the same thing by different IRIs. Whenever multiple names are used to denote the same thing, owl:sameAs statements are needed in order to link the data and foster reuse. However, studies that date back as far as 2009 have observed that the Semantic Web identity predicate is sometimes used incorrectly, leaving multiple incorrect owl:sameAs statements in the Web. This problem is known as the “sameAs problem”. In this talk, we show how network metrics, such as the community structure of the owl:sameAs graph, can be used for detecting such possibly erroneous statements. One benefit of the here presented approach is that it can be applied to the network of owl:sameAs links itself, and does not rely on any additional knowledge. In order to illustrate its ability to scale, the approach is evaluated on the largest collection of identity links to date, containing over 558 million owl:sameAs links scraped from the LOD Cloud.

Biography: Joe is a PhD student at the University of Paris-Saclay, and member of the LINK (AgroParisTech-INRA, Paris) and LAHDAK teams (LRI, Orsay). His current research comprises knowledge representation using Semantic Web languages, as well as studying the use of identity in the Semantic Web.

5 July 2018, 12:00, C47

Thomas Rebele, Télécom ParisTech, DIG team
Extending the YAGO knowledge base (slides)

Abstract: A knowledge base is a set of facts about the world. YAGO was one of the first large-scale knowledge bases that were constructed automatically. This presentation shows our work on extending the YAGO knowledge base along two axes: extraction and preprocessing.

The first part of the talk presents methods that increase the number of facts about people in YAGO. We have developed algorithms and heuristics for extracting more facts about birth and death date, about gender, and about the place of residence. We also show how to use these data for studies in Digital Humanities.

The second part discusses two algorithms for repairing a regular expression automatically so that it matches a given set of words. Experiments on various datasets show the effectiveness and generality of these algorithms. Both algorithms improve the recall of the initial regular expression while achieving a similar or better precision.

The third part presents a system for translating database queries into Bash scripts. This approach allows preprocessing large tabular datasets and knowledge bases by executing Datalog and SPARQL queries, without installing any software beyond a Unix-like operating system. Experiments show that the performance of our system is comparable with state-of-the-art systems.

Biography: Thomas Rebele is a PhD student in our group.

14 June 2018, 12:00, C47

Lucie-Aimée Kaffee, University of Southampton
Multilinguality of Wikidata (slides)

Abstract: The web in general shows a lack of support for non-English languages. One way of overcoming this lack of information is using multilingual linked data. Wikidata data supports over 400 languages in theory. In practice, however, not all languages are equally supported. As a first step, we want to explore the language distribution of a collaboratively edited knowledge base such as Wikidata d label coverage of the web of data in general. Labels are the access point for humans to the web of data, and a lack thereof means limited reusability. Wikipedia is an ideal candidate for reuse of the multilingual data: the project has instances in over 280 languages, but the number of articles differ drastically. For many readers it could be a first starting point to get information. wever, with a lack of information the project is unlikely to attract new community members that could create new articles. We investigate the possibility of neural natural language generation for underserved Wikipedia communities, using kidata’s facts and evaluate this approach with the help of the Arabic and Esperanto Wikipedia communities. This approach can only be as good as the amount of multilingual data we have at our disposal. Therefore, we discuss future ways of improving the coverage of under-resourced languages’ information in Wikidata.

Biograhphy: Lucie is a PhD student at the School of Electronics and Computer Science, University of Southampton, as part of the Web and Internet Science (WAIS) research group. Additionally, she is part of the part of the Marie Skłodowska-Curie ITN Aqua. Generally, she is working on how to support underserved languages on the web with the means of linked data. Therefore, her research interests include linked data, multilinguality, Wikidata, underserved languages on the web and most recently natural language generation and relation extraction. Before getting involved with research, she worked as a software developer at Wikimedia Deutschland in the Wikidata team. There she was already involved in the previously mentioned topics, developing the ArticlePlaceholder extension, ich includes Wikidata’s structured knowledge on Wikipedias of small languages, a project she continued research on. She is still involved in Open Source projects, mainly Wikimedia related, where she is currently part of the Code of Conduct Committee for technical spaces.

23 May 2018, 12:05, C47

Viktor Losing, University of Bielefeld, HONDA Research Institute Europe
Memory Models for Incremental Learning Architectures (slides)

Abstract: There are more and more products available with automated functions for human assistance or autonomous services in home or outdoor environments. A common problem is the inadequate match between user expectations which are highly individual and the assistant system function which is typically rather standardized. Incremental learning methods offer a way to adapt the parameters and behavior of an assistant system according to user needs and preferences. In this talk, I will illustrate the benefits of personalization and incremental learning using the task of driver maneuver prediction at intersections. The study is based on a collection of commuting drivers who recorded their daily routes with a standard smart phone and GPS receiver. The personalized prediction based on at least one experience of a certain intersection already improves the prediction performance over an average prediction model trained.

A closely related topic is incremental learning in non-stationary data streams which is highly challenging, since the possibly occurring types of drift are fundamentally different and undermine classical assumptions such as data independence or stationary distributions. Here, I will introduce the Self Adjusting Memory (SAM) model for the k Nearest Neighbor (kNN) algorithm. The basic idea is to construct dedicated models for the current and former concepts and apply them according to the demands of the given situation. In an extensive evaluation, SAM-kNN achieves highly competitive results throughout all experiments, underlining its robustness and capability to handle heterogeneous concept drift.

Biography: Viktor Losing received his M. Sc. in Intelligent Systems at the University of Bielefeld in 2014. Since 2015 he is a PhD student at the CoR-Lab of the University of Bielefeld in cooperation with the HONDA Research Institute Europe. His research interests comprise incremental and online learning, learning under concept drift as well as corresponding real-world applications.

28 March 2018, 12:05, C47

Romain Giot, IUT Bordeaux and LaBRI
Biometric performance evaluation with novel visualization (slides)

Abstract: Biometric authentication verifies the identity of individuals based on what they are. However, biometric authentication systems are error prone and can reject genuine individuals or accept impostors. Researchers on biometric authentication quantify the quality of their algorithm by benchmarking it several databases. However, although the standard evaluation metrics state the performance of a system, they are not able to explain the reasons of these errors.

After presenting the existing evaluation procedures of biometric authentication systems as well as visualisation properties, this talk presents a novel visual evaluation of the results of a biometric authentication system which helps to find which individuals or samples are sources of errors and could help to fix the algorithms. Two variants are proposed: one where the individuals of the database are modelled as a firected graph and another one where the biometric database of scores is modelled as a partitioned power-graph where nodes represent biometric samples and power-nodes represent individuals. A novel recursive edge bundling method is also applied to reduce clutter. This proposal has been successfully applied on several biometric databases and proved its interest.

Biography: I am associate professor at the IUT de Bordeaux and the LaBRI and head of the team “Back to Bench and Beyond” of the group “Bench to Knowledge end Beyond”. I have a research experience in biometric authentication (as a PhD student at the university of Caen where I worked on template update and multibiometrics for keystroke dynamics), anomaly detection (as a postdoctoral researcher at Orange Labs where I worked on fraud detection in mobile payment), and large graph visualisation (since I'm associate professor at Bordeaux).

5 March 2018, 12:05, C46

Fadi Badra, LIMICS
Analogical Transfer: a Form of Similarity-Based Inference? (slides)

Abstract: Making an analogical transfer consists in assuming that if two situations are alike in some ways, they may be alike in others. Such a cognitive process is the inspiration for different machine learning approaches like analogical classification, the k-nearest neighbors algorithm, or case-based reasoning. This talk explores the role of similarity in the transfer phase of analogy, by taking a qualitative reasoning viewpoint. We first show that there exists an intimate link between the qualitative measurement of similarity and computational analogy. Essential notions of formal models of analogy, such as analogical equalities/inequalities, or analogical dissimilarity, and the related inferences (mapping and transfer) can be formulated as operations on ordinal similarity relations. In the light of these observations, we will defend the idea that analogical transfer is a form of similarity-based inference.

Biography: Fadi Badra is an assistant professor at Paris 13 University, and is a member of the Medical Informatics and Knowledge Engineering Research Group (LIMICS) in Paris, France. He completed his PhD in the Orpailleur Research Group at the LORIA Lab in Nancy, France. His current research interests are in the area of computational analogy and case-based reasoning, with a particular focus on its adaptation phase.

22 November 2017, 12:00, C47

Vwani Roychowdhury, UCLA
The Unreasonable Effectiveness of Data: A Scalable framework for "Understanding" Social Forums and Online Discussions (no slides provided)

Abstract: As humans we interpret and react to the world around us in terms of narratives. At a basic level, a narrative is comprised of principal actors and entities, their interactions, and finally the decisions they make to reinforce and protect their interests. The primary question we address in this talk is whether a computer can automatically distill and create such narrative maps from millions of posts and discussions that happen in the online world. How much and which parts of the underlying narratives can be extracted via unsupervised statistical methods, and how much "humanness" needs to becoded into a computer? We provide a framework that uses statistical techniques to generate automated summaries, and show that when augmented with a small-size dictionary that encodes "humanness," the framework can generate effective narratives from a number of domains. We will present several sets of empirical results where millions of posts are processed to generate story graphs and plots of the underlying discussions.

Biography: Vwani Roychowdhury is a Professor of Electrical and Computer Engineering at University of California, Los Angeles (UCLA). He specializes in interdisciplinary work that deal with the modeling and design of information and computing systems, ranging from the physical, biological and engineered systems. He has done pioneering work in Quantum Computing, Nanoelectronics, Peer-to-Peer (P2P), social and complex networks, machine learning, text mining, artificial neural networks, computer vision, and Internet-Scale data processing. He has published more than 200 peer reviewed journal and conference papers, and co-authored several books. He has also cofounded several silicon valley startups, including www.netseer.com and www.stieleeye.com.

18 October 2017, 12:00, C47

Yun Sing Koh, University of Auckland
Using Volatility in Concept Drift Detection and Capturing Recurrent Concept Drift in Data Streams (slides)

Abstract: Much of scientific research involves the generation and testing of hypotheses that can facilitate the development of accurate models for a system. In machine learning the automated building of accurate models is desired. However traditional machine learning often assumes that the underlying models are static and unchanging over time. In reality there are many applications that analyse data streams where the underlying model or system changes over time. This may be caused by changes in the conditions of the system, or a fundamental change in how the system behaves. In this talk, I will present a change detector called SEED, and how we capture stream volatility. We coin the term stream volatility, to describe the rate of changes in a stream. A stream has a high volatility if changes are detected frequently and has a low volatility if changes are detected infrequently. I will also present a drift prediction algorithm to predict the location of future drift points based on historical drift trends which we model as transitions between stream volatility patterns. Our method uses a probabilistic network to learn drift trends and is independent of the drift detection technique. I will then present a meta-learner, Concept Profiling Framework (CPF) that uses a concept drift detector and a collection of classification models to perform effective classification on data streams with recurrent concept drifts, through relating models by similarity of their classifying behaviour.

Biography: Yun Sing Koh is a Senior Lecturer at the Department of Computer Science, The University of Auckland, New Zealand. She completed her PhD at the Department of Computer Science, University of Otago, New Zealand in 2007. Her current research interest is in the area of data mining and machine learning, specifically data stream mining and pattern mining.

12 September 2017, 12:00, C47

Bob Durrant, University of Waikato
Random Projections for Dimensionality Reduction (slides)

12 July 2017, 12:00, C47

Amin Mantrach, Criteo Research
Deep Character-Level Click-Through Rate Prediction for Sponsored Search (slides)

31 May 2017, 12:00, C48

Quentin Lobbé, Télécom ParisTech
An exploration of web archives beyond the pages : Introducing web fragments (slides)
Mikaël Monet, Télécom ParisTech
Probabilistic query evaluation: towards tractable combined complexity (slides)

26 April 2017, 12:00, C47

Themis Palpanas, LIPADE, Paris Descartes University
Riding the Big IoT Data Wave: Complex Analytics for IoT Data Series (slides)

8 March 2017, 12:00, C47

Thomas Bonald, Télécom ParisTech
Community detection in graphs (slides)

27 February 2017, 12:00, C46

Laurent Decreusefond, Télécom ParisTech
Stochastic geometry, random hypergraphs, random walks (slides)

26 January 2017, 12:00, C47

Nofar Carmeli, Technion
Efficiently Enumerating Tree Decompositions (slides)

11 January 2017, 12:00, C47

Simon Razniewski, Free University of Bozen-Bolzano
Query-driven Data Completeness Assessment (slides)

14 December 2016, 12:00, C47

Fabian M. Suchanek, Télécom ParisTech
A hitchhiker’s guide to Ontology (slides)

23 November 2016, 12:00, C47

Ngurah Agus Sanjaya ER, Télécom ParisTech
Set of T-uples Expansion by Example (slides)
Qing Liu, National University of Singapore
Top-k Queries over Uncertain Scores (slides)

26 October 2016, 12:00, C46

Maria Koutraki, Université Paris-Saclay
Approaches towards unified models for integrating Web knowledge bases. (slides)

From November 2013 to September 2016

During this time, the DBWeb seminar was held as part of the IC2 group seminar. These seminars used to be listed on the IC2 seminar Web page at https://www.infres.telecom-paristech.fr/wp/ic2/seminar/, but this link no longer works so they are probably lost to time.

10 September 2013, 14:00, C49

Antoine Amarilli
Taxonomy-Based Crowd Mining (slides)
Jean-Louis Dessalles
Relevance (slides)

14 January 2013, 10:00, B549

Vincent Lepage, Cinequant
Cinequant, datamining pour le monde réel
Jean Marc Vanel, Déductions SARL
EulerGUI, un outil libre pour le Web Sémantique et l'inférence

04 December 2012, 10:00, C017

Jean-Louis Dessalles
Why spend (so much) time on the social Web? A model of investment in communication
François Rousseau
Short talk and brainstorming on graph based text representation and mining

20 November 2012, 10:00, C017

Mohamed-Amine Baazizi
Static analysis for optimizing the update of large temporal XML documents
Christos Giatsidis
S-cores and degeneracy based graph clustering

6 November 2012, 10:00, C49

Jonathan Michaux, Télécom ParisTech
Interaction safety in Web service orchestrations (slides)
Georges Gouriten
Brainstorming on knowledge-based content suggestions on the social Web

16 October 2012, 10:00, C49

Clémence Magnien, Université Pierre et Marie Curie
Measuring, studying, and modelling the dynamics of Internet topology
Imen Ben Dhia
Evaluating reachability queries over large social graphs (slides)

2 October 2012, 10:00, C017

Idrissa Sarr, Université Cheikh Anta Diop
Dealing with the disappearance of nodes in social networks (slides)
Damien Munch
“Eating cake during a scientific talk:” Can we reverse-engineer natural language aspectual processing? (slides)

18 September 2012, 10:00, C017

Silviu Maniu
Context-Aware Top-k Processing using Views
Asma Souihli
Optimizing Approximations of DNF Query Lineage in Probabilistic XML (slides)

4 September 2012, 10:00, C017

Antoine Amarilli
Advances in holistic ontology alignment (slides)
Yannis Papakonstantinou, University of California, San Diego
Declarative, optimizable data-driven specifications of web and mobile applications