## IEEE Transactions on Knowledge and Data Engineering - new TOC TOC Alert for Publication# 69

- Iterative Graph Self-Distillationel agosto 14, 2023 a las 2:02 pm
Recently, there has been increasing interest in the challenge of how to discriminatively vectorize graphs. To address this, we propose a method called Iterative Graph Self-Distillation (IGSD) which learns graph-level representation in an unsupervised manner through instance discrimination using a self-supervised contrastive learning approach. IGSD involves a teacher-student distillation process that uses graph diffusion augmentations and constructs the teacher model using an exponential moving average of the student model. The intuition behind IGSD is to predict the teacher network representation of the graph pairs under different augmented views. As a natural extension, we also apply IGSD to semi-supervised scenarios by jointly regularizing the network with both supervised and self-supervised contrastive loss. Finally, we show that fine-tuning the IGSD-trained models with self-training can further improve graph representation learning. Empirically, we achieve significant and consistent performance gain on various graph datasets in both unsupervised and semi-supervised settings, which well validates the superiority of IGSD.

- XMQAs: Constructing Complex-Modified Question-Answering Dataset for Robust Question Understandingel agosto 10, 2023 a las 2:01 pm
Question understanding is an important issue to the success of a Knowledge-based Question Answering (KBQA) system.However, the existing study does not pay enough attention to this issue given that the questions in the existing KBQA datasets are usually expressed in simple and straightforward way. This is not in line with the actual linguistic conventions, which often use a lot of modifiers. To facilitate the study on evaluating and enhancing the question understanding ability of the KBQA systems, this paper proposes to construct a complex-modified question-answering (XMQAs) dataset based on existing KBQA datasets. With the help of knowledge bases and dictionaries, three kinds of modifiers are defined and applied to original simple-expressed questions. These modifiers could make the expression of these questions complex without changing their semantics. Based on XMQAs, we then propose a novel question understanding algorithm upon existing KBQA models, which greatly improves the robustness of their question understanding abilities. We conduct extensive experiments on XMQAs and two widely acknowledged KBQA datasets. The empirical results demonstrate that our proposed algorithm can improve the performance of KBQA models on not only the complex-modified questions, but also simple-expressed questions.

- Fast and Slow Thinking: A Two-Step Schema-Aware Approach for Instance Completion in Knowledge Graphsel agosto 10, 2023 a las 2:01 pm
Modern Knowledge Graphs (KG) often suffer from an incompleteness issue (i.e., missing facts). By representing a fact as a triplet $(h,r,t)$(h,r,t) linking two entities $h$h and $t$t via a relation $r$r, existing KG completion approaches mostly consider a link prediction task to solve this problem, i.e., given two elements of a triplet predicting the missing one, such as $(h,r,?)$(h,r,?). However, this task implicitly has a strong yet impractical assumption on the two given elements in a triplet, which have to be correlated, resulting otherwise in meaningless predictions, such as (Marie Curie, headquarters location, ?). Against this background, this paper studies an instance completion task suggesting $r$r-$t$t pairs for a given $h$h, i.e., $(h,?,?)$(h,?,?). Inspired by the human psychological principle “fast-and-slow thinking”, we propose a two-step schema-aware approach RETA++ to efficiently solve our instance completion problem. It consists of two components: a fast RETA-Filter efficiently filtering candidate $r$r-$t$t pairs schematically matching the given $h$h, and a deliberate RETA-Grader leveraging a KG embedding model scoring each candidate $r$r-$t$t pair considering the plausibility of both the input triplet and its corresponding schema. RETA++ systematically integrates them by training RETA-Grader on the reduced solution space output by RETA-Filter via a customized negative sampling process, so as to fully benefit from the efficiency of RETA-Filter in solution space reduction and the deliberation of RETA-Grader in scoring candidate triplets. We evaluate our approach against a sizable collection of state-of-the-art techniques on three real-world KG datasets. Results show that RETA-Filter can efficiently reduce the solution space for the instance completion task, outperforming best baseline techniques by 10.61%–84.75% on the reduced solution space size, while also being 1.7×–29.6x faster than these techniques. Moreover, RETA-Grader trained on the reduced solution space also significantly outperforms the best state-of-the-art techniques on the instance completion task by 31.90%–105.02%.

- Efficient Classification by Removing Bayesian Confusing Samplesel agosto 9, 2023 a las 2:01 pm
Improving the generalization performance of classifiers from data pre-processing perspective has recently received considerable attention in the machine learning community. Although many methods have been proposed in the past decades, most of them lack theoretical foundations and cannot guarantee better generalization performance of classifiers on processed datasets. To overcome this flaw, in this paper, we propose a method, which is supported by Bayesian decision theory and percolation theory, to improve generalization performance by removing Bayesian confusing samples (abbr. BCS). Specifically, for a training set, we define the samples that misclassified by the Bayesian optimal classifier as BCS and prove that a classifier trained on the training set after removing BCS can obtain better generalization performance. To find out BCS, we indicate that BCS can be identified according to the size of global homogeneous cluster, a set of samples with the same labels, based on percolation theory. Based on these analysis, we propose a method to construct global homogeneous clusters and remove BCS from the training set. Extensive experiments show that the proposed method is effective for a number of classical and state-of-the-art classifiers.

- Topic Modeling on Document Networks With Dirichlet Optimal Transport Barycenterel agosto 9, 2023 a las 2:01 pm
Text documents are often interconnected in a network structure, e.g., academic papers via citations, Web pages via hyperlinks. On the one hand, though Graph Neural Networks (GNNs) have shown promising ability to derive effective embeddings for such networked documents, they do not assume a latent topic structure and result in uninterpretable embeddings. On the other hand, topic models can infer semantically interpretable topic distributions for documents by associating each topic with a group of understandable key words. However, most topic models mainly focus on plain text within documents and fail to leverage network structure across documents. Network connectivity reveals topic similarity between linked documents, and modeling it could uncover meaningful semantics. Motivated by above two challenges, in this paper, we propose a GNN-based neural topic model that both captures network connectivity and derives semantically interpretable topic distributions for networked documents. For network modeling, we build the model based on the theory of Optimal Transport Barycenter, which captures network structure by allowing the topic distribution of a document to generate the content of its linked neighbors. For semantic interpretability, we extend optimal transport by incorporating semantically related words in the embedding space. Since Dirichlet prior in Latent Dirichlet Allocation successfully improves topic quality, we also analyze Dirichlet as an optimal transport prior distribution to improve topic interpretability. We design rejection sampling to simulate Dirichlet distribution. Extensive experiments on document classification, clustering, link prediction, and topic analysis verify the effectiveness of our model.