The comment sections of online newspapers are an important space to indulge in political discussions and discuss various opinions. These discussion forums have to be moderated due to the misuse by spammers, haters, trolls, and means of propaganda. This moderation process is very expensive and many online news providers have discontinued their comment sections. With more and more political campaigning, or even agitation being distributed over the internet, serious and safe platforms to discuss political topics are increasingly important.
In this project, we therefore analyze comments, users, and articles to understand the dynamics, the information flow, and the interactions in the comment sections. We work on detecting inappropriate comments, predicting popular news topics, identifying fake news and recommending information.
Today's business communication is almost unimaginable without emails. They document discussions and decisions or summarise face-to-face meetings in the form of unstructured text or attachments and thus hold a significant amount of information about a business. In very exceptional cases, for example when investigating a known case of fraud, specialists examine inboxes and attached files of involved personnel to determine the extent of the situation. However, the sheer quantity of documents is unmanageable without some guidance by an exploration tool, as journalists working with the Panama Papers leak experienced.
In this project, we develop and evaluate information extraction and linking methods to combine and in an exploration tool. This work touches the fields of text mining, text summarisation, document classification, topic modelling, named entity extraction, entity linking, relationship extraction, as well as social network-, and graph analysis. We work together with our industry partner from the financial sector to put our prototypes in the hands of auditors for real world feedback.
Art archives are a rich source of information for multiple reasons: proving the provenence of certain art pieces, facilitating research on art history, and understanding a particular artist with regard to the context of his or her work. These archives typically comprise of various kinds of heterogeneous documents: auction catalogs, personal correspondence, books, exhibition catalogs, bills, certificates, studies, theses, etc. Many of these archives are not easily accessible as they are not yet digitized. Even the ones that are available in digitized form are hard to explore with general text mining tools.
In this project, we aim to facilitate access to a large collection of art related documents. To this end, we need to adapt standard NLP tools to cater to the unique challenges of the art domain. The ultimate goal is to generate a knowledge graph which can be easily explored by art historians. The knowlege graph would also serve as a backbone for semantic search functionality and for new ways to represent art entities, e.g. as embeddings in a high dimensional space. Modern deep learning methods will be developed to manage and visualize large collections of art historical and scholarly documents.
Topic models automatically learn probabilistic representations for documents and their underlying semantic topics. In this project, we extend state-of-the-art topic models for new applications and compare and combine them with other document representations.
Combining several text collections into a joint, large dataset can reveal connections between apparently unrelated documents. However, usual text mining approaches cannot deal with different document styles and collection-specific language use. In this project, we jointly model documents despite linguistic differences for various tasks, such as clustering, classification, recommendation, or retrieval. For example, we allow to measure document similarity on a semantic level across patents and scientific papers or newspaper articles and tweets.
Deep neural networks can be used to create representations for words, sentences, and documents, as well as for entities, relations, and many more. They provide a dense vector to represent high-dimensional, sparse data in a compact way. Such embedding models have been show to improve the results of many text mining tasks. Further, combining these representations can reveal new insights. We investigate how these models can be used for text mining and develop new models for specific text mining tasks, such as splitting of e-mail threads, embedding book plots, or generating texts.