Frameworks, Algorithms, and Systems for Efficient Discovery of Data-Backed Fact

Posted on:2018-11-13

Degree:Ph.D

Type:Thesis

University:The University of Texas at Arlington

Candidate:Zhang, Gensheng

Full Text:PDF

GTID:2446390005953917

Subject:Computer Science

Abstract/Summary:

This thesis studies the problem of finding facts from semi-structured and structured data. The amount of data in our world is exploding, and the proliferation of data is making them increasingly inaccessible. It is now more challenging than ever how to efficiently identify useful information where a vast amount of data is available.;This thesis first studies the problem of finding facts in semi-structured data, specifically, in knowledge graphs. We built Maverick, a general, extensible framework that discovers exceptional facts about entities in knowledge graphs. We model an exceptional fact about an entity of interest as a context-subspace pair, in which a subspace is a set of attributes and a context is defined by a graph query pattern of which the entity is a match. The entity is exceptional among the entities in the context, with regard to the subspace. The search spaces of both patterns and subspaces are exponentially large. Maverick conducts beam search on the patterns which uses a match-based pattern construction method to evade the evaluation of invalid patterns. It applies two heuristics to select promising patterns to form the beam in each iteration. Maverick traverses and prunes the subspaces organized as a set enumeration tree by exploiting the upper bound properties of exceptionality scoring functions. Maverick demonstrated substantial performance improvement of the proposed framework over the baselines as well as its effectiveness in discovering exceptional facts.;This thesis further investigates the problem of finding facts in structured data. In particular, our objective is to find top-k prominent streaks in multi-dimensional multi-sequence data. Given a sequence of values, a prominent streak is a long consecutive subsequence consisting of only large (small) values, e.g., consecutive games of outstanding performance in sports. To efficiently discover prominent streaks, we exploited the properties of local prominent streaks (LPS), which is a superset of prominent streaks. We showed that LPS-based algorithms exhibited orders of magnitude performance improvement against the baseline method.;This thesis also presents a fact-finding system FactWatcher, which helps journalists identify data-backed, attention-seizing facts which serve as leads to news stories. FactWatcher discovers three types of facts, including situational facts, one-of-the-few facts, and prominent streaks, through a unified suite of data model, algorithm framework, and fact ranking measure. Furthermore, FactWatcher provides multiple features in striving for an end-to-end system, including fact ranking, fact-to-statement translation and keyword-based fact search.

Keywords/Search Tags:

Data, Fact, Prominent streaks, Framework

Related items

1	UN Fact-Finding Inquiry Commissions for Assassinations of Prominent Individuals
2	Research On Data Protection Rules Under �C & M Framework�
3	The Types And Rules Of Data Rights In China Under The C&M Framework
4	Research On General Data Editing And Processing Framework Technology Driven By Data Simulation
5	Research On The Configuration Of Personal Data Protection Rules In China Under The C?M Framework
6	Research On Data Cleaning Framework And Application For Open Government Data
7	The Institutional Framework Of Data Portability And China's Path
8	The Prominent Issue Of Minority Rights Diaspora Appearing In Urban Conservation And Countermeasures
9	Research On Criminal Law Protection Of Data Security Under The Framework Of Classification And Classification
10	Cyberspace Governance Under The Influence Of Big Data Framework Construction And Path Optimization Research