Font Size: a A A

Places, networks, and crowds: Scalable data management and analysis for emerging online applications

Posted on:2016-10-22Degree:Ph.DType:Dissertation
University:Polytechnic Institute of New York UniversityCandidate:Christoforaki, MariaFull Text:PDF
GTID:1478390017984068Subject:Computer Science
Abstract/Summary:PDF Full Text Request
The amount of information that is currently generated, gathered, and stored has reached unprecedented levels. Various sources such as websites, local business catalogs, social networks, and Qustion and Answer (Q/A) sites contain vast amounts of data that can potentially be very valuable for both web users and companies. The large size of these datasets poses an obstacle to their effective utilization since answering queries regarding the data and extracting the useful parts of it becomes more demanding. This dissertation focuses on emerging online applications that need to manage and analyze large datasets like these; it comprises three parts each of which studies a different problem and a different type of data.;The first part studies the problem of efficient spatio-textual query processing. Location-based search services, such as Google maps, allow users to issue text queries constrained to a specific geographic location. In order to efficiently process these queries, previous work focused on optimizations regarding the spatial aspect. We provide a solution that gives higher priority to the textual aspect while using only a coarse-grained spatial structure. Our experiments show that this solution outperforms existing approaches by up to two orders of magnitude.;The second part focuses on efficient pairwise distance estimation in large graphs. Point-to-point distance estimation is a fundamental and well-studied problem with numerous applications such as Social Search, but previous algorithms become intractable as the size of the graph grows. We take a fresh look at this setting and approach it as a learning problem, using structural properties of the graph as features in the learning process. Our experiments verify that this approach leads to lower prediction errors than the state-of-the-art solutions.;Finally, the third part proposes a system that utilizes content available in Q/A sites, such as Stack Overflow, in order to efficiently generate and evaluate test questions that assess the technical skills of job candidates. Upon extracting relevant threads from the Q/A sites, our system combines Crowdsourcing and Item Response Theory so as to re-purpose this content to generate tests. Our experiments show that the quality of these tests is comparable to, or higher than, that of tests that are used in practice. At the same time, we achieve a per-test question cost that is lower than that of licensing questions from existing test banks.
Keywords/Search Tags:Data
PDF Full Text Request
Related items