Places, networks, and crowds: Scalable data management and analysis for emerging online applications

Posted on:2016-10-22

Degree:Ph.D

Type:Dissertation

University:Polytechnic Institute of New York University

Candidate:Christoforaki, Maria

Full Text:PDF

GTID:1478390017984068

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

The amount of information that is currently generated, gathered, and stored has reached unprecedented levels. Various sources such as websites, local business catalogs, social networks, and Qustion and Answer (Q/A) sites contain vast amounts of data that can potentially be very valuable for both web users and companies. The large size of these datasets poses an obstacle to their effective utilization since answering queries regarding the data and extracting the useful parts of it becomes more demanding. This dissertation focuses on emerging online applications that need to manage and analyze large datasets like these; it comprises three parts each of which studies a different problem and a different type of data.;The first part studies the problem of efficient spatio-textual query processing. Location-based search services, such as Google maps, allow users to issue text queries constrained to a specific geographic location. In order to efficiently process these queries, previous work focused on optimizations regarding the spatial aspect. We provide a solution that gives higher priority to the textual aspect while using only a coarse-grained spatial structure. Our experiments show that this solution outperforms existing approaches by up to two orders of magnitude.;The second part focuses on efficient pairwise distance estimation in large graphs. Point-to-point distance estimation is a fundamental and well-studied problem with numerous applications such as Social Search, but previous algorithms become intractable as the size of the graph grows. We take a fresh look at this setting and approach it as a learning problem, using structural properties of the graph as features in the learning process. Our experiments verify that this approach leads to lower prediction errors than the state-of-the-art solutions.;Finally, the third part proposes a system that utilizes content available in Q/A sites, such as Stack Overflow, in order to efficiently generate and evaluate test questions that assess the technical skills of job candidates. Upon extracting relevant threads from the Q/A sites, our system combines Crowdsourcing and Item Response Theory so as to re-purpose this content to generate tests. Our experiments show that the quality of these tests is comparable to, or higher than, that of tests that are used in practice. At the same time, we achieve a per-test question cost that is lower than that of licensing questions from existing test banks.

Keywords/Search Tags:

Data

PDF Full Text Request

Related items

1	Seismic Achievement Data ETL Platform Architecture Design And Software System Implementation
2	The Research And Application Of Data Preprocessing In XML Data Warehouse
3	Design And Implementation Of Data Mining Support Subsystem Based On Big Data Of Power
4	The Data Integrationã€analysis And Utilization For Hosiptal Information Based On The Data Warehouse
5	Research On Related Issues Of Unstructured Data
6	Design And Implementation Of Environmental Monitoring Data Management System
7	Study On Data Dependency_Based Data Quality Processing Techniques In Data Integration
8	Research On The Problems And Countermeasures Of Domestic Data Journalism Practice
9	Application Of Artificial Intelligence On Data Cleaning
10	Design And Implementation Of The Bayonet Data Integration Platform