In general, I am interested in solving exciting problems related to large and complex data. My PhD research (Dissertation: Geometric and Statistical Summaries for Big Data Visualization) focused on interactive visual exploration of large-scale data through reduction, analysis and query. A common theme that appears in many of my projects and papers is the concept of data summary, a compact representation of the most important features extracted from the data. A useful data summary has to be significantly small yet feature-packed, and it should easily lend itself to a visual representation. I try to identify a property of the data that quantifies importance (or saliency) and hence, can be used to perform importance-driven sampling, filtering, compression or other forms of data reduction.

In the past, my research investigated data summaries for two major categories of spatial data, scalar and vector fields, while exploiting their underlying statistical and geometric properties. More recently, I have been exploring how to create compact graph-based summaries from crowdsourced question-answer networks and use them to visually recommend better answers to the users.

At Intel, I developed tools and algorithms for analyzing and visualizing large-scale data in computational imaging and computational lithography area.

At @WalmartLabs, I design and develop machine learning driven solutions for large-scale classification and prediction problems such as production categorization and content (text, images) enrichment in the e-commerce domain.

Graph-based Data Summary:

Recent research has been dedicated to improve the problem-solving potential of crowdsourcing activities. However, not much has been done to help a user quickly extract the valuable knowledge from crowdsourced solutions to a problem, without having to spend a lot of time examining all content in details. Online knowledge-sharing forums (Y! Answers, Quora, and StackOverflow), review aggregation platforms (Amazon, Yelp, and IMDB), etc. are all instances of crowdsourcing sites which users visit to find out solutions to problems. Our system, CrowdMGR, performs visual analytics to help users manage and interpret crowdsourced data, and find relevant nuggets of information. Given a user query, CrowdMGR returns the solution, referred to as the SolutionGraph, to the problem as an interactive canvas of linked visualizations. The SolutionGraph allows a user to systematically explore, visualize and extract the knowledge in the crowdsourced data. It not only summarizes content directly linked to a user's query, but also enables her to explore related topics within the temporal and topical scope of the query and discover answers to questions which she did not even ask.

Related publication: SIGKDD Workshop IDEA'14.

Statistical Data Summary:

We investigated statistical distributions (commonly represented by histograms) as data summaries which can substitute for the raw data to answer statistical queries of various kinds. The main goal was to support range distribution query which returns the distribution of an axis-aligned query region. However, while answering large number of queries on big data, frequent access to the raw data is often not practical, if possible at all. Hence, we proposed parallel algorithms to pre-compute local distributions at different levels of detail, and presented a clever partitioning scheme to obtain any arbitrary region statistics by re-using the finite set of pre-computed distributions. Pre-computed distributions often exceed the raw data size, causing storage and query overhead. Moreover, maintaining the interactivity of such query is difficult because the workload and the query response time scale up with the data and the query size. We solved this by adapting an integral histogram based data structure to answer range distribution queries for any arbitrary region in near constant time, regardless of data and query size. To reduce the huge storage overhead of pre-computed distributions, we propose a similarity-driven indexing which exploits the hidden self-similarity among distributions collected from different regions.

Related publication: SciDac'11, LDAV'12, PacificVis '14.

Geometric Data Summary:

This work presents a novel method for identifying, extracting and visualizing the streamlines (or the streamline segments) which represent interesting flow features in a large vector field. We adapt a fractal dimension based metric called box counting ratio which can capture geometric complexity of streamline. Unlike other measures, box counting ratio can detect features of different sizes enabling multi-scale feature detection. Our feature exploration framework allows feature selection from an interactive 2D space. This framework facilitates clutter reduction, clustering of similar features, and exploration of unknown connection between remote regions in various scientific data sets.

Related publication: SciDac'11, TVCG '14.

Self-adaptive Visualization to Minimize Occlusion:

The objective of this work was to reduce visual clutter by showing overview information in 2D and the details in 3D on demand. However, 3D induces occlusion. We proposed a novel technique which automatically adapts the geometry and orientation of the rendering primitives to display details in 3D without causing significant occlusion. The technique was applied to hierarchical data with high depth using 3D treemaps. It was also adapted suitably to visualize additional information (such as temporal variation of water elevation associated to the spatial locations in geospatial data.

Related publications: PacificVis'09, VDA'12.

Short-term Projects:

  • Visual Analysis of Ultrasound Video: A technique to highlight the periodic behavior and the abnormalities in time-varying ultrasound signals. Related publication: PacificVis'10.
  • Blue Marble Mapper: A collaborative project with OSU hydrologists to build an online visualization system for water elevation data from around the Earth.
  • Volume Rendering on Cell Processor: We examined portability and performance of parallel ray casting on a single cell processor (a PS3)
  • Visualization of adaptive mesh refinement data
  • Rendering of large scale particle data