Back to top

Ha11a - Efficient Text Analytics with Scientific Workflows

Last modified Apr 4, 2012

Abstract

Scientists use text analytics to organize and retrieve knowledge from a large amount of raw data. They have to choose between many different algorithms and consider possible combinations of those algorithms. Workflows are a good way to manage this complexities and are successfully applied to this problem. Existing approaches however do not support scientists with the laborious task of exploring the design space of worklows and configuring computational experiements.

This thesis describes a framework to capture text analytic processes along with an efficient way to explore different algorithms. It takes advantage of the unique capabilites of Wings to reason about semantic constraints. As a result the framework can rule out invalid workflows and help scientists to explore the design space of valid solutions for computational experiments. Metadata information about datasets and computation steps are used to elaborate complex experiments automatically. These metadata are also used to store provenance information about applied algorithms with their parameters helping researchers to comprehend scientific results. The algorithms in the workflows are not tied to a specific technology and heterogeneous implementations are possible because every component in the workflow is described with their interfaces and the necessary data types. The execution environment can be switched to run computations on local resources or a high-performance cyberinfrastructures. For the scientists using the framework the selected execution environment is completely transparent. All these benefits are attainable for novice researchers or students since predefined workflows can be executed easily even without expert knowledge in machine learning. Necessary expertise to perform text analytics is incorporated in the desgin of the workflows with numerous solutions for many applications. 

Files and Subpages

Name Type Size Last Modification Last Editor
document-2.pdf 5,68 MB 16.05.2012