Project Overview
CreateData4AI is a project at the sebis chair (I19) for which a web application is the final “public-facing” deliverable. It is designed to transform unstructured text data into valuable, annotated datasets. It serves as a bridge for users to categorize data efficiently through user-guided processes. However, the platform must improve its processing speed, collaboration capabilities, and reliance on manual user input, affecting the resulting datasets' overall quality. This project aims to tackle these challenges through focused research and development efforts.
Objectives and Research Questions:
How can enhancements in scalability and collaboration address the challenges of efficiency and teamwork in CD4AI?
Objective 1: Enhance Processing Efficiency
Research Question:
How can tasks requiring high computational resources be optimized to enhance the scalability and efficiency of processing large datasets?
Current State:
Tasks within the application exhibit poor efficiency, leading to prolonged execution times even for small datasets.
Goal:
To improve task efficiency, ensuring fast execution and the ability to scale operations for handling extensive datasets.
Exploration Areas:
Code Optimization: looking for inefficiencies
Parallel Execution: Leveraging multithreading and multiprocessing to enable distributed execution of tasks.
Objective 2: Enable Effective User Collaboration
Research Question:
How can collaborative features and automation tools enhance the efficiency and speed of the data selection process for domain experts?
Current State:
The application currently lacks features supporting user collaboration and automated selections.
Goal:
To introduce functionalities that enable both real-time and asynchronous collaboration among users.
Exploration Areas:
User Management: Establishing systems for organization, users, roles, and permissions.
Collaboration Features: Implement collaboration features where users can work together to start data pipeline tasks and select the most relevant results for their classes.
Dataset:
A dataset is textual data containing Job Descriptions for multiple industries. The goal is to annotate these job descriptions into user-defined classes (Industries). The data is around 100k docs.
Name | Type | Size | Last Modification | Last Editor |
---|---|---|---|---|
240603 Majd Alkayyal CD4AI-AP.pptx | 3,03 MB | 07.10.2024 | ||
240923 Majd Alkayyal CD4AI-AP.pptx | 2,67 MB | 07.10.2024 |