Back to top

Application Project Majd Al Kayyal

Last modified Apr 5
   No tags assigned

Project Overview

CreateData4AI is a project at the sebis chair (I19) for which a web application is the final “public-facing” deliverable. It is designed to transform unstructured text data into valuable, annotated datasets. It serves as a bridge for users to categorize data efficiently through user-guided processes. However, the platform needs to improve its processing speed, collaboration capabilities, and reliance on manual user input, affecting the resulting datasets' overall quality. This project aims to tackle these challenges through focused research and development efforts.

 

Objectives and Research Questions:

How can enhancements in scalability and collaboration address the challenges of efficiency and teamwork in CD4AI?

 

Objective 1: Enhance Processing Efficiency

Research Question: 

How can tasks typically requiring GPU resources be optimized to enhance the scalability and efficiency of processing large datasets on any system?​

 

Current State: 

Tasks within the application exhibit poor efficiency, leading to prolonged execution times even for small datasets.

 

Goal: 

To improve task efficiency, ensuring fast execution and the ability to scale operations for handling extensive datasets.

 

Exploration Areas: 

  • Code Optimization: looking for inefficiencies 

 

  • Parallel Execution: Leveraging multithreading and multiprocessing to enable distributed execution of tasks.

 

  • Efficient LLM Utilization: Investigating various Large Language Models (LLMs) to find an optimal balance between embedding quality and computational speed.

 

  • Dimensionality Reduction Techniques: Investigating different dimensionality reduction techniques to reduce the number of computations needed for the task.



Objective 2: Enable Effective User Collaboration

Research Question

What mechanisms can be implemented to facilitate seamless collaboration among users working on the same projects?

 

Current State

The application currently lacks features supporting user collaboration.

 

Goal

To introduce functionalities that enable both real-time and asynchronous collaboration among users.

 

Exploration Areas

  • User Management: Establishing systems for organization, users, roles, and permissions. 

 

  • Collaboration Features: Implement real-time collaboration features where users can work together to start data pipeline tasks and select the most relevant results for their classes.

 

Dataset:

A dataset is textual data called German Business Registry - Business Purposes. The goal is to annotate these business activities into user-defined classes. The data is around 2.4M rows.

Files and Subpages

There are no subpages or files.