Enterprise AI at SAP

Last modified by Anum Afzal Jun 5

SAP and TUM collaborate in strategic areas, where applied research can provide a positive impact to business and people. The portfolio of joint activities cover a broad set of areas as SAP‘s solutions is used in various aspects. "Enterprise AI" is a crucial part of that portfolio under which several research projects are being carried out. There is currently two on-going projects and one completed project.

Projects

AutoRAG - Leveraging Bayesian Optimization for Accelerating RAG Pipeline Optimization (2024 - Present)

Overview

Retrieval-Augmented Generation (RAG) pipelines are widely used to enhance the performance of large language models by combining retrieval mechanisms with generative modeling. However, optimizing these pipelines is complicated and time-consuming, as it involves carefully selecting and tuning components such as retrievers, filters, and generators. Bayesian Optimization (BO) is a probabilistic method designed to efficiently identify optimal solutions for complex optimization problems. This research will investigate the application of BO within RAG pipelines to automate and accelerate the optimization process. Specifically, we aim to explore how BO can be effectively utilized for both discrete module selection and continuous hyperparameter tuning within a unified framework. The research will also evaluate how robust Bayesian Optimization is under different conditions, including variations in data quality and domain characteristics. The main goal of this research is to find the optimal configurations for RAG pipelines while reducing the optimization time.

Research Objectives

To what extent can Bayesian Optimization be used to choose between different RAG pipeline modules (e.g., retrievers, filters) and tune their hyperparameters?
How do different resource constraints (e.g., evaluation budges, time, compute) affect the performance and stability of BO in RAG tuning?
How sensitive is Bayesian Optimization to the size and quality of the evaluation dataset during RAG pipeline tuning?
How consistent are the RAG pipeline configurations discovered by Bayesian Optimization across datasets from different domains or with different query characteristics?

Benchmarking of Voice LLMs for Joule (2024 - Present)

Overview

Joule is SAP’s Business Digital Assistant across platforms such as smart watches, tablets, mobile that supports text functionality. But How can a voice assistant understand business specific vocabulary and meaning? Given a large choice of LLMs, it is essential to evaluate and benchmark STT, TTSandvoicecloning models from different vendors on both performance and cost. There is lack of existing benchmarks to compare the models against. LLMs should be tested on their ability to comprehend domain-specific data, which is not available due to privacy concerns. Customers have different business processes, data, and hence vocabulary and acronyms. SAP does not train models on customer data and also can’t have direct access to many of the customer data. The goal is to extending Joule to add voice support with state-of-the-art models would bring great value to SAP’s Customers.

Research Objectives

TBD

ChatNLG (2022 - 2024)

Overview:

During the initial phases, this project explored semi-supervised learning frameworks with applications of text generation using deep learning models. This project addressed several research questions, such as the development of an approach for automated labeling of data generated via chatbot interactions, the integration of user feedback for an enhanced learning experience for the chatbot, and the improvement of chatbot responses where only a limited dataset is available for training. For the initial cycle of this project, the focus was on a HR chatbot use-case where a semi-supervised learning framework is used to generate response for user utterance. A human-in-the-loop is also embedded into the semi-supervised learning framework to correct the generated responses before they are appended to the training data.

During the second year of the project, we worked with the domain experts of SAP SE to develop an HR support chatbot as an efficient and effective tool for addressing employee inquiries. We inserted a human-in-the-loop in various parts of the development cycles such as dataset collection, prompt optimization, and evaluation of generated output. By enhancing the LLM-driven chatbot's response quality and exploring alternative retrieval methods, we have created an efficient, scalable, and flexible tool for HR professionals to address employee inquiries effectively. Our experiments and evaluation conclude that GPT-4 outperforms other models and can overcome inconsistencies in data through internal reasoning capabilities. Additionally, through expert analysis, we infer that reference-free evaluation metrics such as G-Eval and Prometheus demonstrate reliability closely aligned with that of human evaluation.

Research Objectives

Would an SSL framework along with Human in the loop generate high quality labelled data to be used for model training?
Can direct inference yield adequate results without the need for fine-tuning, and what prompt-tuning techniques can be used to improve the quality of the responses?
What methods can be used to optimize the retrieval when using LLM embeddings and vector search in comparison to the current DPR model?
What are the emerging state-of-the-art metrics in the evaluation of generative conversational agents, and how do they compare to traditional metrics?
Are reference-free evaluation metrics, especially those leveraging advanced language models, a more reliable indicator of a generative model's performance compared to traditional reference-based metrics?

Key Publications:

Anum Afzal, Alexander Kowsik, Rajna Fani, and Florian Matthes. 2024. Towards Optimizing and Evaluating a Retrieval Augmented QA Chatbot using LLMs with Human-in-the-Loop. In Proceedings of the Fifth Workshop on Data Science with Human-in-the-Loop (DaSH 2024), pages 4–16, Mexico City, Mexico. Association for Computational Linguistics.
Afzal, Anum & Xiang, Tao & Matthes, Florian. A Semi-Automatic light-weight Approach towards Data Generation for a Domain-Specific FAQ chatbot using Human-in-the-Loop. In Proceedings of the 15th International Conference on Agents and Artificial Intelligence (ICAART 2024), Rome, Italy. SCITEPRESS - Science and Technology Publications.

Research Partner:

This project is part of the SAP @ TUM Collaboration Lab and hence fosters a close research partnership with SAP Intelligent Enterprise Solutions and Artificial Intelligence Center of Excellence.

Enterprise AI at SAP

Projects

AutoRAG - Leveraging Bayesian Optimization for Accelerating RAG Pipeline Optimization (2024 - Present)

Research Objectives

Benchmarking of Voice LLMs for Joule (2024 - Present)

Overview

Research Objectives

ChatNLG (2022 - 2024)

Overview:

Research Objectives

Key Publications:

Research Partner:

Sponsored by