Evaluation of LLM Systems, and more

In this post you will read about LLM Evaluation, how to help LLMs not get distracted by irrelevant info in the context, and brain waves and cognitive performance.

Nov 26, 2023

In my substack, deep random thoughts, I share a randomly selected set of my writing and updates every week. My posts will be related to LLMs (AI in general), product dev and UX, health (founders’ flavor), startup related topics, and of course the events we run!

Asks and Announcements

One Day Workshop on LLM Evaluation - add to your calendar

Last week at Aggregate Intellect!

This past week was all about following up with the leads from a few trade shows I went to. And the rest of the week was spent on pushing some of the deals we are working on forward. The conversations are getting to very interesting points where we need to discuss, both internally and externally, the details and terms of the deals we are interested in getting into. One of the interesting nuances that we are trying to work out is how to maximize optionality simultaneously for these early clients and ourselves. This is all exciting stuff, and I’m glad that we’re finally at a phase where we need to sort through details like these.

Last week’s unsung hero

This week I had a few conversations with folks in luxury travel space which is one of the target markets we are looking at. These were through intros made by Jennifer Martin.

Thank you, Jennifer, for being a super connector.

LLM Stuff

In our recent LLM workshop Matt Fornito emphasized the importance of effectively communicating the value of data science to executives and stakeholders. He introduced the concept of a maturity framework and identified key personas within an organization. He discussed challenges and strategies for adoption, collaboration with different roles, and implementation. Matt also emphasized the importance of creating a data-driven transformation culture within organizations.

Topics:
---------
⃝ Importance of effectively communicating the value of data science

* Address concerns and educate executives and stakeholders about the benefits and opportunities of data science
* Use a maturity framework to assess an organization's level of data-driven decision-making
* Build relationships with key personas within an organization

⃝ Challenges and strategies for adoption

* Executives may have reservations about privacy and security implications of using large language models (LLMs) and proprietary data
* Provide education and training opportunities to foster trust and understanding
* Collaborate with CDOs and CTOs to overcome hardware constraints and cost considerations, and understand what's needed to transform data pipelines
* Data engineers play a crucial role in ensuring reliability, explainability, privacy, and security of Gen AI models

⃝ Implementation

* Identify business cases where Gen AI can have a significant impact
* Collaborate with various departments to prioritize use cases and develop a roadmap
* Generate small wins to build trust and stakeholder buy-in
* Productionize models, assess their value, and scale solutions successfully

⃝ Creating a data-driven transformation culture

* Provide training and workshops to educate employees about generative AI
* Align AI initiatives with business and organizational goals
* Assess talent acquisition and consider bringing in new talent or external consultants
* Encourage a culture of experimentation and adaptability
* Demonstrate the financial impact of AI initiatives through ROI discussions

1. How can Language Models (LLMs) enhance customer service through natural conversations?

2. What challenges arise when using LLMs for dialogue systems, particularly in terms of evaluation?

3. How can the simulation of conversations between production dialogue systems and users improve the assessment of changes in production pipelines?

add to your calendar: https://lu.ma/llm-eval

Join us for a workshop session with Benedicte PIERREJEAN, a Senior ML Scientist at Ada, specializing in Applied Machine Learning.

LLMs empower us to move beyond structured dialogue flows, fostering natural conversations. However, the application of LLMs in dialogue systems poses challenges, especially in the realm of evaluation. Bénédicte will unravel the complexities and solutions associated with ensuring safe, accurate, and relevant content production.

Béné holds a PhD in Natural Language Processing, the very heart of the technology driving these advancements. With a passion for improving customer experiences through Machine Learning, she brings a wealth of knowledge and practical insights to the table. As a key member of the Applied Machine Learning team at Ada, Bénédicte actively contributes to shaping the future of dialogue systems.

The session will delve into the Automatic Evaluation of Dialogue Systems. By exploring how LLMs are employed in the evaluation process, Bénédicte will guide us through the simulation of conversations between production dialogue systems and users. This innovative approach allows us to replicate realistic testing conditions, providing a swift and comprehensive assessment of any changes introduced to our production pipelines.

1. How do Multi-Modal RAG systems contribute to enhancing information retrieval?
2. What are the traditional and modern techniques for evaluating these LLM systems?
3. How can the LlamaIndex library be practically utilized to build and evaluate a Multi-Modal RAG engine?

Join us for an insightful workshop session with Val Andrei Fajardo, a seasoned Software/Machine Learning Engineer at LlamaIndex, as he delves into the evaluation of Multi-Modal RAG Systems.

Here's a sneak peek into what you can expect:

Multi-Modal RAG systems play a pivotal role in information retrieval, combining power retrieval systems, for example for text and image data, and generative power of LLMs to provide a richer user experience. Evaluating these systems is crucial for ensuring their effectiveness in diverse applications. Andrei, with nearly a decade of experience and a PhD in Statistics, brings a wealth of expertise to guide us through the intricate process of evaluation.

Andrei, a Founding Software/Machine Learning Engineer at LlamaIndex, stands at the intersection of industry experience and academic prowess with a PhD from the University of Waterloo. His vast knowledge makes him the perfect guide for understanding the complexities of Multi-Modal RAG system evaluation.

In this interactive tutorial, we'll explore the practical aspects of using the LlamaIndex library. Learn how to seamlessly integrate text documents and images into a Vector store, construct a Multi-Modal RAG engine, and evaluate its performance against benchmarks.

1. How does the peer review process contribute to the advances in machine learning, and what challenges does it currently face?
2. What structural issues in the field of machine learning might be hindering the peer review process?
3. How can a transparent, community-driven approach, particularly through machine learning competitions, address these challenges and contribute to the progress of the field?

Machine learning has propelled remarkable advancements, but its foundation lies in a rigorous peer review process. Megan Risdal, a lead Product Manager at Kaggle, explores the challenges facing this process and how the rapid growth of the field may be impacting its effectiveness. By addressing structural issues and advocating for a community-driven approach, the talk sheds light on the vital role that machine learning communities play in improving the quality, trustworthiness, and rigor of results.

Meg, with a background in linguistics and Master's degrees from UCLA and North Carolina State University, brings a unique perspective to the discussion. As a lead Product Manager at Kaggle, she combines academic knowledge with practical experience, making her well-equipped to delve into the structural challenges of machine learning and propose innovative solutions.

Some resources from our Slack channel (join):

Podcasts

Here are the good podcasts I listened to this week:

Deep Random Thoughts

Discussion about this post