Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Empowering Developers: Navigating the AI Revolution in Software Engineering
MLOps Architectural Models: An Advanced Guide to MLOps in Practice
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Enterprise AI: The Emerging Landscape of Knowledge Engineering. Conversational AI refers to the technology enabling machines to engage in natural language conversations with humans. This encompasses a suite of techniques, including natural language processing (NLP), natural language understanding (NLU), natural language generation (NLG), and dialogue management. In recent years, conversational AI has experienced a remarkable evolution, transitioning from simplistic rule- or FAQ-based systems to advanced virtual assistants capable of human-like dialogue. This evolution has been closely intertwined with breakthroughs in generative AI (GenAI) and the development of large language models (LLMs), exemplified by OpenAI's GPT series and Google's BERT. While significant strides have been made, challenges such as privacy, bias, and user experience persist, promising even more sophisticated interactions between humans and machines. In this article, we explore the intertwined journey of conversational AI and the emergence of GenAI and LLMs, examining their evolution, impact, and implications for the future of human-computer interaction. Convergence and Divergence of Conversational AI Conversational AI has entered a new era with the integration of LLMs and GenAI. While traditional conversational AI focused on rule-based interactions, this fusion of LLMs and GenAI introduces a departure from traditional conversational AI that enables systems to generate more diverse, intelligent, and contextually aware interactions as AI systems grow to comprehend and respond with greater depth and richer responses. At the same time, these technologies are converging to elevate conversational experiences to unprecedented heights. This divergence and convergence opens avenues for more nuanced dialogue and personalized interactions, challenging conventional approaches and paving the way for more sophisticated AI-human engagements. Table 1. GenAI vs. conversational AI Aspect GenAI Conversational AI Objective Generates new, coherent, and contextually relevant content (e.g., text, images) Facilitates natural language interactions/conversation between humans and machines Techniques Uses generative models: GANs, VAEs, autoregressive models Employs NLP, NLU, NLG, and dialogue management techniques Applications Text generation, image synthesis, creative content generation Virtual assistants, chatbots, customer service automation, etc. Data requirements Large amounts of diverse training data Substantial datasets for language understanding and generation Evaluation metrics Quality, diversity, coherence, realism (perplexity, BLEU score, FID score) Accuracy of responses, relevance, fluency, user satisfaction Ethical considerations Concerns around deepfakes, misleading content, copyright infringement Privacy, bias, fairness, user trust, responsible deployment Shared Foundations The fusion of conversational AI and GenAI marks a significant leap in AI capabilities, enabling more intelligent and contextually aware conversations. By integrating GenAI techniques like LLMs into conversational AI systems, AI can deeply comprehend user inputs, discern intents, and produce relevant responses. This convergence ensures more natural, personalized interactions that adapt dynamically to user needs and preferences. Overall, conversational AI's merger with GenAI empowers AI systems to engage with human-like intelligence, revolutionizing technological interactions. Adaptive Learning Both conversational AI and GenAI systems utilize adaptive learning, which continuously refines their capabilities. Through iterative analysis of user interactions and feedback, these systems improve response accuracy and content generation. This iterative learning process enables them to evolve over time, delivering more sophisticated and tailored experiences to users. Intelligent Conversations LLMs and GenAI, integrated with conversational AI systems, generate diverse responses that adapt to user preferences, conversational context, and evolving language nuances with emotional intelligence. This integration allows for dynamic interactions, where AI responses are finely tuned to empathetically address user needs, fostering more engaging and personalized conversations. LLMs and GenAI in Conversational AI Embedding LLMs and GenAI involves a series of technical steps to build robust and effective systems for NLU and NLG. The process begins with the collection of large datasets containing diverse conversational data, which serve as the foundation for training LLMs and GenAI models. These datasets are preprocessed to clean the data and prepare it for input into the models, which includes tokenizing the text and encoding it into numerical representations. In this context, prompts, commands, and sentiments play crucial roles in facilitating effective human-machine interactions: Prompts Initiate conversations, guiding user interaction with the AI Establish interaction context, indicating user information or action needs Guide conversation direction, triggering AI to respond suitably to user queries Commands Prompt AI to perform tasks in response to user requests Guide AI to perform tasks like setting reminders or providing information Trigger AI to generate responses or perform user-requested actions, guiding conversation flow Sentiments Indicate user mood, preferences, or satisfaction Shape AI responses, adjusting tone or content based on user emotion Provide feedback for AI adaptation, enhancing the user experience Conversational AI Implementation and Deployment With LLMs and GenAI Next, the models are trained using advanced deep learning techniques, such as transformers, for LLMs and generative adversarial networks (GANs) or variational autoencoders (VAEs) for GenAI. During training, the models learn to understand the intricacies of human language by optimizing parameters to minimize loss functions and improve performance on specific conversational tasks. Once trained, the models undergo fine-tuning to specialize them for particular applications or domains. This involves further training on smaller, domain-specific datasets to enhance performance and adapt the models to the target use case. The fine-tuned LLMs and GenAI models are then integrated into the conversational AI system architecture, typically through the development of APIs or interfaces that enable interaction with the models. Upon deployment in production environments, the conversational AI system with integrated LLMs and GenAI models is monitored for performance and user feedback. Continuous evaluation allows for iterative improvements to the models' and system architectures, ensuring that the conversational AI system remains effective and responsive to user needs over time. Overall, the implementation of conversational AI with LLMs and GenAI represents a complex yet essential process in the development of advanced conversational systems capable of engaging with users in natural and meaningful ways. Figure 1. Conversational AI multi-modal architecture with embedded LLM Contextual Continuity, Diversity, Dynamism, and Personalization Conversational AI uses LLMs and GenAI to ensure contextual continuity, diversity, dynamism, and personalization, thus enhancing user engagement and satisfaction. LLMs analyze previous interactions to generate consistent responses, preserving conversational context and user preferences. This integration bridges the gap between human and machine interactions, making conversations more coherent and engaging. Furthermore, LLMs and GenAI empower conversational AI systems to generate diverse, contextually relevant responses, catering to user preferences and dynamically adapting to evolving conversational contexts. Real-time learning mechanisms enable continual improvement in response accuracy and effectiveness, while adaptive learning ensures personalized interactions tailored to individual user needs. Ultimately, this integration drives business value by increasing customer satisfaction, loyalty, and engagement, leading to enhanced sales and revenue. Conversational AI for the Metaverse In the rapidly evolving landscape of virtual reality, the metaverse emerges as a digital domain characterized by its immersive and interconnected nature. It encompasses virtual environments where users can interact, socialize, and engage in various activities, blurring the boundaries between the physical and digital worlds. Conversational AI plays a pivotal role in shaping the user experience within the metaverse. By leveraging AI and NLP technologies, conversational AI enhances interaction and communication in virtual environments. Virtual Assistance and Immersive Language Experience Foundations In the metaverse, conversational AI-powered virtual assistants act as essential guides, providing personalized assistance and facilitating seamless interactions. Integrated with GenAI, conversational AI enables intelligent and contextually aware conversations, enhancing immersion and engagement. It leverages pre-trained LLMs to understand and generate human-like responses in real time. These models are fine-tuned to specific conversational contexts within the metaverse, enabling them to comprehend user queries deeply and respond with contextually relevant information, thus enriching the entire metaverse experience. Overall, conversational AI plays a vital role in facilitating communication, enhancing user engagement, and shaping immersive virtual environments. Ethical Implication and Challenges in Conversational AI Conversational AI brings forth a host of ethical dilemmas, ranging from the risk of generating misleading content to ensuring fairness, compliance, and transparency. In this section, we explore the multifaceted ethical challenges inherent in conversational AI and strategies for ethical AI development. Table 2. Challenges, implications, and mitigations for conversational AI Challenge Risk Detection and Mitigation AI-generated misleading content Harms trust and credibility Causes confusion and misunderstanding Undermines communication and decision-making Violates ethical and legal standards Use NLP algorithms to spot inconsistencies Employ human oversight for content credibility Disclose AI limitations transparently Establish clear ethical guidelines Bias Perpetuates discrimination and inequalities Leads to unfair treatment and biased decisions Reinforces stereotypes and prejudices Poses ethical, legal, and economic risks Use bias detection algorithms to spot discriminatory patterns Regularly audit AI systems for fairness Apply debiasing algorithms to mitigate unfairness Educate developers on bias awareness and mitigation Regulations and compliance Non-compliance risks legal penalties and reputation damage Inadequate measures lead to breaches and operational disruptions Violations spark lawsuits, audits, and regulatory investigations Enhance data security, policy compliance, and staff training Ensure clear documentation, legal partnerships, and internal reviews Maintain transparent communication with regulators Overfitting and generalization Overfitting memorizes data, neglecting patterns; hampers adaptation to new situations, causing incorrect assumptions Overgeneralization yields oversimplified, unreliable models May fail to see some data, leading to inaccurate predictions Regularly validate models on diverse datasets Apply regularization techniques to prevent overfitting Utilize cross-validation to assess model generalization Fine-tune model hyperparameters judiciously Transparency and accountability Transparency deficits erode user trust in AI Inadequate accountability risks legal and ethical problems Opaque processes raise concerns about decision-making and may breach regulations Privacy concerns deter users from engaging with opaque AI Use explainable AI for transparent decisions Offer comprehensive model documentation Follow industry standards for transparency Conduct regular audits for accountability and compliance Conclusion The future trajectory of conversational AI promises a synergistic evolution, propelled by advancements in generative AI and LLMs. Innovative interfaces, including voice-enabled devices and augmented reality platforms, are reshaping human-AI interactions. By leveraging transformer-based architectures and massive training datasets, LLMs enable conversational AI systems to comprehend user queries more effectively and generate contextually relevant responses in real time. LLM inspires these interactions with emotional intelligence and empathy, providing personalized experiences tailored to individual users. These advancements are driving increased adoption across industries such as healthcare, finance, and retail. This crossover with GenAI and LLMs has elevated conversational experiences to unprecedented heights, offering users richer, more personalized interactions. While the future of conversational AI holds immense promise, it also presents significant challenges and ethical considerations. Safeguarding privacy, mitigating bias, ensuring transparency, and fostering trust are paramount in navigating this evolving landscape. Moreover, enterprises must address challenges related to data security, regulatory compliance, and the responsible deployment of AI technologies. By prioritizing ethical considerations and proactively addressing enterprise challenges, we can ensure that conversational AI continues to deliver value while upholding ethical standards and societal well-being. This is an excerpt from DZone's 2024 Trend Report, Enterprise AI: The Emerging Landscape of Knowledge Engineering.Read the Free Report
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Enterprise AI: The Emerging Landscape of Knowledge Engineering. Over the last few years, artificial intelligence (AI), especially generative AI, has permeated all sectors, growing smarter and faster — and spreading a ubiquitous presence. Generative AI can lead to competitive advantage, and the large language models (LLMs) that underpin AI and the ever-expanding use cases have evolved faster than any tech in history. But AI also has the potential for misuse, raising fundamental questions regarding its human and ethical impact: How can it be regulated to ensure a balance of fostering innovation while also safeguarding society against the risks of ill design or misuse? What is good use? How should AI be designed and deployed? These questions demand immediate answers as AI's influence continues to spread. This article offers food for thought and some advice for the way forward. Can You Regulate When the Horse Has Already Bolted? AI is not a problem for tomorrow — it's already here; the horse has well and truly bolted. OpenAI's ChatGPT set a record for the fastest-growing user base, gaining 100 million monthly active users within the first two months. According to Open AI, "Millions of developers and more than 92% of Fortune 500 are building on our products today." Many startups and enterprises use tech for good, such as helping people who stutter speak more clearly, detecting landmines, and designing personalized medicine. However, AI can also be used in ways that cause harm, such as misidentifying suspects, defaming journalists, breaching artistic copyright, and developing deepfakes that can steal millions. Furthermore, the datasets within the LLMs that power AI can be gender or racially biased or contain illegal images. Therefore, AI regulations must examine existing problems and anticipate future problems, which will evolve as LLMs provide new use cases across various industries, many of which we never thought possible. The latter is no easy task. Today's AI has created entirely new business opportunities and economic advantages that will make enterprises resistant to change. But change is possible, as GDPR regulations in Europe demonstrate, especially since compliance failure results in fines proportional to a business' earnings depending on factors such as intent, damage mitigation, and cooperation with authorities. Attempts to Regulate AI Regarding governance, Europe has passed the Artificial Intelligence Act, which aims to protect fundamental rights, democracy, the rule of law, and environmental sustainability from high-risk AI, according to its potential risks and level of impact. And in North America, the Defense Production Act "will require tech companies to let the government know when they train an AI model using a significant amount of computing power." The President's Executive Order (EO) charges multiple agencies — including NIST — with producing guidelines and taking other actions to advance the safe, secure, and trustworthy development and use of AI. The UK hosted the first global AI safety summit last autumn and is building an AI-governance framework that embraces the transformative benefits of AI, while being able to address emerging risks. India has yet to implement specific AI regulations. We don't know what these regulations will mean in practice as they have yet to be tested in law. However, there are multiple active litigations against generative AI companies. There's currently civil action against Microsoft, GitHub, and OpenAI, claiming that, "By training their AI systems on public GitHub repositories … [they] violated the legal rights of a vast number of creators who posted code under certain open-source licenses on GitHub." Writer Sarah Silverman has a similar claim against Meta and OpenAI for alleged copyright infringement. This is very different from having legislation requiring responsible AI from the design phase, with financial and legal penalties for companies that create AI that breaches regulations. Until the regulations are tested with enough heft to disincentivize the creation and use of unethical AI, such as deepfakes and racial bias, I predict a lot of David vs. Goliath cases where the onus is on the individuals harmed, going up against tech behemoths and spending years in court. AI Challenges in Practice Generative AI can be used to work better and faster than competitors, but it can breach regulations like GDPR, share company secrets, and break customer confidentiality. Most people fail to understand that ChatGPT retains user input data to train itself further. Thus, confidential data and competitive secrets are no longer private and are up for grabs by OpenAI's algorithm. Multiple studies show employees uploading sensitive data, including personal identifiable information (PII), to OpenAI's ChatGPT platform. The amount of sensitive data uploaded to ChatGPT by employees increased by 60% between just March and April 2023. Salesforce surveyed over 14,000 global workers across 14 countries and found that 28% of workers use generative AI at work, and over half without formal employer approval. In 2023, engineers at Samsung's semiconductor arm used ChatGPT to input confidential data such as source code for a new program and internal meeting notes. In response, Samsung is developing its own AI models for internal use and restricting employee use to prompts with a 1024-byte limit. The Lack of Recourse There's also the issue of how AI is used as part of an enterprise's service offerings, ostensibly to increase efficiency and reduce manual tasks. For example, decision-making AI in the enterprise can choose one potential employee over another in recruitment or predict a tenant's ability to pay rent in housing software. Companies can't simply blame bad outcomes on AI. There must be a human overseer to address any potential or identified issues generated by the use of AI. They also must create effective channels for users to report concerns and provide feedback about decisions made by AI, such as chatbots. Clear policies and training are also necessary to hold employees accountable for responsible AI use and establish consequences for unethical behavior. AI Innovation vs. Regulation Governments are constantly trying to balance the regulation of AI against tech advancement. And the more you delve into it, the more the need for a human overseer emerges. Unplanned Obsolescence There's plenty of talk about AI making tasks easier and reducing pain points at work, but what happens to telemarketers, data clerks, copywriters, etc., who find their roles obsolete because AI can do it faster and better? I don't believe programs like universal basic income will provide adequate financial security for those whose jobs are replaced by AI. Nor do all displaced people want to transition to physical roles like senior care or childcare. We need a focus on upskilling and reskilling workers to ensure they have the necessary skills to continue meaningful employment of their choosing. Sovereignty and Competition There is a pervasive challenge in the dominance of large companies responsible for most tools, especially where smaller companies and governments build products on top of their models and open-source AI. What if open-source models become proprietary or raise their prices so startups can no longer afford to create commercial products, preventing large-scale innovation? This is hugely problematic, as it means that smaller companies cannot compete equitably in the economic market. There's also sovereignty. Most LLMs originate from the US, meaning the data generated is more likely to be embedded with North American perspectives. This geographical skew creates a real risk that North American perspectives, biases, and cultural nuances will heavily influence users' understanding of the world. This increases the chance of algorithmic bias, cultural insensitivity, and ultimately, inaccuracies for users seeking information or completing tasks outside the dominant data landscape. International companies in particular have the opportunity to ensure that LLMs have diverse data representation with global perspectives. Open-source collaboration is an effective way to foster this and already has the necessary frameworks. Creating custom LLMs is no easy task on an infrastructural level — it's expensive, especially when you factor in the cost of talent, hardware, infrastructure, and compute power. GPUs power AI workloads and training, but they've been in short supply since the COVID-19 pandemic, with GPUs earmarked for 2024 already sold out. Some countries are buying up GPUs; the UK is planning to spend $126.3 million to purchase AI chips. This will leave fewer resources for less prosperous nations. Intentionally fostering innovation between developed and developing nations is crucial to facilitate knowledge-sharing, more equitable resource allocation, and joint development efforts. It also requires targeting funding and support for local infrastructure. What Does Accountability Really Mean? Company accountability for unethical AI — whether by design, deployment, or unintentional misuse — is complex, especially as we have yet to see the net result of AI regulations in practice. Accountability involves detecting and measuring the impact of unethical AI and determining the appropriate penalties. Existing regulations in industries such as financial services and healthcare are likely to help establish parameters, but each industry needs to predict and respond to its unique challenges. For example, the World Health Organization suggests liability rules, so users harmed by an LLM in healthcare are adequately compensated or have other forms of redress to reduce the burden of proof, thus ensuring fair compensation. We're only just getting started, and companies that commit to ethical AI as their earliest use cases will be able to adapt easier and faster to whatever regulations come over the following months and years. The Way Forward Ethical AI in practice involves intentionality, ongoing commitment to design auditing, and an environment willing to look at the risks associated with AI. Companies that embed this commitment throughout their organization will succeed. Active Ethical AI in the Workplace The last few years have seen companies like X and Google reduce their responsible AI teams. A dedicated team or role can assist with proactive risk management, building a transparent culture, and employee training. However, an AI ethicist or a responsible AI team only works if they have a place in the company hierarchy where they can drive and influence bottom-line business decisions with business managers, developers, and the C-suite. Otherwise, the role is simply a public relations spin. There's also the temptation that hiring a dedicated person or team makes ethics someone else's problem. Assigning ethics to a single individual or team could create a false sense of security and neglect broader responsibility across the organization, especially if it comes at the expense of embedding responsible AI from the earliest design phase and seeing it as a valuable asset to a company's brand. Evolving Policies and Practices Creating an AI policy is useful but needs to be embedded in your company's practices rather than simply be something that gets shared to keep investors happy. Ultimately, companies that want to practice responsible, ethical AI need to have this commitment embedded into their DNA, much like a security-first approach. This means active, working AI policies that are amenable, align with innovation, and spread responsibility throughout the workplace. For example, companies like Microsoft highlight key factors in what ethical AI should look like, encompassing: Fairness Reliability and safety Privacy and security Inclusiveness Transparency Accountability Choosing Ethical Tools Companies can also weatherproof themselves by committing to using tools and services focused on ethical AI. Some examples include: Researchers from the Center for Research of Foundational Models have developed the Foundation Model Transparency Index to assess the transparency of foundation model developers. Fairly Trained offers certifications for generative AI companies that get consent for the training data they use. daios helps developers fine-tune LLMs with ethical values that users control, creating a feedback loop between users, data, and companies. Last year, Aligned AI made AI more "human" against misgeneralization. It is the first to surpass a key benchmark called CoinRun by teaching an AI to "think" in human-like concepts. Conclusion AI is complex, and ultimately, this article poses as many questions as answers. When tech capabilities, use cases, and repercussions are ever-evolving, continual discussions and an actionable commitment to ethics is vital. A company that commits to ethical AI in its early iterations weatherproofs itself from the incoming regulations and possible penalties for AI misuse. But most importantly, committing to ethical AI protects a company's identity and competitive advantage. Resources: "Artists take new shot at Stability, Midjourney in updated copyright lawsuit" by Blake Brittain, 2023 GitHub Copilot litigation by Matthew Butterick, 2022 "ChatGPT sets record for fastest-growing user base - analyst note" by Krystal Hu, 2023 "UK to spend £100m in global race to produce AI chips" by Anna Isaac, 2023 "Nvidia's Best AI Chips Sold Out Until 2024, Says Leading Cloud GPU Provider" by Tae Kim, 2023 "Samsung workers made a major error by using ChatGPT" by Lewis Maddison, 2023 "Biden Administration to implement new AI regulations on tech companies" by Duncan Riley, 2024 "More than Half of Generative AI Adopters Use Unapproved Tools at Work," Salesforce, 2023 "Nvidia's AI Chip Supplies Will Be Insufficient This Year, Foxconn’s Chairman Says" by Shi Yi, 2024 "Sarah Silverman Sues OpenAI and Meta Over Copyright Infringement" by Zachary Small, 2023 "Hackers Steal $25 Million by Deepfaking Finance Boss" by Victor Tangermann, 2024 "Identifying and Eliminating CSAM in Generative ML Training Data and Models" by David Thiel, 2023 "Implicit Bias in Large Language Models: Experimental Proof and Implications for Education" by Melissa Warr, Nicole Jakubczyk Oster, and Roger Isaac, 2023 This is an excerpt from DZone's 2024 Trend Report, Enterprise AI: The Emerging Landscape of Knowledge Engineering.Read the Free Report
It's hard to believe it has been almost six years since I wrote my last article on Artificial Intelligence (AI), "Practical Artificial Intelligence." In that article, I gave an overview of the state of AI and Machine Learning (ML) and some popular usage and tools at the time. Since then, things have gotten crazy in the AI world: everyone is talking about tools like ChatGPT, but most people really don’t understand all the terminology and tools or what they are best suited for. In this article (part 1 of 2), I will attempt to give you an updated look at this ecosystem and try to explain things in “Joel” terms. So let’s start off by defining all the key terms and look at some examples of each. Key Terms Artificial Intelligence (AI) Artificial intelligence (AI) is human intelligence exhibited by machines. Examples of AI include facial recognition, help desk chatbots, smart thermostats, and speech-to-text. As developers, we have been writing AI into our code since the beginning. Below is an early example of playing a game of tic tac toe – the code is simulating another human's intelligence to play against you. Note, there is nothing fancy or magic about this code, in fact, it's just a series of IF statements; and, yes, this does qualify as AI! Machine Learning (ML) Machine learning (ML) is a subset of AI where we use algorithms to parse data, learn from it, and then decide to do something. All machine learning counts as AI, but not all AI counts as machine learning. For example, symbolic logic – rules engines, expert systems, and knowledge graphs – could all be described as AI, and none of them are machine learning. If you go to Amazon to shop for some shoes, maybe it will suggest a shirt to match. Amazon also uses tools like image classification to decide what an image is to prevent false advertising. Lastly, the big email providers such as Google and Yahoo use spam classification to decide whether an email is spam. How well they do this impacts us every day! As developers, when we put machine learning into our code, we historically implemented it with algorithms as models. Have you ever built a decision tree in your code? This was you implementing a machine learning model, and probably not even realizing it. Large Language Models (LLM) Large Language Models (LLM) is a type of artificial intelligence (AI) algorithm that uses deep learning techniques and massively large data sets to understand, summarize, generate, and predict new content. The term "generative AI" also is closely connected with LLMs and is a type of generative AI that has been specifically architected to help generate text-based content. Current examples of LLMs are GPT-4 (behind OpenAI’s ChatGPT), Llama (Meta), and LaMDA (Google). Examples of historical developer-built LLMs include: Coding up a chatbot to understand basic commands and process them (think about the early text-based “Adventure Game”) A technique Word2vec, where you obtain vector representations of words to determine semantic meaning. Generative AI Generative AI is artificial intelligence capable of generating text, images, or other media, using predictive models. Probably the most well-known generative AI today is Dall-E. Many people are using it to generate images. Also, ChatGPT is being used to generate articles/content. Most of us never coded generative AI in the early days, but there was plenty of work going on even back in 1970; some of the early examples included speech synthesis. GPT: Generative Pre-trained Transformer ChatGPT (Chat Generative Pre-trained Transformer) being the most popular is a large language model-based chatbot developed by OpenAI (multiple competitors here like Bard, Claude). Retrieval Augmented Generation (RAG) Retrieval Augmented Generation (RAG) is a powerful technique that combines the strengths of pre-trained language models with the benefits of information retrieval systems. The primary purpose of RAG is to enhance the capabilities of large language models (LLMs), particularly in tasks that require a deep understanding and generation of contextually relevant responses. Since most developers didn’t code LLMs back in the day, most also didn’t implement RAG. An example of RAG is taking some information not already contained in a standard model and augmenting it to make the model more powerful for that specific usage. An example includes taking my company handbook and augmenting it with an existing GPT/LLM by using a technique called embeddings. Once I have done this I can ask the GPT questions about my company policies and it should be able to make deductions about it, even though the original LLM knows nothing about my company. Business Use Now that we have the basic vocabulary down and have some good examples of how all this stuff works, let's talk about 2024, and how we should do things to leverage all the great tools folks have built to do awesome AI and ML! My company uses software to solve business problems for our customers. This is what we have always done, so I like to approach AI and ML not from a technology approach, but more from the question, “What business problems can we solve?” Traditionally we have used AI and ML to solve the following problems: Taking documents and turning them into something searchable and usable: Traditionally we have used Natural Language Processing (NLP) to understand the meaning and sentiment of text, to make it searchable, and also automate document processing to turn images into text and classify and segment them. Most of this work revolved around taking legacy PDF documents and making better use of them. Predicting sales results: Taking data our customers have and building models to produce sales predictions for our customers Building faster, less error-prone software: For the last few years, we have been leveraging code-assist tools like GitHub Copilot and Amazon CodeWhisperer to help developers write code faster at a higher quality. Today, we can solve these same problems with higher accuracy and less work using the latest and greatest tools (we will talk about a few of these shortly). We can also solve other typical business problems more easily with new tools. Some problems that are easier to solve now are: Automating simple customer service interactions: Building a custom chatbot to solve your customers’ most common problems is now fairly easy to do. Automating tasks that were historically done manually, like combining files, filtering files, summarizing files, etc. by using ChatGPT-type tools to assist you. Incorporating features like facial recognition and speech-to-text/text-to-speech into our systems. Tools and Frameworks Until recently, if you wanted to build a machine learning model, the only option you had was to write some code. Most developers choose from one of a few options: Scikit-learn: Easy to learn, where most people start TensorFlow: From Google; more powerful, but complex (some folks use Keras on top) PyTorch: From Meta, more powerful and easier to use than TensorFlow Now the major cloud providers each have their own set of tools targeted at each area: Microsoft: Azure OpenAI (GPT), Vision/Speech/Translation, ML Studio, Automated ML, pre-trained models AWS: Sagemaker (traditional); Bedrock - multiple model options,; Sagemaker has Canvas and Autopilot for Automated ML, AmazonQ (GPT) Google: VertexAI (Gemini model), Bard (GPT), TensorFlow Recently, higher-level tools have also come upon the scene that allow you to build models quickly without writing any code: Nyckel - Lets you upload a CSV and go SimpleML for Google Sheets: See "Business Owners: Take Control of Your Data" AutoML (AWS, Azure, GCP) from cloud providers (on the previous page) Using ChatGPT to do your ML for you! (Used to be called Advanced Data Analysis) Summary In this article (Part 1), we summarized all the latest terminology around artificial intelligence and machine learning along with many examples of them. We also showed typical business problems that can be solved with the latest tools. Lastly, we talked about the popular tools in this space today. Stay tuned for Part 2 of this article where we will walk you through a real-life example using one of these tools to solve a business problem!
The relentless advancement of artificial intelligence (AI) technology reshapes our world, with Large Language Models (LLMs) spearheading this transformation. The emergence of the LLM-4 architecture signifies a pivotal moment in AI development, heralding new capabilities in language processing that challenge the boundaries between human and machine intelligence. This article provides a comprehensive exploration of LLM-4 architectures, detailing their innovations, applications, and broader implications for society and technology. Unveiling LLM-4 Architectures LLM-4 architectures represent the cutting edge in the evolution of large language models, building upon their predecessors' foundations to achieve new levels of performance and versatility. These models excel in interpreting and generating human language, driven by enhancements in their design and training methodologies. The core innovation of LLM-4 models lies in their advanced neural networks, particularly transformer-based structures, which allow for efficient and effective processing of large data sequences. Unlike traditional models that process data sequentially, transformers handle data in parallel, significantly enhancing learning speed and comprehension. To illustrate, consider the Python implementation of a transformer encoder layer below. This code reflects the intricate mechanisms that enable LLM-4 models to learn and adapt with remarkable proficiency: Python import torch import torch.nn as nn class TransformerEncoderLayer(nn.Module): def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1): super(TransformerEncoderLayer, self).__init__() self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout) self.linear1 = nn.Linear(d_model, dim_feedforward) self.dropout = nn.Dropout(dropout) self.linear2 = nn.Linear(dim_feedforward, d_model) self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) self.dropout1 = nn.Dropout(dropout) self.dropout2 = nn.Dropout(dropout) def forward(self, src): src2 = self.self_attn(src, src, src)[0] src = src + self.dropout1(src2) src = self.norm1(src) src2 = self.linear2(self.dropout(self.linear1(src))) src = src + self.dropout2(src2) src = self.norm2(src) return src This encoder layer serves as a fundamental building block for the transformer architecture, facilitating deep learning processes that underpin the intelligence of LLM-4 models. Broadening Horizons: Applications of LLM-4 The versatility of LLM-4 architectures opens a plethora of applications across various sectors. In natural language processing, these models enhance translation, summarization, and content generation, bridging communication gaps and fostering global collaboration. Beyond these traditional uses, LLM-4 models are instrumental in creating interactive AI agents capable of nuanced conversation and making strides in customer service, therapy, education, and entertainment. Moreover, LLM-4 architectures extend their utility to the realm of coding, offering predictive text generation and debugging assistance, thus revolutionizing software development practices. Their ability to process and generate complex language structures also finds applications in legal analysis, financial forecasting, and research, where they can synthesize vast amounts of information into coherent, actionable insights. Navigating the Future: Implications of LLM-4 The ascent of LLM-4 architectures raises critical considerations regarding their impact on society. As these models blur the line between human and machine-generated content, they prompt discussions on authenticity, intellectual property, and the ethics of AI. Furthermore, their potential to automate complex tasks necessitates a reevaluation of workforce dynamics, emphasizing the need for policies that address job displacement and skill evolution. The development of LLM-4 architectures also underscores the importance of robust AI governance. Ensuring transparency, accountability, and fairness in these models is paramount to harnessing their benefits while mitigating associated risks. As we chart the course for future AI advancements, the lessons learned from LLM-4 development will be instrumental in guiding responsible innovation. Conclusion The emergence of LLM-4 architectures marks a watershed moment in AI development, signifying profound advancements in machine intelligence. These models not only enhance our technological capabilities but also challenge us to contemplate their broader implications. As we delve deeper into the potential of LLM-4 architectures, it is imperative to foster an ecosystem that promotes ethical use, ongoing learning, and societal well-being, ensuring that AI continues to serve as a force for positive transformation.
Artificial intelligence (AI) holds vast potential for societal and industrial transformation. However, ensuring AI systems are safe, fair, inclusive, and trustworthy depends on the quality and integrity of the data upon which they are built. Biased datasets can produce AI models that perpetuate harmful stereotypes, discriminate against specific groups, and yield inaccurate or unreliable results. This article explores the complexities of data bias, outlines practical mitigation strategies, and delves into the importance of building inclusive datasets for the training and testing of AI models [1]. Understanding the Complexities of Data Bias Data plays a key role in the development of AI models. Data bias can infiltrate AI systems in various ways. Here's a breakdown of the primary types of data bias, along with real-world examples [1,2]: Bias Type Description Real-World Examples Selection bias Exclusion or under/over-representation of certain groups * A facial recognition system with poor performance on darker-skinned individuals due to limited diverse representation in the training data. * A survey-based model primarily reflecting urban populations, making it unsuitable for nationwide resource allocation. Information bias Errors, inaccuracies, missing data, or inconsistencies * Outdated census data leading to inaccurate neighborhood predictions. * Incomplete patient history affecting diagnoses made by medical AI. Labeling bias Subjective interpretations and unconscious biases in how data is labeled * Historical bias encoded in image labeling, leading to harmful misclassifications. * Subjective evaluation criteria in a credit risk model, unintentionally disadvantaging certain socioeconomic groups. Societal bias Reflects existing inequalities, discriminatory trends, and stereotypes in data * Word embeddings encoding gender biases from historical text data. * AI loan approval systems inadvertently perpetuating past discriminatory lending practices. Consequences of Data Bias Biased AI models can have far-reaching implications: Discrimination: AI systems may discriminate based on protected attributes such as race, gender, age, or sexual orientation. Perpetuation of stereotypes: Biased models can reinforce and amplify harmful societal stereotypes, further entrenching them within decision-making systems. Inaccurate or unreliable results: AI models built on biased data may produce significantly poorer or unfair results for specific groups or contexts, diminishing their utility, value, and trustworthiness. Erosion of trust: The discovery of bias in AI models can damage public trust, delaying beneficial technology adoption. Strategies for Combating Bias Building equitable AI requires a multi-pronged approach involving tools, planning, transparency, and human oversight: Bias mitigation tools: Frameworks like IBM AI Fairness 360 offer algorithms and metrics to identify and reduce bias throughout the AI development lifecycle. Fairness thresholds: Techniques, such as statistical parity or equal opportunity, establish quantitative fairness goals. Data augmentation: Oversampling techniques and synthetic data generation can help address the underrepresentation of specific groups, improving model performance. Data Management Plans (DMPs): A comprehensive DMP ensures data integrity and outlines collection, storage, security, and sharing protocols. Datasheets: Detailed documentation of dataset characteristics, limitations, and intended uses promotes transparency and aids in informed decision-making [3]. Human-in-the-loop: AI models should be complemented by human oversight and validation to ensure safe, ethical outcomes and also maintain accountability. Advanced techniques: For complex scenarios, explore re-weighting, re-sampling, adversarial learning, counterfactual analysis, and causal modeling for bias reduction. Guidance on Data Management Plans (DMPs) While a data management plan may sound like a simple document. A well-developed data management plan can make a huge difference in reducing bias and safe AI development Ethical considerations: DMPs should explicitly address privacy, informed consent, potential bias sources, and the potential for disproportionate impact. Data provenance: Document origin, transformations, and ownership to ensure auditability over time. Version control: Maintain clear versioning systems for datasets to enable reproducibility and track changes. Evolving Datasheets for Transparency Knowing how and what was used to train the AI models can make it easier to evaluate and also address claims. Datasheets in this case play a major role as they help provide the following Motivational transparency: Articulate the dataset's creation purpose, intended uses, and known limitations [3]. Detailed composition: Provide statistical breakdowns of data features, correlations, and potential anomalies [3]. Comprehensive collection process: Describe sampling methods, equipment, sources of error, and biases introduced at this stage. Preprocessing: Document cleaning, transformation steps, and anonymization techniques. Uses and limitations: Explicitly outline suitable applications and scenarios where ethical concerns or bias limitations are present [3]. AI Fairness Is a Journey Achieving Safe AI is an ongoing endeavor. Regular audits, external feedback mechanisms, and a commitment to continual improvement, in response to evolving societal norms, are vital for building trustworthy and equitable AI systems. References 1. Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447-453. 2. Rajkomar, A., Hardt, M., Howell, M. D., Corrado, G., & Chin, M. H. (2018). Ensuring fairness in machine learning to advance health equity. Annals of Internal Medicine, 169(12), 866-872. 3. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019). Model cards for model reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency, 220-229.
In this blog, you will learn how to implement Retrieval Augmented Generation (RAG) using Weaviate, LangChain4j, and LocalAI. This implementation allows you to ask questions about your documents using natural language. Enjoy! 1. Introduction In the previous post, Weaviate was used as a vector database in order to perform a semantic search. The source documents used are two Wikipedia documents. The discography and list of songs recorded by Bruce Springsteen are the documents used. The interesting part of these documents is that they contain facts and are mainly in a table format. Parts of these documents are converted to Markdown in order to have a better representation. The Markdown files are embedded in Collections in Weaviate. The result was amazing: all questions asked, resulted in the correct answer to the question. That is, the correct segment was returned. You still needed to extract the answer yourself, but this was quite easy. However, can this be solved by providing the Weaviate search results to an LLM (Large Language Model) by creating the right prompt? Will the LLM be able to extract the correct answers to the questions? The setup is visualized in the graph below: The documents are embedded and stored in Weaviate; The question is embedded and a semantic search is performed using Weaviate; Weaviate returns the semantic search results; The result is added to a prompt and fed to LocalAI which runs an LLM using LangChain4j; The LLM returns the answer to the question. Weaviate also supports RAG, so why bother using LocalAI and LangChain4j? Unfortunately, Weaviate does not support integration with LocalAI and only cloud LLMs can be used. If your documents contain sensitive information or information you do not want to send to a cloud-based LLM, you need to run a local LLM and this can be done using LocalAI and LangChain4j. If you want to run the examples in this blog, you need to read the previous blog. The sources used in this blog can be found on GitHub. 2. Prerequisites The prerequisites for this blog are: Basic knowledge of embedding and vector stores; Basic Java knowledge, Java 21 is used; Basic knowledge of Docker; Basic knowledge of LangChain4j; You need Weaviate and the documents need to be embedded, see the previous blog on how to do so; You need LocalAI if you want to run the examples, see a previous blog on how you can make use of LocalAI. Version 2.2.0 is used for this blog. If you want to learn more about RAG, read this blog. 3. Create the Setup Before getting started, there is some setup to do. 3.1 Setup LocalAI LocalAI must be running and configured. How to do so is explained in the blog Running LLM’s Locally: A Step-by-Step Guide. 3.2 Setup Weaviate Weaviate must be started. The only difference with the Weaviate blog is that you will run it on port 8081 instead of port 8080. This is because LocalAI is already running on port 8080. Start the compose file from the root of the repository. Shell $ docker compose -f docker/compose-embed-8081.yaml Run class EmbedMarkdown in order to embed the documents (change the port to 8081!). Three collections are created: CompilationAlbum: a list of all compilation albums of Bruce Springsteen; Song: a list of all songs by Bruce Springsteen; StudioAlbum: a list of all studio albums of Bruce Springsteen. 4. Implement RAG 4.1 Semantic Search The first part of the implementation is based on the semantic search implementation of class SearchCollectionNearText. It is assumed here, that you know in which collection (argument className) to search for. In the previous post, you noticed that strictly spoken, you do not need to know which collection to search for. However, at this moment, it makes the implementation a bit easier and the result remains identical. The code will take the question and with the help of NearTextArgument, the question will be embedded. The GraphQL API of Weaviate is used to perform the search. Java private static void askQuestion(String className, Field[] fields, String question, String extraInstruction) { Config config = new Config("http", "localhost:8081"); WeaviateClient client = new WeaviateClient(config); Field additional = Field.builder() .name("_additional") .fields(Field.builder().name("certainty").build(), // only supported if distance==cosine Field.builder().name("distance").build() // always supported ).build(); Field[] allFields = Arrays.copyOf(fields, fields.length + 1); allFields[fields.length] = additional; // Embed the question NearTextArgument nearText = NearTextArgument.builder() .concepts(new String[]{question}) .build(); Result<GraphQLResponse> result = client.graphQL().get() .withClassName(className) .withFields(allFields) .withNearText(nearText) .withLimit(1) .run(); if (result.hasErrors()) { System.out.println(result.getError()); return; } ... 4.2 Create Prompt The result of the semantic search needs to be fed to the LLM including the question itself. A prompt is created which will instruct the LLM to answer the question using the result of the semantic search. Also, the option to add extra instructions is implemented. Later on, you will see what to do with that. Java private static String createPrompt(String question, String inputData, String extraInstruction) { return "Answer the following question: " + question + "\n" + extraInstruction + "\n" + "Use the following data to answer the question: " + inputData; } 4.3 Use LLM The last thing to do is to feed the prompt to the LLM and print the question and answer to the console. Java private static void askQuestion(String className, Field[] fields, String question, String extraInstruction) { ... ChatLanguageModel model = LocalAiChatModel.builder() .baseUrl("http://localhost:8080") .modelName("lunademo") .temperature(0.0) .build(); String answer = model.generate(createPrompt(question, result.getResult().getData().toString(), extraInstruction)); System.out.println(question); System.out.println(answer); } 4.4 Questions The questions to be asked are the same as in the previous posts. They will invoke the code above. Java public static void main(String[] args) { askQuestion(Song.NAME, Song.getFields(), "on which album was \"adam raised a cain\" originally released?", ""); askQuestion(StudioAlbum.NAME, StudioAlbum.getFields(), "what is the highest chart position of \"Greetings from Asbury Park, N.J.\" in the US?", ""); askQuestion(CompilationAlbum.NAME, CompilationAlbum.getFields(), "what is the highest chart position of the album \"tracks\" in canada?", ""); askQuestion(Song.NAME, Song.getFields(), "in which year was \"Highway Patrolman\" released?", ""); askQuestion(Song.NAME, Song.getFields(), "who produced \"all or nothin' at all?\"", ""); } The complete source code can be viewed here. 5. Results Run the code and the result is the following: On which album was “Adam Raised a Cain” originally released?The album “Darkness on the Edge of Town” was originally released in 1978, and the song “Adam Raised a Cain” was included on that album. What is the highest chart position of “Greetings from Asbury Park, N.J.” in the US?The highest chart position of “Greetings from Asbury Park, N.J.” in the US is 60. What is the highest chart position of the album “Tracks” in Canada?Based on the provided data, the highest chart position of the album “Tracks” in Canada is -. This is because the data does not include any Canadian chart positions for this album. In which year was “Highway Patrolman” released?The song “Highway Patrolman” was released in 1982. Who produced “all or nothin’ at all?”The song “All or Nothin’ at All” was produced by Bruce Springsteen, Roy Bittan, Jon Landau, and Chuck Plotkin. All answers to the questions are correct. The most important job has been done in the previous post, where embedding the documents in the correct way, resulted in finding the correct segments. An LLM is able to extract the answer to the question when it is fed with the correct data. 6. Caveats During the implementation, I ran into some strange behavior which is quite important to know when you are starting to implement your use case. 6.1 Format of Weaviate Results The Weaviate response contains a GraphQLResponse object, something like the following: JSON GraphQLResponse( data={ Get={ Songs=[ {_additional={certainty=0.7534831166267395, distance=0.49303377}, originalRelease=Darkness on the Edge of Town, producers=Jon Landau Bruce Springsteen Steven Van Zandt (assistant), song="Adam Raised a Cain", writers=Bruce Springsteen, year=1978} ] } }, errors=null) In the code, the data part is used to add to the prompt. Java String answer = model.generate(createPrompt(question, result.getResult().getData().toString(), extraInstruction)); What happens when you add the response as-is to the prompt? Java String answer = model.generate(createPrompt(question, result.getResult().toString(), extraInstruction)); Running the code returns the following wrong answer for question 3 and some unnecessary additional information for question 4. The other questions are answered correctly. What is the highest chart position of the album “Tracks” in Canada?Based on the provided data, the highest chart position of the album “Tracks” in Canada is 50. In which year was “Highway Patrolman” released?Based on the provided GraphQLResponse, “Highway Patrolman” was released in 1982.who produced “all or nothin’ at all?” 6.2 Format of Prompt The code contains functionality to add extra instructions to the prompt. As you have probably noticed, this functionality is not used. Let’s see what happens when you remove this from the prompt. The createPrompt method becomes the following (I did not remove everything so that only a minor code change is needed). Java private static String createPrompt(String question, String inputData, String extraInstruction) { return "Answer the following question: " + question + "\n" + "Use the following data to answer the question: " + inputData; } Running the code adds some extra information to the answer to question 3 which is not entirely correct. It is correct that the album has chart positions for the United States, United Kingdom, Germany, and Sweden. It is not correct that the album reached the top 10 in the UK and US charts. All other questions are answered correctly. What is the highest chart position of the album “Tracks” in Canada?Based on the provided data, the highest chart position of the album “Tracks” in Canada is not specified. The data only includes chart positions for other countries such as the United States, United Kingdom, Germany, and Sweden. However, the album did reach the top 10 in the UK and US charts. It remains a bit brittle when using an LLM. You cannot always trust the answer it is given. Changing the prompt accordingly seems to be possible to minimize the hallucinations of an LLM. It is therefore important that you collect feedback from your users in order to identify when an LLM seems to hallucinate. This way, you will be able to improve the responses to the users. An interesting blog is written by Fiddler which addresses this kind of issue. 7. Conclusion In this blog, you learned how to implement RAG using Weaviate, LangChain4j, and LocalAI. The results are quite amazing. Embedding documents the right way, filtering the results, and feeding them to an LLM is a very powerful combination that can be used in many use cases.
Generative AI (GenAI) and Large Language Models (LLMs) offer transformative potential across various industries. However, their deployment in production environments faces challenges due to their computational intensity, dynamic behavior, and the potential for inaccurate or undesirable outputs. Existing monitoring tools often fall short of providing real-time insights crucial for managing such applications. Building on top of the existing work, this article presents the framework for monitoring GenAI applications in production. It addresses both infrastructure and quality aspects. On the infrastructure side, one needs to proactively track performance metrics such as cost, latency, and scalability. This enables informed resource management and proactive scaling decisions. To ensure quality and ethical use, the framework recommends real-time monitoring for hallucinations, factuality, bias, coherence, and sensitive content generation. The integrated approach empowers developers with immediate alerts and remediation suggestions, enabling swift intervention and mitigation of potential issues. By combining performance and content-oriented monitoring, this framework fosters the stable, reliable, and ethical deployment of Generative AI within production environments. Introduction The capabilities of GenAI, driven by the power of LLMs, are rapidly transforming the way we interact with technology. From generating remarkably human-like text to creating stunning visuals, GenAI applications are finding their way into diverse production environments. Industries are harnessing this potential for use cases such as content creation, customer service chatbots, personalized marketing, and even code generation. However, the path from promising technology to operationalizing these models remains a big challenge[1]. Ensuring the optimal performance of GenAI applications demands careful management of infrastructure costs associated with model inference, cost, and proactive scaling measures to handle fluctuations in demand. Maintaining user experience requires close attention to response latency. Simultaneously, the quality of the output generated by LLMs is of utmost importance. Developers must grapple with the potential for factual errors, the presence of harmful biases, and the possibility of the models generating toxic or sensitive content. These challenges necessitate a tailored approach to monitoring that goes beyond traditional tools. The need for real-time insights into both infrastructure health and output quality is essential for the reliable and ethical use of GenAI applications in production. This article addresses this critical need by proposing solutions specifically for real-time monitoring of GenAI applications in production. Current Limitations The monitoring and governance of AI systems have garnered significant attention in recent years. Existing literature on AI model monitoring often focuses on supervised learning models [2]. These approaches address performance tracking, drift detection, and debugging in classification or regression tasks. Research in explainable AI (XAI) has also yielded insights into interpreting model decisions, particularly for black-box models [3]. This field seeks to unravel the inner workings of these complex systems or provide post-hoc justifications for outputs [4]. Moreover, studies on bias detection explore techniques for identifying and mitigating discriminatory patterns that may arise from training data or model design [5]. While these fields provide a solid foundation, they do not fully address the unique challenges of monitoring and evaluating generative AI applications based on LLMs. Here, the focus shifts away from traditional classification or regression metrics and towards open-ended generation. Evaluating LLMs often involves specialized techniques like human judgment or comparison against reference datasets [6]. Furthermore, standard monitoring and XAI solutions may not be optimized for tracking issues prevalent in GenAI, such as hallucinations, real-time bias detection, or sensitivity to token usage and cost. There has been some recent work in helping solve this challenge [8], [9]. This article builds upon prior work in these related fields while proposing a framework designed specifically for the real-time monitoring needs of production GenAI applications. It emphasizes the integration of infrastructure and quality monitoring, enabling the timely detection of a broad range of potential issues unique to LLM-based applications. This article concentrates on monitoring Generative AI applications utilizing model-as-a-service (MLaaS) offerings such as Google Cloud's Gemini, OpenAI's GPTs, Claude on Amazon Bedrock, etc. While the core monitoring principles remain applicable, self-hosted LLMs necessitate additional considerations. These include model optimization, accelerator (e.g. GPU) management, infrastructure management, scaling, etc - factors outside the scope of this discussion. Also, this article focuses on text-to-text models, but the principles can be extended to other modalities as well. The subsequent sections will focus on various metrics, techniques, and architecture for capturing those metrics to gain visibility into LLM's behavior in production. Application Monitoring Monitoring the performance and resource utilization of Generative AI applications is vital for ensuring their optimal functioning and cost-effectiveness in production environments. This section delves into the key components of application monitoring for GenAI, specifically focusing on cost, latency, and scalability considerations. Cost Monitoring and Optimization The cost associated with deploying GenAI applications can be significant, especially when leveraging MLaaS offerings. Therefore, granular cost monitoring and optimization are crucial. Below are some of the key metrics to focus on: Granular Cost Tracking MLaaS providers typically charge based on factors such as the number of API calls, tokens consumed, model complexity, and data storage. Tracking costs at this level of detail allows for a precise understanding of cost drivers. For MLaaS LLMs, input and output characters/token count can be the key driver of cost. Most models have tokenizer APIs to count the characters/tokens for any given text. These APIs can help understand usage for monitoring and optimizing inference costs. Below is an example of generating a billable character count for Google Cloud’s Gemini model. Python import vertexai from vertexai.generative_models import GenerativeModel def generate_count(project_id: str, location: str) -> str: # Initialize Vertex AI vertexai.init(project=project_id, location=location) # Load the model model = GenerativeModel("gemini-1.0-pro") # prompt tokens count count = model.count_tokens("how many billable characters are here?")) # response total billable characters return count.total_billable_characters generate_count('your-project-id','us-central1') Usage Pattern Analysis and Token Efficiency Analyzing token usage patterns plays a pivotal role in optimizing the operating costs and user experience of GenAI applications. Cloud providers often impose token-per-second quotas, and consistently exceeding these limits can degrade performance. While quota increases may be possible, there are often hard limits. Creative resource management may be required for usage beyond these thresholds. A thorough analysis of token usage over time helps identify avenues for cost optimization. Consider the following strategies: Prompt optimization: Rewriting prompts to reduce their size reduces token consumption and should be a primary focus of optimization efforts. Model tuning: A model fine-tuned on a well-curated dataset can potentially deliver similar or even superior performance with smaller prompts. While some providers charge similar fees for base and tuned models, premium pricing models for tuned models also exist. One needs to be cognizant of these, before making a decision. In certain cases, model tuning can significantly reduce token usage and associated costs. Retrieval-augmented generation: Incorporating information retrieval techniques can help reduce input token size by strategically limiting the data fed into the model, potentially reducing costs. Smaller model utilization: When a smaller model is used in tandem with high-quality data, not only can it achieve comparable performance to a larger model, but it offers a compelling cost-saving strategy too. The token count analysis code example provided earlier in the article can be instrumental in understanding and optimizing token usage. It's worth noting that pricing models for tuned models vary across MLaaS providers, highlighting the importance of careful pricing analysis during the selection process. Latency Monitoring In the context of GenAI applications, latency refers to the total time elapsed between a user submitting a request and receiving a response from the model. Ensuring minimal latency is crucial for maintaining a positive user experience, as delays can significantly degrade perceived responsiveness and overall satisfaction. This section delves into the essential components of robust latency monitoring for GenAI applications. Real-Time Latency Measurement Real-time tracking of end-to-end latency is fundamental. This entails measuring the following components: Network latency: Time taken for data to travel between the user's device and the cloud-based MLaaS service. Model inference time: The actual time required for the LLM to process the input and generate a response. Pre/post-processing overhead: Any additional time consumed for data preparation before model execution and formatting responses for delivery. Impact on User Experience Understanding the correlation between latency and user behavior is essential for optimizing the application. Key user satisfaction metrics to analyze include: Bounce rate: The percentage of users who leave a website or application after viewing a single interaction. Session duration: The length of time a user spends actively engaged with the application. Conversion rates: (When applicable) The proportion of users who complete a desired action, such as a purchase or sign-up. Identifying Bottlenecks Pinpointing the primary sources of latency is crucial for targeted fixes. Potential bottleneck areas warranting investigation include: Network performance: Insufficient bandwidth, slow DNS resolution, or network congestion can significantly increase network latency. Model architecture: Large, complex models may have longer inference times. Many times using smaller models, with higher quality data and better prompts can help yield necessary results. Inefficient input/output processing: Unoptimized data handling, encoding, or formatting can add overhead to the overall process. MLaaS platform factors: Service-side performance fluctuations on the MLaaS platform can impact latency. Proactive latency monitoring is vital for maintaining the responsiveness and user satisfaction of GenAI applications in production environments. By understanding the components of latency, analyzing its impact on user experience, and strategically identifying bottlenecks, developers can make informed decisions to optimize their applications. Scalability Monitoring Production-level deployment of GenAI applications necessitates the ability to handle fluctuations in demand gracefully. Regular load and stress testing are essential for evaluating a system's scalability and resilience under realistic and extreme traffic scenarios. These tests should simulate diverse usage patterns, gradual load increases, peak load simulations, and sustained load. Proactive scalability monitoring is critical, particularly when leveraging MLaaS platforms with hard quota limits for LLMs. This section outlines key metrics and strategies for effective scalability monitoring within these constraints. Autoscaling Configuration Leveraging the autoscaling capabilities provided by MLaaS platforms is crucial for dynamic resource management. Key considerations include: Metrics: Identify the primary metrics that will trigger scaling events (e.g., response time, API requests per second, error rates). Set appropriate thresholds based on performance goals. Scaling policies: Define how quickly resources should be added or removed in response to changes in demand. Consider factors like the time it takes to spin up additional model instances. Cooldown periods: Implement cooldown periods after scaling events to prevent "thrashing" (rapid scaling up and down), which can lead to instability and increased costs. Monitoring Scaling Metrics During scaling events, meticulously monitor these essential metrics: Response time: Ensure that response times remain within acceptable ranges, even when scaling, as latency directly impacts user experience. Throughput: Track the system's overall throughput (e.g., requests per minute) to gauge its capacity to handle incoming requests. Error rates: Monitor for any increases in error rates due to insufficient resources or bottlenecks that can arise during scaling processes. Resource utilization: Observe CPU, memory, and GPU utilization to identify potential resource constraints. MLaaS platforms' hard quota limits pose unique challenges for scaling GenAI applications. Strategies to address this include: Caching: Employ strategic caching of model outputs for frequently requested prompts to reduce the number of model calls. Batching: Consolidate multiple requests and process them in batches to optimize resource usage. Load balancing: Distribute traffic across multiple model instances behind a load balancer to maximize utilization within available quotas. Hybrid deployment: Consider a hybrid approach where less demanding requests are served by MLaaS models, and those exceeding quotas are handled by a self-hosted deployment (assuming the necessary expertise). Proactive application monitoring, encompassing cost, latency, and scalability aspects, underpins the successful deployment and cost-effective operation of GenAI applications in production. By implementing the strategies outlined above, developers and organizations can gain crucial insights, optimize resource usage, and ensure the responsiveness of their applications for enhanced user experiences. Content Monitoring Ensuring the quality and ethical integrity of GenAI applications in production requires a robust content monitoring strategy. This section addresses the detection of hallucinations, accuracy issues, harmful biases, lack of coherence, and the generation of sensitive content. Hallucination Detection Mitigating the tendency of LLMs to generate plausible but incorrect information is paramount for their ethical and reliable deployment in production settings. This section delves into grounding techniques and strategies for leveraging multiple LLMs to enhance the detection of hallucinations. Human-In-The-Loop To address the inherent issue of hallucinations in LLM-based applications, the human-in-the-loop approach offers two key implementation strategies: End-user feedback: Incorporating direct feedback mechanisms, such as thumbs-up/down ratings and options for detailed textual feedback, provides valuable insights into the LLM's output. This data allows for continuous model refinement and pinpoints areas where hallucinations may be prevalent. End-user feedback creates a collaborative loop that can significantly enhance the LLM's accuracy and trustworthiness over time. Human review sampling: Randomly sampling a portion of LLM-generated outputs and subjecting them to rigorous human review establishes a quality control mechanism. Human experts can identify subtle hallucinations, biases, or factual inconsistencies that automated systems might miss. This process is essential for maintaining a high standard of output, particularly in applications where accuracy is paramount. Implementing these HITL strategies fosters a symbiotic relationship between humans and LLMs. It leverages human expertise to guide and correct the LLM, leading to progressively more reliable and factually sound outputs. This approach is particularly crucial in domains where accuracy and the absence of misleading information are of utmost importance. Grounding in First-Party and Trusted Data Anchoring the output of GenAI applications in reliable data sources offers a powerful method for hallucination detection. This approach is essential, especially when dealing with domain-specific content or scenarios where verifiable facts are required. Techniques include: Prompt engineering with factual constraints: Carefully construct prompts that incorporate domain-specific knowledge, reference external data, or explicitly require the model to adhere to a known factual context. For example, a prompt for summarizing a factual document could include instructions like, "Restrict the summary to information explicitly mentioned in the document. Retrieval Augmented Generation: Augment LLMs using trusted datasets that prioritize factual accuracy and adherence to provided information. This can help reduce the model's overall tendency to fabricate information. Incorporating external grounding sources: Utilize APIs or services designed to access and process first-party data, trusted knowledge bases, or real-world information. This allows the system to cross-verify the model's output and flag potential discrepancies. For instance, a financial news summarization task could be coupled with an API that provides up-to-date stock market data for accuracy validation. LLM-based output evaluation: The unique capabilities of LLMs can be harnessed to evaluate the factual consistency of the generated text. Strategies include: Self-consistency check: This can be achieved through multi-step generation, where a task is broken into smaller steps, and later outputs are checked for contradictions against prior ones. For instance, asking the model to first outline key points of a document and then generate a full summary allows for verification that the summary aligns with those key points. Alternatively, rephrasing the original prompt in different formats and comparing the resulting outputs can reveal inconsistencies indicative of fabricated information. Cross-model comparison: Feed the output of one LLM as a prompt into a different LLM with potentially complementary strengths. Analyze any inconsistencies or contradictions between the subsequent outputs, which may reveal hallucinations. Metrics for tracking hallucinations: Accurately measuring and quantifying hallucinations generated by LLMs remains an active area of research. While established metrics from fields such as information retrieval and classification offer a foundation, the unique nature of hallucination detection necessitates the adaptation of existing metrics and the development of novel ones. This section proposes a multi-faceted suite of metrics, including standard metrics creatively adapted for this context as well as novel metrics specifically designed to capture the nuances of hallucinated text. Importantly, I encourage practitioners to tailor these metrics to the specific sensitivities of their business domains. Domain-specific knowledge is essential in crafting a metric set that aligns with the unique requirements of each GenAI deployment. Considerations and Future Directions Specificity vs. Open-Endedness Grounding techniques can be highly effective in tasks requiring factual precision. However, in more creative domains where novelty is expected, strict grounding might hinder the model's ability to generate original ideas. Data Quality The reliability of any grounding strategy depends on the quality and trustworthiness of the external data sources used. Verification against curated first-party data or reputable knowledge bases is essential. Computational Overhead Fact-checking, data retrieval, and multi-model evaluation can introduce additional latency and costs that need careful consideration in production environments. Evolving Evaluation Techniques Research into the use of LLMs for semantic analysis and consistency checking is ongoing. More sophisticated techniques for hallucination detection leveraging LLMs are likely to emerge, further bolstering their utility in this task. Grounding and cross-model evaluation provide powerful tools to combat hallucinations in GenAI outputs. Used strategically, these techniques bolster the factual accuracy and trustworthiness of these applications, promoting their robust deployment in real-world scenarios. Bias Monitoring The issue of bias in LLMs is a complex and pressing concern, as these models have the potential to perpetuate or amplify harmful stereotypes and discriminatory patterns present in their training data. Proactive bias monitoring is crucial for ensuring the ethical and inclusive deployment of GenAI in production. This section explores data-driven, actionable strategies for bias detection and mitigation. Fairness Evaluation Toolkits Specialized libraries and toolkits offer a valuable starting point for bias assessment in LLM outputs. While not all are explicitly designed for LLM evaluation, many can be adapted and repurposed for this context. Consider the following tools: Aequitas: Provides a suite of metrics and visualizations for assessing group fairness and bias across different demographics. This tool can be used to analyze model outputs for disparities based on sensitive attributes like gender, race, etc. ([invalid URL removed]) FairTest: Enables the identification and investigation of potential biases in model outputs. It can analyze the presence of discriminatory language or differential treatment of protected groups. ([invalid URL removed]) Real-Time Analysis In production environments, real-time bias monitoring is essential. Strategies include: Keyword and phrase tracking: Monitor outputs for specific words, phrases, or language patterns historically associated with harmful biases or stereotypes. Tailor these lists to sensitive domains and potential risks related to your application. Dynamic prompting for bias discovery: Systematically test the model with carefully constructed inputs designed to surface potential biases. For example, modify prompts to vary gender, ethnicity, or other attributes while keeping the task consistent, and observe whether the model's output exhibits prejudice. Mitigation Strategies When bias is detected, timely intervention is critical. Consider the following actions: Alerting: Implement an alerting system to flag potentially biased outputs for human review and intervention. Calibrate the sensitivity of these alerts based on the severity of bias and its potential impact. Filtering or modification: In sensitive applications, consider automated filtering of highly biased outputs or modification to neutralize harmful language. These measures must be balanced against the potential for restricting valid and unbiased expressions. Human-in-the-loop: Integrate human moderators for nuanced bias assessment and for determining appropriate mitigation steps. This can include re-prompting the model, providing feedback for fine-tuning, or escalating critical issues. Important Considerations Evolving standards: Bias detection is context-dependent and definitions of harmful speech evolve over time. Monitoring systems must remain adaptable. Intersectionality: Biases can intersect across multiple axes (e.g., race, gender, sexual orientation). Monitoring strategies need to account for this complexity. Bias monitoring in GenAI applications is a multifaceted and ongoing endeavor. By combining specialized toolkits, real-time analysis, and thoughtful mitigation strategies, developers can work towards more inclusive and equitable GenAI systems. Coherence and Logic Assessment Ensuring the internal consistency and logical flow of GenAI output is crucial for maintaining user trust and avoiding nonsensical results. This section offers techniques for unsupervised coherence and logic assessment, applicable to a variety of LLM-based tasks at scale. Semantic Consistency Checks Semantic Similarity Analysis Calculate the semantic similarity between different segments of the generated text (e.g., sentences, paragraphs). Low similarity scores can indicate a lack of thematic cohesion or abrupt changes in topic. Implementation Leverage pre-trained sentence embedding models (e.g., Sentence Transformers) to compute similarity scores between text chunks. Python from sentence_transformers import SentenceTransformer model = SentenceTransformer('paraphrase-distilroberta-base-v2') generated_text = "The company's stock price surged after the earnings report. Cats are excellent pets." sentences = generated_text.split(".") embeddings = model.encode(sentences) similarity_score = cosine_similarity(embeddings[0], embeddings[1]) print(similarity_score) # A low score indicates potential incoherence Topic Modeling Apply topic modeling techniques (e.g., LDA, NMF) to extract latent topics from the generated text. Inconsistent topic distribution across the output may suggest a lack of a central theme or focus. Implementation Utilize libraries like Gensim or scikit-learn for topic modeling. Logical Reasoning Evaluation Entailment and Contradiction Detection Assess whether consecutive sentences within the generated text exhibit logical entailment (one sentence implies the other) or contradiction. This can reveal inconsistencies in reasoning. Implementation Employ entailment models (e.g., BERT-based models fine-tuned on Natural Language Inference datasets like SNLI or MultiNLI). These techniques can be packaged into user-friendly functions or modules, shielding users without deep ML expertise from the underlying complexities. Sensitive Content Detection With GenAI's ability to produce remarkably human-like text, it's essential to be proactive about detecting potentially sensitive content within its outputs. This is necessary to avoid unintended harm, promote responsible use, and maintain trust in the technology. The following section explores modern techniques specifically designed for sensitive content detection within the context of large language models. These scalable approaches will empower users to safeguard the ethical implementation of GenAI across diverse applications. Perspective API integration: Google's Perspective API offers a pre-trained model for identifying toxic comments. It can be integrated into LLM applications to analyze generated text and provide a score for the likelihood of containing toxic content. The Perspective API can be accessed through a REST API. Here's an example using Python: Python from googleapiclient import discovery import json def analyze_text(text): client = discovery.build("commentanalyzer", "v1alpha1") analyze_request = { "comment": {"text": text}, "requestedAttributes": {"TOXICITY": {}, } response = client.comments().analyze(body=analyze_request).execute() return response["attributeScores"]["TOXICITY"]["summaryScore"]["value"] text = "This is a hateful comment." toxicity_score = analyze_text(text) print(f"Toxicity score: {toxicity_score}") The API returns a score between 0 and 1, indicating the likelihood of toxicity. Thresholds can be set to flag or filter content exceeding a certain score. LLM-based safety filter: Major MLaaS providers like Google offer first-party safety filters integrated into their LLM offerings. These filters use internal LLM models trained specifically to detect and mitigate sensitive content. When using Google's Gemini API, the safety filters are automatically applied. You can access different creative text formats with safety guardrails in place. They also provide a second level of safety filters that users can leverage to apply additional filtering based on a set of metrics. For example, Google Cloud’s safety filters are mentioned here. Human-in-the-loop evaluation: Integrating human reviewers in the evaluation process can significantly improve the accuracy of sensitive content detection. Human judgment can help identify nuances and contextual factors that may be missed by automated systems. A platform like Amazon Mechanical Turk can be used to gather human judgments on the flagged content. Evaluator LLM: This involves using a separate LLM (“Evaluator LLM”) specifically to assess the output of the generative LLM for sensitive content. This Evaluator LLM can be trained on a curated dataset labeled for sensitive content. Training an Evaluator LLM requires expertise in deep learning. Open-source libraries like Hugging Face Transformers provide tools and pre-trained models to facilitate this process. An alternative is to use general-purpose LLMs such as Gemini or GPT with appropriate prompts to discover sensitive content. The language used to express sensitive content constantly evolves, requiring continuous updates to the detection models. By combining these scalable techniques and carefully addressing the associated challenges, we can build robust systems for detecting and mitigating sensitive content in LLM outputs, ensuring responsible and ethical deployment of this powerful technology. Conclusion Ensuring the reliable, ethical, and cost-effective deployment of Generative AI applications in production environments requires a multifaceted approach to monitoring. This article presented a framework specifically designed for real-time monitoring of GenAI, addressing both infrastructure and quality considerations. On the infrastructure side, proactive tracking of cost, latency, and scalability is essential. Tools for analyzing token usage, optimizing prompts, and leveraging auto-scaling capabilities play a crucial role in managing operational expenses and maintaining a positive user experience. Content monitoring is equally important for guaranteeing the quality and ethical integrity of GenAI applications. This includes techniques for detecting hallucinations, such as grounding in reliable data sources and incorporating human-in-the-loop verification mechanisms. Strategies for bias mitigation, coherence assessment, and sensitive content detection are vital for promoting inclusivity and preventing harmful outputs. By integrating the monitoring techniques outlined in this article, developers can gain deeper insights into the performance, behavior, and potential risks associated with their GenAI applications. This proactive approach empowers them to take informed corrective actions, optimize resource utilization, and ultimately deliver reliable, trustworthy, and ethical AI-powered experiences to users. While we have focused on MLaaS offerings, the principles discussed can be adapted to self-hosted LLM deployments. The field of GenAI monitoring is rapidly evolving. Researchers and practitioners should remain vigilant regarding new developments in hallucination detection, bias mitigation, and evaluation techniques. Additionally, it's crucial to recognize the ongoing debate around the balance between accuracy restrictions and creativity in generative models. Reference M. Korolov, “For IT leaders, operationalized gen AI is still a moving target,” CIO, Feb. 28, 2024. O. Simeone, "A Very Brief Introduction to Machine Learning With Applications to Communication Systems," in IEEE Transactions on Cognitive Communications and Networking, vol. 4, no. 4, pp. 648-664, Dec. 2018, doi: 10.1109/TCCN.2018.2881441. F. Doshi-Velez and B. Kim, "Towards A Rigorous Science of Interpretable Machine Learning", arXiv, 2017. [Online]. A. B. Arrieta et al. "Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI." Information Fusion 58 (2020): 82-115. A. Saleiro et al. "Aequitas: A Bias and Fairness Audit Toolkit." arXiv, 2018. [Online]. E. Bender and A. Koller, “Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data,” Proceedings of the 58th Annual Meeting of the Association for Computational S. Mousavi et al., “Enhancing Large Language Models with Ensemble of Critics for Mitigating Toxicity and Hallucination,” OpenReview. X. Amatriain, “Measuring And Mitigating Hallucinations In Large Language Models: A Multifaceted Approach”, Mar. 2024. [Online].
“The Mixtral-8x7B Large Language Model (LLM) is a pre-trained generative Sparse Mixture of Experts.” When I saw this come out it seemed pretty interesting and accessible, so I gave it a try. With the proper prompting, it seems good. I am not sure if it’s better than Google Gemma, Meta LLAMA2, or OLLAMA Mistral for my use cases. Today I will show you how to utilize the new Mixtral LLM with Apache NiFi. This will require only a few steps to run Mixtral against your text inputs. This model can be run by the lightweight serverless REST API or the transformers library. You can also use this GitHub repository. The context can have up to 32k tokens. You can also enter prompts in English, Italian, German, Spanish, and French. You have a lot of options on how to utilize this model, but I will show you how to build a real-time LLM pipeline utilizing Apache NiFi. One key thing to decide is what kind of input you are going to have (chat, code generation, Q&A, document analysis, summary, etc.). Once you have decided, you will need to do some prompt engineering and will need to tweak your prompt. In the following section, I include a few guides to help you improve your prompt-building skills. I will give you some basic prompt engineering in my walk-through tutorial. Guides To Build Your Prompts Optimally Mixtral: Prompt Engineering Guide Getting Started with Mixtral 8X7B The construction of the prompt is very critical to make this work well, so we are building this with NiFi. Overview of the Flow Step 1: Build and Format Your Prompt In building our application, the following is the basic prompt template that we are going to use. Prompt Template { "inputs": "<s>[INST]Write a detailed complete response that appropriately answers the request.[/INST] [INST]Use this information to enhance your answer: ${context:trim():replaceAll('"',''):replaceAll('\n', '')}[/INST] User: ${inputs:trim():replaceAll('"',''):replaceAll('\n', '')}</s>" } You will enter this prompt in a ReplaceText processor in the Replacement Value field. Step 2: Build Our Call to HuggingFace REST API To Classify Against the Model Add an InvokeHTTP processor to your flow, setting the HTTP URL to the Mixtral API URL. Step 3: Query To Convert and Clean Your Results We use the QueryRecord processor to clean and convert HuggingFace results grabbing the generated_text field. Step 4: Add Metadata Fields We use the UpdateRecord processor to add metadata fields, the JSON readers and writers, and the Literal Value Replacement Value Strategy. The fields we are adding are adding attributes. Overview of Send to Kafka and Slack: Step 5: Add Metadata to Stream We use the UpdateAttribute processor to add the correct "application/json Content Type", and set the model type to Mixtral. Step 6: Publish This Cleaned Record to a Kafka Topic We send it to our local Kafka broker (could be Docker or another) and to our flank-mixtral8x7B topic. If this doesn't exist, NiFi and Kafka will automagically create one for you. Step 7: Retry the Send If something goes wrong, we will try to resend three times, then fail. Overview of Pushing Data to Slack: Step 8: Send the Same Data to Slack for User Reply The first step is to split into a single record to send one at a time. We use the SplitRecord processor for this. As before, reuse the JSON Tree Reader and JSON Record Set Writer. As usual, choose "1" as the Records Per Split. Step 9: Make the Generated Text Available for Messaging We utilize EvaluateJsonPath to extract the Generated Text from Mixtral (on HuggingFace). Step 10: Send the Reply to Slack We use the PublishSlack processor, which is new in Apache NiFi 2.0. This one requires your Channel name or channel ID. We choose the Publish Strategy of Use 'Message Text' Property. For Message Text, use the Slack Response Template below. For the final reply to the user, we will need a Slack Response template formatted for how we wish to communicate. Below is an example that has the basics. Slack Response Template =============================================================================================================== HuggingFace ${modelinformation} Results on ${date}: Question: ${inputs} Answer: ${generated_text} =========================================== Data for nerds ==== HF URL: ${invokehttp.request.url} TXID: ${invokehttp.tx.id} == Slack Message Meta Data == ID: ${messageid} Name: ${messagerealname} [${messageusername}] Time Zone: ${messageusertz} == HF ${modelinformation} Meta Data == Compute Characters/Time/Type: ${x-compute-characters} / ${x-compute-time}/${x-compute-type} Generated/Prompt Tokens/Time per Token: ${x-generated-tokens} / ${x-prompt-tokens} : ${x-time-per-token} Inference Time: ${x-inference-time} // Queue Time: ${x-queue-time} Request ID/SHA: ${x-request-id} / ${x-sha} Validation/Total Time: ${x-validation-time} / ${x-total-time} =============================================================================================================== When this is run, it will look like the image below in Slack. You have now sent a prompt to Hugging Face, had it run against Mixtral, sent the results to Kafka, and responded to the user via Slack. We have now completed a full Mixtral application with zero code. Conclusion You have now built a full round trip utilizing Apache NiFi, HuggingFace, and Slack to build a chatbot utilizing the new Mixtral model. Summary of Learnings Learned how to build a decent prompt for HuggingFace Mixtral Learned how to clean up streaming data Built a HuggingFace REST call that can be reused Processed HuggingFace model call results Send your first Kafka message Formatted and built Slack calls Built a full DataFlow for GenAI If you need additional tutorials on utilizing the new Apache NiFi 2.0, check out: Apache NiFi 2.0.0-M2 Out! For additional information on building Slack bots: Building a Real-Time Slackbot With Generative AI Building an LLM Bot for Meetups and Conference Interactivity Also, thanks for following my tutorial. I am working on additional Apache NiFi 2 and Generative AI tutorials that will be coming to DZone. Finally, if you are in Princeton, Philadelphia, or New York City please come out to my meetups for in-person hands-on work with these technologies. Resources Mixtral of Experts Mixture of Experts Explained mistralai/Mixtral-8x7B-v0.1 Mixtral Overview Invoke the Mixtral 8x7B model on Amazon Bedrock for text generation Running Mixtral 8x7b on M1 16GB Mixtral-8x7B: Understanding and Running the Sparse Mixture of Experts by Mistral AI Retro-Engineering a Database Schema: Mistral Models vs. GPT4, LLama2, and Bard (Episode 3) Comparison of Models: Quality, Performance & Price Analysis A Beginner’s Guide to Fine-Tuning Mixtral Instruct Model
Much has changed since I wrote the article An Introduction to BentoML: A Unified AI Application Framework, both in the general AI landscape and BentoML. Generative AI, Large Language Models, diffusion models, ChatGPT (Sora), and Gemma: these are probably the most mentioned terms over the past several months in AI and the pace of change is overwhelming. Amid these brilliant AI breakthroughs, the quest for AI deployment tools that are not only powerful but also user-friendly and cost-effective remains unchanged. For BentoML, it comes with a major update 1.2, which moves towards the very same goal. In this blog post, let’s revisit BentoML and use a simple example to see how we can leverage some of the new tools and functionalities provided by BentoML to build an AI application in production. The example application I will build is capable of doing image captioning, which involves generating a textual description for an image using AI. BLIP (Bootstrapping Language-Image Pre-training) is a method that improves these AI models by initially training on large image-text datasets to understand their relationship, and then further refining this understanding with specific tasks like captioning. The BLIP model I will use in the sections below is Salesforce/blip-image-captioning-large. You can use any other BLIP model for this example as the code implementation logic is the same. A Quick Intro Before we delve deeper, let's highlight what BentoML brings to the table, especially with its 1.2 update. At its core, BentoML is an open-source platform designed to streamline the serving and deployment of AI applications. Here's a simplified workflow with BentoML 1.2: Model wrapping: Use BentoML Service SDKs to wrap your machine learning model so that you can expose it as an inference endpoint. Model serving: Run the model on your own machine, leveraging your own resources (like GPUs) for model inference through the endpoint. Easy deployment: Deploy your model to a serverless platform BentoCloud. For the last step, previously we needed to manually build a Bento (the unified distribution unit in BentoML which contains source code, Python packages, and model reference and configuration), then push and deploy it to BentoCloud. With BentoML 1.2, “Build, Push, and Deploy” are now consolidated into a single command bentoml deploy. I will talk more about the details and BentoCloud in the example below. Note: If you want to deploy the model in your own infrastructure, you can still do that by manually building a Bento, and then containerizing it as an OCI-compliant image. Now, let’s get started to see how this works in practice! Setting up the Environment Create a virtual environment using venv. This is recommended as it helps avoid potential package conflicts. python -m venv bentoml-new source bentoml-new/bin/activate Install all the dependencies. pip install "bentoml>=1.2.2" pillow torch transformers Building A BentoML Service First, import the necessary packages and use a constant to store the model ID. from __future__ import annotations import typing as t import bentoml from PIL.Image import Image MODEL_ID = "Salesforce/blip-image-captioning-large" Next, let's create a BentoML Service. For versions prior to BentoML 1.2, we use abstractions called “Runners” for model inference. In 1.2, BentoML works off this Runner concept by integrating the functionalities of API Servers and Runners into a single entity called “Services.” They are the key building blocks for defining model-serving logic in BentoML. Starting from 1.2, we use the @bentoml.service decorator to mark a Python class as a BentoML Service in a file called service.py. For this BLIP example, we can create a Service called BlipImageCaptioning like this: @bentoml.service class BlipImageCaptioning: During initialization, what we usually do is load the model (and other components if necessary) and move it to GPU for better computation efficiency. If you are not sure what function or package to use, just copy and paste the initialization code from the BLIP model’s Hugging Face repo. Here is an example: @bentoml.service class BlipImageCaptioning: def __init__(self) -> None: import torch from transformers import BlipProcessor, BlipForConditionalGeneration # Load the model with torch and set it to use either GPU or CPU self.device = "cuda" if torch.cuda.is_available() else "cpu" self.model = BlipForConditionalGeneration.from_pretrained(MODEL_ID).to(self.device) self.processor = BlipProcessor.from_pretrained(MODEL_ID) print("Model blip loaded", "device:", self.device) The next step is to create an endpoint function for user interaction through @bentoml.api. When applied to a Python function, it transforms that function into an API endpoint that can handle web requests. This BLIP model can take an image and optionally some starting text for captioning, so I defined it this way: @bentoml.service class BlipImageCaptioning: ... @bentoml.api async def generate(self, img: Image, txt: t.Optional[str] = None) -> str: if txt: inputs = self.processor(img, txt, return_tensors="pt").to(self.device) else: inputs = self.processor(img, return_tensors="pt").to(self.device) # Generate a caption for the given image by processing the inputs through the model, setting a limit on the maximum and minimum number of new tokens (words) that can be added to the caption. out = self.model.generate(**inputs, max_new_tokens=100, min_new_tokens=20) # Decode the generated output into a readable caption, skipping any special tokens that are not meant for display return self.processor.decode(out[0], skip_special_tokens=True) The generate method within the class is an asynchronous function exposed as an API endpoint. It receives an image and an optional txt parameter, processes them with the BLIP model, and returns a generated caption. Note that the main inference code also comes from the BLIP model’s Hugging Face repo. BentoML here only helps you manage the input and output logic. That’s all the code! The complete version: from __future__ import annotations import typing as t import bentoml from PIL.Image import Image MODEL_ID = "Salesforce/blip-image-captioning-large" @bentoml.service class BlipImageCaptioning: def __init__(self) -> None: import torch from transformers import BlipProcessor, BlipForConditionalGeneration self.device = "cuda" if torch.cuda.is_available() else "cpu" self.model = BlipForConditionalGeneration.from_pretrained(MODEL_ID).to(self.device) self.processor = BlipProcessor.from_pretrained(MODEL_ID) print("Model blip loaded", "device:", self.device) @bentoml.api async def generate(self, img: Image, txt: t.Optional[str] = None) -> str: if txt: inputs = self.processor(img, txt, return_tensors="pt").to(self.device) else: inputs = self.processor(img, return_tensors="pt").to(self.device) out = self.model.generate(**inputs, max_new_tokens=100, min_new_tokens=20) return self.processor.decode(out[0], skip_special_tokens=True) To serve this model locally, run: bentoml serve service:BlipImageCaptioning The HTTP server is accessible at http://localhost:3000. You can interact with it using the Swagger UI. I uploaded the image below (I created this image with Stable Diffusion, and it was also deployed using BentoML) and used the prompt text “a unicorn in a forest” for inference. The image caption output by the model was : a unicorn in a forest with a rainbow in the background and flowers in the foreground and a pond in the foreground with a rainbow. Local serving works properly but there are different things we always need to consider for deploying AI applications in production, such as infrastructure (especially GPUs), scaling, observability, and cost-efficiency. This is where BentoCloud comes in. Deploying to BentoCloud Explaining BentoCloud may require an independent blog post. Here's an overview of what it offers and how you can leverage it for your machine learning deployment: Autoscaling for ML workloads: BentoCloud dynamically scales deployment replicas based on incoming traffic, scaling down to zero during periods of inactivity to optimize costs. Built-in observability: Access real-time insights into your traffic, monitor resource utilization, track operational events, and review audit logs directly through the BentoCloud console. Optimized infrastructure. With BentoCloud, the focus shifts entirely to code development as the platform manages all underlying infrastructure, ensuring an optimized environment for your AI applications. To prepare your BentoML Service for BentoCloud deployment, begin by specifying the resources field in your Service code. This tells BentoCloud how to allocate the proper instance type for your Service. For details, see Configurations. @bentoml.service( resources={ "memory" : "4Gi" } ) class BlipImageCaptioning: Next, create a bentofile.yaml file to define the build options, which is used for building a Bento. Again, when using BentoCloud, you don’t need to build a Bento manually, since BentoML does this automatically for you. service: "service:BlipImageCaptioning" labels: owner: bentoml-team project: gallery include: - "*.py" python: packages: - torch - transformers - pillow Deploy your Service to BentoCloud using the bentoml deploy command, and use the -n flag to assign a custom name to your Deployment. Don’t forget to log in beforehand. bentoml deploy . -n blip-service Deployment involves a series of automated processes where BentoML builds a Bento, and then pushes and deploys it to BentoCloud. You can see the status displayed in your terminal. All set! Once deployed, you can find the Deployment on the BentoCloud console, which provides a comprehensive interface, offering enhanced user experience for interacting with your Service. Conclusion BentoML 1.2 significantly simplifies AI deployment, enabling developers to easily bring AI models into production. Its integration with BentoCloud offers scalable, efficient solutions. In future blog posts, I will demonstrate how to build more production-ready AI applications for different scenarios. Happy coding!
Documentation abounds for any topic or questions you might have, but when you try to apply something to your own uses, it suddenly becomes hard to find what you need. This problem doesn't only exist for you. In this blog post, we will look at how LangChain implements RAG so that you can apply the same principles to any application with LangChain and an LLM. What Is RAG? This term is used a lot in today's technical landscape, but what does it actually mean? Here are a few definitions from various sources: "Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response." — Amazon Web Services "Retrieval-augmented generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources." — NVIDIA "Retrieval-augmented generation (RAG) is an AI framework for improving the quality of LLM-generated responses by grounding the model on external sources of knowledge to supplement the LLM's internal representation of information." — IBM Research In this blog post, we'll be focusing on how to write the retrieval query that supplements or grounds the LLM's answer. We will use Python with LangChain, a framework used to write generative AI applications that interact with LLMs. The Data Set First, let's take a quick look at our data set. We'll be working with the SEC (Securities and Exchange Commission) filings from the EDGAR (Electronic Data Gathering, Analysis, and Retrieval system) database. The SEC filings are a treasure trove of information, containing financial statements, disclosures, and other important information about publicly-traded companies. The data contains companies who have filed financial forms (10k, 13, etc.) with the SEC. Different managers own stock in these companies, and the companies are part of different industries. In the financial forms themselves, various people are mentioned in the text, and we have broken the text down into smaller chunks for the vector search queries to handle. We have taken each chunk of text in a form and created a vector embedding that is also stored on the CHUNK node. When we run a vector search query, we will compare the vector of the query to the vector of the CHUNK nodes to find the most similar text. Let's see how to construct our query! Retrieval Query Examples I used a few sources to help me understand how to write a retrieval query in LangChain. The first was a blog post by Tomaz Bratanic, who wrote a post on how to work with the Neo4j vector index in LangChain using Wikipedia article data. The second was a query from the GenAI Stack, which is a collection of demo applications built with Docker and utilizes the StackOverflow data set containing technical questions and answers. Both queries are included below. Python # Tomaz's blog post retrieval query retrieval_query = """ OPTIONAL MATCH (node)<-[:EDITED_BY]-(p) WITH node, score, collect(p) AS editors RETURN node.info AS text, score, node {.*, vector: Null, info: Null, editors: editors} AS metadata """ # GenAI Stack retrieval query retrieval_query=""" WITH node AS question, score AS similarity CALL { with question MATCH (question)<-[:ANSWERS]-(answer) WITH answer ORDER BY answer.is_accepted DESC, answer.score DESC WITH collect(answer)[..2] as answers RETURN reduce(str='', answer IN answers | str + '\n### Answer (Accepted: '+ answer.is_accepted + ' Score: ' + answer.score+ '): '+ answer.body + '\n') as answerTexts } RETURN '##Question: ' + question.title + '\n' + question.body + '\n' + answerTexts AS text, similarity as score, {source: question.link} AS metadata ORDER BY similarity ASC // so that best answers are the last """ Now, notice that these queries do not look complete. We wouldn't start a Cypher query with an OPTIONAL MATCH or WITH clause. This is because the retrieval query is added on to the end of the vector search query. Tomaz's post shows us the implementation of the vector search query. Python read_query = ( "CALL db.index.vector.queryNodes($index, $k, $embedding) " "YIELD node, score " ) + retrieval_query So LangChain first calls the db.index.vector.queryNodes() procedure (more info in documentation) to find the most similar nodes and passes (YIELD) the similar node and the similarity score, and then it adds the retrieval query to the end of the vector search query to pull additional context. This is very helpful to know, especially as we construct the retrieval query, and for when we start testing results! The second thing to note is that both queries return the same three variables: text, score, and metadata. This is what LangChain expects, so you will get errors if those are not returned. The text variable contains the related text, the score is the similarity score for the chunk against the search text, and the metadata can contain any additional information that we want for context. Constructing the Retrieval Query Let's build our retrieval query! We know the similarity search query will return the node and score variables, so we can pass those into our retrieval query to pull connected data of those similar nodes. We also have to return the text, score, and metadata variables. Python retrieval_query = """ WITH node AS doc, score as similarity # some more query here RETURN <something> as text, similarity as score, {<something>: <something>} AS metadata """ Ok, there's our skeleton. Now what do we want in the middle? We know our data model will pull CHUNK nodes in the similarity search (those will be the node AS doc values in our WITH clause above). Chunks of text don't give a lot of context, so we want to pull in the Form, Person, Company, Manager, and Industry nodes that are connected to the CHUNK nodes. We also include a sequence of text chunks on the NEXT relationship, so we can pull the next and previous chunks of text around a similar one. We also will pull all the chunks with their similarity scores, and we want to narrow that down a bit...maybe just the top 5 most similar chunks. Python retrieval_query = """ WITH node AS doc, score as similarity ORDER BY similarity DESC LIMIT 5 CALL { WITH doc OPTIONAL MATCH (prevDoc:Chunk)-[:NEXT]->(doc) OPTIONAL MATCH (doc)-[:NEXT]->(nextDoc:Chunk) RETURN prevDoc, doc AS result, nextDoc } # some more query here RETURN coalesce(prevDoc.text,'') + coalesce(document.text,'') + coalesce(nextDoc.text,'') as text, similarity as score, {<something>: <something>} AS metadata """ Now we keep the 5 most similar chunks, then pull the previous and next chunks of text in the CALL {} subquery. We also change the RETURN to concatenate the text of the previous, current, and next chunks all into text variable. The coalesce() function is used to handle null values, so if there is no previous or next chunk, it will just return an empty string. Let's add a bit more context to pull in the other related entities in the graph. Python retrieval_query = """ WITH node AS doc, score as similarity ORDER BY similarity DESC LIMIT 5 CALL { WITH doc OPTIONAL MATCH (prevDoc:Chunk)-[:NEXT]->(doc) OPTIONAL MATCH (doc)-[:NEXT]->(nextDoc:Chunk) RETURN prevDoc, doc AS result, nextDoc } WITH result, prevDoc, nextDoc, similarity CALL { WITH result OPTIONAL MATCH (result)-[:PART_OF]->(:Form)<-[:FILED]-(company:Company), (company)<-[:OWNS_STOCK_IN]-(manager:Manager) WITH result, company.name as companyName, apoc.text.join(collect(manager.managerName),';') as managers WHERE companyName IS NOT NULL OR managers > "" WITH result, companyName, managers ORDER BY result.score DESC RETURN result as document, result.score as popularity, companyName, managers } RETURN coalesce(prevDoc.text,'') + coalesce(document.text,'') + coalesce(nextDoc.text,'') as text, similarity as score, {documentId: coalesce(document.chunkId,''), company: coalesce(companyName,''), managers: coalesce(managers,''), source: document.source} AS metadata """ The second CALL {} subquery pulls in any related Form, Company, and Manager nodes (if they exist, OPTIONAL MATCH). We collect the managers into a list and ensure the company name and manager list are not null or empty. We then order the results by a score (doesn't currently provide value but could track how many times the doc has been retrieved). Since only the text, score, and metadata properties get returned, we will need to map these extra values (documentId, company, and managers) in the metadata dictionary field. This means updating the final RETURN statement to include those. Wrapping Up! In this post, we looked at what RAG is and how retrieval queries work in LangChain. We also looked at a few examples of Cypher retrieval queries for Neo4j and constructed our own. We used the SEC filings data set for our query and saw how to pull extra context and return it mapped to the three properties LangChain expects. If you are building or interested in more Generative AI content, check out the resources linked below. Happy coding! Resources Demo application: Demo project that uses this retrieval query GitHub repository: Code for demo app that includes retrieval query Documentation: LangChain for Neo4j vector store Free online courses: Graphacademy: LLMs + Neo4j
Tuhin Chattopadhyay
CEO at Tuhin AI Advisory and Professor of Practice,
JAGSoM
Yifei Wang
Senior Machine Learning Engineer,
Meta
Austin Gil
Developer Advocate,
Akamai
Tim Spann
Principal Developer Advocate,
Cloudera