Wednesday, March 27, 2024

Embedding AND Vector Databases - creating a long term memory




What Are Vector Databases? - Intelligent memory of the GenAI


While traditional databases store data in rows and columns, a vector database stores data as math vectors. Each piece of data is represented as a point in high-dimensional space, with hundreds or thousands of dimensions. This allows very sophisticated relationships between data points to be captured.


Searching and analyzing vector databases relies on vector mathematics and similarity calculations. By comparing vector positions, highly relevant results can be returned, even if there are no exact keyword matches.
Vector databases index and store the vector embeddings/tokens for faster retrieval at interactive speeds and similarity search with capabilities like CRUD (create, read, update, and delete) operations, horizontal scaling, and serverless.

Why Are Vector Databases Important for AI?


Vector databases are ideal for managing and extracting insights from the enormous datasets required to train modern AI models. Key advantages include:

In the midst of the Gen AI revolution, efficient data processing is crucial not only for GenAI but also for efficient semantic search. GenAI and semantic search rely on vector embeddings/tokens. This vector data representation carries semantic information critical for the AI to gain understanding and maintain a long-term memory they can draw upon when executing complex tasks.

Embeddings/Tokens

LLMs generate embeddings with many attributes or features linked to each other to represent different dimensions essential to understanding patterns, relationships, etc., making their representation challenging to manage.

That is why we need a specialized database to handle this data type. Vector databases like Pinecone meet this by offering optimized storage and querying capabilities for embeddings. Vector databases have the capabilities of a traditional database that are absent in standalone vector indexes and the specialization of dealing with vector embeddings, which traditional scalar-based databases lack.

Embeddings (arrays of numbers) represent data(words and images transformed into numerical vectors that capture their essences). For example, the phrase puppy and dog will have similar embeddings with vectors close to each other. These embeddings are stored on the vector DB.
Puppy = 0.3, 0.5, 0.9, 0.8, 0.4...]
Dog =[0.1,0.51, 0.6, 0.2, 0.8,,,]
Numbers depend on the ML algorithm and model.

If you can convert a text, sentence or image into many vectors, you can compare, detect, and find the closest cosine similarity, semantic similarity, etc.

OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are commonly used for:

  • Search (where results are ranked by relevance to a query string)
  • Clustering (where text strings are grouped by similarity)
  • Recommendations (where items with related text strings are recommended)
  • Anomaly detection (where outliers with little relatedness are identified)
  • Diversity measurement (where similarity distributions are analyzed)
  • Classification (where text strings are classified by their most similar label)

Embedding Apps - GloVe, OpenAI, Word2Vec
Vector DBs are pinecone, Milvus, PgVector, Weaviate

here is how to create an embedding for the text "food" via an Open AI model 

curl https://api.openai.com/v1/embeddings \

  -H "Authorization: Bearer sk-“VPVgpYYi5znT3BlbkFJj0otiGN" \

  -H "Content-Type: application/json" \

  -d '{

    "input": "food"",                             

    "model": "text-embedding-ada-002",

    "encoding_format": "float"

  }'


More details: (Credits)

1. https://platform.openai.com/docs/api-reference/embeddings

2. good video course - explains the theory as well as the setting up of vector db


3. 
 3. https://www.youtube.com/watch?v=ySus5ZS0b94

Right size the vector DB:

Setting up Vector stores introduces new challenges. For example, correctly partitioning large data that cannot fit entirely in RAM in vector stores like Milvus is not easy. 
- Doing it poorly/under partitioning can result in some queries taking up too much RAM and bringing the service down.
- RAG responsiveness significantly depends on reducing the probes required to find relevant documents. So avoid over-partitioning as well

The Road Ahead

As GenAI moves into mainstream applications, vector databases' role will only grow. Their ability to organize and structure knowledge in a format tailored for AI aligns with the needs of next-gen generative models. 


Combining vector databases and transformers allows GenAI to understand language meaning rather than just keywords. This next-generation AI capability, powered by vector math, delivers such natural, intelligent conversations.







Friday, March 15, 2024

Data for AI - Storage, ETL, Prepare, Clean and update the data


Taking your good data to AI



The most commonly used phrases

  • Garbage in, Garbage Out 
  • Bad input produces bad output 
  • Output can be only as good as input. 

Soon: Ethically Sourced, Organically Raised, Grass Fed Data at a Higher Price.

If we properly source and manage the data, LLMs will be trained on the correct data, causing fewer hallucinations. Unremembering or Unlearning specific segments of LLM will be one of the significant facets of GenAI in future.

Teaching the kids wrong things is worse than not teaching them at all.

https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist


Why do we need to be careful about source data?

1. Incorrect Information: This could lead to AI providing answers that could be disruptive. Need to be careful when prescribing steps for a problem that could lead to severe complications
2. PII and Secure Data: Inadvertently sharing the secure private data of one client to another client. Data Classification and Desensitization using GenAI to preprocess the data to be utilized by AI is becoming a significant business proposition. There are quite a few startups in this space.
3. Feeding Data driven by an agenda: IMHO, we all know about the Gemini fiasco that was providing results that were not truthful because the truth hurts or the truth is not politically correct
4. Properiterary/Copyright Data: How do we monetize and attribute these proprietary research data to the correct author and content creator to prevent plagiarism and reward the inventor? This would be another area of new startups.  
5. Using Publically Available Data has its downfall as well.
"Generative AI copyright battles have been brewing for over a year, and many stakeholders, from authors, photographers and artists to lawyers, politicians, regulators and enterprise companies, want to know what data trained Sora and other models — and examine whether they really were publicly available, properly licensed, etc."

The legal side is a big part of this, but let us review the technical side.

Here are some thoughts on data - types of data, storing, accessing, cleaning, preparing and updating the data

1) Structured Data: Structured data fits neatly into data tables and includes discrete data types such as numbers, short text, and dates.
2) Unstructured Data: Unstructured data, such as audio and video files and large text documents, doesn't fit neatly into a data table because of its size or nature.
3) How to store the data - Fast Storage like VAST and Pure stocks are rising as demand for low latency storage requirements increase
4) Sourcing the data without latency - primary data accessed by the business applications can't be used for observability using AI insights/analytics because it will impact the performance of the production business applications. Again, backup data can't used for analytics as it will generally be a few days older, and the answers will be aged. Databricks/Snowflake are pioneers in the Warehouse/DataLake and Lakehouse technologies with ETL pipelines using Apache Spark to manage both structured and unstructured data with the ability to run CPU-intensive queries on these data. This helps to replicate the data almost immediately for training LLM/analytics purposes.
5) Preparing the data for AI - 
     a) Improve the data quality, 
     b) integrate multiple data sources - Data integration can help you access and analyze more data, enrich your data with additional attributes, and reduce data silos and inconsistencies. ETL with data sync can help. Databricks is helpful for this.
     c) Data labelling: To label your data, you can use tools and techniques such as data annotation, classification, segmentation, and verification.
     d) Data augmentation can help with data scarcity, reduce bias, and improve data generalization and robustness.
     e) Data Governance: Data governance involves defining and implementing policies, processes, roles, and metrics to manage your data throughout its lifecycle. It can help you ensure that your data quality, integration, labelling, augmentation, and privacy are aligned with your AI objectives, standards, and best practices. You can use frameworks and platforms such as data strategy, stewardship, catalogue, and lineage to establish your data governance. 

6) Desensitizing the data for AI: To protect your data privacy, you can use tools and techniques such as data encryption, anonymization, consent, and audit.
7) Data management with proper Authentication/Authorization(IAM): Store and Isolate the data based on the users. Multitenancy and reducing cross-pollination of data without less cost. Having one LLM for each client will be an expensive proposition.
Secure-minded design to protect the data:
 Tier structure for LLMs—general, Domain-Specific, and private LLMs to protect the data or RAG/Grounding with hashed metadata embeddings in VectorDB.



Wednesday, March 13, 2024

Product improvements using GenAI - Serviceability and Usability

 Serviceability is the last in the mind of most Product managers and Developers.  90% of the product users have less time to play around with various configurations to make the product work, as this is one of many products they manage.  Nobody has time to read the docs.

The product managers say that their product is like Apple, but they need to remember to provide guardrails and alerts in a way that makes the product self-serviceable or self-healable.

In this blog, we will review how GenAI can improve products' usability and how product managers make the product suitable for LLMs to learn fast.

- Make the products and documents GenAI ready

- Products utilize GenAI principles to self-heal and be better usable/serviceable.  LLMs for Proactive monitoring and self-healing:


GenAI Ready Product:

  1.  Logs:

Logs generated by the product should have a clear structure, making it easy for LLMs to train on these logs.  Easily identifiable PII data.

Error: TimeStamp: Message in Clear English 

Info: Timestamp: Message in Clear English

All the processes in your product should follow a similar pattern.

  2.  GuardRails LLM

Guide the customer to an optimal solution rather than allowing them to shoot themselves in the foot.  These guardrails can be trained by LLMs (product-specific LLMs running within the product - smaller LLM  footprint that acts as a well-trained product user), 

Do not allow the customers to install the new software if the storage or memory crunch is already in the system.

  3.  Customer pattern learner LLM

This LLM can sit in the product or run in the SaaS to understand the customer usage/use case and provide solutions to the customer.  This LLM can alert the customer if any anomaly is spotted. 

Customers using an older code version with a bug with a specific use case can be alerted to upgrade.  (Version recommender)

  4. Utilizing LLM for insights/analytics and file walker algorithms ( backup vendors and others browse files to identify patterns and can use GenAI tech for LLMs/Vector DBs).

Convert all the ML-based analytics to LLM-based analytics.

  5.  Prompt Engineering: Simplify the UI experience for the customer.  Current UX/UI can be used for advanced users.

Example: The prompt can be - "Identify current bottlenecks and suggest a solution"  

Because there were too many zombie processes, a CPU Bottleneck was identified.  Also, identify these processes and kill them.

Prompt Example for a backup software: Show me the current job that protects VMWare-SQL-Server-5  or Protect SQL-Server6 (provides the step or configures it automatically)

6.  If you are shipping Hardware with your product, LPU/GPU-ready hardware may be the future, or you can ship the call home data to GPU clusters in Amazon to run insights and analytics.

7. Better Product APIs to interact with LLMs: LLMs should be able to connect to the data source, log in to the product, and automatically change the product's configuration as per the prompt. This will help with AI-powered automation. There is a new development in this area called AI-APIs. AI APIs take things a step further by using machine learning and natural language processing to understand requests, generate relevant responses, and complete tasks.

Documents suitable for GenAI:

If the product vendor generates the product documents, they should be structured and parsed through AI.  Reinforced/supervised training of AI is good for verifying that it can produce clear, concise, and correct answers for a specific vendor software version without hallucinations and contradictions for the questions asked and can translate correctly in multiple languages.

LLMs are good at reading documents and summarizing them.  A quick test run of various prompts with LLMs trained on the new docs for every version/white paper will improve confidence.


LLMs for Proactive monitoring and self-healing:

Model 1: The product sends the call-home data to the SaaS-based vendor monitoring system.  Based on the above requirements of the logs/alerts being AI-ready, this data is AI-ready and will spend less time in the data cleaning/prep phase. 

Model 2: An on-prem Master that collects data from multiple points (IoTs, Clusters, nodes, Servers) and looks for anomalies. 
  • Pros: Secure, Quick identification/Trained for local data—useful for Cameras to monitor a break-in; 
  • Cons: Local Admin required and upgrades required, Limited processing power
LLMs can analyze these data for
  • Identify the anomaly and the corresponding fingerprint, provide solutions or apply the solution if it exists. 
  • Walk through the logs/alerts to identify new issues and alert the respective teams (Engineering/Field Notice/CSMs).  Create a draft doc on the field notice.
  • LLMs can identify the blast radius of this fingerprint.
  • LLMs can be live monitors of your product, and LLMs can be enabled to fix the issue or create docs or scripts to resolve the issue.
  • LLMs can scan these logs much faster than the current log parsers/file walkers.
Examples:
  LLMs monitor the logs for FATAL failures and review whether this is a known issue or an unknown issue, triggering an appropriate action.
  LLMs monitoring the video input, start tracking the person who broke the glass based on glass shattering sound





Monday, March 11, 2024

Coming attractions in this blog space

Here is what you can expect in this space: I plan to write at least one blog monthly, if not more frequently. I appreciate your support in providing feedback and sharing this blog.

Let us learn together and make the world better with GenAI.


Index:

  • Review of Startups in this space


Mindful Software: Building Agentic Automations using GenAI

Currently, software development and automation are painful. The software or automation team has to complete almost 95% of the process, takin...