ai: Data for AI - Storage, ETL, Prepare, Clean and update the data

Taking your good data to AI

The most commonly used phrases

Garbage in, Garbage Out
Bad input produces bad output
Output can be only as good as input.

Soon: Ethically Sourced, Organically Raised, Grass Fed Data at a Higher Price.

If we properly source and manage the data, LLMs will be trained on the correct data, causing fewer hallucinations. Unremembering or Unlearning specific segments of LLM will be one of the significant facets of GenAI in future.

Teaching the kids wrong things is worse than not teaching them at all.

https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist

Why do we need to be careful about source data?

1. Incorrect Information: This could lead to AI providing answers that could be disruptive. Need to be careful when prescribing steps for a problem that could lead to severe complications

2. PII and Secure Data: Inadvertently sharing the secure private data of one client to another client. Data Classification and Desensitization using GenAI to preprocess the data to be utilized by AI is becoming a significant business proposition. There are quite a few startups in this space.

3. Feeding Data driven by an agenda: IMHO, we all know about the Gemini fiasco that was providing results that were not truthful because the truth hurts or the truth is not politically correct

4. Properiterary/Copyright Data: How do we monetize and attribute these proprietary research data to the correct author and content creator to prevent plagiarism and reward the inventor? This would be another area of new startups.

5. Using Publically Available Data has its downfall as well.

https://venturebeat.com/ai/openais-sora-the-devil-is-in-the-details-of-the-data/

"Generative AI copyright battles have been brewing for over a year, and many stakeholders, from authors, photographers and artists to lawyers, politicians, regulators and enterprise companies, want to know what data trained Sora and other models — and examine whether they really were publicly available, properly licensed, etc."

The legal side is a big part of this, but let us review the technical side.

Here are some thoughts on data - types of data, storing, accessing, cleaning, preparing and updating the data

1) Structured Data: Structured data fits neatly into data tables and includes discrete data types such as numbers, short text, and dates.

2) Unstructured Data: Unstructured data, such as audio and video files and large text documents, doesn't fit neatly into a data table because of its size or nature.

3) How to store the data - Fast Storage like VAST and Pure stocks are rising as demand for low latency storage requirements increase

4) Sourcing the data without latency - primary data accessed by the business applications can't be used for observability using AI insights/analytics because it will impact the performance of the production business applications. Again, backup data can't used for analytics as it will generally be a few days older, and the answers will be aged. Databricks/Snowflake are pioneers in the Warehouse/DataLake and Lakehouse technologies with ETL pipelines using Apache Spark to manage both structured and unstructured data with the ability to run CPU-intensive queries on these data. This helps to replicate the data almost immediately for training LLM/analytics purposes.

5) Preparing the data for AI -

a) Improve the data quality,

b) integrate multiple data sources - Data integration can help you access and analyze more data, enrich your data with additional attributes, and reduce data silos and inconsistencies. ETL with data sync can help. Databricks is helpful for this.

c) Data labelling: To label your data, you can use tools and techniques such as data annotation, classification, segmentation, and verification.

d) Data augmentation can help with data scarcity, reduce bias, and improve data generalization and robustness.

e) Data Governance: Data governance involves defining and implementing policies, processes, roles, and metrics to manage your data throughout its lifecycle. It can help you ensure that your data quality, integration, labelling, augmentation, and privacy are aligned with your AI objectives, standards, and best practices. You can use frameworks and platforms such as data strategy, stewardship, catalogue, and lineage to establish your data governance.

6) Desensitizing the data for AI: To protect your data privacy, you can use tools and techniques such as data encryption, anonymization, consent, and audit.

7) Data management with proper Authentication/Authorization(IAM): Store and Isolate the data based on the users. Multitenancy and reducing cross-pollination of data without less cost. Having one LLM for each client will be an expensive proposition.

Secure-minded design to protect the data:

Tier structure for LLMs—general, Domain-Specific, and private LLMs to protect the data or RAG/Grounding with hashed metadata embeddings in VectorDB.

ai

Friday, March 15, 2024

Data for AI - Storage, ETL, Prepare, Clean and update the data

Taking your good data to AI

No comments:

Post a Comment

Mindful Software: Building Agentic Automations using GenAI

Report Abuse

Labels