Pablo AI — LLM Training & Architecture | Automotive Parts Intelligence

🤖

Pablo AI Assistant

Automotive Parts Expert • Always Online

🤖

👋 Hi! I'm Pablo, your AI automotive parts expert. I'm trained on 577+ million part fitment records!

I can help you find the perfect parts for your vehicle, verify fitment, and answer any automotive questions. What are you working on today?

Pablo AI • Just now

I need an oil filler cap for my 1986 Hyundai Excel with the 1.5L engine

You • Just now

👤

🤖

🎯 Great! I found 4 compatible oil filler caps for your 1986 Hyundai Excel 1.5L:

🔧

Beck/Arnley 016-0050 Best Match

Engine Oil Filler Cap • OE Exact Grade

$8.99 Fits 326 applications

🔧

Gates 31085 Premium

Engine Oil Filler Cap • Twist Lock

$12.49 Fits 546 applications

💡 Pro Tip: The Beck/Arnley 016-0050 has the exact OE specifications (1.41" neck diameter, twist-lock type) for your Excel's 1.5L SOHC engine.

Pablo AI • Just now • Searched 577M records in 0.3s

📊 Real-time Metrics

Response Accuracy 99.2%

Avg Response Time 0.3s

Fitment Precision 99.7%

🧠 Knowledge Base

577M

Fitment Records

2.1M

Part Numbers

45K

Vehicle Models

850+

Brands

🏗️

Pablo's Brain: How It Works

Understanding the LLM Architecture in Simple Terms

🍽️ Think of Pablo Like a Super-Smart Restaurant Waiter

Imagine a restaurant waiter who has memorized every dish from every restaurant in the world, knows every ingredient, and can instantly tell you what dishes match your dietary needs. That's Pablo — but for auto parts.

When you ask "I need an oil cap for my 1986 Hyundai Excel," Pablo doesn't just randomly guess. It uses a sophisticated system with three main parts working together, just like how a great restaurant has a host, waiter, and chef working as a team.

🧩 The Three Parts of Pablo's Brain

🧠

Part 1: The Brain (Llama 3.1 70B Model)

This is the core "thinking" part of Pablo. It's a Large Language Model (LLM) created by Meta called Llama 3.1 with 70 billion parameters. Think of parameters like brain cells — more parameters means more capacity to understand and generate complex responses.

What Does "70 Billion Parameters" Mean?

Imagine a massive spreadsheet with 70 billion numbers. Each number has been carefully adjusted during training so that when you feed text in, the model produces intelligent responses. These numbers encode everything the model has learned about language, logic, and knowledge.

70B

Parameters

128K

Context Window

~140GB

Model Size

🎯 Why Llama 3.1 70B? It's the sweet spot between being smart enough to understand complex automotive questions and being small enough to run on our hardware. Larger models (like 405B) are smarter but require massive data centers. Smaller models (8B) are faster but less accurate.

📚

Part 2: The Library (RAG Vector Database)

Even with 70 billion parameters, the brain can't memorize ALL 577 million parts. So we give it a "reference library" it can search instantly. This technique is called RAG (Retrieval-Augmented Generation).

How RAG Works — A Simple Analogy:

Imagine you're taking an open-book exam. You're smart and know the concepts, but you're allowed to look up specific facts in your textbook. RAG works the same way:

User asks: "Oil cap for 1986 Hyundai Excel 1.5L"

System searches 577M records and finds the 10 most relevant parts

These 10 parts are given to the brain along with the question

The brain crafts a helpful response using this specific information

Vector Database

We use Pinecone — a specialized database that can search through 577 million records in under 50 milliseconds by comparing mathematical "fingerprints" of text.

Embeddings

Each part record is converted into a 1024-number "fingerprint" using the E5-large-v2 model. Similar parts have similar fingerprints, making search fast and accurate.

🎓

Part 3: The Specialized Training (QLoRA Fine-tuning)

The base Llama model is smart but doesn't know anything about auto parts. Fine-tuning is how we teach it automotive expertise by showing it 150,000 example conversations about parts, fitment, and vehicle specifications.

The "Sticky Notes" Technique (QLoRA):

Retraining all 70 billion parameters would take weeks and cost hundreds of thousands of dollars. Instead, we use a clever technique called QLoRA:

Imagine the base model as a massive filing cabinet with 70 billion folders. Instead of rewriting all folders, we attach small "sticky notes" (about 400 million of them) to key folders. When the AI looks something up, it reads both the main folder AND the sticky note.

This means we only train 0.5% of the model, making it possible to train on 4 consumer GPUs instead of a massive data center!

LoRA Rank

Adapter size

Training Data

150K

Q&A pairs

Training Time

~72 hrs

On 4× RTX 4090

🔄 How All Three Parts Work Together

When a customer asks: "I need an oil filler cap for my 1986 Hyundai Excel 1.5L"

💬

Customer Query

Natural language

→

🔍

Extract Info

Year, Make, Model

→

📚

Search Library

Find 10 best parts

→

🧠

Brain Processes

Llama 3.1 + QLoRA

→

✅

Smart Response

Parts + advice

🎯 The Result:

In about 0.3 seconds, Pablo combines its specialized automotive training, searches through 577 million parts, and generates a helpful, accurate response that recommends the right oil filler cap (Beck/Arnley 016-0050) with detailed specifications, alternative options, and pro tips — just like having an expert parts specialist available 24/7.

⚙️ Technical Specifications Summary

🧠

Base Model

ModelLlama 3.1 70B

Parameters70 Billion

Context Window128K tokens

PrecisionBF16 / INT4

🎯

Fine-tuning

MethodQLoRA

LoRA Rank64

LoRA Alpha128

Training Data150K Q&A

📊

RAG System

EmbeddingE5-large-v2

Vector DBPinecone

Total Vectors577M

Top-K Retrieval10

📚

How We Train Pablo AI

A Simple Guide to Understanding LLM Training

🎓 Think of it Like Training a New Employee

Imagine you hired a brilliant new employee who has read millions of books and websites — they're incredibly smart and can write, code, and answer questions about almost anything. But they know nothing about auto parts.

That's exactly what the base Llama 3.1 model is like. It's incredibly intelligent, but it doesn't know that a 016-0050 oil filler cap fits a 1986 Hyundai Excel.

Training is how we teach this brilliant employee everything about automotive parts — which parts fit which vehicles, what the part numbers mean, how to help customers find the right parts.

🔄 The Complete Training Journey

📥 Gathering the Knowledge (Raw Data Collection)

First, we collect ALL the automotive knowledge that exists. This comes from ACES files — the industry-standard format that every auto parts manufacturer uses to share their catalog data.

Real ACES Data Example:

When Beck/Arnley says "our part 016-0050 fits the 1986 Hyundai Excel", it looks like this in their catalog file:

<App id="5223">
  <BaseVehicle id="139" />   ← This number means "1986 Hyundai Excel"
  <EngineBase id="1062" />   ← This means "1.5L 4-cylinder SOHC engine"
  <PartType id="5269" />     ← This means "Oil Filler Cap"
  <Part>016-0050</Part>        ← The actual part number
</App>

We have 577 million of these records from 850+ brands!

577M

Fitment Records

2.1M

Unique Parts

45K

Vehicle Types

~500GB

Raw Data

🧹 Making the Data Human-Readable (Data Processing)

Those cryptic codes like BaseVehicle id="139" mean nothing to a human or AI. We need to translate them into actual vehicle and part information by joining with reference databases.

Before → After Translation:

❌ Raw (Unreadable):

BaseVehicle: 139
EngineBase: 1062
PartType: 5269
Part: 016-0050

✅ Processed (Human-Readable):

Make: HYUNDAI
Model: EXCEL  
Year: 1986
Engine: 1.5L L4 SOHC
Product: Engine Oil Filler Cap
Part Number: 016-0050
Brand: Beck/Arnley

This processed data goes into our euro table — a massive database with 50+ columns of vehicle and part attributes for every fitment record.

✍️ Creating Lesson Plans (Q&A Dataset Generation)

Now here's the key insight: you can't just show an AI raw data and expect it to learn. You need to create example conversations — showing the AI "when a customer asks THIS, you should respond like THIS."

Training Example (Q&A Pair):

👤 USER (Input):

"I need an oil filler cap for my 1986 Hyundai Excel with a 1.5 liter engine"

🤖 PABLO (Expected Output):

"For your 1986 Hyundai Excel 1.5L SOHC, I recommend the Beck/Arnley 016-0050 Engine Oil Filler Cap. This is an OE-exact replacement with a twist-lock design, 1.41" neck diameter. It fits all Excel trims (Base, GL, GLS) for the USA and Canadian markets. The Gates 31085 is also compatible if you prefer an alternative brand."

We generate 150,000+ of these training examples programmatically from our 577M records!

💡 Analogy: This is like creating a training manual with 150,000 example customer conversations. "When a customer says X, the perfect response is Y."

🔢 Converting Text to Numbers (Tokenization)

Computers don't understand words — they only understand numbers. We need to convert every word into tokens (numbers) that the AI can process.

Tokenization Example:

Original text:

"oil filler cap for Hyundai Excel"

After tokenization:

[8435, 1254, 6623, 369, 19478, 35621]

The Llama tokenizer has a vocabulary of 128,000 tokens, covering every word and subword it might encounter.

🧠 The Actual Training (Fine-tuning with QLoRA)

This is where the magic happens. We load the base Llama 3.1 70B model and teach it using our 150,000 Q&A examples. But here's the clever part — we don't modify the entire 70 billion parameters (that would require massive computing power). Instead, we use a technique called QLoRA.

💡 The QLoRA Technique Explained Simply:

Imagine the base model as a massive filing cabinet with 70 billion folders (parameters). Rewriting all those folders would take forever.

Instead, QLoRA adds a small "sticky note system" — about 400 million small notes attached to key folders. When the AI looks something up, it checks the main folder AND reads the sticky note. The sticky notes contain all our automotive knowledge!

This means we only need to train 0.5% of the model (the sticky notes), making it possible to train on consumer GPUs instead of requiring a massive data center.

🔧 Training Configuration:

Base Model

Llama 3.1 70B

Loaded in 4-bit (~35GB)

LoRA Adapters

Rank 64, Alpha 128

~400M trainable params

Training Data

150,000 Q&A pairs

3 epochs (passes)

⚙️ What Happens Inside During Training

During each training step, the computer does this cycle millions of times:

Show Example

Feed one Q&A pair: "User asks about Excel oil cap → Expected response about 016-0050"

Model Attempts

The AI generates its own response based on current knowledge

Calculate Error

Compare AI's response to expected response — how different are they?

Adjust "Sticky Notes"

Slightly update the LoRA adapter weights to reduce the error

Repeat!

Do this for all 150,000 examples, 3 times (epochs) = ~450,000 update cycles

🎯 Result: After 72 hours, the LoRA adapters have been tuned so precisely that the AI now understands automotive parts as well as your best employee — but it can answer instantly and never forgets!

📊 Adding the Reference Library (RAG Vector Database)

Even after training, the AI can't memorize ALL 577 million parts. So we give it a "reference library" it can search instantly. This is called RAG (Retrieval-Augmented Generation).

How RAG Works:

1️⃣

Convert each fitment record into a numerical "fingerprint" (embedding vector)

2️⃣

Store all 577M fingerprints in a vector database (Pinecone)

3️⃣

When user asks a question, find the 10 most relevant records instantly

4️⃣

Feed those records to the AI along with the question

This way, Pablo can access any of the 577 million records in under 50 milliseconds, without needing to memorize them all!

🎯 The Complete Training Pipeline

📥

Raw ACES

500GB XML

→

🧹

Process

577M records

→

✍️

Generate Q&A

150K pairs

→

🔢

Tokenize

Numbers

→

🧠

Train (72hrs)

4× RTX 4090

→

📊

RAG Index

577M vectors

→

🤖

Pablo AI

Ready!

🖥️

Local GPU Training Setup

Train on Your Own Hardware → Deploy to Cloud

🏆 RECOMMENDED

Best Performance Setup

GPU

4× RTX 4090

96GB Total VRAM

CPU

Threadripper PRO

5975WX 32-Core

RAM

256GB DDR5

ECC 5600MHz

Storage

4TB NVMe

PCIe Gen5

~72 hours

Training Time

~1,800W

Peak Power

~$15,000

GPU Cost

💰 BUDGET

Cost-Effective Alternative

2× RTX 4090

48GB VRAM

Ryzen 9 7950X

16-Core

128GB DDR5

Non-ECC

~144 hours

Training Time

📊 GPU Memory Usage

GPU 022.1GB / 24GB

GPU 121.8GB / 24GB

GPU 222.3GB / 24GB

GPU 321.9GB / 24GB

⚙️ Training Config

MethodQLoRA (4-bit)

LoRA Rank64

Batch Size1 × 8 grad accum

Learning Rate2e-4

Epochs3

🛠️ Software Stack

Training

PyTorch2.2.0+cu121

Transformers4.40.0

PEFT0.10.0

bitsandbytes0.43.0

CUDA

CUDA12.1

cuDNN8.9.7

Flash Attention2.5.0

Driver545.xx

Inference

vLLM0.4.0

AutoAWQ0.2.0

FastAPI0.110.0

LangChain0.1.10

☁️

AWS Inference (Cost-Optimized)

Train Locally → Deploy to AWS for Inference Only

💡

Cost Optimization Strategy

By training locally on your own hardware, you eliminate expensive GPU training costs on AWS. Cloud is used purely for scalable inference.

Training

LOCAL

One-time HW cost

Inference

AWS

Pay-per-use

Savings

~85%

vs full cloud training

🏗️ Minimum Viable Infrastructure

🤖

LLM Inference

1× g5.12xlarge

4× A10G (96GB)

~$5.67/hr

⚡

API Workers

2× t3.large

Auto-scaling

~$0.17/hr

🗃️

Database

RDS db.t3.large

PostgreSQL

~$0.15/hr

📊

Vector DB

Pinecone s1

Starter Pod

~$70/mo

💰 Monthly Cost (On-Demand)

g5.12xlarge (730 hrs)$4,139

API Workers$248

RDS PostgreSQL$109

Pinecone$70

S3 + Transfer$50

Total ~$4,616/mo

💡 Cost Reduction Options

Spot Instances (~60-70% off)

GPU: ~$1,500/mo

Reserved Instances (~40% off)

1-year commitment: ~$2,500/mo

Auto-Scaling

Scale down during low traffic

⚡ Optimized Total: ~$2,000/mo

📅

Comprehensive Development Plan

10-Week Sprint Plan with Detailed Deliverables

Weeks Total

Major Phases

20+

Deliverables

MVP

Final Goal

📊 Project Timeline Overview

Phase 1

W1-2

Phase 2

W3-4

Phase 3

W5-6

Phase 4

W7-8

Phase 5

W9-10

📅 Weeks 1-2

Phase 1: Research & Planning

Foundation Setup & Architecture Design

14 Days

~80 Hours

📋 Week 1: Data Assessment

✓

ACES/VCdb Data Audit

Inventory all 500GB+ of raw XML files, validate data quality, identify missing brands

✓

Database Schema Analysis

Review existing euro table structure, plan optimizations for 577M records

✓

Sample Q&A Generation Test

Create 100 sample training pairs to validate approach

🏗️ Week 2: Architecture Design

✓

LLM Model Selection

Benchmark Llama 3.1 70B vs 8B, finalize QLoRA approach

✓

RAG Architecture Design

Design vector DB schema, embedding strategy, retrieval pipeline

✓

Hardware Procurement

Order 4× RTX 4090 GPUs, verify workstation specs

📦 Phase 1 Deliverables:

✓Data Quality Report

✓Architecture Document

✓Hardware Setup Complete

📅 Weeks 3-4

Phase 2: Data Pipeline & Vector Database

Process 577M Records & Build RAG Infrastructure

14 Days

~100 Hours

🔄 Week 3: Data Processing

✓

ACES XML Parser Development

Build Python parser to extract all App records from XML files

✓

VCdb/PCdb Join Pipeline

Translate IDs to human-readable values (Make, Model, Year, etc.)

✓

Euro Table Population

Load all 577M processed records into PostgreSQL

📊 Week 4: Vector Database

✓

Embedding Generation Pipeline

Convert 577M records to E5-large-v2 embeddings (batch processing)

✓

Pinecone Index Setup

Create index, configure metadata filters, upload vectors

✓

RAG Retrieval Testing

Validate search accuracy with 1000 test queries

📦 Phase 2 Deliverables:

✓577M Records in PostgreSQL

✓577M Vectors in Pinecone

✓RAG Pipeline Working

📅 Weeks 5-6

Phase 3: Model Fine-tuning (LOCAL)

Train Pablo AI on 4× RTX 4090 GPUs

14 Days

~72 hrs training

✍️ Week 5: Dataset Preparation

✓

Q&A Dataset Generation

Generate 150,000 training pairs from euro table data

✓

Dataset Formatting (JSONL)

Format for Llama instruction tuning, add system prompts

✓

Training Environment Setup

Install PyTorch, PEFT, bitsandbytes, configure multi-GPU

🧠 Week 6: Training & Export

✓

QLoRA Fine-tuning (~72 hours)

Train on 4× RTX 4090, 3 epochs, monitor loss curves

✓

Model Evaluation

Test on 500 held-out examples, measure accuracy metrics

✓

AWQ Quantization

Convert to 4-bit for efficient deployment (~35GB)

📦 Phase 3 Deliverables:

✓150K Training Dataset

✓Fine-tuned LoRA Adapters

✓AWQ Quantized Model

📅 Weeks 7-8

Phase 4: API Development & Integration

Build Production-Ready Backend Services

14 Days

~100 Hours

⚡ Week 7: Inference Server

✓

vLLM Server Setup

Deploy model with tensor parallelism, optimize throughput

✓

RAG Pipeline Integration

Connect Pinecone retrieval with LLM generation

✓

Response Validation

Build fitment verification against ACES database

🔌 Week 8: API & UI

✓

FastAPI Endpoints

REST API for chat, part lookup, fitment verification

✓

WebSocket Streaming

Real-time response streaming for chat interface

✓

Chat UI Development

React frontend with conversation history, product cards

📦 Phase 4 Deliverables:

✓vLLM Inference Server

✓REST/WebSocket API

✓Chat UI Prototype

📅 Weeks 9-10

Phase 5: AWS Deployment & Launch

Production Deployment & Go-Live

14 Days

~80 Hours

☁️ Week 9: Infrastructure

✓

AWS Infrastructure Setup

Deploy g5.12xlarge, RDS, ElastiCache via Terraform

✓

Model Upload to S3

Transfer 35GB quantized model, configure EBS volumes

✓

Auto-Scaling Configuration

Set up CloudWatch alarms, scaling policies

🚀 Week 10: Launch

✓

Load Testing

Stress test with 100 concurrent users, optimize latency

✓

Monitoring & Logging

Set up CloudWatch dashboards, error tracking

✓

🚀 Production Launch!

Go-live with Pablo AI MVP

📦 Phase 5 Deliverables:

✓AWS Infrastructure Live

✓Monitoring Dashboard

✓🎉 Pablo AI MVP Live!

⚠️ Key Risks & Mitigation

Risk: GPU Hardware Delays

Mitigation: Order hardware in Week 1, have cloud backup (RunPod)

Risk: Training Quality Issues

Mitigation: Weekly eval checkpoints, prepared to extend Phase 3

Risk: Vector DB Costs

Mitigation: Start with Pinecone s1, scale if needed; pgvector as fallback

Risk: Data Quality Gaps

Mitigation: Phase 1 audit identifies gaps, prioritize top 80% brands

🎯 Success Metrics (MVP)

>95%

Fitment Accuracy

<1s

Response Time

100

Concurrent Users

99.5%

Uptime Target