Local AI Processing: Privacy Without Sacrifice
Bringing powerful AI capabilities to your device. How we use WebAssembly and local models to deliver intelligent features without data collection.
Cloud AI is powerful but comes with a cost: your data leaves your device and goes to someone else's servers. At GrepLabs, we've built AI features that run entirely locally. Here's how we make it work without sacrificing capability.
The Cloud AI Problem
When you use cloud AI services, you typically:
- Send your data to remote servers
- Trust the provider with your information
- Accept latency from network round-trips
- Pay per API call
For privacy-conscious users and enterprises, this is unacceptable for sensitive data.
Our Approach: Local AI
We run AI models directly on your device:
- **No data transmission**: Your information never leaves your device
- **No API costs**: Use as much as you want
- **No latency**: Instant responses
- **Works offline**: No internet required
Technical Implementation
WebAssembly for Cross-Platform Execution
WebAssembly (Wasm) lets us run high-performance code in browsers and native apps:
// Loading a Wasm ML model
async function loadModel() {
const wasmModule = await WebAssembly.instantiate(
await fetch('/models/embedding.wasm'),
{ env: { memory: new WebAssembly.Memory({ initial: 256 }) } }
);
return {
embed: (text: string) => {
const encoder = new TextEncoder();
const bytes = encoder.encode(text);
// Copy input to Wasm memory
const inputPtr = wasmModule.exports.allocate(bytes.length);
new Uint8Array(wasmModule.exports.memory.buffer)
.set(bytes, inputPtr);
// Run inference
const outputPtr = wasmModule.exports.embed(inputPtr, bytes.length);
// Read output embeddings
return new Float32Array(
wasmModule.exports.memory.buffer,
outputPtr,
384 // embedding dimension
);
}
};
}ONNX Runtime for Model Execution
ONNX (Open Neural Network Exchange) provides a standard format that runs anywhere:
import * as ort from 'onnxruntime-web';
class LocalEmbedding {
private session: ort.InferenceSession | null = null;
async initialize() {
// Load model from local file
this.session = await ort.InferenceSession.create(
'/models/all-MiniLM-L6-v2.onnx',
{
executionProviders: ['wasm'], // or 'webgl' for GPU
graphOptimizationLevel: 'all'
}
);
}
async embed(text: string): Promise<number[]> {
// Tokenize input
const tokens = this.tokenize(text);
// Create input tensors
const inputTensor = new ort.Tensor('int64', tokens, [1, tokens.length]);
const attentionMask = new ort.Tensor(
'int64',
new Array(tokens.length).fill(1),
[1, tokens.length]
);
// Run inference
const results = await this.session!.run({
input_ids: inputTensor,
attention_mask: attentionMask
});
// Return embeddings
return Array.from(results.embeddings.data as Float32Array);
}
}Quantized Models for Efficiency
Full-precision models are too large for local execution. We use quantization:
# Quantization process
import onnx
from onnxruntime.quantization import quantize_dynamic
# Load original model
model = onnx.load('model.onnx')
# Apply dynamic quantization
quantize_dynamic(
'model.onnx',
'model_quantized.onnx',
weight_type=QuantType.QInt8,
per_channel=True
)Use Cases in Our Products
Hippo: Semantic File Search
// Indexing files with local embeddings
async function indexFile(file: File, model: LocalEmbedding) {
// Extract text content
const content = await extractText(file);
// Split into chunks for embedding
const chunks = splitIntoChunks(content, 512);
// Generate embeddings locally
const embeddings = await Promise.all(
chunks.map(chunk => model.embed(chunk))
);
// Store in local vector database
await vectorDb.upsert({
id: file.path,
embeddings,
metadata: { name: file.name, type: file.type }
});
}
// Searching with natural language
async function search(query: string, model: LocalEmbedding) {
// Embed query locally
const queryEmbedding = await model.embed(query);
// Find similar documents
const results = await vectorDb.search(queryEmbedding, { topK: 10 });
return results;
}Chai.im: Local Message Summaries
// Summarize conversation without cloud
async function summarizeThread(messages: Message[]) {
const model = await loadSummarizationModel();
// Combine messages into context
const context = messages
.map(m => `${m.sender}: ${m.text}`)
.join('\n');
// Generate summary locally
const summary = await model.summarize(context, {
maxLength: 150,
minLength: 50
});
return summary;
}Shields AI: ML Threat Detection
// Threat detection runs locally
async function scoreDomain(domain: string) {
const model = await loadThreatModel();
// Extract domain features
const features = extractDomainFeatures(domain);
// Run inference locally
const [score] = await model.predict(features);
return score; // 0.0 - 1.0 threat probability
}Performance Optimizations
1. Model Caching
const modelCache = new Map<string, any>();
async function getModel(name: string) {
if (modelCache.has(name)) {
return modelCache.get(name);
}
const model = await loadModel(name);
modelCache.set(name, model);
return model;
}2. Batch Processing
// Process multiple inputs efficiently
async function embedBatch(texts: string[], model: LocalEmbedding) {
// Batch size depends on device memory
const batchSize = 32;
const results: number[][] = [];
for (let i = 0; i < texts.length; i += batchSize) {
const batch = texts.slice(i, i + batchSize);
const embeddings = await model.embedBatch(batch);
results.push(...embeddings);
}
return results;
}3. Web Worker Offloading
// Run inference in background thread
const worker = new Worker('/workers/ml-worker.js');
function inferenceAsync(input: any): Promise<any> {
return new Promise((resolve) => {
const id = crypto.randomUUID();
worker.postMessage({ id, input });
worker.onmessage = (e) => {
if (e.data.id === id) {
resolve(e.data.result);
}
};
});
}4. GPU Acceleration (WebGL/WebGPU)
// Use GPU when available
const session = await ort.InferenceSession.create(modelPath, {
executionProviders: [
'webgpu', // Best performance
'webgl', // Good fallback
'wasm' // Universal fallback
]
});Benchmarks
Embedding Generation
Summarization
Threat Classification
Limitations and Trade-offs
What Works Well Locally
- Text embeddings (semantic search)
- Classification (spam, threats)
- Short summarization
- Entity extraction
- Sentiment analysis
What Still Needs Cloud (For Now)
- Large language model chat (GPT-4 scale)
- Image generation
- Long document analysis
- Complex reasoning
Trade-offs
Future Directions
More Capable Local Models
- Quantized LLMs (Llama 2 7B runs on phones)
- Smaller, specialized models
- Better WebGPU support
On-Device Training
- Personalized models without data leaving device
- Federated learning approaches
- Continuous improvement from usage
Hardware Acceleration
- Apple Neural Engine support
- Android NPU utilization
- WebNN standard adoption
Conclusion
Local AI processing lets us deliver intelligent features without compromising privacy. By using WebAssembly, ONNX Runtime, and quantized models, we run powerful AI directly on your device—no cloud required.
The technology continues to improve. Every year, we can run more capable models locally. The future is local-first AI.