Local AI Processing: Privacy Without Sacrifice

Cloud AI is powerful but comes with a cost: your data leaves your device and goes to someone else's servers. At GrepLabs, we've built AI features that run entirely locally. Here's how we make it work without sacrificing capability.

The Cloud AI Problem

When you use cloud AI services, you typically:

Send your data to remote servers
Trust the provider with your information
Accept latency from network round-trips
Pay per API call

For privacy-conscious users and enterprises, this is unacceptable for sensitive data.

Our Approach: Local AI

We run AI models directly on your device:

**No data transmission**: Your information never leaves your device
**No API costs**: Use as much as you want
**No latency**: Instant responses
**Works offline**: No internet required

Technical Implementation

WebAssembly for Cross-Platform Execution

WebAssembly (Wasm) lets us run high-performance code in browsers and native apps:

// Loading a Wasm ML model
async function loadModel() {
  const wasmModule = await WebAssembly.instantiate(
    await fetch('/models/embedding.wasm'),
    { env: { memory: new WebAssembly.Memory({ initial: 256 }) } }
  );

  return {
    embed: (text: string) => {
      const encoder = new TextEncoder();
      const bytes = encoder.encode(text);

      // Copy input to Wasm memory
      const inputPtr = wasmModule.exports.allocate(bytes.length);
      new Uint8Array(wasmModule.exports.memory.buffer)
        .set(bytes, inputPtr);

      // Run inference
      const outputPtr = wasmModule.exports.embed(inputPtr, bytes.length);

      // Read output embeddings
      return new Float32Array(
        wasmModule.exports.memory.buffer,
        outputPtr,
        384  // embedding dimension
      );
    }
  };
}

ONNX Runtime for Model Execution

ONNX (Open Neural Network Exchange) provides a standard format that runs anywhere:

import * as ort from 'onnxruntime-web';

class LocalEmbedding {
  private session: ort.InferenceSession | null = null;

  async initialize() {
    // Load model from local file
    this.session = await ort.InferenceSession.create(
      '/models/all-MiniLM-L6-v2.onnx',
      {
        executionProviders: ['wasm'],  // or 'webgl' for GPU
        graphOptimizationLevel: 'all'
      }
    );
  }

  async embed(text: string): Promise<number[]> {
    // Tokenize input
    const tokens = this.tokenize(text);

    // Create input tensors
    const inputTensor = new ort.Tensor('int64', tokens, [1, tokens.length]);
    const attentionMask = new ort.Tensor(
      'int64',
      new Array(tokens.length).fill(1),
      [1, tokens.length]
    );

    // Run inference
    const results = await this.session!.run({
      input_ids: inputTensor,
      attention_mask: attentionMask
    });

    // Return embeddings
    return Array.from(results.embeddings.data as Float32Array);
  }
}

Quantized Models for Efficiency

Full-precision models are too large for local execution. We use quantization:

# Quantization process
import onnx
from onnxruntime.quantization import quantize_dynamic

# Load original model
model = onnx.load('model.onnx')

# Apply dynamic quantization
quantize_dynamic(
    'model.onnx',
    'model_quantized.onnx',
    weight_type=QuantType.QInt8,
    per_channel=True
)

Use Cases in Our Products

Hippo: Semantic File Search

// Indexing files with local embeddings
async function indexFile(file: File, model: LocalEmbedding) {
  // Extract text content
  const content = await extractText(file);

  // Split into chunks for embedding
  const chunks = splitIntoChunks(content, 512);

  // Generate embeddings locally
  const embeddings = await Promise.all(
    chunks.map(chunk => model.embed(chunk))
  );

  // Store in local vector database
  await vectorDb.upsert({
    id: file.path,
    embeddings,
    metadata: { name: file.name, type: file.type }
  });
}

// Searching with natural language
async function search(query: string, model: LocalEmbedding) {
  // Embed query locally
  const queryEmbedding = await model.embed(query);

  // Find similar documents
  const results = await vectorDb.search(queryEmbedding, { topK: 10 });

  return results;
}

Chai.im: Local Message Summaries

// Summarize conversation without cloud
async function summarizeThread(messages: Message[]) {
  const model = await loadSummarizationModel();

  // Combine messages into context
  const context = messages
    .map(m => `${m.sender}: ${m.text}`)
    .join('\n');

  // Generate summary locally
  const summary = await model.summarize(context, {
    maxLength: 150,
    minLength: 50
  });

  return summary;
}

Shields AI: ML Threat Detection

// Threat detection runs locally
async function scoreDomain(domain: string) {
  const model = await loadThreatModel();

  // Extract domain features
  const features = extractDomainFeatures(domain);

  // Run inference locally
  const [score] = await model.predict(features);

  return score;  // 0.0 - 1.0 threat probability
}

Performance Optimizations

1. Model Caching

const modelCache = new Map<string, any>();

async function getModel(name: string) {
  if (modelCache.has(name)) {
    return modelCache.get(name);
  }

  const model = await loadModel(name);
  modelCache.set(name, model);
  return model;
}

2. Batch Processing

// Process multiple inputs efficiently
async function embedBatch(texts: string[], model: LocalEmbedding) {
  // Batch size depends on device memory
  const batchSize = 32;
  const results: number[][] = [];

  for (let i = 0; i < texts.length; i += batchSize) {
    const batch = texts.slice(i, i + batchSize);
    const embeddings = await model.embedBatch(batch);
    results.push(...embeddings);
  }

  return results;
}

3. Web Worker Offloading

// Run inference in background thread
const worker = new Worker('/workers/ml-worker.js');

function inferenceAsync(input: any): Promise<any> {
  return new Promise((resolve) => {
    const id = crypto.randomUUID();

    worker.postMessage({ id, input });

    worker.onmessage = (e) => {
      if (e.data.id === id) {
        resolve(e.data.result);
      }
    };
  });
}

4. GPU Acceleration (WebGL/WebGPU)

// Use GPU when available
const session = await ort.InferenceSession.create(modelPath, {
  executionProviders: [
    'webgpu',  // Best performance
    'webgl',   // Good fallback
    'wasm'     // Universal fallback
  ]
});

Benchmarks

Embedding Generation

Summarization

Threat Classification

Limitations and Trade-offs

What Works Well Locally

Text embeddings (semantic search)
Classification (spam, threats)
Short summarization
Entity extraction
Sentiment analysis

What Still Needs Cloud (For Now)

Large language model chat (GPT-4 scale)
Image generation
Long document analysis
Complex reasoning

Trade-offs

Future Directions

More Capable Local Models

Quantized LLMs (Llama 2 7B runs on phones)
Smaller, specialized models
Better WebGPU support

On-Device Training

Personalized models without data leaving device
Federated learning approaches
Continuous improvement from usage

Hardware Acceleration

Apple Neural Engine support
Android NPU utilization
WebNN standard adoption

Conclusion

Local AI processing lets us deliver intelligent features without compromising privacy. By using WebAssembly, ONNX Runtime, and quantized models, we run powerful AI directly on your device—no cloud required.

The technology continues to improve. Every year, we can run more capable models locally. The future is local-first AI.

*Experience local AI yourself with Hippo or Chai.im.*