Hippo: Building Semantic Search for 100K+ Files

Traditional file search is broken. You type keywords and hope for the best. Hippo is different—it understands what you're looking for, even if you don't use the exact words. Here's how we built semantic search that works across 100,000+ files with sub-100ms latency.

The Problem with Keyword Search

You have a file called "Q3 Financial Report.pdf" but search for "third quarter revenue"—no results. You look for "meeting notes from Sarah" but the file is named "2024-03-15-standup.md"—nothing.

Keyword search fails because:

It matches strings, not meaning
You must remember exact file names
Related documents don't surface
Synonyms don't work

Semantic Search: Understanding Meaning

Semantic search transforms both your query and documents into numerical representations (embeddings) that capture meaning. Similar concepts have similar embeddings, regardless of exact words used.

"Q3 revenue"          →  [0.23, -0.15, 0.87, ...]
"third quarter sales" →  [0.25, -0.14, 0.89, ...]  ← Similar!
"vacation photos"     →  [0.91, 0.32, -0.45, ...]  ← Different

Hippo's Architecture

┌─────────────────────────────────────────────────────────────┐
│                       Hippo Desktop                          │
├─────────────────────────────────────────────────────────────┤
│  ┌───────────┐    ┌───────────┐    ┌───────────┐           │
│  │   File    │───→│   Text    │───→│  Vector   │           │
│  │  Watcher  │    │  Extract  │    │ Embedding │           │
│  └───────────┘    └───────────┘    └───────────┘           │
│                                           │                  │
│                                           ▼                  │
│  ┌───────────┐    ┌───────────┐    ┌───────────┐           │
│  │  Search   │←───│   HNSW    │←───│  SQLite   │           │
│  │    UI     │    │   Index   │    │  + VSS    │           │
│  └───────────┘    └───────────┘    └───────────┘           │
└─────────────────────────────────────────────────────────────┘

Component 1: File Watcher

We use native OS APIs to detect file changes in real-time:

use notify::{Watcher, RecursiveMode, watcher};

fn watch_directory(path: &Path, tx: Sender<FileEvent>) {
    let mut watcher = watcher(tx, Duration::from_secs(1)).unwrap();

    watcher.watch(path, RecursiveMode::Recursive).unwrap();

    // Events trigger indexing pipeline
    // - FileEvent::Create → Index new file
    // - FileEvent::Modify → Re-index
    // - FileEvent::Delete → Remove from index
}

Component 2: Text Extraction

Different file types need different extractors:

fn extract_text(path: &Path) -> Result<String> {
    let extension = path.extension().unwrap_or_default();

    match extension.to_str() {
        Some("pdf") => extract_pdf(path),
        Some("docx") => extract_docx(path),
        Some("md") | Some("txt") => read_text(path),
        Some("html") => extract_html(path),
        Some("xlsx") => extract_excel(path),
        Some("pptx") => extract_powerpoint(path),
        Some("jpg") | Some("png") => ocr_image(path),
        _ => Err(UnsupportedFormat)
    }
}

// PDF extraction with poppler
fn extract_pdf(path: &Path) -> Result<String> {
    let doc = poppler::Document::from_file(path)?;
    let mut text = String::new();

    for i in 0..doc.n_pages() {
        let page = doc.page(i)?;
        text.push_str(&page.text()?);
        text.push('\n');
    }

    Ok(text)
}

Component 3: Vector Embeddings

We use a local embedding model (no cloud required):

use ort::{Session, Environment};

struct EmbeddingModel {
    session: Session,
    tokenizer: Tokenizer,
}

impl EmbeddingModel {
    fn new() -> Self {
        let environment = Environment::builder()
            .with_name("hippo")
            .build()
            .unwrap();

        let session = Session::builder(&environment)
            .with_model_from_file("models/all-MiniLM-L6-v2.onnx")
            .unwrap();

        Self {
            session,
            tokenizer: Tokenizer::from_file("models/tokenizer.json"),
        }
    }

    fn embed(&self, text: &str) -> Vec<f32> {
        // Tokenize
        let encoding = self.tokenizer.encode(text, true).unwrap();

        // Prepare input tensors
        let input_ids: Vec<i64> = encoding.get_ids()
            .iter()
            .map(|&x| x as i64)
            .collect();

        // Run inference
        let outputs = self.session.run(vec![
            Array::from_shape_vec((1, input_ids.len()), input_ids).unwrap(),
        ]).unwrap();

        // Mean pooling
        mean_pool(&outputs[0])
    }
}

Component 4: Vector Storage with SQLite

We store embeddings in SQLite with the vector search extension:

use rusqlite::{Connection, params};

fn create_schema(conn: &Connection) {
    conn.execute_batch("
        CREATE TABLE IF NOT EXISTS documents (
            id INTEGER PRIMARY KEY,
            path TEXT UNIQUE NOT NULL,
            name TEXT NOT NULL,
            content TEXT,
            embedding BLOB NOT NULL,
            created_at INTEGER,
            modified_at INTEGER,
            file_type TEXT,
            file_size INTEGER
        );

        CREATE INDEX IF NOT EXISTS idx_path ON documents(path);
        CREATE INDEX IF NOT EXISTS idx_type ON documents(file_type);
    ").unwrap();
}

fn insert_document(
    conn: &Connection,
    path: &str,
    content: &str,
    embedding: &[f32]
) {
    let embedding_bytes: Vec<u8> = embedding
        .iter()
        .flat_map(|f| f.to_le_bytes())
        .collect();

    conn.execute(
        "INSERT OR REPLACE INTO documents
         (path, name, content, embedding, modified_at)
         VALUES (?1, ?2, ?3, ?4, ?5)",
        params![
            path,
            Path::new(path).file_name().unwrap().to_str(),
            content,
            embedding_bytes,
            SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_secs()
        ],
    ).unwrap();
}

Component 5: HNSW Index for Fast Search

For 100K+ documents, we need efficient approximate nearest neighbor search:

use hnsw::{Hnsw, SearchResult};

struct VectorIndex {
    hnsw: Hnsw<f32, DistCosine>,
    id_to_path: HashMap<usize, String>,
}

impl VectorIndex {
    fn new() -> Self {
        Self {
            hnsw: Hnsw::new(16, 200, 12, 200, DistCosine),
            id_to_path: HashMap::new(),
        }
    }

    fn add(&mut self, id: usize, path: String, embedding: Vec<f32>) {
        self.hnsw.insert(embedding, id);
        self.id_to_path.insert(id, path);
    }

    fn search(&self, query: &[f32], k: usize) -> Vec<SearchResult> {
        let results = self.hnsw.search(query, k, 200);

        results
            .into_iter()
            .map(|r| SearchResult {
                path: self.id_to_path[&r.id].clone(),
                score: 1.0 - r.distance,  // Convert distance to similarity
            })
            .collect()
    }
}

Chunking Strategy

Long documents need to be split into chunks for effective embedding:

fn chunk_document(text: &str, chunk_size: usize, overlap: usize) -> Vec<String> {
    let words: Vec<&str> = text.split_whitespace().collect();
    let mut chunks = Vec::new();

    let mut start = 0;
    while start < words.len() {
        let end = (start + chunk_size).min(words.len());
        let chunk = words[start..end].join(" ");
        chunks.push(chunk);

        start += chunk_size - overlap;
    }

    chunks
}

// For a 10-page PDF:
// - Chunk size: 512 tokens (~400 words)
// - Overlap: 50 tokens
// - Results in ~25-30 chunks per document

Search Algorithm

struct SearchEngine {
    model: EmbeddingModel,
    index: VectorIndex,
    db: Connection,
}

impl SearchEngine {
    fn search(&self, query: &str, limit: usize) -> Vec<SearchResult> {
        // 1. Embed the query
        let query_embedding = self.model.embed(query);

        // 2. Find similar chunks via HNSW
        let candidates = self.index.search(&query_embedding, limit * 3);

        // 3. Deduplicate by document
        let unique_docs = deduplicate_by_path(&candidates);

        // 4. Re-rank with full document context
        let ranked = self.rerank(&query, &unique_docs);

        // 5. Return top results
        ranked.into_iter().take(limit).collect()
    }

    fn rerank(&self, query: &str, docs: &[Document]) -> Vec<SearchResult> {
        // Simple TF-IDF boost for exact matches
        docs.iter().map(|doc| {
            let semantic_score = doc.similarity;
            let keyword_boost = if doc.content.contains(query) { 0.1 } else { 0.0 };

            SearchResult {
                path: doc.path.clone(),
                score: semantic_score + keyword_boost,
            }
        }).collect()
    }
}

Performance Optimizations

1. Incremental Indexing

Only index changed files:

fn should_index(path: &Path, db: &Connection) -> bool {
    let modified = fs::metadata(path)
        .unwrap()
        .modified()
        .unwrap();

    let stored_time: Option<u64> = db.query_row(
        "SELECT modified_at FROM documents WHERE path = ?",
        [path.to_str()],
        |row| row.get(0)
    ).ok();

    match stored_time {
        Some(t) => modified > UNIX_EPOCH + Duration::from_secs(t),
        None => true,  // New file
    }
}

2. Batch Processing

async fn index_batch(files: Vec<PathBuf>) -> Result<()> {
    // Process in parallel
    let results: Vec<_> = files
        .par_iter()
        .filter_map(|f| {
            let text = extract_text(f).ok()?;
            let chunks = chunk_document(&text, 512, 50);
            let embeddings: Vec<_> = chunks
                .iter()
                .map(|c| model.embed(c))
                .collect();
            Some((f, text, embeddings))
        })
        .collect();

    // Batch insert to database
    let mut stmt = conn.prepare(
        "INSERT INTO documents (path, content, embedding) VALUES (?, ?, ?)"
    )?;

    for (path, content, embeddings) in results {
        for (i, emb) in embeddings.iter().enumerate() {
            stmt.execute(params![
                format!("{}#{}", path.display(), i),
                content,
                emb
            ])?;
        }
    }

    Ok(())
}

3. Memory-Mapped Index

// Use memory-mapped files for large indexes
use memmap2::Mmap;

struct MappedIndex {
    mmap: Mmap,
    header: IndexHeader,
}

impl MappedIndex {
    fn load(path: &Path) -> Self {
        let file = File::open(path).unwrap();
        let mmap = unsafe { Mmap::map(&file).unwrap() };
        let header = IndexHeader::from_bytes(&mmap[..64]);

        Self { mmap, header }
    }

    fn get_vector(&self, id: usize) -> &[f32] {
        let offset = 64 + id * self.header.dim * 4;
        let bytes = &self.mmap[offset..offset + self.header.dim * 4];

        unsafe {
            std::slice::from_raw_parts(
                bytes.as_ptr() as *const f32,
                self.header.dim
            )
        }
    }
}

Real-World Performance

Indexing Speed

Search Latency

Memory Usage

The User Experience

All this complexity is hidden behind a simple interface:

┌─────────────────────────────────────────────────────────────┐
│  🔍 Search your files...                                    │
│  ┌─────────────────────────────────────────────────────────┐│
│  │ meeting notes about the product launch                  ││
│  └─────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  📄 Product Launch Planning.docx                    98%     │
│     "...discussed timeline for product launch..."           │
│                                                             │
│  📧 Re: Launch Date Confirmation.eml               94%     │
│     "...confirming the launch date for..."                  │
│                                                             │
│  📝 standup-2024-03-20.md                          89%     │
│     "...sprint planning for launch preparation..."          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Users don't need to know about embeddings, HNSW, or vector databases. They just type what they're looking for and get results instantly.

Conclusion

Building semantic search that's fast, private, and accurate required solving challenges at every layer:

Efficient file watching and text extraction
Local embedding generation with quantized models
High-performance vector indexing with HNSW
Smart chunking and re-ranking strategies

The result: search that understands meaning, runs entirely on your device, and returns results in under 100ms—even with 100,000+ files.

*Ready to organize your files? Try Hippo free for up to 10K files.*