Skip to content

Local AI Embeddings

WDG uses local embedding models for semantic code search. Indexing happens automatically when you commit code - no manual intervention required after initial setup.

How It Works

Traditional (Keyword) Search:

  • Searches for exact text matches
  • Misses similar concepts with different wording
  • Can't understand code context

Semantic (AI) Search:

  • Understands meaning and context
  • Finds similar code patterns
  • Recognizes related concepts
  • Works across languages (PHP, JS, CSS)

How Local Embeddings Work

%%{init: {'theme':'neutral'}}%%
graph LR
    Code[Your Code] --> Model[Local AI Model]
    Model --> Vectors[384-dim Vectors]
    Vectors --> Qdrant[(Qdrant DB)]

    Query[Search Query] --> Model2[Same Model]
    Model2 --> QVector[Query Vector]
    QVector --> Qdrant
    Qdrant --> Results[Similar Code]
  1. Code Indexing: Your code is processed by a local AI model
  2. Vector Generation: Each code chunk becomes a 384-dimensional vector
  3. Storage: Vectors are stored in Qdrant database
  4. Search: Queries are converted to vectors and compared
  5. Results: Most similar code chunks are returned

Available Models

Configure in .env:

bash
EMBEDDING_MODEL=all-MiniLM-L6-v2

Model Comparison

ModelSpeedQualitySizeDimensionsUse Case
all-MiniLM-L6-v2⚡⚡⚡★★★☆☆80MB384Default - Fast indexing
all-mpnet-base-v2⚡⚡☆★★★★☆420MB768Better accuracy
all-MiniLM-L12-v2⚡⚡⚡★★★☆☆120MB384Balanced
all-distilroberta-v1⚡⚡☆★★★★☆290MB768High quality
multi-qa-MiniLM-L6-cos-v1⚡⚡⚡★★★★☆80MB384Q&A optimized

Switching Models

⚠️ WARNING

Changing models requires re-indexing all code!

bash
# 1. Update .env
EMBEDDING_MODEL=all-mpnet-base-v2

# 2. Restart indexer service to load new model
docker compose restart indexer

# 3. Re-index Wikit framework
wdg index

# 4. Re-index your projects
wdg index my-project

What Gets Indexed

The indexer intelligently chunks your code:

PHP Files

  • Functions with full body
  • Classes with methods
  • WordPress hooks with context
  • DocBlock comments

Example chunk:

php
// Indexed as one unit:
function get_user_by_email($email) {
    global $wpdb;
    return $wpdb->get_row(
        $wpdb->prepare(
            "SELECT * FROM users WHERE email = %s",
            $email
        )
    );
}

JavaScript Files

  • Functions (regular, arrow, async)
  • React components
  • Event handlers
  • Module exports

CSS/SCSS Files

  • Chunked by selectors
  • Media queries preserved
  • Variables and mixins

Wikit Blocks

  • block.json configurations
  • registerBlockType calls
  • Block metadata

Markdown/Documentation Files

  • Sections split by headers (H1-H3)
  • Code examples preserved with language tags
  • Technical documentation indexed semantically
  • README and wiki files

Example chunk:

markdown
## User Authentication

The authentication system uses JWT tokens...

```php
function authenticate_user($credentials) {
    // Indexed as code example
}

### Other Files
- JSON configuration files
- YAML files
- CSS chunked by selectors (50 lines per chunk)

## Performance Optimization

### First-Time Setup
```bash
# Initial model download (one-time)
Downloading model: ~80MB
Time: 1-2 minutes

# Indexing Wikit framework
Files: ~5000
Time: 5-7 minutes
Vectors created: ~15,000

Automatic Indexing via Git Hooks

bash
# Model already cached in Docker
Loading time: <1 second

# Git post-commit hook triggers indexing
# Only changed files are indexed
# Happens automatically on: git commit, git merge
Time: seconds per file

Memory Usage

OperationRAM UsageCPU Usage
Idle50MB0%
Model Loading200-500MB20%
Indexing300-800MB40-60%
Searching100-200MB10%

How Indexing Works

The indexing process happens automatically through git hooks:

  1. Code Parsing: When you commit code, the indexer extracts semantic components

    • PHP: Functions, classes, WordPress hooks
    • JavaScript: Functions, React components, event handlers
    • Other files: Chunked by logical sections
  2. Embedding Generation: Each code chunk is converted to a 384-dimensional vector using the local Sentence Transformer model

  3. Storage in Qdrant: Vectors are stored in the vector database with metadata:

    • File path and line number
    • Component type (function, class, hook)
    • Language and repository information
    • Project association

Search Examples

Finding Similar Functions

When you search for "get user by ID", the system finds:

  • getUserById()
  • fetch_user_by_identifier()
  • loadUserFromDatabase($id)
  • wp_get_user($user_id)

Even though none have "get user by ID" exactly!

Search: "validate email"

Finds across all languages:

  • PHP: is_valid_email($email)
  • JS: validateEmailAddress(email)
  • Regex: /^[^@]+@[^@]+\.[^@]+$/

Pattern Recognition

Search: "database query with prepare statement"

Finds all secure database patterns:

php
$wpdb->prepare("SELECT * FROM...", $var)
$stmt = $pdo->prepare(...)
mysqli_prepare($conn, ...)

Privacy & Security

Local Processing

%%{init: {'theme':'neutral'}}%%
graph TB
    subgraph "Your Machine"
        Code[Your Code]
        Model[AI Model]
        Vectors[Vectors]
        DB[(Qdrant)]

        Code --> Model
        Model --> Vectors
        Vectors --> DB
    end

All processing occurs locally:

  • Source code
  • Embeddings/vectors
  • Search queries
  • Results
  • Model weights

The system does not require external API connections for embedding generation or vector search operations.

Advanced Configuration

Custom Model Path

bash
# Use custom model location
export SENTENCE_TRANSFORMERS_HOME=/path/to/models

Batch Processing

python
# Index multiple files at once
embeddings = model.encode(
    code_chunks,
    batch_size=32,
    show_progress_bar=True
)

GPU Acceleration (Advanced)

GPU acceleration requires Docker GPU passthrough configuration. This is an advanced setup not covered in standard installation.

If you have an NVIDIA GPU and want faster indexing, you'll need to:

  1. Install nvidia-docker2
  2. Modify the indexer service in docker-compose.yml to enable GPU access
  3. Rebuild the indexer container with CUDA-enabled PyTorch

Troubleshooting

Model Download Issues

If the indexer fails to download the model on first run:

bash
# Check indexer logs
docker compose logs indexer

# Restart indexer to retry download
docker compose restart indexer

# Manually trigger download in container
docker exec wdg-indexer python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"

Indexing Performance

If indexing is slow:

bash
# Use faster model (update .env)
EMBEDDING_MODEL=all-MiniLM-L6-v2

# Restart indexer service
docker compose restart indexer

# Check resource usage
docker stats wdg-indexer

Memory Issues

If indexer runs out of memory:

bash
# Increase Docker memory limit in compose.yml
# Under indexer service, adjust:
deploy:
  resources:
    limits:
      memory: 3G  # Increase from 2G

# Or use a smaller model
EMBEDDING_MODEL=all-MiniLM-L6-v2

Best Practices

1. Choose the Right Model

  • Speed priority: all-MiniLM-L6-v2
  • Quality priority: all-mpnet-base-v2
  • Multilingual: paraphrase-multilingual-MiniLM-L12-v2

2. Leverage Automatic Indexing

bash
# Indexing happens automatically via git hooks
# Just commit your changes:
git commit -m "Add new feature"

# The post-commit hook will:
# - Detect changed files
# - Index them automatically
# - Update the vector database

# Manual indexing only needed for:
# - Initial project setup: wdg index my-site
# - After pulling Wikit updates: wdg index

3. Collection Management

bash
# Separate collections per project
wdg index project1  # Creates: project_project1
wdg index project2  # Creates: project_project2

# Clean old collections
wdg collections delete project_old_site

The Future

Coming Soon

  • Fine-tuned models for WordPress/PHP
  • Code completion using local LLMs
  • Semantic diff for code review
  • Multi-model support (different models per project)

Research & Development

  • Training custom models on Wikit patterns
  • Multi-modal embeddings (code + comments + docs)
  • Cross-project code similarity detection

Released under the MIT License.