Skip to content

Code Indexing

Automatic code indexing transforms your codebase into searchable knowledge, enabling AI assistants to understand and work with your WordPress projects using semantic search.

Overview

Code indexing in WDG:

  • Automatic: Triggered by git commits and pulls
  • Incremental: Only indexes changed files
  • Semantic: Understands code meaning, not just keywords
  • Project-Scoped: Each project has isolated collections
  • Fast: Local embeddings require no external API calls

How It Works

Indexing Pipeline

%%{init: {'theme':'neutral'}}%%
sequenceDiagram
    participant Git
    participant Hook
    participant Indexer
    participant Parser
    participant Embedder
    participant Qdrant

    Git->>Hook: git commit
    Hook->>Indexer: Trigger with changed files
    Indexer->>Parser: Parse code files
    Parser->>Parser: Extract components
    loop For each component
        Parser->>Embedder: Generate embedding
        Embedder->>Embedder: all-MiniLM-L6-v2
        Embedder->>Qdrant: Store vector + metadata
    end
    Qdrant->>Git: Indexing complete

What Gets Indexed

PHP Files

Functions:

php
// Indexed as complete unit with context
function get_user_posts($user_id, $post_type = 'post') {
    global $wpdb;
    return $wpdb->get_results(
        $wpdb->prepare(
            "SELECT * FROM {$wpdb->posts}
             WHERE post_author = %d
             AND post_type = %s
             AND post_status = 'publish'",
            $user_id,
            $post_type
        )
    );
}

Classes:

php
// Indexed with all methods and properties
class CustomPostType {
    private $post_type;

    public function __construct($type) {
        $this->post_type = $type;
        $this->register();
    }

    public function register() {
        // Method implementation
    }
}

WordPress Hooks:

php
// Indexed with full context
add_action('init', function() {
    register_post_type('custom_type', [
        'public' => true,
        'supports' => ['title', 'editor']
    ]);
});

JavaScript Files

Functions:

javascript
// Regular functions
function validateEmail(email) {
    return /^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(email);
}

// Arrow functions
const fetchUserData = async (userId) => {
    const response = await fetch(`/api/users/${userId}`);
    return response.json();
};

React Components:

jsx
// Indexed with props and hooks
function UserProfile({ userId, showEmail = false }) {
    const [user, setUser] = useState(null);

    useEffect(() => {
        fetchUserData(userId).then(setUser);
    }, [userId]);

    return (
        <div className="user-profile">
            <h2>{user?.name}</h2>
            {showEmail && <p>{user?.email}</p>}
        </div>
    );
}

CSS/SCSS Files

scss
// Chunked by logical sections
.user-profile {
    display: flex;
    padding: 2rem;

    &__header {
        font-size: 1.5rem;
        color: var(--primary-color);
    }

    @media (max-width: 768px) {
        flex-direction: column;
    }
}

Block Configurations

json
{
  "name": "wdg/custom-block",
  "title": "Custom Block",
  "category": "wdg-blocks",
  "attributes": {
    "content": {
      "type": "string",
      "default": ""
    }
  }
}

Indexing Strategies

Git hooks automatically trigger indexing:

bash
# After committing changes
git add .
git commit -m "Add new feature"
# → Automatically indexes changed files

# After pulling updates
git pull origin main
# → Automatically indexes merged changes

Manual Indexing

bash
# Index Wikit framework
wdg index

# Index specific project
wdg index my-site

# Force re-index everything
wdg index my-site --force

# Update repositories and index
wdg index --update

Selective Indexing

bash
# Index only specific file types
wdg index my-site --types php,js

# Index specific directory
wdg index my-site --path wp-content/themes/custom-theme

# Exclude directories
wdg index my-site --exclude node_modules,vendor

Vector Collections

Collection Structure

Each project gets its own collection in Qdrant:

Collections:
├── wdg_framework          # Wikit core framework
├── project_my_site        # Project: my-site
├── project_client_site    # Project: client-site
└── project_demo           # Project: demo

Vector Metadata

Each indexed code chunk includes rich metadata:

json
{
  "id": "abc123",
  "vector": [0.1, 0.2, 0.3, ...],  // 384 dimensions
  "metadata": {
    "project": "my-site",
    "file_path": "wp-content/themes/custom/functions.php",
    "file_type": "php",
    "component_type": "function",
    "component_name": "get_user_posts",
    "line_start": 45,
    "line_end": 58,
    "language": "php",
    "content": "function get_user_posts($user_id...",
    "docblock": "Get all posts for a user...",
    "indexed_at": "2024-10-14T10:30:00Z",
    "commit_hash": "abc123def456",
    "branch": "main"
  }
}

Indexing Performance

Initial Indexing

bash
# Wikit Framework (~5,000 files)
Time: 5-7 minutes
Vectors created: ~15,000
Disk space: ~50MB

# Typical project (~500 files)
Time: 30-60 seconds
Vectors created: ~1,500
Disk space: ~5MB

Incremental Updates

bash
# Single file change
Time: <1 second
Vectors updated: 1-10
Overhead: Minimal

# Pull with 20 changed files
Time: 5-10 seconds
Vectors updated: 50-200
Overhead: Negligible

Performance Optimization

Batch Processing:

python
# Indexer processes files in batches
batch_size = 32
embeddings = model.encode(
    code_chunks,
    batch_size=batch_size,
    show_progress_bar=True
)

Caching:

python
# Only re-index if file changed
if file_hash != cached_hash:
    index_file(file)
else:
    skip_file(file)

Search Capabilities

Find code by meaning, not just keywords:

bash
# Search query: "validate user email address"
# Finds:
- is_valid_email($email)
- validateEmailAddress(email)
- checkUserEmailFormat()
- /^[^\s@]+@[^\s@]+\.[^\s@]+$/
bash
# Search: "fetch data from API"
# Finds across languages:
PHP:  wp_remote_get($url)
JS:   fetch(url).then(r => r.json())
JS:   axios.get(url)

Pattern Recognition

bash
# Search: "custom post type registration"
# Finds all register_post_type() calls with context:
- Portfolio custom post type
- Testimonials CPT
- Events post type
- Product catalog

Code Chunking Strategy

PHP Chunking

php
// Chunk 1: Function with full body
function calculate_total($items) {
    $total = 0;
    foreach ($items as $item) {
        $total += $item->price;
    }
    return $total;
}

// Chunk 2: Separate function
function apply_discount($total, $discount) {
    return $total * (1 - $discount);
}

JavaScript Chunking

javascript
// Chunk 1: Component definition
function ProductCard({ product }) {
    return (
        <div className="product-card">
            <h3>{product.name}</h3>
            <p>{product.price}</p>
        </div>
    );
}

// Chunk 2: Helper function
const formatPrice = (price) => {
    return `$${price.toFixed(2)}`;
};

CSS Chunking

scss
// Chunk 1: Component styles
.product-card {
    display: flex;
    padding: 1rem;

    h3 {
        font-size: 1.2rem;
    }
}

// Chunk 2: Media queries
@media (max-width: 768px) {
    .product-card {
        flex-direction: column;
    }
}

Managing Collections

List Collections

bash
wdg collections list

Output:

Vector Database Collections:

wdg_framework
  Vectors: 15,234
  Size: 48.2 MB
  Last updated: 2024-10-14 09:15

project_my_site
  Vectors: 1,450
  Size: 4.7 MB
  Last updated: 2024-10-14 10:30

project_client_site
  Vectors: 3,892
  Size: 12.1 MB
  Last updated: 2024-10-13 16:45

Delete Collection

bash
# Delete project collection
wdg collections delete project_old_site

# Re-create by re-indexing
wdg index old-site

Git Hook Integration

Post-Commit Hook

bash
#!/bin/bash
# .git/hooks/post-commit

# Get changed files in this commit
CHANGED_FILES=$(git diff --name-only HEAD^ HEAD)

# Filter for indexable files
INDEXABLE=$(echo "$CHANGED_FILES" | grep -E '\.(php|js|jsx|scss|css|json)$')

if [ -n "$INDEXABLE" ]; then
    echo "Indexing changed files..."
    wdg index $(basename $(pwd)) --files "$INDEXABLE"
fi

Post-Merge Hook

bash
#!/bin/bash
# .git/hooks/post-merge

# Get merged files
MERGED_FILES=$(git diff --name-only ORIG_HEAD HEAD)

# Index merged changes
if [ -n "$MERGED_FILES" ]; then
    echo "Indexing merged changes..."
    wdg index $(basename $(pwd))
fi

Installing Hooks

bash
# Hooks are automatically installed when:
# 1. Creating new project with --init-wikit
wdg create my-site --init-wikit

# 2. Adding repository to project
wdg my-site repo add https://github.com/client/repo

# 3. Manually install
cd projects/my-site/repositories/my-site
cp /path/to/wdg/hooks/* .git/hooks/
chmod +x .git/hooks/*

Indexing Best Practices

1. Commit Frequently

bash
# Each commit triggers incremental indexing
git commit -m "Add user authentication"  # Indexes auth code
git commit -m "Add email validation"     # Indexes validation

2. Use Descriptive Commits

bash
# Good: AI can understand context
git commit -m "Add custom post type for portfolio items"

# Bad: Less context for AI
git commit -m "Update code"

3. Structure Code Well

php
// Good: Clear function separation
function get_user() { }
function validate_user() { }
function save_user() { }

// Bad: Monolithic function (harder to search)
function handle_user() {
    // 200 lines of mixed logic
}

4. Include DocBlocks

php
/**
 * Calculate discounted price for user
 *
 * @param float $price Original price
 * @param int $user_id User ID for discount lookup
 * @return float Discounted price
 */
function calculate_discount($price, $user_id) {
    // Implementation
}

5. Regular Maintenance

bash
# Weekly: Update and re-index framework
wdg update

# Monthly: Clean up old collections
wdg collections list
wdg collections delete project_old_*

# Quarterly: Full re-index
wdg index --all --force

Troubleshooting

Indexing Not Triggering

bash
# Check if hooks are installed
ls -la .git/hooks/post-commit

# Verify hook is executable
chmod +x .git/hooks/post-commit

# Test hook manually
.git/hooks/post-commit

Slow Indexing

bash
# Check system resources
docker stats wdg-indexer

# Use faster model
# Edit .env: EMBEDDING_MODEL=all-MiniLM-L6-v2

# Restart indexer
docker-compose restart indexer

Missing Results

bash
# Verify collection exists
wdg collections list

# Check vector count
curl http://localhost:6333/collections/project_my_site

# Re-index if needed
wdg index my-site --force

Out of Disk Space

bash
# Check collection sizes
wdg collections list

# Delete old collections
wdg collections delete project_old_*

# Prune Docker volumes
docker system prune -v

Advanced Configuration

Custom Embedding Model

bash
# Edit .env
EMBEDDING_MODEL=all-mpnet-base-v2  # Higher quality, slower
# or
EMBEDDING_MODEL=all-MiniLM-L6-v2  # Faster, default

# Restart indexer service
docker-compose restart indexer

# Re-index with new model
wdg index --all --force

Indexing Filters

Create .wdg/indexing.json in project:

json
{
  "include": [
    "wp-content/themes/**/*.php",
    "wp-content/plugins/**/*.{php,js}"
  ],
  "exclude": [
    "**/node_modules/**",
    "**/vendor/**",
    "**/*.min.js",
    "**/dist/**"
  ],
  "chunk_size": 512,
  "overlap": 50
}

Custom Metadata

Add custom metadata to vectors:

python
# indexer/custom_metadata.py
def extract_metadata(file_path, content):
    metadata = {
        "project": get_project_name(),
        "file_path": file_path,
        "author": get_git_author(file_path),
        "last_modified": get_file_mtime(file_path),
        "custom_tags": extract_custom_tags(content)
    }
    return metadata

Integration with AI Assistants

The indexed code becomes instantly searchable by AI:

bash
# AI can now answer:
"Where do we register custom post types?"
"Show me how we handle user authentication"
"Find similar implementations of email validation"
"What Wikit blocks are used in this project?"

Monitoring Indexing

View Indexing Logs

bash
# Real-time logs
wdg logs indexer --follow

# Last 100 lines
wdg logs indexer --tail 100

Indexing Status

bash
# Overall status
wdg status

# Project-specific status
wdg status my-site

Next Steps:

Released under the MIT License.