Indexer Internals

Deep dive into the indexing architecture, code parsing, embedding generation, and vector storage.

Architecture

%%{init: {'theme':'neutral'}}%%
sequenceDiagram
    participant Git
    participant Hook as post-commit hook
    participant Service as indexer_service (FastAPI :8666)
    participant Indexer as WDGCodeIndexer
    participant Model as embedding model (in-process)
    participant Qdrant

    Git->>Hook: git commit
    Hook->>Service: POST /index/incremental {files, project}
    Service->>Indexer: index_file() per path
    Indexer->>Indexer: chunk_code() → components
    Indexer->>Model: encode(batch)
    Model->>Indexer: vectors
    Indexer->>Qdrant: upsert (batches of 50)
    Service->>Hook: IndexResponse (chunks, errors, duration)

Modules

The indexer is four Python files in indexer/. There is no separate watcher, parser, embedder, or storage module — extraction and embedding all live inside WDGCodeIndexer.

`indexer/indexer.py` — `WDGCodeIndexer`

The core engine. One class does extraction, embedding, and storage.

python

# indexer/indexer.py
class WDGCodeIndexer:
    def __init__(self, collection_name=None, project_name=None):
        # collection resolution:
        #   collection_name given      → use it verbatim
        #   project_name given         → "project_" + name.replace("-", "_")
        #   neither                    → "wdg_framework"
        self.batch_size = 50           # hardcoded; points are upserted in batches of 50
        self.ensure_collection()       # creates the collection if missing

    def chunk_code(self, file_path, content): ...   # language-aware extraction
    def index_file(self, file_path, project=None): ...
    def index_directory(self, directory, project=None): ...  # parallel, thread pool
    def index_wikit_repos(self): ...   # → wdg_framework
    def index_platform(self): ...      # → platform

Module-level configuration is read from the environment at import time:

python

QDRANT_HOST    = os.getenv("QDRANT_HOST", "localhost")
QDRANT_PORT    = int(os.getenv("QDRANT_PORT", 6333))
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "all-MiniLM-L6-v2")

The model is loaded once at import; EMBEDDING_DIM is read back from the loaded model (384 for all-MiniLM-L6-v2). Collections are created with VectorParams(size=EMBEDDING_DIM, distance=COSINE).

Language-aware extraction. chunk_code() dispatches on file extension:

PHP — functions (with called-function dependency tracking), classes (extends/implements), add_action/add_filter registrations, do_action/apply_filters invocations, and ACF field usages (get_field, the_field, have_rows, …).
JavaScript/TypeScript — functions (declarations, arrow, function expressions), React components, plus Wikit registerBlockType calls and block.json parsing.
Python — functions and classes via the ast module (decorators, call dependencies, base classes), with a regex fallback for files that fail to parse.
Bash — functions in either syntax, including the preceding ## doc block.
JSON — block.json (Wikit block schema) and acf-json/*.json (ACF field groups); other JSON is chunked by 50-line windows.
Markdown — split into sections by #–### headers, plus fenced code blocks as code_example chunks.
CSS/SCSS/Sass and anything else supported falls back to 50-line chunks.

`indexer/indexer_service.py` — FastAPI HTTP service

Long-running service (container port 8666, host 6666) that keeps the embedding model resident in memory so per-commit indexing is fast. This is the integration surface the git hooks call.

Method & Path	Purpose
`GET /health`	Qdrant connectivity + model-loaded status
`GET /`	Service info and endpoint list
`POST /index/incremental`	Index a specific list of files (`{files, project, repository}`)
`POST /index/full`	Full index of a project/repository/path, or `index_type: wikit \| platform`
`GET /status`	Per-collection vector counts, model name, Qdrant host/port

The service resolves the target collection from the request: project="platform" maps to the platform collection (not project_platform); any other project maps to project_{name}; a repository maps to repo_{name}.

`indexer/collection_manager.py` — collection validation

Backs wdg collections list | validate | clean | delete. It compares project_* collections in Qdrant against the directories under projects/ and reports:

Orphaned collections — a project_* collection with no matching project directory (candidate for wdg collections clean).
Unindexed projects — a project directory with no collection yet (run wdg index <project>).

`indexer/cleanup_utils.py` — project teardown

Centralized cleanup used during wdg delete. cleanup_all_project_resources(project_name) removes a project's Qdrant collection, MySQL database (wp_{name}), nginx config, and SSL certificates, returning a per-operation result dict.

Indexing Pipeline

Traversal and exclusions

index_directory() walks the tree with os.walk, pruning excluded directories in place so it never descends into them. It then filters by supported extension and an additional path-substring exclusion list.

Pruned directories (never descended):node_modules, vendor, dist, build, .git, __pycache__, .cache, .pytest_cache, .mypy_cache, venv, .venv, env, and worktrees (covers both git worktrees and Claude Code agent worktrees — transient copies of already-indexed source).

Path substrings excluded from results:/plugins/ (third-party plugins; mu-plugins are kept), the bundled twenty* default themes, and .vitepress/cache / .vitepress/dist.

Indexable extensions

php
js, jsx, ts, tsx
css, scss, sass
json
md
yml, yaml
py
sh

Parallel processing

index_directory() submits files to a ThreadPoolExecutor. Worker count is min(max(1, (cpu_count - 1) // 2), 3) (capped at 3). Each worker chunks a file and batch-embeds its chunks; resulting points are accumulated under a lock and flushed to Qdrant in batches of batch_size (50). Threading (not multiprocessing) is used so the single resident embedding model is shared.

Point construction

Each chunk becomes a Qdrant point with:

id = md5("{file_path}:{line_number}:{name}")
embedding text = "[{language}] {component_type}: {name}\n{content}"
a flat payload (see Database Architecture → Vector Schema)

Configuration

Environment Variables

The indexer reads these (and only these) from the environment:

bash

EMBEDDING_MODEL=all-MiniLM-L6-v2   # default if unset; override to change model
QDRANT_HOST=qdrant                 # service name on wdg-network (localhost outside Docker)
QDRANT_PORT=6333
PROJECT_ROOT=/workspace            # base path inside the container
INDEXER_PORT=8666                  # port the FastAPI service binds to

There is no INDEXING_BATCH_SIZE, INDEXING_WORKERS, EMBEDDING_CACHE_DIR, VECTOR_DB_HOST, or VECTOR_DB_PORT. Batch size (50) and worker count are computed internally and are not environment-configurable. The downloaded model is cached in the transformer-cache volume.

Monitoring

bash

# Service health and per-collection counts
curl http://localhost:6666/health
curl http://localhost:6666/status

# Container logs
wdg logs indexer        # or: docker logs wdg-indexer

See Also:

Indexer Internals ​

Architecture ​

Modules ​

indexer/indexer.py — WDGCodeIndexer ​

indexer/indexer_service.py — FastAPI HTTP service ​

indexer/collection_manager.py — collection validation ​

indexer/cleanup_utils.py — project teardown ​

Indexing Pipeline ​

Traversal and exclusions ​

Indexable extensions ​

Parallel processing ​

Point construction ​

Configuration ​

Environment Variables ​

Monitoring ​