Ingest

What it does

Reads every markdown file in the hub and parses YAML frontmatter against the hub schema.
Chunks each body by heading.
Embeds chunks via Ollama (default nomic-embed-text), OpenAI (text-embedding-3-small), or a deterministic local-hash fallback for CI.
Upserts chunks into Qdrant with payload indices on kh_id, scope, scope_prefixes, sensitivity, source_type, team, and tags.
Maintains a Postgres manifest so unchanged files are skipped on subsequent runs.
Sweeps orphans — IDs in the manifest but no longer in the hub get deleted from both Qdrant and the manifest.

Embedding providers

Ollama (default): runs against a local or remote Ollama instance.
OpenAI: hosted embedding API.
Hash: deterministic local fallback for CI smoke tests; never used in production.

If the configured provider can’t be reached, the run aborts. There is no silent fallback in production.

CLI

python ingest/ingest_hub.py
  [--hub-dir knowledge-hub]
  [--collection cerebrum_knowledge_hub]
  [--chunk-size 1200] [--chunk-overlap 120]
  [--no-manifest]
  [--dry-run]

Configuration

Variable	Purpose
`QDRANT_URL`	Qdrant endpoint.
`QDRANT_API_KEY`	Qdrant auth (optional).
`QDRANT_COLLECTION`	Collection name; default `cerebrum_knowledge_hub`.
`EMBEDDING_PROVIDER`	One of `ollama`, `openai`, `hash`.
`OLLAMA_BASE_URL`, `OLLAMA_EMBED_MODEL`	Ollama provider config.
`OPENAI_API_KEY`, `OPENAI_EMBED_MODEL`	OpenAI provider config.
`POSTGRES_DSN`	Manifest persistence (optional).

The manifest write is skipped under --no-manifest (first runs, debug, or when Postgres is unavailable) and under --dry-run.

Get Started

Components

What it does

Embedding providers

CLI

Configuration

Get Started

Components

Documentation Index

​What it does

​Embedding providers

​CLI

​Configuration

What it does

Embedding providers

CLI

Configuration