Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.gdilabs.io/llms.txt

Use this file to discover all available pages before exploring further.

What it does

  • Reads every markdown file in the hub and parses YAML frontmatter against the hub schema.
  • Chunks each body by heading.
  • Embeds chunks via Ollama (default nomic-embed-text), OpenAI (text-embedding-3-small), or a deterministic local-hash fallback for CI.
  • Upserts chunks into Qdrant with payload indices on kh_id, scope, scope_prefixes, sensitivity, source_type, team, and tags.
  • Maintains a Postgres manifest so unchanged files are skipped on subsequent runs.
  • Sweeps orphans — IDs in the manifest but no longer in the hub get deleted from both Qdrant and the manifest.

Embedding providers

  • Ollama (default): runs against a local or remote Ollama instance.
  • OpenAI: hosted embedding API.
  • Hash: deterministic local fallback for CI smoke tests; never used in production.
If the configured provider can’t be reached, the run aborts. There is no silent fallback in production.

CLI

python ingest/ingest_hub.py
  [--hub-dir knowledge-hub]
  [--collection cerebrum_knowledge_hub]
  [--chunk-size 1200] [--chunk-overlap 120]
  [--no-manifest]
  [--dry-run]

Configuration

VariablePurpose
QDRANT_URLQdrant endpoint.
QDRANT_API_KEYQdrant auth (optional).
QDRANT_COLLECTIONCollection name; default cerebrum_knowledge_hub.
EMBEDDING_PROVIDEROne of ollama, openai, hash.
OLLAMA_BASE_URL, OLLAMA_EMBED_MODELOllama provider config.
OPENAI_API_KEY, OPENAI_EMBED_MODELOpenAI provider config.
POSTGRES_DSNManifest persistence (optional).
The manifest write is skipped under --no-manifest (first runs, debug, or when Postgres is unavailable) and under --dry-run.