VectorOps Know - Intro

Table of Contents

TLDR

VectorOps Know is an extensible code-intelligence helper library. It scans your repository, builds a language-aware graph of files / packages / symbols and exposes high-level tooling for search, summarisation, ranking and graph analysis to LLMs.

Intro

It just happened that I settled (for now) with Aider as my AI coding tool. Overall, it works fine, gives full control over the coding process, and I’ve come up with a coding loop that mostly satisfies me. I do all the thinking; the LLM fills in the gaps and writes the code.

Aider has a lot of good ideas baked in:

  • Follows the “explicit is better than implicit” mantra
  • Full control over the LLM context size, allowing a developer to decide which files are relevant to the problem at hand.
  • Repomap, which provides additional context to LLMs when the developer doesn’t add all possible files.
  • Ready to go with a default configuration, but without hiding complexity.
  • Doesn’t offload a lot of decision making to LLM, so it’s very predictable.

I particularly like Aider’s repomap feature. Aider parses project files with tree-sitter, runs queries over AST trees to extract symbol definitions and references, and then applies a recommendation algorithm to find relevant files for symbols or file names mentioned in the source prompt. File contents are then trimmed and added to the LLM context.

But as with any tool, Aider is not perfect. Sometimes I would prefer the tool to find and offer a list of files to change on its own. Sometimes I would want it to validate proposed changes against another LLM-enabled linting step. Aider’s fixed workflow feel limiting for simplier things.

That being said, I’ve wanted to do a hobby project in the AI space for a while, mainly to keep up with the latest developments. Then I thought: why not build something that is useful and that I can use myself?

The result is VectorOps, and VectorOps Know is the first project under its umbrella. And yes, it’s mostly coded by Aider.

VectorOps Know

In a nutshell, Know is a source code analysis tool built on top of a collection of parsers. The tool extracts various metadata from project files (imports, exports, top-level symbols, docstrings, comments, etc.) and stores it in a database.

The database is then used by internal tools that provide Python, OpenAI function-calling, and MCP interfaces to expose the collected data to an LLM or an AI tool.

Right now there are three “main” tools:

  1. Search tool takes a free-form query and runs semantic search over parsed symbols
  2. Repomap tool takes a list of symbol names and/or file names and returns relevant files that are worth adding to the LLM context
  3. File summary tool - returns file summaries for requested paths

The tools use built-in code summary generation capabilities to return symbol signatures whenever possible, and optionally include preceding comments and/or documentation strings.

For example, the following dummy Python snippet:

class Foo(Bar):
    A = 10

    # Pass me 10
    def __init__(self, a: int):_
        "Expect to receive 10"
        self.a = a
        self.b = a * 10
        self.c = a * 20
        

will be summarized as:

class Foo(Bar):
    A = 10

    def __init__(self, a: int):
        "Expect to receive 10"
        ...

Summaries are lossy by design and should only be used for project discovery tasks.

Repository Parsing

File parsing is very straightforward: a recursive, multi-threaded directory walk that honors .gitignore entries. If a file has a known extension, it’s parsed into individual nodes and symbols, which are then persisted in the database.

I decided to go with rich metadata extraction and this required writing a custom parser for each supported file type. At the time of writing, Know supports:

  • Python
  • Go
  • TypeScript / TSX
  • JavaScript / JSX
  • Markdown
  • Text files

Source code symbols are mostly parsed at a top level. For example, methods of a class are parsed as separate symbols, but method implementations are not dissected into separate syntax nodes. This is just enough to generate file and symbol summaries by staying at the interface level.

Markdown files are split at section levels and similar to text files are chunked using simplistic recursive algorithm (paragraph -> sentence -> phrases -> words) to fit within embedding model context window.

You can find the data model here.

Storage Backends

All data is stored locally. Know has a pluggable storage backend subsystem and right now has only one implementation: in-memory or on-disk DuckDB database.

DuckDB fts and vss extensions are used to provide semantic search capabilities.

DuckDB is very capable, but has certain limitations:

  • Any project change requires rebuilding the FTS index.
  • The HNSW search extension keeps its index in memory and requires fixed-length float32 vectors. I chose 1024 as the vector length, so each embedding requires around 4KB of memory.

The HNSW index for a fully parsed Django project is under 300MB at the moment because not all file types are parsed. This will grow as Know starts to support more file types.

There are plans to add Sqlite (with fts5 and sqlite-vec extensions) and PostgreSQL (with pgvector) backends too.

Search uses the BM25 algorithm to do full-text search over symbols or text chunks. There are a few customizations that were built on top of the default search functionality:

  • Custom tokenizer that splits camelCase and PascalCase into separate words
  • Per-language stop words
  • Symbol signature and file path rank boosting
  • Symbol type boosting (methods and functions are preferred over import blocks)

If vector embeddings are enabled, the tool adds a cosine similarity search and combines the BM25 results using Reciprocal Rank Fusion (RRF). The RRF ratios are configurable, with a 50/50 weight split by default.

Embedding calculators are also pluggable, and so far, there is one implementation that uses sentence-transformers for calculations. For code symbols that are longer than the embedding context window, they are split with overlaps into separate chunks, and the embedding vectors are then averaged.

I had good results locally on my MacBook M4 Max and MPS backend with a few embedding models I’ve tried. Later, external embedding providers will also be added - I have plans to add LiteLLM embedding APIs which would allow pretty much any other model to be used.

Know comes with some CLI tools which I’m using to test things. Here’s the output from the search tool run against the project searching for “duckdb node search” without embeddings enabled:

DEBUG:know:2025-08-05 22:25:22 [debug    ] scan_repo finished.            duration=8.240s files_added=87 files_deleted=0 files_updated=0
DEBUG:know:2025-08-05 22:25:22 [debug    ] File processing summary:      
DEBUG:know:2025-08-05 22:25:22 [debug    ]   - Suffix: .go        | Files:    3 | Total:   8.375s | Avg:  2791.72 ms/file
DEBUG:know:2025-08-05 22:25:22 [debug    ]   - Suffix: .html      | Files:    7 | Total:   0.258s | Avg:    36.86 ms/file
DEBUG:know:2025-08-05 22:25:22 [debug    ]   - Suffix: .js        | Files:    1 | Total:   0.309s | Avg:   309.15 ms/file
DEBUG:know:2025-08-05 22:25:22 [debug    ]   - Suffix: .lock      | Files:    1 | Total:   0.456s | Avg:   455.55 ms/file
DEBUG:know:2025-08-05 22:25:22 [debug    ]   - Suffix: .md        | Files:    2 | Total:   0.468s | Avg:   233.78 ms/file
DEBUG:know:2025-08-05 22:25:22 [debug    ]   - Suffix: .mod       | Files:    1 | Total:   0.137s | Avg:   136.61 ms/file
DEBUG:know:2025-08-05 22:25:22 [debug    ]   - Suffix: .org       | Files:    1 | Total:   0.227s | Avg:   226.85 ms/file
DEBUG:know:2025-08-05 22:25:22 [debug    ]   - Suffix: .py        | Files:   64 | Total:  12.472s | Avg:   194.87 ms/file
DEBUG:know:2025-08-05 22:25:22 [debug    ]   - Suffix: .sql       | Files:    2 | Total:   0.019s | Avg:     9.68 ms/file
DEBUG:know:2025-08-05 22:25:22 [debug    ]   - Suffix: .toml      | Files:    1 | Total:   0.271s | Avg:   270.68 ms/file
DEBUG:know:2025-08-05 22:25:22 [debug    ]   - Suffix: .tsx       | Files:    1 | Total:   0.312s | Avg:   312.43 ms/file
DEBUG:know:2025-08-05 22:25:22 [debug    ]   - Suffix: .txt       | Files:    1 | Total:   0.333s | Avg:   333.33 ms/file
DEBUG:know:2025-08-05 22:25:22 [debug    ]   - Suffix: no_suffix  | Files:    2 | Total:   0.657s | Avg:   328.72 ms/file
Interactive symbol search.  Type '/exit' or Ctrl-D to quit.
> duckdb node search
--------------------------------------------------------------------------------
search   (method) (cfbeb42e-34f2-44ef-9031-431cd7d51f6d)
FQN:  know.data.AbstractNodeRepository.search
File: know/data.py

class AbstractNodeRepository(ABC):
    ...
    @abstractmethod
    def search(self, query: NodeSearchQuery) -> List[Node]:
        ...
--------------------------------------------------------------------------------
data_repo   (function) (346adaf7-53fc-4a20-bacf-9b21ff75b1e5)
FQN:  tests.test_symbol_search.data_repo
File: tests/test_symbol_search.py

@pytest.fixture(params=["duckdb"])
def data_repo(request, tmp_path):
    ...
--------------------------------------------------------------------------------
search   (method) (82057f1a-d9e8-4b85-8ae9-997607ef8404)
FQN:  know.stores.duckdb.DuckDBNodeRepo.search
File: know/stores/duckdb.py

class DuckDBNodeRepo(_DuckDBBaseRepo[Node], AbstractNodeRepository):
    ...
    def search(self, query: NodeSearchQuery) -> list[Node]:
        ...
--------------------------------------------------------------------------------
test_symbol_search   (function) (31691667-87fc-480a-aa5b-e4744b5e3e09)
FQN:  tests.test_repositories.test_symbol_search
File: tests/test_repositories.py

def test_symbol_search(data_repo):
    ...
--------------------------------------------------------------------------------
symbol   (method) (95401e8b-3c06-40c7-9c7c-a7b8fcbf4a49)
FQN:  know.stores.duckdb.DuckDBDataRepository.symbol
File: know/stores/duckdb.py

class DuckDBDataRepository(AbstractDataRepository):
    ...
    @property
    def node(self) -> AbstractNodeRepository:
        ...
--------------------------------------------------------------------------------
5 result(s).
>

There are multiple reasons why BM25 found the first two symbols over the concrete DuckDBNodeRepo.search implementation, but all results are relevant to the query.

Go parser needs some performance profiling too.

Repomap

Repomap is implemented using the NetworkX library. Whenever the project is updated, files, named nodes (symbols), and references are added to an in-memory graph. The user’s query is parsed, known file and symbol names are boosted, and then the PageRank algorithm finds relevant file recommendations. For example, the repomap CLI tool returns this list for the duckdb.py javascript.py query:

know/stores/duckdb.py                                         0.166630
know/lang/javascript.py                                       0.121807
know/data.py                                                  0.070037
know/models.py                                                0.054503
know/parsers.py                                               0.044437
know/lang/typescript.py                                       0.022041
know/project.py                                               0.010818
know/helpers.py                                               0.004803
know/data_helpers.py                                          0.003884
tools/explorer.py                                             0.003610

I excluded file summaries from the output - they will be also included by default.

The tool correctly identified the mentioned files at the top and then found these additional relevant files:

  • Mentioned duckdb.py and javascript.py files
  • data.py contains abstract data access classes that the DuckDB storage backend implements.
  • models.py contains all data models (files, parsed nodes, etc.).
  • parsers.py contains base parser classes, which javascript.py depends on.
  • typescript.py is a closely related language.
  • project.py defines the public project API.

Using APIs

There is a sample chatbot implementation using LiteLLM as an API. It can be used as an example of how to use Know programmatically: registering tools as functions with an OpenAI-compatible API or gateway.

Know also exposes an MCP implementation. It requires installing FastMCP, which is offered as an optional uv dependency. Once installed, all tools are available via the MCP protocol.

Settings

Many aspects of the project are configurable via the Settings class. All options are documented, and the CLI tools use Pydantic settings to populate the settings.

Summary

Overall, I believe the tool will be generally useful. At least, I had a lot of fun playing with MCP-enabled coding tools. Anecdotally, project discovery no longer forces tools to read full files to generate summaries or find relevant parts of the code, thus directly saving on token usage.

Here is the Chat CLI doing a summary of the project using the OpenAI o3 model with this prompt: summarize the project files and provide detailed description of the project functionality

Assistant: Project overview  
===============  

The repository contains a complete "code-intelligence" stack that can scan one or more source-code 
repositories, extract a rich graph of symbols, imports, and references, store that graph in DuckDB,
add vector embeddings, and expose the data through a family of "vectorops_*" tools (file listing, 
file summary, node search, repo-map, etc.).

Below is the high-level architecture, followed by a per-package outline.

----------------------------------------------------------------------------------------------------
1. Data model (know/models.py)
----------------------------------------------------------------------------------------------------
• Pydantic models capture every entity:
  – Project, Repo, Package, File  
  – Node (symbol) with kind, visibility, modifiers, source ranges, docstring, signature…  
  – ImportEdge (package-level import) and NodeRef (call/type references inside code)

These classes are pure data – no persistence logic – and are shared by every layer.

----------------------------------------------------------------------------------------------------
2. Persistence layer (know/stores/duckdb.py + know/stores/sql.py)
----------------------------------------------------------------------------------------------------
• DuckDB is used as the default embedded database.  
• Each repository "XXXRepo" (RepoRepo, FileRepo, NodeRepo, etc.) inherits from a generic
  _DuckDBBaseRepo that serializes Pydantic objects into SQL rows, including automatic
  compression of bulky fields.
• Full-text search indexes and custom UDFs are added for fast BM25 scoring.  
• Vector search is implemented with a `array_cosine_similarity` UDF.  
• All repositories implement the abstract interfaces defined in know/data.py so that a
  different backend could be swapped in.

----------------------------------------------------------------------------------------------------
3. Project manager (know/project.py)
----------------------------------------------------------------------------------------------------
`ProjectManager` wires everything together for one logical project:

• Holds the DuckDBDataRepository instance, an optional EmbeddingWorker, and a registry
  of “components” (pluggable background services).  
• Provides helper methods such as `add_repo_path`, `refresh`, `compute_embedding`,
  and mapping between on-disk paths and the “.virtual-path/<repo>” scheme used by the
  tools.  
• Components currently registered:
  – RepoMap (graph-based ranking service)  
  – any future extension that subclasses ProjectComponent.

----------------------------------------------------------------------------------------------------
4. Scanning & parsing (know/scanner.py + know/parsers/*.py)
----------------------------------------------------------------------------------------------------
• The scanner walks every file (with gitignore & custom ignore support),
  computes its hash, decides whether it changed, then delegates to the
  CodeParserRegistry to find a parser.  
• Parsers are built on tree-sitter for: Python, Go, JavaScript, TypeScript/TSX,
  Markdown, plain text, plus a recursive text-chunker for `.txt`.  
• Each parser returns a ParsedFile that contains:
  – ParsedPackage information (physical + virtual path)  
  – a tree of ParsedNodes (symbols) with children relationships  
  – ParsedImportEdges and ParsedNodeRefs (calls / type refs).  
• The scanner stores / updates the data in DuckDB and schedules embedding work for
  new or changed symbols.

----------------------------------------------------------------------------------------------------
5. Embeddings (know/embeddings/*)
----------------------------------------------------------------------------------------------------
`EmbeddingWorker` is a threaded job-queue around any `EmbeddingCalculator`
  (currently a local Sentence-Transformers wrapper).  
• Supports async/sync calls, batching, back-pressure and an LRU SQL cache
  (DuckDB or SQLite).  
• Embeddings are stored directly inside the `Node` row (`embedding_code_vec`).

----------------------------------------------------------------------------------------------------
6. Tools layer (know/tools/*.py)
----------------------------------------------------------------------------------------------------
All tools derive from BaseTool and are automatically registered in ToolRegistry.
Each tool doubles as:

  • A callable Python API
  • An OpenAI “function call” schema
  • A CLI (`tools/*.py`) or web/MCP endpoint

Important tools:

vectorops_list_files: glob pattern -> list of files
vectorops_summarize_files: generate structured summaries (signature list, imports, etc.)
vectorops_search: hybrid (BM25 + cosine) symbol search
vectorops_repomap: random-walk ranking of files starting from seed symbols/paths
There are also optional front-ends:
  - chatcli.py: TUI chatbot that injects RepoMap + Search tools
  - explorer.py: Flask web UI for browsing, searching, and reading summaries
  - mcpserver.py: FastMCP server exposing the same tools over HTTP/WS for LangChain-style agents

----------------------------------------------------------------------------------------------------
7. Settings (know/settings.py)
----------------------------------------------------------------------------------------------------
A large Pydantic `ProjectSettings` hierarchy drives everything (embedding, search
weights, tokenizer choice, chunking limits, etc.).  All CLI entry-points inherit
from it and add their own flags.

----------------------------------------------------------------------------------------------------
8. Utility helpers
----------------------------------------------------------------------------------------------------
• know/helpers.py      – hash helpers, id generation, visibility inference  
• know/tokenizers.py: code/word tokenizers and search preprocessors
• know/file_summary.py: converts a file + its symbols/imports into a Markdown summary
• know/data_helpers.py: functions to post-process search results (hierarchy, parent inclusion, RRF fusion, etc.)

----------------------------------------------------------------------------------------------------
Typical flow
----------------------------------------------------------------------------------------------------
1. `init_project(settings)` creates a DuckDBDataRepository and ProjectManager, starts the
   EmbeddingWorker, registers parsers/helpers and components.  
2. `pm.refresh_all()` walks all configured repositories; modified files are parsed and
   upserted into DuckDB; embeddings are scheduled.  
3. Client (CLI / web / OpenAI tool) invokes e.g. `vectorops_search` → NodeRepo.search
   builds a hybrid SQL query → results returned as NodeSearchResult objects.  
4. For context-building tasks, `vectorops_repomap` performs a personalised random walk on
   the symbol->file graph and returns the most relevant file summaries.

----------------------------------------------------------------------------------------------------
Why the project exists / intended use-cases
----------------------------------------------------------------------------------------------------
• Provide fast, local semantic search across large multi-language codebases.  
• Feed LLM agents with high-quality, size-bounded context (RepoMap + summaries).  
• Enable IDE or chat-based assistants without sending source code to remote servers.  
• Act as a foundation layer that others can extend (new parsers, UIs, backends).

The codebase is intentionally modular: swap DuckDB for Postgres, plug in a vector
database, or add new programming languages by writing new parser subclasses.

----------------------------------------------------------------------------------------------------
File inventory (abbreviated)
----------------------------------------------------------------------------------------------------
know/models.py              – Immutable data structures (≈ AST & graph)  
know/stores/duckdb.py       – DuckDB repositories & SQL helpers  
know/project.py             – ProjectManager / component framework  
know/scanner.py             – File-system scanner & parser orchestrator  
know/parsers/               – Folder with per-language tree-sitter parsers  
know/embeddings/            – Embedding worker, cache and calculators  
know/tools/                 – ListFiles, SummarizeFiles, NodeSearch, RepoMap, etc.  
know/file_summary.py        – Markdown summary generator  
tools/*.py                  – CLI wrappers (chatcli, searchcli, filesummarycli, …)  
tests/parsers/*             – Sample source files used in unit tests

----------------------------------------------------------------------------------------------------
In short
----------------------------------------------------------------------------------------------------
The project is a full-stack, offline code-understanding engine. It scans code, constructs a symbol 
and import graph, augments it with embeddings, saves everything in DuckDB, and exposes ergonomic, 
LLM-ready APIs for file listing, summarization, semantic search, and relevance-ranked context gathering.

[usage] prompt=29500  completion=1963 total=31463  -  est. cost $0.074704  session $0.077172

OpenAI called functions a few times and used a total of 31.4K input tokens (out of 116K total, or almost 1/4 reduction) for a total cost of $0.077. Not too shabby.

Share :