// project 05

Arxiv Sanity Preserver

Overview

A Rust-first rebuild of the original arxiv-sanity: a single binary that runs a full arXiv ingestion pipeline and serves a modern web UI for paper search, similarity browsing, and SVM-based personalized recommendations.

Open SourceRustAxumSQLiteTF-IDFHNSWDocker

Arxiv Sanity Preserver is a self-hosted research tool for navigating the arXiv corpus without drowning in it. It is a ground-up rewrite of Andrej Karpathy's original Python/Flask project, rebuilt in Rust as a single CLI binary that bundles both the data pipeline and the web server.

The pipeline runs in sequential stages: fetch paper metadata from the arXiv Atom API, download PDFs, extract text with pdftotext, generate thumbnails via ImageMagick, build TF-IDF vectors for the full corpus, and index them with an HNSW (Hierarchical Navigable Small World) graph for fast approximate nearest-neighbor search. Each stage is a standalone subcommand, and a run-all command chains them all. Everything lands in a local .pipeline/ directory -- no external database required beyond a bundled SQLite file for user accounts and saved libraries.

The web UI, served by Axum, exposes search, topic browsing, a personal library, and a recommendation page. Recommendations are generated by training a lightweight SVM on each user's saved papers, producing a ranked list of papers the model predicts the user would find relevant. Citation counts and publication metadata are pulled from the OpenAlex API and folded into an impact score surfaced in the UI.

This version diverges from the original in a few deliberate ways. The Python backend and multi-user hosting model have been replaced: the app runs in single-user mode by default, meaning each person self-hosts their own instance rather than sharing a server. This removes the operational overhead of managing accounts across users and makes the deployment footprint trivial -- a single Docker container or a cargo build --release is enough. The old CLI management scripts are gone; everything is one binary with subcommands. The UI has been modernized from the original's dated look while keeping the same core browsing model.

Deploying on Hermes (my homelab) is the intended runtime: the container runs persistently, the pipeline re-crawls on a schedule, and the SQLite library persists between restarts. The result is a private, fast, zero-dependency arXiv reader tuned to a personal research focus.