OpenSourceProjects logo
ArchiveBox logo

ArchiveBoxπŸ—ƒ Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

πŸ—ƒ Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

27,265 stars
1,509 forks
Python
MIT
ArchiveBox screenshot

ArchiveBox

ArchiveBox is a self-hosted web archiving solution that preserves content from websites in multiple formats. It captures and stores HTML, PDFs, media, screenshots, and metadata, ensuring your important web content remains accessible long-term without relying on third-party services.

Key Features

  • Multi-Format Archiving: Saves content as HTML, PNG, PDF, WARC, JSON, and SQLite for decades-long readability and accessibility
  • Multiple Input Sources: Imports from browser bookmarks, history, Pocket, Pinboard, RSS feeds, and browser extension
  • Standard Tools Integration: Uses Chrome, wget, and yt-dlp to capture websites, videos, and media content
  • Multiple Access Methods: Web interface, CLI, REST API, Python API, and browser extension for flexible interaction
  • Media Extraction: Automatically detects and extracts embedded content including videos, images, metadata, and comments

Use Cases

  • Personal Knowledge Management: Archive research papers, articles, and resources for offline access and long-term reference
  • Evidence Preservation: Save copies of web content for legal documentation, compliance, and evidence retention
  • Media Backup: Back up photos from social media and videos from streaming platforms before they disappear
  • Research & Academic: Preserve web-based sources and citations for scholarly work and institutional archives

Who Is It For

ArchiveBox is ideal for individuals, researchers, organizations, and institutions who want to maintain control over their archived web content while ensuring long-term preservation. It's perfect for anyone concerned about digital preservation, data ownership, or maintaining offline access to critical web resources.

Trending Open Source Projects