So I'm a researcher that almost always uses pdfs... Does HTML have the reproduci...

telegtron · on March 21, 2024

The potential to freeze an HTML page in time with minimal changes at render time is already there. [0] Such an ability can even be baked directly into the rendered HTML page so the viewer would be able to download a copy of the page as it is seen at a given time. Other archiving facilities, such as archive.org, take static snapshots of accessible pages if allowed by the publisher of the page and requested by anyone who wants to make that snapshot.

My point is that it is possible to achieve in principle and in practice, albeit that might be practiced as often as one would like to see.

-------

[0] See SingleFile by gildas at https://addons.mozilla.org/en-US/firefox/addon/single-file/: “Save an entire web page—including images and styling—as a single HTML file.”

hju22_-3 · on March 22, 2024

I like SingleFile, but it's not perfect. It usually works just fine, but will occasionally drop the ball depending on the type of JavaScript on the page.

For example, I once backed up a page using it, and while it got all the content, it did not grab the JavaScript necessary for the images to display correctly.

cxr · on March 22, 2024

> Does HTML have the reproducibility that PDF promises? My feeling is that if I store a PDF, it'll look the same in a decade.

Feelings and promises are each one thing. Reality is another. PDF doesn't even look "the same" today. I have serious questions about how often folks who think that PDF is reliably consistent from system to system step outside their bubble and just how diverse their setups are that they're testing on.

> is HTML the same way?

Well the status-quo for copy-and-paste in HTML isn't dogshit, it's comparatively trivial to find and use tools that can thoroughly and exhaustively search your collection (or even write your own), and HTML is a dead simple plain-text format that if worst comes to worst you can read with your eyes (unlike needing to run a bunch of inscrutable code from a PostScript subset through an interpreter before you can do anything with it). So, no, I wouldn't call them the same.

nickpsecurity · on March 21, 2024

Machines and humans can both easily use HTML/XML. Extracting information from PDF’s is so much harder that there’s deep learning products dedicated to doing it. They still make mistakes, too.

I’d much rather have something akin to the CHM files where everything I need is in one file, easy to analyze, and has good readers.

cygnion · on March 21, 2024

I explored tools to export/interchange PDF to HTML in the KnowledgeGarden app, but the results were not optimal, suffering from non-standard layout and poor typesetting of equations. Publishers of scholarly articles generate web pages of papers, but they're not replicas of PDF files.

Re. self-contained HTML (and slightly off-topic), look at TiddlyWiki, which contains data/code/layout all in one interactive, local or hosted HTML. Extensibility, plugins, and community of contributors are some key highlights, among others.

[1] https://www.tiddlywiki.com

theGnuMe · on March 21, 2024

I'd like to see PDFs move to Computational Notebooks. One can dream.

gerroo · on March 21, 2024

That'd be so nice. Imagine executing the code for an ai paper and seeing the beautiful visualizations as you read it.