About

What is this website?

This is a public dataset of Google Docs that was used to test my Google Docs HTML parser. This was a stab at extracting metadata and content from a snapshot of a document's HTML rather than visiting the document in a browser. As Google Docs is an interactive, multiplayer application, a document's properties are held within minified script tags.

It turns out, you can extract a lot of interesting information like:

Title
Creation date
Number of revisions
Snippets of the document's content
Links that are embedded in the document
Image urls that are embedded in the document (although these are always hosted on Google's servers and unique for every image.)

You can read my blog post on this effort.

How do you collect this data?

This dataset leverages commoncrawl and urlscan as the primary sources of public Google Docs.

Both commoncrawl and urlscan archive a page's HTML on their servers. HTML is parsed from these third parties to extract a document's metadata and content. For the majority of the docs here, no data is scraped directly from Google servers.

As a threat intelligence researcher, I have also ingested a limited set of my own document scans. These documents are almost certainly malicious in some way. You can see this corpus by filtering for source kmsec.uk.

How does this work under the hood?

Raw Google Docs pages' HTML are retrieved from commoncrawl and urlscan, parsed, and then ingested into a SQLITE database. It's all written in Go. Check out the source code.

At the moment, commoncrawl ingestion requires a manual trigger.

The web server and database is hosted on a computer in my home in the UK, so availability and performance is not guaranteed.

Do you have an API?

A basic one yes. Both the query page and document view can return JSON.

The query page is newline-delimited JSON (NDJSON, aka LDJSON).

Simply add a format=json parameter to your requests to return your results in JSON format. For example
https://dochunt.kmsec.uk/?q=%E7%A2%BA%E8%AA%8D%E3%81%99%E3%82%8B&format=json

Adding ?format=json to a document endpoint returns a single JSON object that discloses all instances of this document in the corpus. For example:
https://dochunt.kmsec.uk/d/1_G0BCG2pd-6JvmGVeflBM5xVovAPzHniwKygSAttG48?format=json

Is the data accurate?

Part of my motivation here was to test whether the parser works against a large corpus of raw HTML documents. It works 98% of the time (that number is based on vibes. Completely made up).

However, when parsing document content, there is no guarantee that it is complete or in the correct order.

This is due to my preference for "good enough" rather than perfect parsing (see also: "skill issue").

What is a revision?

On a document's page, you may see several revision numbers. Throughout the lifetime of a document, it may undergo several revisions by editors. A larger revision number means more edits to a document.

Observers (commoncrawl, urlscan, your eyes) may see different revisions of the same document when visiting it over time, so we can use the revision as a chronological marker, which helps inform us of the document's lifecycle.

Why did you do this?

I'm not a developer, but as a threat intelligence analyst, I love turning raw data into actionable content.

On parental leave I had the unique combination of limited, sporadic focus time, extra mental capacity, and a mind free to pursue something completely meritless. This was something I could pick up and work on when I needed a bit of cognitive stimulation.

I had also only recently discovered commoncrawl (which truly is a modern treasure trove), and I wanted to get familiar with it.

Is this vibe coded?

95% of the code is written by me, and any professional Go developer would see that from examining (aka roasting) my code. I use AI to get over implementation humps or to scratchpad, but this was a hobby project and I truly enjoy the struggle of creation.

Do you accept takedowns?

Yes. This is a hobby project of mine, however I am able to conduct manual takedowns in as timely manner as I can.

Be mindful that this data is merely a refinement of what can be found on third party websites (namely commoncrawl, urlscan, or Google Docs if the document is still publicly accessible).

Check out the contact details on the footer of my website or create an issue on GitHub.

Please provide the document ID and come prepared with the justification for takedown -- for example the presence of sensitive personally identifiable information (PII), illegal, offensive, or harmful content.