This is a public dataset of Google Docs that was used to test my Google Docs HTML parser.
This was a stab at extracting metadata and content from a snapshot of a document's HTML rather than
visiting the document in a browser. As Google Docs is an interactive, multiplayer application, a document's properties
are held within minified script tags.
It turns out, you can extract a lot of interesting information like:
- Title
- Creation date
- Number of revisions
- Snippets of the document's content
- Links that are embedded in the document
- Image urls that are embedded in the document (although these are always hosted on Google's servers and unique for every image.)
You can read my blog post on this effort.
How do you collect this data?
This dataset leverages commoncrawl
and urlscan as the primary sources of public Google Docs.
Both commoncrawl and urlscan archive a page's HTML on their servers.
HTML is parsed from these third parties to extract a document's metadata and content.
No data is scraped directly from Google servers, and all data held here is already publicly
available on the internet in an unparsed format.
How does this work under the hood?
Raw Google Docs pages' HTML are retrieved from commoncrawl and urlscan, parsed, and then ingested into a SQLITE database.
It's all written in Go. Check out the source code.
At the moment, all collection requires a manual trigger.
The web server and database is hosted on a computer in my home in the UK, so availability and performance is not guaranteed.
A basic one yes. Both the query page and document view can return JSON.
The query page is newline-delimited JSON (NDJSON).
Simply add a format=json parameter to your requests. For example
https://dochunt.kmsec.uk/?q=%E7%A2%BA%E8%AA%8D%E3%81%99%E3%82%8B&format=json
The query page is newline-delimited JSON (NDJSON).
Adding ?format=json to a document endpoint returns a
single JSON object that discloses all instances of this document in the
corpus. For example:
https://dochunt.kmsec.uk/d/1_G0BCG2pd-6JvmGVeflBM5xVovAPzHniwKygSAttG48?format=json
Part of my motivation here was to test whether the parser works against a large corpus of raw HTML documents. It works 98% of the time (that number is based on vibes. Completely made up).
However, when parsing document content, there is no
guarantee that it is complete or in the correct order.
This is due to my preference for "good enough" rather than perfect parsing (see also: "skill issue").
On a document's page, you may see several revision numbers.
Throughout the lifetime of a document, it may undergo several revisions by editors.
A larger revision number means more edits to a document.
Observers (commoncrawl, urlscan, your eyes) may see different revisions of the same document when visiting it over
time, so we can use the revision as a chronological marker, which helps inform us of the document's lifecycle.
I'm not a developer, but as a threat intelligence analyst, I love turning raw data into actionable content.
On parental leave I had the unique combination of limited, sporadic focus time, extra mental capacity, and a mind free to pursue something completely
meritless. This was something I could pick up and work on when I needed a bit of cognitive stimulation.
I had also only recently discovered commoncrawl (which truly is a modern treasure trove), and I wanted to get familiar with it.
95% of the code is written by me, and any professional Go developer would see that from examining (aka roasting) my code.
I use AI to get over implementation humps or to scratchpad, but this was a hobby project and I truly enjoy the struggle of creation.
Yes. This is a hobby project of mine, however I am able to conduct manual takedowns in as timely manner as I can.
Be mindful that this data is merely a refinement of what can be found on third party websites (namely commoncrawl, urlscan, or Google Docs if the document is still publicly accessible).
Check out the contact details on the footer of my website or create an issue on GitHub.
Please provide the document ID and come prepared with the justification for takedown -- for example the presence of sensitive personally identifiable information (PII), illegal, offensive, or harmful content.