All Official Open License Content in Markdown format on Github

Salve Sodales,

Since the great folks at Atlas gave us the Open License (thank you!), I’ve been waiting for a good repackage of the Corpus - or parts of it so we can use it more efficiently. It has yet to materialize (and not sure what’s stalling at redcap.org?). Anyway, dis faventibus, facienti:

I’ve been working quite a bit on the problem of accurately and efficiently extracting all Open License PDFs to human friendly Markdown, including the gnarly old 3e stuff. This included untold number of tool tests, multiple re-OCRs, various AI powered dead-ends etc. While not yet perfected, most of the files are now good enough to enjoy - and who knows, someone might want to take it upon themselves to edit one or more of the books for remaining OCR errors, weird headings, broken tables, indexes etc and commit it back to the collection and thereby the community?

EDIT: As for “Why Markdown?”, I realize it may not be clear to everyone here. But there are several good reasons why Markdown may be the best “forever” format for the open-license text content:

Markdown's greatest practical strength for open-license content is that it is a universal source format. It is plain text, even human-readable without a renderer - and readily convert to any format (HTML, PDF, EPUB, DOCX, Google DOCS, MediaWIKI etc), natively version-controlled with Git, and readable by both humans and machines without ANY proprietary software. It carries no vendor lock-in and is durable across decades.
This write-once, publish-anywhere model means the canonical open-license text can stay clean and unencumbered, while feeding any wanted derivatives in the easiest possible manner (compare this with copy/pasting from PDF (everyone’s favorite…?), trying to convert .docx formats or manually reformatting plain text headings and tables).

You can straight open a markdown in google docs, word, (modern) notepad - or use Visual Code, Typora, Notepad++ (with plugin) etc or copy paste as-is to Websites (like GitHub), Blogs (WordPress, Medium), Notes (Notion, Obsidian), Chats (Slack, Discord), and Forums (this one!) as they all support Markdown natively.

14 Likes

Linkback to announcement thread:

2 Likes

This is fantastic! Thank you!

Although I can't be the only person whose initial reaction was "why only Corpus?"…

5 Likes

Which is probably very useful for … something?

I feel like the dog shown a card trick.

1 Like

That's amazing, thanks !

Redcap.org is SLOWLY uploading the material, as seen in the SRD and the progress-log.

The main holdback is that I couldn't find tools I could rely on to produce the markup reliably and well - I'm glad you were able to invest the time, effort and skill to make it happen.

Note also the similar project

It's not as extensive, and uses a different mark-up. At a glance, the markup in your version appears generally better, but lacks good handling of insets and the table markup may be harder to paste to MediaWiki format.

@YR7
Yeah, I was aware of this and had a look at the files earlier. Text only was not quite enough for my taste (and there were tons of other garbage left from extraction, as well as split files) so I just took it from scratch to get to a point where much less human polish would be needed.

You're absolutely right the inserts, tables (and headings) are all over the place - this will happen to any extraction (text, markdown or otherwise) as is the nature of the legacy pdf format (really unfriendly to anything else, and not seldom out of order with itself). The markdown extraction here (at least for the newer files) is likely in a better state than most others. Some manual edits will be needed. Inserts tends to end up wherever and will need to be manually moved. Inserts with dark or black background are sometimes lost (at least if the pdf needed to be OCRed, as is the case with Lions of the North) requiring manual fix.
Just the nature of the beast I guess - PDF format is definitely of the Infernal realm.

Fixing the tables and align the inserts in the markdown should quickly fix that problem. The tables are dirt easy in Markdown, but likely just requires that manual touch. Also the OCR (where required) should be much better as it's done with latest generation.
I have naturally yet to check everything in every file - but I can fairly easily try an alternate extraction if one is in a particularly bad shape (3e supps I'm more or less at end of road unless I take my physical books and re-scan from scratch).

1 Like

If you or anyone wants, I can also put up a pure text extraction version of the corpus.

Well, I mean, a MediaWiki markup will help with uploading to redcap.org, but I suppose a covertor can quite-easily be written to change the text-wrapping format between the two markup formats. Probably one already exits.

I find it cuts some text in weird ways. See Hedge Magic Chapter Two (Elementalism) - see the "Elemental Fire" minor supernatural virtue, how it is separated into a few parts, part of it withint the next virtue. See the end of page 17 ("can be reduced to an elemental philoso-") which should continue in "pher..." which is to be found much later, and then too the following middle column is found elsewhere.

The line "this chapter. Selecting this Virtue gives the" disappeared completely.

Yeah, PDF format tends to do that with text flow - especially in more complex layouts. It's been a bit of a forever problem with text extraction. Part of the Infernal nature I assume.

Anyway, it seems Hedge Magic is completely borked file overall with tons of garbage (I didn't check it manually until now). The file size should have tipped me off. I'll see if I can find a way around it with another extraction.
Any other file that's dastardly?

EDIT: I uploaded two new versions with redone auto OCR and one with complete forced OCR. Both seem better. Feel free to have a look @YR7 and let me know if sorted.

1 Like

Pandoc allegedly has a mature markdown → MediaWiki writer that correctly handles headings, lists, tables, code blocks, links, images, and many extensions. A simple one‑off command for a single file looks like:
pandoc -f markdown -t mediawiki input.md -o output.wiki

I'd be happy to add a workflow to provide all files in mediawiki output for everything - especially as cleaned up markdown files start happening - if it'd help you out too.

1 Like

I decided to do a quick test for mediawiki conversion, but I'm not sure what to view it in. I made two varieties of conversion of two supps. One of the not-yet-done Lions of the North (from the wip file, which has multiple remaining issues), and one of the RAW Guardians of the Forest (looked reasonably clean). Let me know if these are serviceable and which conversion method is better if any.

I uploaded the two GotF to two temporary-pages of project redcap:

The only difference I see is that Version 2 (gfm) correctly converted the link to www.atlas-games.com in the opening page; version 1 (straight convesion) did not. Both versions convered the links at the end of the document identically.

I think both versions are essentially identical, there is no real difference.

Further automation could surely be done to remove the "File:...jpeg" link; even I can do that. Perhaps also institute a standard formatting to stat-blocks, including
before the bolded headings (e.g. Characteristics of "Sordus, a Basilisk"). There is also the issue of uniting broken-off lines, e.g. the "Sight of the True Form" spell of Philipus Niger", which perhaps can be automated with AI but I'm afraid of hallucinations. This is even more difficult when an inset gets in the way, as in e.g. the line before and after "Story Seeds: The Courts of the Seasons".

Another issue may be the division into header types - I'm not sure this can be easily automated. Right now the text's title levels are... weird.

Overall - the text is very good as-is. It needs some manual attention to improve formatting and text-flow, but quite little.

One thing I'm not sure about - it converts the entire book, whereas our old, manual, procedure divided things up into separate chapters. I'm not sure which is actually better to have on project-redcap.

That sounds about right. Gfm should do links better, seems there were no other drawbacks.

I've cleaned all files from image links, spans and a lot of junk. The only HTML code we need to keep is the "< br >" because Markdown has no such function for multiline tables. I've checked and it's now only remaining in tables - the standard Markdown renderers will go with it.

I've also fixed all the 3e and 4e supplements manually with back cover text in the beginning and some other minor fixes.

Unfortunately the page break line breaks showing up is a pdf artifact that I found difficult to automate so likely manual fixing needed. Same for text boxes, and certain text sections end up in wrong order (pdf shenanigans).

I also managed to miss Mythic Iceland (I only had the extension PDF with similar name). That's sorted now so we should be complete.

Anyway, should be a great jump in usable quality now, so check out the "wip" folder files.

1 Like

I see the wip files there - how do I convert them to MediWiki format ?

In the wip-mediawiki folder, you'll find everything from the wip folder converted (via pandoc in gfm mode) to Mediawiki format. This is for convenience. I can't take submissions for fixes on this.
I wrote a simple python script for it, but you can replicate via commandline:
pandoc -f gfm+footnotes+raw_html+tex_math_dollars+autolink_bare_uris -t mediawiki [infile] -o [outfile]

The reality is that, Unlike Markdown (which has dozens of options, and is easy in even text editors), MediaWiki markup have basically no standalone offline renderers (syntax highlighters don't count for reading) because:

  1. Complex parser functions and templates require full MediaWiki engine (not needed for this)
  2. Most tools assume you're editing within an existing wiki installation

May I suggest we get the Markdown files to full wanted shape (text and tables in right places, headings the right size, ocr typos gone etc) BEFORE getting it all to mediawiki - as it is a simpler, good enough, and way way more convenient format. It'll basically save the work having to be done in two places, and will allow for future flexibility.

I can recommend trying the visual studio or notepad++ plugins, but easy mode is Typora which is a great editor/viewer and you can extend the nag screen trial indefinitely on (note that there's also a simple config hack to make it take md files larger than 2MB, done in a second with a text editor. Needed to edit/view the Definitive Edition, which is just above)

You also asked about chapter division vs whole book. As most of the books are <1MB and we're in the modern age + you can link to relevant section just as well on "one page" = please entire book on a single page. No one will notice the load difference anymore, and it also enables quick CTRL+F searching through the book etc. Best of all, you can just scroll - no need to click around different pages.

1 Like

Alright. I started working on GotF - how do I upload a new version to the wip folder ?

The best way to do this:

Forking and uploading a new file takes 3 clicks via GitHub's web interface—no Git commands needed.

Steps for Contributors

  1. Fork the repo

  2. Upload file to your fork

    • In your new fork, navigate to target folder (or stay at root)

    • Click Add file > Upload files

    • Drag/drop file(s) or click "choose your files"

    • Add commit message (e.g., "Fixed GotF")

    • Click Commit changes

  3. Create pull request

    • Yellow banner appears: "This branch is 1 commit ahead"

    • Click Contribute > Open pull request

    • Add description, click Create pull request

    • Original repo owner reviews and merges

This allows me to review every edit very easily in colored sections and just accept fixes to files. I.e. best way - and anyone with a github account can do it. Note: you can even edit files directly in github once you’ve forked.

Anyway, if this seems overwhelming (took me a few mins first time I admit), feel free to dump me the file here or create an “issue” in the repo and simply attach the file to the issue.

Done. I hope. Let me know if not / anything-else.