Texts from open licensed sourcebooks

I've done some code to help with extracting character (creatures, animals, magi) statistics from sourcebooks under open license. At some point I realized that code can be used to get texts out of those sourcebooks too. Text with headings and sidebars too. It's not perfect and there will be some improvements as some headings still go unrecognized. Finer details of texts, like indentation, bold, italics, lists, tables and so on are not in scope, because they are, as far as I know, far too difficult to parse from PDF.

I have shared them in github

Code for extraction
Texts of 41 sourcebooks

They are meant to help and speed up development of things. Let Ars Magica community thrive.

6 Likes

Very cool. Can you consider doing a variant for a wiki format? That would make it very easy to incorporate the files into the Ars Magica Open Content Conversion Tracker in Project Redcap.

Although I suppose a wiki script to just automatically insert the whole book can also be produced, given the files. I'm not sure at all how to do that.

You seem to not have access to all of the files released as open content (in particular - older editions). I think I do. But am not sure how to run the program on the ones you didn't.

I was actually planning to do so. Changing format to wiki is easy to do. I suppose it would be preferable to have text by chapters instead of a whole book in one file?

I have 5th Edition files from Definitive edition backerkit add-on Ars Magica 5th Edition Digital Bundle: COMPLETE. As this project is tied to Foundry VTT Ars Magica 5e compendium, I didn't purchase all open licensed sourcebooks. But once extraction works properly I can guide how you can run the code on your set of files

1 Like

Yes, I believe that would work better.

split in chapters and formatted for mediawiki. There are some problems with chapters and sidebar detection may be wrong with some sourcebooks. But those can be fixed later if needed

1 Like

Thanks a lot. It seems to not extract the last chapter - at least in Apprentices and Against the Dark.

Some books also seem to be missing. Like the core rulebooks.

Getting last chapters may be a simple fix. I'll have a look. Some books are indeed missing. There are two reasons. I have concentrated on 5ed sourcebooks. Second reason is that I haven't bought pdfs of core rulebook or 3rd and 4th edition sourcebooks.

1 Like

Another thing - if the program can identify paragraph-breaks and insert a line-break ("enter") after each paragraph, that will really help.

The formatting of the sidebars is also a bit off, but I don't know how to do it correctly, if it even can be made so in wiki format.

I cannot promise for line-breaks at end of paragraph. Text is quite messy in pdf's, when layout is partially done with coordinates of lines of text. I might be that only way to recognize start of a paragraph is to check from coordinates that line is indented. But I'll keep that in mind.

Unfortunately getting last chapters to parse was not a simple fix. I don't really know what is the problem, so I'll need to take a closer look at those sourcebooks.

Sidebar formatting can be changed. What's in files now was my best guess and it doesn't seem to correspond with mediawiki formatting. So how would you like to have it? By quick look at redcap it looks that sidebar contents are included in the text.

There was a stupid mistake in the code and it dropped last chapter from each source. Well, I may have made another bug as I was trying things. And found an anomaly in Mythic Locations with chapters.

Anyways there's updated files in redcap folder. Probably not final ones, but getting there

1 Like

Well, the core-rulebook open game content extraction seem to be the best formatted. It does things like this:

{| class="wikitable"
|+Option: title
| text text text
|}

or

{| class="wikitable"
|-
|

text

|}

I see. Yah that's doable and I'll try it. It will not work in every case as sidebar headings some times come after the text. I could rearrange them, but if there are more than one heading in sidebars well then we are out of luck.

1 Like

Not being perfect is part of the territory. If it just works some of the time, that'll help.

As a programmer it's hard to do code that doesn't work perfectly. And yet I don't have time or energy to make this code to work perfectly in messy world of pdf's.

Anyway. I had a look at mediawiki tables and in a way it is usable, but most of sidebars are just text, that should be somehow separated from the main text. So I put a narrow brownish border around sidebars. You can think of it as visual reminder for later editing. Or maybe it's usable as is. Naturally color and border width are editable or can be changed. I also added a bit of padding inside borders, otherwise it looks pretty cramped.

{|style="border-style: solid; border-width: 3px; border-color: chocolate"
| style="padding-left: 7px; padding-right: 7px" |
text and headings here
|}

This is what it looks like in preview of redcap. (at start there is regular text and then a sidebar with level three heading and text

image

3 Likes

Looks great.

Text files for redcap were updated in github and it may be the final update for a while. Black monks sourcebook went sour with latest handling of chapters (it has a chapter text on every page) and I dropped it off from texts.

Next I'll update full text files and then I'll return to work with getting character statistics for Foundry VTT Ars Magica compendium

1 Like

Full text files are in github too.

Thank a lot.