Open Arabic Periodical Editions (OpenArabicPE) (Till Grallert)

A framework for open, collaborative and scholarly digital editions of early Arabic periodicals


OpenArabicPE establishes a framework for open, collaborative, and fully-referenceable scholarly digital editions of early Arabic periodicals. The guiding principles of OpenArabicPE can be summarised as accessibility, sustainability, credibility. It is developed against the backdrop of two editions of Arabic periodicals from the early twentieth century: Muḥammad Kurd ʿAlī’s Majallat al-Muqtabas (published in Cairo and later Damascus between 1906 and 1917/18) and ʿAbd al-Qādir al-Iskandarānī’s al-Ḥaqāʾiq (also published in Damascus, 1910–12). OpenArabicPE shows that through re-purposing well-established open software and by bridging the gap between popular, but non-academic online libraries of volunteers and academic scanning efforts as well as editorial expertise, one can produce scholarly editions that offer solutions for most of the problems pertinent to the preservation of the early periodical press in the region: active destruction by war and cuts in funding for cultural heritage institutions; focus on digital imagery due the absence of reliable OCR technologies for Arabic fonts (Recent advances in OCR technology based on neural networks and deep learning are to be published by the Open Islamicate Text Initiative (OpenITI) project in 2018 but they still require transcriptions as training data); absence of reliable bibliographic metadata on the issue and article level; anonymous transcriptions of unknown quality; slow and unreliable internet connections and old hardware.

In more concrete terms, we start with digital texts available from grey online libraries, such as al-Maktaba al-Shāmila. They suffer from being of unknown provenance, editorial principals, and quality and they lack most—if not any—information linking the digital representation to a printed original, namely bibliographic meta-data and page breaks, which makes them almost impossible to employ for scholarly research. Most of these immediate shortcomings could be remedied by access to facsimiles, by closely linking the digital text to digital imagery of the printed page, and by transcribing missing bibliographic information from the facsimiles. Hathitrust, the British Library’s “Endangered Archives Programme” (EAP), MenaDoc or Institut du Monde Arabe provide such facsimiles but also suffer from incomplete and often faulty bibliographic metadata.

Fig. 1 TEI XML file of al-Muqtabas 7(3), July 1908, CC BY-SA 4.0Fig.2 Web-display of al-Muqtabas 7(3), July 1908, CC BY-SA 4.0


To achieve our aims we transform the text into an open, standardised file format (XML) following the Text Encoding Initiative (TEI)’s guidelines, which is the quasi-standard of textual editing and required by funding bodies and repositories for long-term archiving. We add light structural mark-up for articles, sections, authors, and bibliographic metadata, and link each page to a number of facsimiles—in this process we also make first corrections to the transcription. Since almost no editor or reader will want to work directly with bi-directional XML files combining Arabic script for the content and Latin script for the mark-up and documentation of editorial decisions (fig.1) and since one actually needs to see the facsimiles, we provide a basic web-display (fig.2). This web-display  runs in most internet browsers and can be downloaded, distributed and run locally without any internet connection—an absolute necessity for societies outside the global North. By linking facsimiles to the digital text, readers can validate the quality of the transcription against the original. 

Finally, we provide  structured bibliographic metadata for every article that can easily be integrated into larger bibliographic information systems or individual scholars’ reference managing software. To improve access to our editions this data is also publicly accessible through a constantly updated Zotero group.

All code and the editions are hosted on the code-sharing platform GitHub under MIT and Creative Commons CC BY-SA 4.0 licenses for reading, contribution, and re-use. Improvements of the transcription and mark-up can be crowd-sourced with clear attribution of authorship and version control using .git and GitHub’s core functionality. All code is archived on an EU/funded platform (CERN’s Zenodo) that also provides stable identifiers (DOI) for every release.

In 2017, we  entered a collaboration with the Institute for Computer Science at Universität Leipzig in order to implement Canonical Text Services (CTS) for al-Muqtabas that provide persistent and application independent URNs for every textual element. A resolver was implemented and al-Muqtabas was integrated into the European CLARIN research infrastructure. This means that every part of al-Muqtabas is discoverable through CLARIN’s Virtual Language Observatory. We also began with the analysis of bibliographic (meta)data in order to establish the intellectual networks of authors and texts published and referenced in both journals. 

 

 

Author: Dr. Till Grallert 
Grallert@orient-institut.org