Docly PDF: AI-Powered Restoration of Chinese Ancient Texts, Digital Preservation of Traditional Culture

When organizing ancient texts, what is the biggest fear? Not being unable to read traditional vertical text, but rather scans that are blurry and messy, or PDFs that are all images, making it impossible to copy or take notes directly. Even more troublesome is flipping through a hundred-page manuscript and then trying to extract key points, which is like searching for a needle in a haystack.

Docly PDF has recently been tested by some friends in the field of philology to see if it can help. Its core logic is: treat PDFs as live documents, not static images. This is fundamentally different from many ordinary readers.

Can vertical traditional characters in scans be recognized?

In actual tests, Docly uses AI for text extraction, and its recognition rate for conventional Republic-era vertical lead-type prints and photocopies is quite good. It does not break up classical particles like "之乎者也" (zhi, hu, zhe, ye) haphazardly, and paragraph transitions are mostly correct.

However, in places with handwritten annotations or severe insect damage, the AI hesitates and occasionally skips characters or misidentifies radicals. This is not a tool problem—all OCR has this bottleneck with handwritten ancient texts. The most obvious issue I encountered: when encountering similar-looking characters like "己已巳" (jǐ, yǐ, sì), the AI defaults to recognizing the more common character, rather than guessing based on context like an expert in ancient texts would.

Conclusion: For clear scans with intact character strokes, the effect is very good. For manuscripts with water stains, blurriness, or dense handwriting, manual proofreading is needed.

No need to flip through manuscripts page by page; generate abstracts directly

Most ancient text PDFs are often over a hundred pages. You just want to confirm whether Volume 15 of a certain "County Gazetteer" mentions river management. With ordinary software, you have to look through the table of contents first, then locate it.

Docly's summary function is actually very practical in the context of ancient texts—it does not require you to pre-annotate the text; it simply ingests the entire PDF and outputs a few hundred words of summary. I tested a photocopy of "Complete Treatise on Agriculture" (Nongzheng Quanshu). The AI did not mention Xu Guangqi's biography (which I didn't need), but accurately listed the distribution locations of three core sections: "Water Conservancy," "Famine Relief," and "Agriculture."

However, the AI is better at extracting "factual information" (which year, which person, which crop), while its summarization of "conceptual content" (discussions in prefaces, reflections in postscripts) is relatively bland. If your research focus is on the intellectual context behind the text, the summary can only serve as a navigation tool, not a substitute for reading.

Turn image-based PDFs into editable notes

Many people don't know that after cropping ancient text scans, the headers and footers often include modern library stamps and call numbers. When copying the full text, you suddenly get a line like "Collection of Nanjing Library" inserted, which is annoying.

Docly can extract text by region. I isolated the main text area in the scan and had the AI recognize only the titles and main text, ignoring headers like "Volume 3" and footers like "Page 25." Processing an 80-page annotated edition of "Guanzi" was more than three times faster than manual deletion.

Limitations: If the original book's block lines are slanted, or the center column text and commentary are crowded together, the AI's region recognition may make mistakes, requiring manual adjustment. For books with clean layouts, there is almost no worry.

Who is it suitable for, and who is it not?

First, who it is suitable for: local cultural history enthusiasts, current graduate students, library staff working on digital organization. You have a large number of scanned PDFs and want to quickly know what each book is about, or extract paragraphs containing keywords from a hundred local gazetteers. Docly can save you a lot of time flipping through paper.

Who it is less suitable for: experts in textual criticism. The AI has no concept of textual bibliography; it won't check other editions if a character is obscured by a black square. It also does not generate a comparison table between the current book and variant editions—that's the job of specialized ancient text software.

Additionally, if your PDF is in a pure image format, or has double-layer PDF but the underlying text quality is poor, Docly's AI extraction effectiveness will be reduced. It's best to first use its built-in scan enhancement feature to increase contrast, then run text recognition.

A practical suggestion

Don't expect AI to complete the entire set of ancient text restoration and digitization at once. The most suitable positioning for Docly PDF at present is: a rapid pre-screening and note generation tool for ancient texts. First, let it run summaries and keywords on all PDFs, marking high-value documents, then conduct manual in-depth reading and proofreading on key sections. This combination is much more efficient than simply stacking human labor, and more reliable than blindly trusting AI.

If you happen to have a batch of scans of rare books from the Republic of China or late Qing Dynasty, you can first test the limits of Docly's recognition with a sample. Finding the boundary between what it excels at and where it fails is more useful than reading ten reviews.

Can vertical traditional characters in scans be recognized?

No need to flip through manuscripts page by page; generate abstracts directly

Turn image-based PDFs into editable notes

Who is it suitable for, and who is it not?

A practical suggestion

Found this helpful? Explore more

Comments

Leave a Comment