近代デジタルライブラリーの文字切り出しにおける実際的手法

福尾 真実, 高田 雅美, 城 和貴

本研究では，近代デジタルライブラリーが所蔵する画像データから上手く文字を切り出す実際的手法の開発を行う．国立国会図書館では近代デジタルライブラリーとして，所蔵する書籍を Web 上で一般公開している．これらは，画像データとして公開されており，文書内容を用いた検索が行えないため，早急なテキスト化が求められている．そのため，近代書籍に特化した多フォント漢字認識手法が提案されている．しかし，ルビが振られた書籍からは上手く文字が切り出せず，認識ができない．そこで本稿では書籍の本文からルビを取り除く手法を開発する．In this research, we develop a practical technique to clip the kanji character well from the image data that the digital library from meiji era houses. The national diet library in Japan is opened to the public as the digital library from meiji era on the Web. There are shown as image data. Since it is impossible to perform full text search, it should be converted to text data. Therefore, it has been proposed the multi-fonts kanji character recognition method for early-modern Japanese printed books. Kanji characters with rubi occur that the kanji character clipping and recognition are badly constructed. In this paper, we propose a technique to remove the rubi from body of the book.

近代デジタルライブラリーの文字切り出しにおける実際的手法

書誌事項

この論文をさがす

説明

収録刊行物

関連プロジェクト

詳細情報詳細情報について

書き出し

問題の指摘

近代デジタルライブラリーの文字切り出しにおける実際的手法

書誌事項

この論文をさがす

説明

収録刊行物

関連プロジェクト

詳細情報 詳細情報について

書き出し

問題の指摘

詳細情報詳細情報について