脱机中文手写识别：从孤立汉字到真实文本_代写论文

中文摘要：

鉴于巨大的应用潜力和附加的特别难度，脱机手写汉字识别吸引了大批的研究者。近三十年的研究，主要集中在工笔手写汉字识别方面。产出的成果囊括了手写矫形、特征提取、分类器设计以及语言后处理等各个方面，进入手写文本时代的条件基本成熟。本文旨在建立脱机中文手写文本识别的基本框架，涵盖了从基础数据到评价体系，再从改进的方法到全新研究策略等一系列内容。首先构建了能够支撑中文手写文本研究任务的基础数据，HIT-MW库；并在理解问题的过程中，定义了评价字符切分和识别算法的度量准则。然后分别从切分策略和无切分策略两条不同路径开展手写文本识别方法的研究。最后，在证实切分策略和无切分策略存在明显互补性的基础上，提出基于双策略的组合系统。

本文分析了手写汉字识别的未来发展趋势并给出研究的逻辑结构。首先以识别对象的升级为主线，系统总结了文字识别研究的发展历史。通过分析发展历史，并结合汉字识别研究在手写库建设和识别策略方面的研究现状，指出中文手写文本识别将是未来的研究重心。这将进入一个新的时代—“手写文本时代”。新生时代是在手写单字时代基础上的进一步发展，所以，随后评述了手写孤立汉字识别领域在手写矫形、特征提取、分类器设计以及语言后处理等各个方面的重要研究成果。

本文从全新角度构建了HIT-MW库。HIT-MW库是国际上首个文本级别的中文手写库，它的收集成功昭示着手写文本时代的开端。它的抄写文本来自人民日报语料库，涵盖了800万字语料的99.33%用字。书写者经过精心确定，得到了与实际分布基本吻合的统计数据。经过系统的采样策略和缜密的过程控制，HIT-MW库不仅包含歪斜、交叠和粘连的文本行，还有抄写错误、文字涂改等真实手写现象。大量的支撑证据表明，这些基础数据可以视为全体中文手写文本的代表子集；其上的识别结果，具有统计意义。目前，该库已为十
多家科研机构采用。

本文不仅定义了文本研究的评价准则，还从切分角度进行了方法研究。首先建立了文本切分和识别的基本评价准则。为评价文本的识别优劣，定义了识别正确率和识别准确率。两种准则可以有效刻画系统在删除错误、插入错误和替换错误上的平衡能力。为了评价不同字符切分方法，定义了切分正确率、切分精确率和切分偏差率等准则。综合应用这三种准则，可以发现切分方法在数字、标点和汉字等不同字符类型上的切分能力以及在过切分和弱切分上的偏向性。其次开展了基于切分策略的真实文本识别研究并提供了两个重要建议。第一，在设计新算法时，如果其支持证据仅依据于一种手写矫形配置上表现出的优势，那么其可信性可能并不成立；理想的方案是比较待评价新、旧系统各自最优手写矫形配置上的结果。第二，MQDF分类器需要改进，以加入先验概率信息，进一步的分析显示，采用大规模语料估计的先验信息比直接从训练集估计的先验更具稳定性。

本文提出基于无切分策略的真实中文手写文本识别方法。该方法在训练时直接采用手写行，不需要对字符位置进行标记；识别时无需字符切分阶段。采用同类型特征的切分系统和无切分系统间的对比实验，证实了无切分策略的可行性和巨大潜力。在这一研究框架下，针对四平面交叉特征的弱点，提出增强的四平面交叉特征（en-FPF）。与以前的方向平面不同，en-FPF的方向平面包含了重构原始图像的全部重要信息。实验表明，en-FPF在数字、标点和汉字上均有更好的识别性能，也是目前无切分框架下各项识别率最高的单项特
征。en-FPF在融合了简单的网格特征，并结合主成分分析和数据共享方法之后，对汉字的识别正确率，在训练数据稀疏的条件下，仍超过50%。

本文在验证了两种识别策略的互补性的基础上，分别设计了串行结构和并行结构的双策略组合系统。首先定义了字符匹配率用以反映两系统在某个识别正确率上的互补能力。在这一准则的辅助下，发现两种识别策略甚至在同样训练数据和同类型特征下，仍可以很好的相互补充。随后，设计了两种双策略组合系统，扩展了多分类器研究的内容和范围。串行结构的组合系统把无切分识别器插入到切分系统的字符切分阶段。这一组合结构是在识别过程中，先启动无切分系统，随后启动切分系统。并行结构的组合系统预先以并行方式执行切分和无切分系统，然后由切分系统的度量值决定是直接输出还是转而输出无切分的结果。实验结果证实了双策略组合系统的显著效力。

关键词：手写文本识别; 汉字识别; 评价体系; 无切分策略; 切分策略; 多分类器组合

英文摘要：

Owing to its huge potentials in application and appealing challenges in intellect, off-line recognition of handwritten Chinese character has been intensively studied by numerous researchers. Great efforts have been made to reliably identify handprinted Chinese characters during the last three decades. Accordingly, considerable advances have been achieved, covering shape normalization, feature extraction, classifier design, and linguistic postprocessing. All the fruits in the state of the art qualify the emergence of the era of handwritten text. This thesis motivates to establish the fundamental framework for the off-line recognition of Chinese handwritten text. Its contribution ranges from gathering essential data to defining evaluation criteria and from enhancing traditional methods to putting forward novel strategy. As the first step, HIT-MW database is presented to facilitate the off-line recognition task of Chinese handwritten text. To a preferable assessment, a series of evaluation criteria are then defined for the character segmentation and text recognition. Subsequently, the recognition problem is undertaken in two distinct strategies, the segmentation-based strategy and segmentation-free one. Finally, two-strategy combination systems are proposed, seeing clear complementary capacities upon the segmentation-based and the segmentation-free ones.

This thesis attempts to infer the future trends and to direct logical structure. The history of off-line character recognition is first systematically summarized, focusing on the upgrade of the recognition unit. Further reflecting on the-state-of-the-art techniques of Chinese character recognition in the collection of database and recognition method, Chinese handwritten text recognition will be the next trend. A new era comes into being which can be termed as “the era of handwritten text”. Since the new era is originated from “the era of isolated character”, survey on and comprehension of the recognition techniques are conducted for handwritten isolated Chinese character, and most achievements are investigated under the head of shape normalization, feature extraction, classifier design and linguistic postprocessing, respectively.

This thesis establishes the HIT-MW database from a novel perspective. The database is the first text-level database of Chinese handwriting in the domain, whose success initiates the new era of handwritten text. The underlying texts of the database are sampled from China Daily Corpus and as a result, high character coverage of 99.33% is obtained on a large corpus with 8,000,000 characters. The writers are carefully determined and their distributions well match the real statistic. Due to its systematic sampling mechanism and strict assurance process, not only are skew, overlapping and touching textlines are included, but realistic phenomena, such as miswriting, erasure are catched. Enough evidences support that HIT-MW database can be used to represent the whole population of Chinese handwritten text, and that the recognition results on it holds in statistics. Currently, the database is used by dozens of research groups throughout the world.

This thesis first presents the basic evaluation criteria for text segmentation and recognition. To encode the balance ability among delete error, insertion error and substitution error, the recognition correct rate and the recognition accuracy rate are defined. To compare different character segmentation methods, the segmentation correct rate, the segmentation precision rate and the segmentation bias rate are provided. Utilizing the three segmentation rates, the segmentation ability in digit, punctuation and Chinese characters, and the preference in under segmentation or over segmentation can be discovered. In addition, the transcription of realistic handwritten text based on segmentation-based strategy and two crucial suggestions are given. First, the advantages of new method may be of doubt, if the evidence is merely collected from single setup of shape normalization. Instead, their results should be compared under their own best setup of shape normalization. Second, the performance of classifiers based on modified quadratic discriminant function will be clearly improved after incorporating the a priori of character class, and further using the corpus rather than training data to estimate the a priori yields more robust results.

This thesis proposes a segmentation-free strategy to transcribe the realistic handwritten Chinese text. During the training process, character positions are not needed. Comparisons are conducted with segmentation-based system of the same type of features and the results show the feasibility and potential of segmentation-free strategy. An enhanced four plane feature (en-FPF) within a segmentation-free recognition framework is also proposed. Unlike the previous directional planes, the planes of en-FPF can reconstruct the original image. Experimental results show that en-FPF yields better recognition performance and it yields the highest recognition rates if just one kind of feature is used. Once the fusion of en-FPF and simple cellular feature is processed with principal component analysis and data sharing techniques, the recognition correct rate of Chinese characters exceeds 50%, even when it is disturbed by the problem of data sparseness.

This thesis combines the segmentation-based strategy and the segmentation-free one with serial structure and parallel structure, respectively, seeing their potential complementary capacities. To explore the complementary capacities between two systems, character matching rate (CMR) is defined first. With the help of CMR, the complementary capacities are verified between two strategies, even when they employ the same training data and the same type of feature. Then two combined systems are constructed adopting serial combination structure and parallel combination structure, respectively. The methods expand the research contents and ranges of multiple classifier combination. In the former, segmentation-free system is used to estimate the initial character boundaries. After a boundary refinement process, the segmentationbased system is launched. In the latter, segmentation-free system can be started simultaneously with segmentation-based system and then the recognition confidence of segmentation-based system is used to determine whose result should be delivered. Experimental results manifest the effectiveness of the combinations.

原创学术论文网Tag：代写博士论文

搜索

热门标签:

脱机中文手写识别：从孤立汉字到真实文本