In general, the acoustical environment of a parcel sort-ing machine is noisy, and is dominated by noise made bymoving machine parts. This noise tends to be non-stationaryand lowers ASR performance accuracy. In this study, we alsoexamine the possibility of using a physiological microphone(PMIC) for speech input. Unlike a traditional close talking microphone (CTM), PMIC is a contact microphone that cap-tures the speech signal through skin vibrations [3, 4]. Thisstudy compares performance of PMIC to CTM and demon-strates the noise robustness that can be achieved with PMIC.We evaluate and report the results for the proposed mul-timodal solution on realistic image and speech datasets. Thepostal address image labels were collected using a postalsorter machine under operational conditions, and consist ofartifacts that are encountered in real-world conditions. Thespeech data was collected in presence of background acousticnoise that is typical to the postal sorting environment. Ourexperiments reveal that the proposed multimodal solutionoutperforms the OCR-only baseline by reducing the numberof errors by 50%. The proposed system is currently opera-tional and is part of the Siemens Mobility postal automationsolutions [8].
2.1. Baseline OCR SystemThe baseline system for parcel sorting system uses OCR only.The postal address images are first binarized where they areconverted to a deskewed grey scaled representation. Thisprocess uses the hysteresis thresholding algorithm [5]. Af-ter the binarization process, layout analysis is performed toidentify inpidual text line segments inside the postal image.This step employs the Resolution by Adaptive Subpision ofTransformation Space (RAST) page segmentation algorithmto achieve line segments and identify them as text/ non-text[6]. Given the system is only concerned with identification of the zip code (optionally with city and state), the last part of thepostal label is extracted with the help of the line segmentation.Finally, the extracted text line segment is fed to the OCR’s linerecognizer for generating a raw OCR text output. We comparethe raw OCR output with a master address database that con-sists of all U.S. city-state-zip code combinations. Here, thedistance between raw OCR output and the database entriesis computed using the normalized Levenshtein’s edit distance(NLED) [7], defined by where i and j are strings, LED is Levenshtein’s edit dis-tance, and is the string length. Thereafter, the databaseentry with the smallest NLED is treated as the system output.The purpose behind using the normalized version of LED isto remove the variations observed due to the difference in thelength of the strings. While running the OCR line recognition,it is likely that the city name text is not recognized properlywhereas the zip code digits have higher accuracy. NLED isused to overcome this inherent weakness by normalizing theedit distance over the combined length of both the strings.This ensures that the edit distance is not biased towards theshorter string. 2.2. Proposed Multimodal SystemThe proposed multimodal system combines OCR and ASRsystems. As shown in Fig. 1, the system captures audio andimage input at the start of the operation. The audio is then decoded by ASR (Step 1a) to generate a 1-Best hypothesiswhich consists of state and 5-digit zip code. In parallel, theaddress image is decoded by the OCR baseline system de-scribed earlier (Step 1b). Here, the NLED between the OCRsystem output and the raw OCR output ( T in Fig. 1) is com-pared to a threshold ( 1). If the NLED ( T ) is below thresh-old ( 1), then the OCR system output is treated as final output.This step reflects our confidence in the OCR output. Since theraw OCR output is not driven by grammar or vocabulary con-straints, it is very unlikely that the raw OCR output containingthe city-state-zip string matches a valid address perfectly (ornear perfectly). Our intuition is verified in our experimentalstudies where we have found that the NLED is a very goodpredictor of OCR output confidence.Now, if T exceeds threshold 1 then the ASR outputis also taken into consideration for final output generation. Infact, the ASR generated output is used to narrow down the listof possible address options from the master address database(Step 2). This is done by using the SCF (sectional center facil-ity) part of the zip code (i.e., the first 3-digits of the zip code)to extract a pruned list of addresses from the master addressdatabase containing matching SCF. Now, OCR decoding isperformed again on the image while using this pruned list toconstrain the decoding path. As a result, a new raw OCR out-put is obtained which is again compared to the master addressdatabase to yield a second NLED measure ( O). Now, Ois compared to threshold 2 and if O is below 2 then theASR-constrained OCR output is treated as the final systemoutput. If O exceeds 2, then the OCR output is discardedand the ASR output is treated as the final system output.In this manner, the proposed system uses NLED to deter-mine if the OCR output is high or low confidence. High con-fidence outputs are trusted as final output, and low confidenceoutputs trigger further processing. In this study, the values ofthresholds 1 =0 2 and 2 =0 3 were set experimentally. 2.3. Automatic Speech Recognition SystemHere, we briefly describe the speech recognition system usedin this study. The acoustic models for the ASR system weretrained using context dependent HMMs (hidden Markovmodels) and standard MFCC (Mel frequency cepstral coeffi-cient) features. Particularly, the acoustic model consisted of3-state 32-mixture HMMs with 1000 tied-states. Using thisHMM topology, whole word models were trained for digitsand state names. The acoustic models were trained separatelyfor CTM and PMIC microphones. The ASR system also useda constrained state and zip code grammar for decoding.3. EXPERIMENTAL SETUP3.1. Postal Address Image CorpusAs part of this study, 400 realistic images of postal addresslabels were obtained directly from the postal sorter machine.Here, the parcel sorter photographed the address labels of theparcels that were placed on the system. Realistic parcels con-sisting of a wide variety of shapes and sizes were used for datacollection. In this manner, the collected data set contained awide variety of distortions (see Fig. 2). The prominent vari-ability in the database includes different types of fonts, hand-written address labels, distortion due to plastic coverings andbackground, curved surfaces and smudged text.3.2. Postal Address Speech CorpusFor this study, speech data was collected from 40 subjects(20 male and 20 female). All subjects were native speakersof American English. Data from 16 male and female speak-ers (approximately 10,000 read-outs of address labels) werecollected under normal recording conditions for training theacoustic models. Data from the remaining 4 male and 4 fe-male subjects (approximately 2500 read-outs) was collectedin normal as well as noisy recording conditions, and used forevaluating the system. The ambient noise conditions werecreated by playing the noise recording through two speakers(located in front and behind the subject). The noise recordingwas obtained from a postal sorting facility and consisted ofa shop-floor environment (with machines, conveyors, peopleshouting/talking, parcels/packages falling etc.). The averageSNRs (signal to noise ratios) for normal and noisy recordingconditions were measured to be 20 dB and 10 dB, respec-tively. All speech data was collected simultaneously using theclose talking (CTM) and physiological (PMIC) microphones.In this study, the PMIC was positioned on the speaker’s throat. 邮政包裹分拣系统英文文献和中文翻译(2):http://www.youerw.com/fanyi/lunwen_22564.html