Welcome to the OvniConv project.

Presentation

This project is about developing some OpenDocumentFormat tools to help converting to TCVN 6909:2001 (Unicode) all files encoded with old Vietnamese encoding, say TCVN 5712:1993, VNI, VPS, and so on.

Second try: more proof of concept!

Experimentation

First try: proof of concept!

  • open an old TCVN encoded MS-Office .DOC file using OOo:
ooffice test-tcvn.doc
  • save the file in .ODT format, then quit OOo
  • use unzip to extract the .ODT content.xml file
unzip test.odt content.xml
  • recode content.xml from UTF-8 to WINDOWS-1252
iconv --from=UTF-8 --to=WINDOWS-1252 < content.xml > content-tcvn.xml
  • recode content.xml from TCVN-5712 to UTF-8
iconv --from=TCVN-5712 --to=UTF-8 < content-tcvn.xml > content.xml
  • use zip to put back content.xml in the .ODT file
zip test.odt content.xml
  • open the .ODT file using OOo
ooffice test.odt
  • It’s all Unicode encoded! (but fonts are still declared as .vn* ones)
  • Note that there still is some issue with some special characters (like double-quote) which are loosely replaced with Vietnamese accentuated characters. This is because we are doing a global raw string conversion, converting also strings using fonts other than .vn*. The final tool would have to take care of converting only those strings associated with some .vn* font.
  • Test file used: test-tcvn.doc

About the HyphenationIssue

 
projects/ovniconv.txt · Last modified: 2008/04/22 10:26 by ict4ngo
 
Recent changes RSS feed Powered by PHP Valid XHTML 1.0 Valid CSS Debian Driven by DokuWiki