STL的I/O Stream如何读写Unicode文件
据说要用到std::locale之类的东西
但是我不清楚啊,谁有办法或者例子?
要用STL解决不要纯C的方案
[解决办法]
http://blog.csdn.net/jixingzhong/archive/2006/12/28/1465419.aspx
[解决办法]
setlocale怎么设置使用unicode?我只找到了这么多
Chinese Chinese "chinese "
Chinese Chinese (simplified) "chinese-simplified " or "chs "
Chinese
Chinese (traditional)
"chinese-traditional " or "cht "
Czech
Czech
"csy " or "czech "
Danish
Danish
"dan " or "danish "
Dutch
Dutch (default)
"dutch " or "nld "
Dutch
Dutch (Belgium)
"belgian ", "dutch-belgian ", or "nlb "
English
English (default)
"english "
English
English (Australia)
"australian ", "ena ", or "english-aus "
English
English (Canada)
"canadian ", "enc ", or "english-can "
English
English (New Zealand)
"english-nz " or "enz "
English
English (United Kingdom)
"eng ", "english-uk ", or "uk "
English
English (United States)
"american ", "american english ", "american-english ", "english-american ", "english-us ", "english-usa ", "enu ", "us ", or "usa "
Finnish
Finnish
"fin " or "finnish "
French
French (default)
"fra " or "french "
French
French (Belgium)
"frb " or "french-belgian "
French
French (Canada)
"frc " or "french-canadian "
French
French (Switzerland)
"french-swiss " or "frs "
German
German (default)
"deu " or "german "
German
German (Austria)
"dea " or "german-austrian "
German
German (Switzerland)
"des ", "german-swiss ", or "swiss "
Greek
Greek
"ell " or "greek "
Hungarian
Hungarian
"hun " or "hungarian "
Icelandic
Icelandic
"icelandic " or "isl "
Italian
Italian (default)
"ita " or "italian "
Italian
Italian (Switzerland)
"italian-swiss " or "its "
Japanese
Japanese
"japanese " or "jpn "
Korean
Korean
"kor " or "korean "
Norwegian
Norwegian (default)
"norwegian "
Norwegian
Norwegian (Bokmal)
"nor " or "norwegian-bokmal "
Norwegian
Norwegian (Nynorsk)
"non " or "norwegian-nynorsk "
Polish
Polish
"plk " or "polish "
Portuguese
Portuguese (default)
"portuguese " or "ptg "
Portuguese
Portuguese (Brazil)
"portuguese-brazilian " or "ptb "
Russian
Russian (default)
"rus " or "russian "
Slovak
Slovak
"sky " or "slovak "
Spanish
Spanish (default)
"esp " or "spanish "
Spanish
Spanish (Mexico)
"esm " or "spanish-mexican "
Spanish
Spanish (Modern)
"esn " or "spanish-modern "
Swedish
Swedish
"sve " or "swedish "
Turkish
Turkish
"trk " or "turkish "
[解决办法]
终于学到了点东西,不容易啊!
#include <iostream>
#include <string>
#include <stdio.h>
#include <locale> ////
using namespace std;
//目标::建将文件小写字母,改成大写,写入另一个文件,//unicode文件
void main( )
{
FILE *f1;
FILE *f2;
//
setlocale(LC_ALL, " "); //重要,否则不能正常处理,出现乱码 setlocale(LC_ALL, "chs ")也可以直接指定中国;
//locale loc ( "English " );
locale loc ( "Chinese " ); //or
wchar_t wc;
wchar_t wcstr[2]=L "好 ";
wstring sf1=L "C:\\Test.ini "; //该文件事先建好。
wstring sf2=L "C:\\Test.txt ";
int fileOpen;
fileOpen=_wfopen_s(&f1,sf1.c_str(),L "rt+,ccs=UNICODE ");
if (fileOpen!=0)// C4996 ///以读文本文件方式打开,字符指定为Unicode
// Note: _wfopen is deprecated; consider using _wfopen_s instead
{
wprintf(L "_wfopen failed!\n ");
return ;
}
fileOpen=_wfopen_s(&f2,sf2.c_str(),L "wt+,ccs=UNICODE ");
if (fileOpen!=0)// C4996 //write
// Note: _wfopen is deprecated; consider using _wfopen_s instead
{
wprintf(L "_wfopen failed!\n ");
return ;
}
fpos_t pos; //文件位置
long ipos=0;
while(!feof(f1)) //判断的是否文件结束。
{
///read s char from f1 write to f2
fgetpos(f1 , &pos );//获取当前位置 ,用fsetpos(f1,&pos)定位。或fseek();
ipos=ftell(f1);
wc=getwc(f1);
if (islower(wc,loc)) {
wc=toupper(wc,loc);
}
fwprintf(stdout,L "%c ",wc);//输出condole 结束符有问题,不明白;
putwc(wc,f2); //输出文件
}
cout < <endl;
wchar_t ws[100]=L "房间dkugdfu团体\n ";
wstring ws1=L "sdjfg好了\n ";
_putws(ws); //写字符穿到stdout
_putws(ws1.c_str());
fseek(f2,20L,SEEK_END);
fwprintf(f2,L "%s ",ws); //在文件2上追加两个字符串,这里没问题。
fseek(f2,0,SEEK_END);
fwprintf(f2,L "%s ",ws1.c_str());
fclose(f1);
fclose(f2);
system( "pause ");
}
[解决办法]
经过我的分析. wostream 这批 io 并没有实现(至少默认)unicode的读写.
而wstring 本身是unicode的.
wcout < < wstring ; 到底做了什么?是将unicode 转化多字节(内码),然后显示出来.
wistream 能够将内码读取出来放在unicode,类似wcin,将屏幕的输入(内码)转化成unicode保存在wstring. 因此,也就是输入输出都是内码,只是进入程序中后的数据保存为unicode.
cout, cin这批只是没有经过这种转换,直接保存到string为内码罢了.
因为这种转化很郁闷,又难控制(自动转化), 所以我设计了用普通的io来读取保存字符流.只要字符流本身已经是Unicode, 那么读取和保存也就是Unicode了,不是么?
//示例
//保存unicode字符串
void saveW( ostream& out, wchar_t const* str, int size )
{
char const* pos = (char const*)str;
//wcout < < str < < " : " < < size < < endl;
char const* const utf16head = "\xFF\xFE ";//txt文本用这个来标识字符编码
out.write( utf16head, 2 );
out.write( pos, size - 2 );
}
//调用的代码
wchar_t str[] = L "abc你好吗? ";
ofstream out( "writeuft16_m.txt " );
saveW( out, str, sizeof( str ) );
out.close();
[解决办法]
Wide 文件 I/O
这里是stream类的wide版本,它容易地定义t-风格的宏去管理他们:
你将像这样用它们:
tofstream testFile( "test.txt " ) ;
testFile < < _T( "ABC ") ;
现在,你期待的结果是,当用single-byte 字符编译的时候,执行代码将生成3字节的文件,当用double-byte 字符编译的时候,执行代码将生成6字节的文件。但是你错了,都是3字节的文件。
到底怎么啦?
这渊源是标准C++的规定,wide流当写到 file。必须转换double-byte 到single-byte 。如上例,宽字符串L "ABC "(有6个字节长),当写到文件前,被转换成窄字符串(3字节)。更坏的情况,如何转换由库的实现来决定的( implementation-dependent)。
我不能找出一个确切的解释,为什么事情会弄成这样子。我猜测,文件被定义为考虑作为字符(single-byte)流。若允许同时写2字节的字符将无法提取。不管对还是错,这都导致严重的问题。例如,你不能写二进制数据到wofstream,因为这个类试图在输出前先窄字符化它。
这对我是明显的问题,因为我有大量的函数像这样写:
void outputStuff( tostream& os )
{
// output stuff to the stream
os < < ....
}
假如你传递的是tstringstream 对象将没有问题(例如,它流出宽字符),但是假如你传递的是tofstream 将得到怪异的结果(因为所有内容都被窄化了)。
-----------------------
Wide 文件 I/O: 解决方案
用调试器单步跟踪STL,结果发现wofstream 在写输出到文件以前,调用std::codecvt 对象来窄化输出的数据。std::codecvt对象是造成字符串从一种字符集到另一种字符集转换的原因。C++要求作为标准提供:1、转换chars 到 chars(例如,费力地什么也不做),2、转换wchar_ts 到chars。后一种就是引起我们这么多伤心事的原因。
解决方案:写一个新的继承自codecvt的类,用来转换wchar_ts 到 wchar_ts(什么也不做),绑定到wofstream 对象中。当wofstream 试图转换它所输出的数据时,它将调用我们新的codecvt 对象,实际上什么也不做,不改变地写输出数据。
google groups浏览找一些P. J. Plauger写的代码 code (是MSVC环境中STL库的作者),但是用 Stlport 4.5.3 编译还是有问题。 这是最后敲定的版本:
#include
// nb: MSVC6+Stlport can 't handle "std:: "
// appearing in the NullCodecvtBase typedef.
using std::codecvt ;
typedef codecvt < wchar_t , char , mbstate_t > NullCodecvtBase ;
class NullCodecvt
: public NullCodecvtBase
{
public:
typedef wchar_t _E ;
typedef char _To ;
typedef mbstate_t _St ;
explicit NullCodecvt( size_t _R=0 ) : NullCodecvtBase(_R) { }
protected:
virtual result do_in( _St& _State ,
const _To* _F1 , const _To* _L1 , const _To*& _Mid1 ,
_E* F2 , _E* _L2 , _E*& _Mid2
) const
{
return noconv ;
}
virtual result do_out( _St& _State ,
const _E* _F1 , const _E* _L1 , const _E*& _Mid1 ,
_To* F2, _E* _L2 , _To*& _Mid2
) const
{
return noconv ;
}
virtual result do_unshift( _St& _State ,
_To* _F2 , _To* _L2 , _To*& _Mid2 ) const
{
return noconv ;
}
virtual int do_length( _St& _State , const _To* _F1 ,
const _To* _L1 , size_t _N2 ) const _THROW0()
{
return (_N2 < (size_t)(_L1 - _F1)) ? _N2 : _L1 - _F1 ;
}
virtual bool do_always_noconv() const _THROW0()
{
return true ;
}
virtual int do_max_length() const _THROW0()
{
return 2 ;
}
virtual int do_encoding() const _THROW0()
{
return 2 ;
}
} ;
你能看得出这些函数都是空架子,实际上什么也不做,仅仅返回noconv 指示而已。
剩下要做的仅仅是把其实例化,并连接到wofstream 对象中。用MSVC,假定你用_ADDFAC() 宏(非标准的)来imbue一个locale到对象。可是它不能和我的新的NullCodecvt类工作,因此我绕过这个宏,写一个新的来代替:
#define IMBUE_NULL_CODECVT( outputFile ) \
{ \
NullCodecvt* pNullCodecvt = new NullCodecvt ; \
locale loc = locale::classic() ; \
loc._Addfac( pNullCodecvt , NullCodecvt::id, NullCodecvt::_Getcat() ) ; \
(outputFile).imbue( loc ) ; \
}
好,上面给出的不能好好工作的例子代码,现在能这样写:
tofstream testFile ;
IMBUE_NULL_CODECVT( testFile ) ;
testFile.open( "test.txt " , ios::out | ios::binary ) ;
testFile < < _T( "ABC ") ;
重要的是必须是在打开文件前,文件流对象要用新的codecvt对象imbue。文件也必须用binary模式打开。假如不是这种模式,每次文件看一个宽字符的高位或低位是10的时候,它将进行既定的CR/LF翻译,结果不是你想要的。假如你真的想要CR/LF序列,你可以明确地插入 "\r\n "来代替std::endl。
====================================
参考:【翻译文章】如何升级基于STL的应用来支持Unicode-
http://dozb.bokee.com/1655050.html