UTF8Filer v1.4
UTF8Filer is an easy to use class which can read
and write any UTF-8 (Mixed Byte) files being raw text, Delimited (CSV, Tab etc)
or Fixed Width data files.
It includes functions to convert to and from Unicode as well as some useful
mixed byte text handling functions.
ASP (pre .NET) does not support mixed byte files or text streams - only pure
Single Byte and Unicode, so we need use this class to help us out.
Full Microsoft Excel delimited rules are also adhered to:
The VBScript function Split does not handle the CSV (Comma Separated Values) format correctly. There is more to CSV files than simply being comma separated or delimited. This class contains a function which works just like Split, except that it applies extra (standard) rules above.
If you are only using single byte (English) files, then use my TextFiler class. It is smaller and better on memory with large files.
For example, take this line from a csv file (generated from MS Excel or any other program). See how the different functions interpret the line differently:
Original lines from csv file
LNG,"Language Code",123,"Text,
and Text","and ""this""","first line second line of same field" |
Split
LNG | "Language Code" | 123 | "Text | and Text" | "and ""this""" | "first line |
TextFiler's Split Delimiter
LNG | Language Code | 123 | Text, and Text | and "this" | first line second line of same field |
See UTF8FilerDemo.asp for a demo of text reading/writing text or UTF8FilerDataDemo.asp for structured data.
Usage
A simplistic version of your ASP could look something like this:
<!-- #include
file=UTF8Filer.asp -->
<%
' Initialise
the class
Dim MyUTF8File
Set MyUTF8File= New UTF8Filer
'Initialise the charset we are going to use
Session.CodePage = 932
MyUTF8File.UnicodeCharset = "shift_jis"
'Open the UTF-8 file
MyUTF8File.OpenFile("demo.htm")
'Convert it to Unicode so ASP functions can handle it
MyUTF8File.cTextBuffer2Unicode
' Do something with
the file
Response.Write(MyUTF8File.TextBuffer)
'Clean up
set MyUTF8File = nothing
%>
Properties
ErrorText
String. Error Description if a method
reported False.
VirtualFileName
String. Contains the virtual path and
file name.
AbsoluteFileName
String. Contains the physical path
and file name.
Big5Space
Big5 (T.Chinese) space
(pseudo-constant) can be used in setting FieldPadding.
CharNumber
Long. Character index (place marker)
from the ReadLine method.
Can be used to determine how far through the file we are.
Delimiter
Character. Only applicable to
delimited files. , = Comma (default), vbTab = Tab, etc
Setting this property will instruct the class to run in Delimited mode.
Set FieldWidths to swap to Fixed Width mode.
FieldWidths
String. Only applicable to Fixed
Width files. Widths are in characters (not bytes) and are comma separated. ie
"10,5,20,8"
When reading this property, it returns the widths converted to Integers in an
array.
Setting this property will instruct the class to run in Fixed Width mode. Set
Delimiter to swap to Delimited mode.
FieldPadding
String. Only applicable to Fixed
Width files. Left/Right + single byte/unicode (Non-UTF8) char padding with comma
separator. ie "R ,,L0,R-,R" & chrw(&H3000).
Default is "R " or right space. The other common one is "L0" or left zeros.
When reading this property, it returns the padding converted to L/R and char in an
array.
Fields
Array. Array of fields read by
ReadLine method.
LineNumber
Long. Line index (pseudo place
marker) for the ReadLine and WriteLine methods.
You can see how many lines have been read / written.
LineDelimiter
String. vbCRLF = carriage return &
line feed (default), vbLF = Line feed, etc
TextBuffer
String. Contains the file opened by
the LoadFile method, or new text you add.
If you do not load a file, but place data into this buffer from another source,
then it is assumed that you are placing Unicode data into it.
TextBufferType
Integer. Type of the text in the
TextBuffer. 1 = Single Byte, 2 = Unicode/Double Byte, 3 = Mixed Byte
UnicodeCharset
String. Name of the character set the
data / file is in.
Note: Session.CodePage must be set to the equivalent value.
These are some common names: Windows-1252, X-ANSI, big5, gb2312, shift_jis, EUC-KR,
UTF-8, UTF-7, ASCII, etc
Methods
LoadFile
Returns: True if the file
opened successfully
Parameters: Absolute or Virtual path and file name. Must not be relative
(start with "../")
Syntax: LoadFile(FileName)
Example: if not LoadFile("myfile.htm") then 'do error handling
Loads (reads) the entire file into the TextBuffer string. The string at
this point will hold the unconverted UTF-8 characters.
cUFT8Unicode must then be run for most other ASP functions to handle it without
corrupting the contents
SaveFile
Returns: True if the file
saved successfully
Parameters: Absolute or Virtual path and file name. Must not be relative
(start with "../")
Syntax: SaveFile(FileName)
Example: if not SaveFile("myfile.htm") then 'do error handling
Saves the TextBuffer string to a UTF-8 file. The TextBuffer can be
in either Unicode or UTF-8 at this point.
The system needs to convert to UTF-8 and save it in one movement, so there is no
point running the cUnicode2UTF8 method before you save the file.
cTextBuffer2UTF8
Returns: nothing
Parameters: none
Syntax: cTextBuffer2UTF8
Example: .cTextBuffer2UTF8
Converts the TextBuffer from Unicode to UTF-8. If it already is UTF-8
then it does nothing.
cUnicode2UTF8
Returns: nothing
Parameters: Unicode string
Syntax: cUnicode2UTF8(MyString)
Example: .cUnicode2UTF8(MyString)
Converts the string from Unicode to UTF-8.
cTextBuffer2Unicode
Returns: nothing
Parameters: none
Syntax: cTextBuffer2Unicode
Example: .cTextBuffer2Unicode
Converts the TextBuffer from UTF-8 to Unicode. If it already is Unicode
then it does nothing.
cUTF82Unicode
Returns: nothing
Parameters: UTF-8 string
Syntax: cUFT8Unicode(MyString)
Example: .cUFT8Unicode(MyString)
Converts the string from UTF-8 to Unicode.
EOF
Returns: True if ReadLine
(CharNumber) is at the End of the File (TextBuffer)
Parameters: none
Syntax: EOF
Example: while .EOF 'do .ReadLine etc
ReadLine
Returns: If neither
Delimiter or FixedWidths has been set, returns the next line of data
from TextBuffer, otherwise if they have been set, returns an array
of the next line of data from TextBuffer.
Also updates Fields array if Delimiter or FixedWidths
has been set
Parameters: none
Syntax: ReadLine
Example: x = .ReadLine
Reads 1 line (up to the next Line Feed) from the TextBuffer, and returns
the data or text depending on configuration options set
WriteLine
Returns: True if the line was
written successfully
Parameters: line or array of data to write
Syntax: WriteLine(myLine)
Example: if not WriteLine(myLine) then 'do error handling
Writes a line to the end of the TextBuffer using the method previously
configured. If Delimiter or FixedWidths has been set, the
field/column of data is written in that format, otherwise, the pure text is
written. The line is then delimited/terminated with the line delimiter specified
in the LineDelimiter property.
SplitDelimiter
Returns: populates Fields
property
Parameters: String of field data
Syntax: SplitDelimiter(LineString)
Example: SplitDelimiter("1243,abcd,4321,dcba")
Converts a string to an array. This is not normally used directly, but is
exposed for you to use if you have the need.
SplitFixed
Returns: populates Fields
property
Parameters: String of field data. Reads FieldWidths property
Syntax: SplitFixed(LineString)
Example: SplitFixed("1243abcd 4321
dcba")
Converts a string to an array. This is not normally used directly, but is
exposed for you to use if you have the need.
Below are some bonus functions.
The first 2 are used internally by the class and the others I wrote/used these a
while ago, but they are not really needed with way this class works.
I am adding them in just in case you find the need for them:
CountChar
Returns: Number of times
SearchChar appears in SourceString
Parameters: String, Character
Syntax: CountChar(SourceString,SearchChar)
Counts the number of times SearchChar occurs in SourceLine.
InstrMB
Returns: Position of
SearchChar in SourceString
Parameters: Binary String, Character
Syntax: InstrMB(SourceString,SearchChar)
Search for char in Mixed Byte string. Instr in binary mode doesn't seem to work
with a binary string array.
LeftMB
Same input/output as LeftB
Return left # of UTF-8 (mixed byte) chars in a Unicode stream.
LenMB
Same input/output as LenB
Count UTF-8 (mixed byte) chars in a Unicode stream.
Important Notes
See UTF8FilerDemo.asp for a demo of text reading/writing text or UTF8FilerDataDemo.asp for structured data.
If you improve this code, please send me a copy!
Thanks!
Special thanks to Lewis Moten and Cakkie (see Planet Source Code) for their
techniques on UTF-8 conversion.
Hunter Beanland
hunter @ beanland.net.au
http://www.beanland.net.au/programming/
Version History
1.4 Slight optimisations
1.3 Added Padding support
1.2 Fixed 3 bugs in the Delimited Unicode mode of ReadLine and WriteLine.
1.1 Added functionality from my TextFiler class to handle Delimited and
Fixed Width data files
1.0 First version.