VBspeed / String / ParseCSV
VBspeed © 2000-10, updated: 02-Jul-2002
ParseCSV


The Definition

Function ParseCSV
CSV means Comma-Separated Values. It's a way of recording database fields in text format. In a CSV-file each record is represented by a line where the values (fields) are separated by commas. Values containing the separator are put between quotes. Quotes within quoted values are replaced by double-quotes.

The ParseCSV function splits a line of CSV values (or fields) and returns the values in a string array. Sounds like Split. However, there's more to the CSV-format and the CSV-parsing standards:

  1. Leading and trailing spaces are trimmed.
  2. The values can be in quotes.
  3. Quoted values are unquoted.
There's no way to handle this with Split. Below you find a table that demonstrates what a CSV-parsing function has to accomplish.
Declaration:
Public Function ParseCSV(Expression As String, asValues() As String) As Long
Arguments:
ExpressionRequired. String expression containing comma-separated values. If expression is a zero-length string, ParseCSV returns a single-element array containing a zero-length string.
asValues()Required. One-dimensional string array that will hold the returned substrings. Does not have to be bound before calling ParseCSV, and is guaranteed to hold at least one element (zero-based) on return.
[return]The function itself returns the number of elements in the returned array.

CSV-parsing by PHP
The following table is generated by a small PHP script. It shows how the native PHP-function fgetcsv() parses a CSV file (PHP 8.1.28-he.0). The behavior you see below corresponds 100% to the CSV-parsing of Excel 97.
# CSV Value 1 Value 2 Value 3 Value 4
1 a,b,c a b c
2 "a",b,c a b c
3 'a',b,c 'a' b c
4   a  ,  b  ,  c     a     b     c  
5 aa,bb;cc aa bb;cc
6
7 a a
8 ,b, b
9 ,,c c
10 ,,
11 "",b b
12 " ",b   b
13 "a,b" a,b
14 "a,b",c a,b c
15 " a , b ", c   a , b   c 
16 a b,c a b c
17 a"b,c a"b c
18 "a""b",c a"b c
19 a""b,c a""b c
20 a,b",c a b" c
21 a,b"",c a b"" c
22 a,"B: ""Hi, I'm B""",c a B: "Hi, I'm B" c
 
However (not shown in the table), there are some more or less criminal cases of not-so-well-formedness, where PHP and Excel differ. Here's one of them. How would you decide?
  ' ??? skip the b, or keep it, or keep the quote AND the b, or what ???
  ' PHP 4.0.6:  "a"b,c -> a,c
  ' Excel 97:   "a"b,c -> ab,c
  ' ParseCSV01: "a"b,c -> a"b,c

IsGoodParseCSV?
If you want to have a go at ParseCSV yourself, use this function (VB5/6-compatible) to verify the correctness of your code.


The Charts
Calls
 lRet = ParseCSV(Expression, asValues())
Call 1 Expression = [5 values list à 20 chars each]
Call 2 Expression = [5 values list à 20 chars each, values quoted]
Call 3 Expression = [50 values list à 2 chars each]
Call 4 Expression = [50 values list à 2 chars each, values quoted]
 VB5
CodeAuthorDopingNotes
ParseCSV01 Donald  
ParseCSV02MikeRegExp 
Call 1
11.0031µs
   
Call 2
11.0033µs
   
Call 3
11.00121µs
   
Call 4
11.00135µs
   
 VB6
CodeAuthorDopingNotes
ParseCSV01 Donald  
ParseCSV02 MikeRegExp 
Call 1
11.007.994µs
25.1541.176µs
Call 2
11.0033µs
25.88196µs
Call 3
11.00129µs
210.891,410µs
Call 4
11.00136µs
211.051,503µs
Notes & Conclusions
ParseCSV01 is a crudely done starter. I'm sure you can do better ...
ParseCSV02 uses the RegExp object (Library VBScript_RegExp_55, Microsoft VBScript Regular Expressions 5.5) which is available only for VB6. It allows for an impressively short code. Alas, the speed is less impressive.
Mail your code! How to read all those numbers


The Code
ParseCSV01
submitted 02-Jun-2002 by Donald Lessau  
revised 02-Jul-2002
Doping: none
Public Function ParseCSV01(ByRef Expression As String, asValues() As String) As Long
' by Donald, donald@xbeat.net, 20020603, rev 20020701
  
  Const lAscSpace     As Long = 32   ' Asc(" ")
  Const lAscQuote     As Long = 34   ' Asc("""")
  Const lAscSeparator As Long = 44   ' Asc(","), comma
  
  Const lValueNone    As Long = 0 ' states of the parser
  Const lValuePlain   As Long = 1
  Const lValueQuoted  As Long = 2
  
  ' BUFFERREDIM is ideally exactly the number of values in Expression (minus 1)
  ' so: if you know what to expect, fine-tune here
  Const BUFFERREDIM   As Long = 64
  Dim ubValues        As Long
  Dim cntValues       As Long
  
  Dim abExpression() As Byte
  Dim lCharCode As Long
  Dim posStart As Long
  Dim posEnd As Long
  Dim cntTrim As Long
  Dim lState As Long
  Dim i As Long
  
  If LenB(Expression) > 0 Then
    
    abExpression = Expression         ' to byte array
    ubValues = -1 + BUFFERREDIM
    ReDim Preserve asValues(ubValues) ' init array (Preserve is faster)
    
    For i = 0 To UBound(abExpression) Step 2
      
      ' 1. unicode char has 16 bits, but 32 bit Longs process faster
      ' 2. add lower and upper byte: ignoring the upper byte can lead to misinterpretations
      lCharCode = abExpression(i) Or (&H100 * abExpression(i + 1))
      
      Select Case lCharCode
      
      Case lAscSpace
        If lState = lValuePlain Then
          ' at non-quoted value: trim 2 unicode bytes for each space
          cntTrim = cntTrim + 2
        End If
      
      Case lAscSeparator
        If lState = lValueNone Then
          ' ends zero-length value
          If cntValues > ubValues Then
            ubValues = ubValues + BUFFERREDIM
            ReDim Preserve asValues(ubValues)
          End If
          asValues(cntValues) = ""
          cntValues = cntValues + 1
          posStart = i + 2
        ElseIf lState = lValuePlain Then
          ' ends non-quoted value
          lState = lValueNone
          posEnd = i - cntTrim
          If cntValues > ubValues Then
            ubValues = ubValues + BUFFERREDIM
            ReDim Preserve asValues(ubValues)
          End If
          asValues(cntValues) = MidB$(Expression, posStart + 1, posEnd - posStart)
          cntValues = cntValues + 1
          posStart = i + 2
          cntTrim = 0
        End If
      
      Case lAscQuote
        If lState = lValueNone Then
          ' starts quoted value
          lState = lValueQuoted
          ' trims the opening quote
          posStart = i + 2
        ElseIf lState = lValueQuoted Then
          ' ends quoted value, or is a quote within
          lState = lValuePlain
          ' trims the closing quote
          cntTrim = 2
        End If
      
      Case Else
        If lState = lValueNone Then
          ' starts non-quoted value
          lState = lValuePlain
          posStart = i
        End If
        ' reset trimming
        cntTrim = 0
      
      End Select
    
    Next
    
    ' remainder
    posEnd = i - cntTrim
    If cntValues <> ubValues Then
      ReDim Preserve asValues(cntValues)
    End If
    asValues(cntValues) = MidB$(Expression, posStart + 1, posEnd - posStart)
    ParseCSV01 = cntValues + 1
  
  Else
    ' (Expression = "")
    ' return single-element array containing a zero-length string
    ReDim asValues(0)
    ParseCSV01 = 1
  
  End If

End Function
Author's comments:
Donald's comments:

top | charts


ParseCSV02
submitted 05-Jun-2002 by Mike Peterson  
Doping: RegExp (Library VBScript_RegExp_55, Microsoft VBScript Regular Expressions 5.5)
Option Explicit
'*******************************************************
' cParser Class:  Used to parse out items from a CSV File.
' by Mike Peterson, mpeterson1200@yahoo.com, 20020605
' Uses (aarrgh!) Microsoft VBScript Regular Expressions 5.5 Object.
'*******************************************************
Private regEx As VBScript_RegExp_55.RegExp
Private mc As VBScript_RegExp_55.MatchCollection
Private m As VBScript_RegExp_55.Match


Private Sub Class_Initialize() Set regEx = New VBScript_RegExp_55.RegExp regEx.Global = True regEx.IgnoreCase = True regEx.Pattern = "(\s*""[^""]*""\s*,)|(\s*[^,]*\s*,)" 'The magic... End Sub
Friend Function ParseCSV02(sExpr$, arVals$()) As Long Dim lCnt& Set mc = regEx.Execute(sExpr & ",") ReDim arVals(mc.Count - 1) As String For Each m In mc arVals(lCnt) = Trim$(Left$(m.Value, m.Length - 1)) If Left$(arVals(lCnt), 1) = """" And Right$(arVals(lCnt), 1) = """" Then arVals(lCnt) = Mid$(arVals(lCnt), 2, Len(arVals(lCnt)) - 2) End If lCnt = lCnt + 1 Next m ParseCSV02 = mc.Count End Function
Author's comments :
Donald's comments :

top | charts




VBspeed © 2000-10 by Donald Lessau