Sean Holmesby

.NET and Sitecore Developer

By

Parsing a Delimited CSV File in Coveo

I was trying to find a way to take a single CSV file containing addresses, and parse each address into it’s own document.
The file had an address on every line, and was delimited with a ‘|’.

11||PARK|ST||01001|1113|AGAWAM|MA|N
21||STATION|ST||02108|1120|BOSTON|MA|N
27|A|GEORGE|ST||01944|1129|MANCHESTER|MA|N

After posting on the forums about getting this to work with the File Connector, I was told that the Database Connector was better suited for this.
https://answers.coveo.com/questions/4542/how-to-index-a-file-from-a-file-connector-source

To set this up there were a couple of moving pieces, so I thought I’d detail them in this blog post.

We will use the Database Connector, and the Microsoft Text ODBC Driver. The ODBC driver allows us to connect to the file, and perform queries on it.
To know what queries to run, and how to split the content up, we require two files to be created.

Create The Schema.ini file

This file is used to tell the database connector how to parse your file. The schema.ini file should be saved in the same location as your CSV file.

[MyAddressesFile.txt]
Format = Delimited(|)
ColNameHeader = False
MaxScanRows=0
CharacterSet = ANSI
Col1=StreetNumber Integer Width 20
Col2=StreetNumber2 Text Width 20
Col3=StreetName Text Width 60
Col4=StreetType Text Width 6
Col5=StreetName2 Text Width 60
Col6=ZipCode1 Text Width 6
Col7=ZipCode2 Text Width 6
Col8=City Text Width 60
Col9=State Text Width 10
Col10=MultipleDwellingUnit Text Width 6

The values for ColNameHeader tells the connector that the first row does not contain the Column names. We will get these further down in the Schema file.
MaxScanRows=0 tells the connector not to search the rest of the document for the column names.
The Format option tells the connector how the data is delimited. In my case, it was pipe separated.
As mentioned above, we then specify the names of each of the columns, and the approximate sizes of the content in each.

More information on the Schema.ini file can be found here.

Create your Configuration file

The configuration file is used to tell the connector what queries to run to extract the data. Typically, this is a little more complex, but for a CSV file it can be kept very simple.
Here is my config file.

<?xml version="1.0" encoding="utf-8" ?> 
<ODBC>
  <Mapping type="Addresses">
    <Accessor type="query">
      SELECT StreetNumber,
             StreetNumber2,
             StreetName,
             StreetType,
             StreetName2,
             ZipCode1,
             ZipCode2,
             City,
             State,
             MultipleDwellingUnit             
      FROM   [MafExtract1507.txt]
    </Accessor>
    <Fields>
      <Uri> http://www.somewebsite.com/default.aspx?Id= %[StreetNumber] %[StreetNumber2] %[StreetName] %[StreetType] %[StreetName2] %[ZipCode1] %[ZipCode2] %[City] %[State]</Uri>
      <ClickableUri>http://www.coveo.com</ClickableUri>
      <ContentType>text/html</ContentType>
      <Title>%[StreetNumber] %[StreetName] %[StreetType] %[City] %[State]</Title>
      <Body>%[StreetNumber] %[StreetName] %[StreetType] %[ZipCode1] %[ZipCode2] %[City] %[State]</Body>
    </Fields>
    <AllowedUsers>
      <AllowedUser type="Windows" allowed="true">
        <Name>everyone</Name>
        <Server></Server>
      </AllowedUser>
    </AllowedUsers>
  </Mapping>
</ODBC>

We can see that the name of the Mapping is ‘Addresses’. We will use this later when setting up the ‘Items to crawl’ field in our index source.
Note the query run here matches the fields with the column names from our file.
Also note that the URI field contains all the information about the address. This is because it needs to be unique for each address in the file.

An example of this configuration file can be found here.

Setup the Custom Fields for the Index

Now we want to create our custom fields for the index. Navigate to Configuration -> Fields, and create a new set of fields. Add custom fields that match the names of the columns in your data. (This also matches the query we specified in the above configuration file, and the column names we set up in the Schema.ini file.

02 - Custom Fields
Here are some instructions on how to create custom fields.
https://developers.coveo.com/display/public/SC201510/Displaying+External+Content+in+a+Search+Interface#DisplayingExternalContentinaSearchInterface-Step2-IndexingtheAuthorNamefortheexternalsource

Setup the Index Source using the Database Connector

Now that we’ve setup our schema and config files, we need to setup our index source.
Create a new index source, and use the following settings.

Source Type: Database
Addresses: This is where we put in our ‘connection string’ for the Odbc Text Driver.
Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=d:\PathToFolderThatContainsCSV\;Extensions=asc,csv,tab,txt;

Items to crawl: Matches the name of the Mappings node in the configuration file. <Mapping type=”Addresses”> so ‘Addresses’
Configuration file path: Link to the config file specified above.
Driver Type: Odbc
Use 32 bits driver: True (the Microsoft Text Driver is a 32 bit driver).

Now that this is saved, assign the custom field scheme we setup above to the source (Index -> Select your Index Source -> Fields -> Select custom field scheme -> Apply Changes).
Now rebuild the index.

You should see each address being indexed in the logs, and you can see that the Indexed Documents show all of the fields.

03 - Indexed Fields

A big thanks to the Coveo support team for helping me through setting this up!

Leave a Reply

Your email address will not be published. Required fields are marked *