ISSUE: SQL export to Excel using SSIS
We want to export numeric data directly to Excel but the rows come out as “Number Stored as Text”
Why is this and how do we get around it so that the numeric data is exported as numberic in Excel?
The first thing to understand is that Excel is not a database!
In Excel each cell is autonomous. That is, it is an entity in its own right and not part of a row or column other than by inclusion. This means each cell can have a different type within the same row or columm. By contrast all rows in a column within a database have the same type.
The connection provider (JET or ODBC) exposes the spread sheet workbook to look like it is a database with each sheet looking like a table, therefore we have rows of data organised in to columns which contain meaningful names (denoted by the text in the first row of the sheet) but this does not mean that the it supports all the properties of a database.
There are two major limitations affecting us right now.
• You cannot define the data type in Excel
• You cannot delete rows from the sheet
i.e. Columns therefore do not have a defined schema (As each cell in the column can be of a different type).
When you read data from Excel using a provider such as Jet or ODBC then the type is inferred from the first 8 rows of data in each column. The column header (name) is assumed to be in the first row so the (column) type will be inferred from rows 2 to 9 (unless you specify otherwise in the connection string).
Sadly when you write data to Excel, MS saw fit to use the same system which doesn’t really make sense. It would be nice if they had looked at the data type of the data you want to put in to the sheet but they didn’t so tough!
Because the type is inferred from the Sheet you cannot specify the data type anywhere in SSIS as this would be pointless.
- You can change the type of the data you are exporting at the source but it will still be exported in the format dictated by the already existing Excel sheet.
- You can change the column type in the advanced properties of the Excel destination control in SSIS but this will just result in an error in the connection until you re-bind the sheet and the value will be put back to what the Excel provider thinks it should be.
What are the rules for inferring type from the column?
- If there are less than 8 rows then it uses the data available
- If there are more than 8 rows it uses the top 8 rows (not including the header)
- If the top 8 rows have different types then the majority rule
- If the top 8 rows have different types but the same number of each e.g. 4 dates and 4 integers then the type follows type precedence which in Excel is numeric first (so integer wins in this case)
- Type is not affected by Excel cell formatting settings.
- If there is no data in the rows then no type can be inferred so when writing the data, the provider will always write as text as this is the safer option. You can use the VBA VAL() function to convert the text to a number.
If reading data from a column with mixed types then type is inferred as above and cells with alternate types will be returned as NULL
The best way to export data to Excel is use a template file (blank Excel sheet that you copy to a new location and then fill) and define a single hidden row in the Excel Spread sheet just under the column headings which contains a value of the correct data type.
When we connect SSIS to the sheet it will infer its type from this hidden rows and therefore populate additional rows with the correct type, however, as we cannot delete rows from the spread sheet via SSIS we cannot remove this hidden data prior to sending to the client so this hidden data must be innocuous. i.e. It must not be seen as a result or have any effect on the results.
The following bugs should also be noted when using the ODBC driver. This is not the case for the ADO connection that uses the JET driver. The following was taken from a Microsoft knowledgebase article….
Column headings: By default, it is assumed that the first row of your Excel data source contains columns headings, which can be used as field names. If this is not the case, you must turn this setting off, or your first row of data “disappears” to be used as field names. This is done by adding the optional FirstRowHasNames= setting to the connection string. The default, which does not need to be specified, is FirstRowHasNames=1, where 1 = True. If you do not have column headings, you need to specify FirstRowHasNames=0, where 0 = False; the driver names your fields F1, F2, and so forth. This option is not available in the DSN configuration dialog box.
However, due to a bug in the ODBC driver, specifying the FirstRowHasNames setting currently has no effect. In other words, the Excel ODBC driver (MDAC 2.1 and later) always treats the first row in the specified data source as field names. For additional informationon the Column Heading bug, click the article number below to view the article in the Microsoft Knowledge Base:
288343 BUG: Excel ODBC Driver Disregards the FirstRowHasNames or Header Setting
Rows to Scan: Excel does not provide ADO with detailed schema information about the data it contains, as a relational database would. Therefore, the driver must scan through at least a few rows of the existing data in order to make an educated guess at the data type of each column. The default for “Rows to Scan” is eight (8) rows. You can specify an integer value from one (1) to sixteen (16) rows, or you can specify zero (0) to scan all existing rows. This is done by adding the optional MaxScanRows= setting to the connection string, or by changing the Rows to Scan setting in the DSN configuration dialog box.
However, due to a bug in the ODBC driver, specifying the Rows to Scan (MaxScanRows) setting currently has no effect. In other words, the Excel ODBC driver (MDAC 2.1 and later) always scans the first 8 rows in the specified data source in order to determine each column’s datatype.