Skip to content

Malformed CSV data #48

@joefutrelle

Description

@joefutrelle

This is related to issue #46. Trying to use pandas.read_csv to read the tps4 neg data I get this syntax error:

>>> df = pd.read_csv('LCdata/neg/Tps4_neg_withAddedDetails.2015.02.05.csv')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 420, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 225, in _read
    return parser.read()
  File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 626, in read
    ret = self._engine.read(nrows)
  File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1070, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 727, in pandas.parser.TextReader.read (pandas/parser.c:7110)
  File "parser.pyx", line 749, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7334)
  File "parser.pyx", line 802, in pandas.parser.TextReader._read_rows (pandas/parser.c:7943)
  File "parser.pyx", line 789, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:7817)
  File "parser.pyx", line 1697, in pandas.parser.raise_parser_error (pandas/parser.c:19569)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 50 fields in line 2510, saw 51

That line of the CSV file looks like this

2509,770.8654788,770.8654788,770.8654788,85.3971,85.3971,85.3971,1,1,7192.340903,2403.965241,22070.52672,187320.4289,142664.2457,214115.0429,10416.10935,21067.41343,69994.9968,1105750.95,72005.57328,30897.23792,21990.54678,34453.52575,0,64901.8607,6047.34812,0,0,0,19950.28018,171732.1797,0,204246.1387,11901.23666,1126.787846,0,176015.1765,6520.571956,0,0,2235.663193,3300.804818,0,15898.08514,243419.2922,,[M-2H+Na]- 749.885,24,0,organo-iodine compound, PubChem CID 11535056

Looks like the annotated column is supposed to contain organo-iodine compound, PubChem CID 11535056, but that value has a comma in it so the correct CSV formatting for this line is:

2509,770.8654788,770.8654788,770.8654788,85.3971,85.3971,85.3971,1,1,7192.340903,2403.965241,22070.52672,187320.4289,142664.2457,214115.0429,10416.10935,21067.41343,69994.9968,1105750.95,72005.57328,30897.23792,21990.54678,34453.52575,0,64901.8607,6047.34812,0,0,0,19950.28018,171732.1797,0,204246.1387,11901.23666,1126.787846,0,176015.1765,6520.571956,0,0,2235.663193,3300.804818,0,15898.08514,243419.2922,,[M-2H+Na]- 749.885,24,0,"organo-iodine compound, PubChem CID 11535056"

So I think this is a "data bug" and there may be more instances of this problem, which I can discover using pd.read_csv.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions