change csv to deal in strings now and convert appended values into strings#625
change csv to deal in strings now and convert appended values into strings#625wfsyre wants to merge 2 commits intoUbuntu-22.04from
Conversation
22ea269 to
1854c6c
Compare
include/scrimmage/common/CSV.h
Outdated
| std::string operator()(size_t value) const { return std::to_string(value); } | ||
| std::string operator()(long value) const { return std::to_string(value); } | ||
| std::string operator()(const std::string& value) const { return value; } | ||
| std::string operator()(bool value) const { return value ? "True" : "False"; } |
There was a problem hiding this comment.
I'd opt for "true" and "false" (not capitalized). There's not a huge reason for this, but there's also not a huge reason for capitalizing, so my "break the tie" reasoning" is that if people are ever typing it, that's one less thing to mess up. If it's read case-insensitive (which I would hope it is), then no reason to throw in a capital and make people wonder. Python is the only thing I know of that has case-sensitive "True" and "False". No reason to bring that into a CSV.
include/scrimmage/common/CSV.h
Outdated
| public: | ||
| typedef std::list<std::string> Headers; | ||
| typedef std::list<std::pair<std::string, double>> Pairs; | ||
| typedef std::variant<bool, size_t, unsigned int, long, int, double, std::string> |
There was a problem hiding this comment.
Given possible ripple effects, I'm not sure if what I'm about to say is feasible. Actually, if all you had was double before, then all these options should be new, so it wouldn't affect existing code any differently, so I'll go with that assumption. Try to erase unsized types like unsigned int, long, int, etc. from usage. Prefer sized types. What's an int? It has a minimum size, but you don't really know because those are implementation-defined. Same for long. I'm guessing for int you're expecting a 32-bit value, so use int32_t. I'm guessing long is expected to be a 64-bit value, so use int64_t.
And if these are for data files, you can probably get away without a size_t. Depends on what you're expecting to represent. A size_t isn't "how many things", but is specifically a value used to represent memory sizes, so it's tied to the size of a pointer. That's why they're used as offsets in arrays -- not because they're counting items, but because they're being used as memory offsets. So while you use it in code, I'm not sure if you'd want that kind of semantic exposed as an option in a data file.
|
|
||
| double at(int row, const std::string& header); | ||
| template <class T1> | ||
| T1 at(int row, const std::string& header) { |
There was a problem hiding this comment.
see -- now this is where I'd use a size_t, because it's an index. But this ties into existing underlying definitions, so I wouldn't change that now. (Actually, taking a closer look... the table uses an int as a key into a map... wut? Why is it not a vector? And can you have negative column/row indexes? Weird... still -- not something for this update.)
src/common/CSV.cpp
Outdated
| bool CSV::append(const Pairs& pairs, bool write, bool keep_in_memory) { | ||
|
|
||
| for (std::pair<std::string, double> pair : pairs) { | ||
| for (auto pair : pairs) { |
There was a problem hiding this comment.
The existing code was making a copy of the pair every time, and the current code still does that. You want to iterate over the pairs by reference, so auto&, and if you're not changing anything, which you're not since pairs is const, then const auto&.
src/common/CSV.cpp
Outdated
| } else { | ||
| values[kv.first] = std::to_string(kv.second); | ||
| } | ||
| values[kv.first] = kv.second; |
There was a problem hiding this comment.
This will work, but check the fuller context. Originally, the logic was: 1) build a vector of converted strings, 2) build an aggregate string out of those. You already have the strings, though, so your values vector here is really just making a bunch of copies. You could get rid of the copies by making it a vector of string ptrs (or references, but it'd have to be of reference_wrappers because you can't make a vector of raw references, but ptrs would do fine in this case). But you can get rid of values altogether; it was never really needed in the first place since you can append to the final string as you iterate. No real need for an intermediate list. And continually appending to a string like this isn't the most efficient since it's having to re-allocate a bigger and bigger string every time. Make result a std::ostringstream of appending is more efficient to the stream, and then return result.str() so it's converted only once at the end.
| boost::split(tokens, line, boost::is_any_of(",")); | ||
| for (unsigned int i = 0; i < tokens.size(); i++) { | ||
| table_[row_num][i] = std::stod(tokens[i]); | ||
| table_[row_num][i] = tokens[i]; |
There was a problem hiding this comment.
This is going to be the most annoying thing, and is going to tie into CSV::row_to_string()... what happens if somebody writes a string that contains a comma? Look at the specification section at https://en.wikipedia.org/wiki/Comma-separated_values, and it tells you how to handle that situation. Any time you decide to allow strings in data files, that opens up "fun" edge cases that you have to think about. What about newlines? Carriage returns? Maybe CSV::append() needs to have more failure cases.
Note that when you handle things like escaped double quotes, those need to be handled via linearly scanning through the string; you can't do global substring substitutions because that may not work the way you expect it to for some cases.
|
Closed by accident. Reopened as a new MR here #630 |
Right now, the CSV writer class can only handle writing doubles to the csv table object. These get converted into strings when writing to the actual CSV. To simplify things, the CSV writer should just maintain a map of strings, which allows us to write any data type we want to the CSV as long as we convert it to a string first. It introduces some extra pre-processing we have to do to recover data from a CSV, but I think the schema of
get<datatype>()is already well established in scrimmage so its not a large burden. This also allows people the added control of determining how their data should be turned into a string instead of leaving it to the CSV writer to determine. I have left the scientific vs fixed schema in there but technically all that should be done before the CSV writer even sees the value. The Visitor stuff is a bit of a hacky way to leave some backward compatibility for the old Append method that only took a double.