"Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes."
If the DMS output isn’t quoting fields that contain commas, that’s technically invalid CSV.
A small normalization step before COPY (or ensuring the writer emits RFC-compliant CSV in the first place) would make the pipeline robust without renaming countries or changing delimiters.
That way, if/when the DMS output is fixed upstream, nothing downstream needs to change.
This is why SQL is "broken", it's powerful, simple and people will always do the wrong thing.
Was teaching a class on SQL, half my class was reminding them that examples with concatenating strings was bad and they should use prepared statements (JDBC).
Come practice time, half the class did string concatenations.
This is why I love Linq and the modern parametrized query-strings in JS, they make the right thing easier than the wrong thing.
@Dao
public interface UserDao {
@Query("SELECT * FROM user")
List<User> getAll();
@Query("SELECT * FROM user WHERE uid IN (:userIds)")
List<User> loadAllByIds(int[] userIds);
@Query("SELECT * FROM user WHERE first_name LIKE :first AND " +
"last_name LIKE :last LIMIT 1")
User findByName(String first, String last);
@Insert
void insertAll(User... users);
@Delete
void delete(User user);
}The Linq code is native C# that can be strongly typed for ID's,etc but you can "think" in SQL terms by writing Where,Select,OrderBy and so on (I will admit that the C# world hasn't really gotten there in terms of promoting strongly typed db ID's yet but there support is there).
create.select(AUTHOR.FIRST_NAME, AUTHOR.LAST_NAME, count())
.from(AUTHOR)
.join(BOOK).on(AUTHOR.ID.equal(BOOK.AUTHOR_ID))
.where(BOOK.LANGUAGE.eq("DE"))
.and(BOOK.PUBLISHED.gt(date("2008-01-01")))
.groupBy(AUTHOR.FIRST_NAME, AUTHOR.LAST_NAME)
.having(count().gt(5))
.orderBy(AUTHOR.LAST_NAME.asc().nullsFirst())
.limit(2)
.offset(1)I guess it's also why all our fancy (as in tsv++?) file types (like GTF and BED) are all tab (or spaces) based. Those fields often have commas in cells for nested lists etc.
I wish sep="\t" was default and one would have to do pd.read/to_tsv(sep=",") for csv. It would have saved me hours and hours of work and idk cross the 79 chars much less often ;)
It looked horrible, the textures just wouldn't load no matter what I tried. Finally, on a forum, some other user, presumably also from Europe, noted that you have to use decimal point as a decimal separator (my locale uses a comma). And that solved the problem.
The whole business of software engineering exists in the gap between "it works today on this input" and "it will also work tomorrow and the day after and after we've scaled 10x and rewrote the serialization abstraction and..."
See also: "Glorp 5.7 Turbo one-shot this for me and it works!"
The JSON in CSVs does piss off the Jetbrains table viewer sometimes though, it will randomly decide to stop parsing a 50k line CSV at halfway through the JSON of line 300ish even though that JSON is no different from other JSON it parsed just fine.
But python reads and writes them fine, as does whatever SQL engine I'm touching, as does other tools.
> >>> json.dumps({ "X" : 1 << 66 })
> '{"X": 73786976294838206464}'
What's the parsing result in javascript ? What's the parsing result in Java ? number,73786976294838206464Because it's easy to understand. Non-technical people understand it. There is tremendous value in that, and that it's underspecified with ugly edge cases doesn't change that.
An interchnage format needs to include information showing that you have all the data - e.g. a hash or the number of rows - or JSON/XML/s-expressions having closing symbols to match the start.
The "dialect dependent" part is usually about escaping double quotes, new lines and line continuations.
Not a portable format, but it is not too bad (for this use) either considering the country list is mostly static
Sure, the most common collation scheme for country names is to sort ignoring certain prefixes like "The Republic of", "The", "People's Democratic...", etc. but this is purely a presentation layer issue (how to order a list of countries to a user) that should be independent of your underlying data.
Sure "hacking" the name of the country like this to make the traditional alphabetical ordering match a particular ordering desired to aid human navigation has a lot of history (encyclopedia indexes, library indexes, record stores, etc.) but that was in the age of paper and physical filing systems.
Store the country name correctly and then provide a custom sort or multiple custom sorts where such functionality belongs - in the presentation layer.
Personally I've never seen any sort where "The Republic of Moldova" would be sorted at "M".
This sometimes causes problems for the UK, which can be sorted as U, G, B, or even E (presumably for "England", making it especially annoying for people in the other countries of the UK).
Ah, but what _is_ the boundary, asks Transnistria?
It's not the serialisation that is correct, it must be the data.
Lets rename a country because we are not capable of handling commas in data.