For many years (decades?) now, I've been using "index" for 0-based and "number" for 1-based as in "column index" for a C/Python style [ix] vs. "column number" for a shell/awk/etc. style $1 $2. Not sure this is the best terminology, but it is nice to have something consistent. E.g., "offset" for 0-based indices means "off" and even the letter "o" in some case becomes "the zero of some range". So, "offset" might be better than "index" for 0-based.
If it helps anyone explain the SkiFire point any better, I like to analogize it to an I-bar cursor vs. a block cursor for text entry. An I-bar is unambiguously "between characters" while a block cursor is not. So, there are questions that arise for block cursors that basically never arise for I-bar cursors. When just looking at an integer like 2 or 3, there is no cursor at all. So, we must instead rely on names/conventions/assumptions with their attendant issues.
To be clear, I liked the SkiFire explanation, but having multiple ways to describe/think about a problem is usually helpful.
In most modern languages, the ordinal numbers start at 2. In most old languages, and also in English, the ordinal numbers start at 3.
The reason for this is the fact that ordinal numbers have been created only recently, a few thousand years ago.
Before that time, there were special words only for certain positions of a sequence, i.e. for the first and for the last element and sometimes also for a few elements adjacent to those.
In English, "first", "second" and "last", are not ordinal numbers, but they are used for the same purpose as ordinal numbers, though more accurately is to say that the ordinal numbers are used for the same purpose with these words, as the ordinal numbers were added later.
The ancient Indo-European languages had a special word for the other element of a pair, i.e. the one that is not the first element of a pair. This word was used for what is now named "second". In late Latin, the original word that meant "the other of a pair" has been replaced with a word meaning "the following", which has been eventually also taken by English through French in the form of "second".
In much the same way as physicists co-opted common words (e.g. "work" and "energy") to mean very specific things in technical contexts, both linguists and mathematicians gave "ordinal" a specific meaning in their respective domains. These meanings are similar but different, and your nit pick is mistakenly asserting that one of these has priority over the other.
"Ordinal" in linguistics is a word for a class of words. The words being classified may be old, but the use of "ordinal" to denote them is a comparatively modern coinage, roughly contemporary with the mathematicians usage. Both come from non-technical language describing putting things in an "orderly" row (c.f. cognates such as "public order", "court order", etc.) which did not carry the load you are trying to place on them.
It was originally proposed as lengthof, but the results of the public poll and the ambiguity convinced the committee to choose countof, instead.
`countof` removes the verb possibility - but that means that a preference for `countof` over `lengthof` isn't necessarily a preference for `count` over `length`.
Countof is strange, because one doesn’t talk about the “count of something” in English, other than uses like “on the count of three” (or the “count of Monte Cristo” ;)).
Just lining things up neatly helps spot bugs.
It’s the one thing I don’t like about strict formatters, I can no longer use spaces to line things up.
That said, I'm not sure how 1-based indexing will solve off-by-1 errors. They naturally come from the fencepost problem, i.e. the fact that sometimes we use indexes to indicate elements and sometimes to indicate boundaries between them. Mixing between them in our reasoning ultimately results in off-by-1 issues.
With half-open intervals, the count of elements is the difference between the interval bounds, adjacent intervals share 1 bound and merging 2 adjacent intervals preserves the extreme bounds.
Any programming problem is simplified when 0-based indexing together with half-open intervals are always used, without exceptions.
The fact that most programmers have been taught when young to use 1-based ordinal numbers and closed intervals is a mental handicap, but normally it is easy to get rid of this, like also getting rid of the mental handicap of having learned to use decimal numbers, when there is no reason to ever use them instead of binary numbers.
My opinion is that 1-based indexing really exacerbates off-by-one errors, besides requiring a more complex implementation in compilers, which is more bug-prone (with 1-based addressing, the compilers must create and use, in a manner transparent for the programmer, pointers that do not point to the intended object but towards an invalid location before the object, which must never be accessed through the pointer; this is why using 1-based addressing was easier in languages without pointers, like the original FORTRAN, but it would have been more difficult in languages that allow pointers, like C, the difficulty being in avoiding to expose the internal representation of pointers to the programmer).
Off-by-one errors are caused by mixing conventions for expressing indices and ranges.
If you always use a consistent convention, e.g. 0-based indexing together with half-open intervals, where the count of elements equals the difference between the interval bounds, there are no chances for ever making off-by-one errors.
To use the terminology from the article, with 0-based indexing, offset = index * node_size. If it was 1-based, you would have offset = (index - 1) * node_size + 1.
And it became a convention even for high level languages, because no matter what you prefer, inconsistency is even worse. An interesting case is Perl, which, in classic Perl fashion, lets you choose by setting the $[ variable. Most people, even Perl programmers consider it a terrible feature and 0-based indexing is used by default.
We can't choose to switch to 1-based indexing - either we use 0-based everywhere, or a mixture of 0-based and 1-based. Given the prevalence of off-by-one errors, I think the most important thing is to be consistent.
"Is there any reason to not just switch to 0-based indexing if we could? Seems like 1-based indexing really exacerbates off-by-one errors without much benefit"
The problem is that humans make off-by-one errors and not that we're using the wrong indexing system.
Programming languages are filled with tiny design choices that don’t completely prevent mistakes (that would be impossible) but do make them less likely.
There are better programming languages, where you do not need to do what you say.
Some languages, like Ada, have special array attributes for accessing the first and the last elements.
Other languages, like Icon, allow the use of both non-negative indices and of negative indices, where non-negative indices access the array from its first element towards its last element, while negative indices access the array from its last element towards its first element.
I consider that your solution, i.e. using array[length] instead of array[length-1], is much worse. While it scores a point for simplifying this particular expression, it loses points by making other expressions more complex.
There are a lot of better programming languages than the few that due to historical accidents happen to be popular today.
It is sad that the designers of most of the languages that attempt today to replace C and C++ have not done due diligence by studying the history of programming languages before designing a new programming language. Had they done that, they could have avoided repeating the same mistakes of the languages with which they want to compete.
You should prefer a language, like Rust, in which [T]::last is Option<&T> -- that is, we can ask for a reference to the last item, but there might not be one and so we're encouraged to do something about that.
IMNSHO The pit of success you're looking for is best dug with such features and not via fiddling with the index scheme.
struct FileNode {
parent: NodeIndex<FileNode>,
content_header_offset: ByteOffset,
file_size: ByteCount,
}
Where `parent` can then only be used to index a container of `FileNode` values via the `std::ops::Index` trait.Strong typing of primitives also help prevent bugs like mixing up parameter ordering etc.
There's Systems Hungarian as used in the Windows header files or Apps Hungarian as used in the Apps division at Microsoft. For Apps Hungarian, see the following URL for a reference - https://idleloop.com/hungarian/
For Apps Hungarian, the variable incorporates the type as well as the intent of the variable - in the Apps Hungarian link from above, these are called qualifiers.
so for the grandparent example, rewritten in C, would be something like:
struct FileNode {
FileNode *pfnParent;
DWORD ibHdrContent;
DWORD cb;
}
For Apps Hungarian, one would know that the ibHdrContent and cb fields are the same type 'b'. ib represents an index/offset in bytes - HdrContent is just descriptive, while cb is a count of bytes. The pfnParent field is a pointer to a fn-type with name Parent.One wouldn't mix an ib with a pfn since the base types don't match (b != fn). But you could mix ibHdrContent and cb since the base types match and presumably in this small struct, they refer to index/offset and count for the FileNode. You'd have only one cb for the FileNode but possibly one or more ibXXXX-related fields if you needed to keep track of that many indices/offsets.
I once had a coworker like that, whose identifiers often stretched into the 30-50 characters range.You really don’t want that.
And the detractors certainly have momentum in certain segments on their side.
Historically, of course, it was languages like Fortran and COBOL and even Smalltalk, but even today we have MATLAB, R, Lua, Mathematica, and julia.
Big-endian won in network byte order, but lost the CPUs. One-based indexing won in mathematical computing so far, and lost main-stream languages so far, but the julia folks are trying to change that.
Offset is ordinarily just a difference of two indices. In a container I don't recall seeing it implicitly refer to byte offset.
Everything else looks disgustingly verbose once you get used to them.
The big-endian naming convention (source_index, target_index instead of index_source, index_target) is also interesting. It means related variables sort together lexicographically, which helps with grep and IDE autocomplete. Small thing but it adds up when you're reading unfamiliar code.
One thing I'd add: this convention is especially valuable during code review. When every variable that represents a byte quantity ends in _size and every item count ends in _count, a reviewer can spot dimensional mismatches almost mechanically without having to load the full algorithm into their head.
At that point I'd rather make them separate data types, and have the compiler spot mismatches actually-mechanically o.o
I would call it “English naming” [0], it’s just more readable to start with, in an anglophone environment.
[0] as opposed to “naming, English”, I suppose ;)
For that matter, many languages, especially "object-oriented" ones, treat heterogeneous containers as the default. They might not even offer native containers that can store everything inline in a single contiguous allocation, except perhaps for strings. In which case, "number of bytes" is itself ambiguous; are you including the indirected objects or not?
"Count" is also overloaded — it commonly means, and I normally only understand it to mean, the number of elements in a collection meeting some condition. Hence the `.count` method of Python sequences, as well as the jargon "population count" referring to the number of set bits in an integer. Today, Python's integers have both a `.bit_count` and a `.bit_length`, and it's obvious what both of them do; calling either `.bit_size` would be confusing in my mental framework, and a contradiction in terms in the OP's.
I would disagree that even C's `strlen` refers to byte size. C comes from a pre-Unicode world; the type is called `char` because that was naively considered sufficient at the time to represent a text character. (Unicode is still in that sense naive, but it at least allows for systems that are acutely aware of the distinction between "characters" and graphemes.) But notice: C's "strings" aren't proper objects; they're null-terminated sequences, i.e. their length is signaled in-band. So that metadata is also just part of the data, in a single allocation with no indirection; the "size" of a string could only reasonably be interpreted to include that null terminator. Yet the result of `strlen` excludes it! Further, if `strlen` is used on a string that was placed within some allocated buffer, it knows nothing about that buffer.
(Similarly, Rust `str::len` is properly named by this scheme. It gives the number of valid 1-byte-sized elements in a collection, not the byte size of the buffer they're stored within. It's still ambiguous in a sense, but that's because of the convention of using UTF-8 to create an abstraction of "character" elements of non-uniform size. This kind of ambiguity is properly resolved either with iterators, like the `Chars` iterator in Rust, or with views.)
Also consider: C has a `sizeof` operator, influencing Python's `.__sizeof__()` methods. That's because the concept of "size" equally makes sense for non-sequences; neither "count" nor "length" does. So of course "length" cannot mean what the author calls "size".