Hacker News

152 points by ingve 4 days ago | 77 comments

jp1016 10 hours ago

Banning "length" from the codebase and splitting the concept into count vs size is one of those things that sounds pedantic until you've spent an hour debugging an off-by-one in serialization code where someone mixed up "number of elements" and "number of bytes." After that you become a true believer.

The big-endian naming convention (source_index, target_index instead of index_source, index_target) is also interesting. It means related variables sort together lexicographically, which helps with grep and IDE autocomplete. Small thing but it adds up when you're reading unfamiliar code.

One thing I'd add: this convention is especially valuable during code review. When every variable that represents a byte quantity ends in _size and every item count ends in _count, a reviewer can spot dimensional mismatches almost mechanically without having to load the full algorithm into their head.

Shish2k 5 hours ago

> When every variable that represents a byte quantity ends in _size and every item count ends in _count, a reviewer can spot dimensional mismatches almost mechanically

At that point I'd rather make them separate data types, and have the compiler spot mismatches actually-mechanically o.o

jkaptur 4 hours ago

Canonical essay on this sort of technique: https://www.joelonsoftware.com/2005/05/11/making-wrong-code-...

layer8 6 hours ago

> big-endian naming

I would call it “English naming” [0], it’s just more readable to start with, in an anglophone environment.

[0] as opposed to “naming, English”, I suppose ;)

zahlman 4 hours ago

I've always understood "length" to mean what the author calls "count", and would never expect it to refer to byte size; as far as I can tell, it never did. Size is a design-time consideration; caring about it in the code is an exceptional case, for applications like (as you mention) serialization. So that's what deserves the dedicated term. "Length" refers specifically to a total number of elements in many languages preceding Rust.

For that matter, many languages, especially "object-oriented" ones, treat heterogeneous containers as the default. They might not even offer native containers that can store everything inline in a single contiguous allocation, except perhaps for strings. In which case, "number of bytes" is itself ambiguous; are you including the indirected objects or not?

"Count" is also overloaded — it commonly means, and I normally only understand it to mean, the number of elements in a collection meeting some condition. Hence the `.count` method of Python sequences, as well as the jargon "population count" referring to the number of set bits in an integer. Today, Python's integers have both a `.bit_count` and a `.bit_length`, and it's obvious what both of them do; calling either `.bit_size` would be confusing in my mental framework, and a contradiction in terms in the OP's.

I would disagree that even C's `strlen` refers to byte size. C comes from a pre-Unicode world; the type is called `char` because that was naively considered sufficient at the time to represent a text character. (Unicode is still in that sense naive, but it at least allows for systems that are acutely aware of the distinction between "characters" and graphemes.) But notice: C's "strings" aren't proper objects; they're null-terminated sequences, i.e. their length is signaled in-band. So that metadata is also just part of the data, in a single allocation with no indirection; the "size" of a string could only reasonably be interpreted to include that null terminator. Yet the result of `strlen` excludes it! Further, if `strlen` is used on a string that was placed within some allocated buffer, it knows nothing about that buffer.

(Similarly, Rust `str::len` is properly named by this scheme. It gives the number of valid 1-byte-sized elements in a collection, not the byte size of the buffer they're stored within. It's still ambiguous in a sense, but that's because of the convention of using UTF-8 to create an abstraction of "character" elements of non-uniform size. This kind of ambiguity is properly resolved either with iterators, like the `Chars` iterator in Rust, or with views.)

Also consider: C has a `sizeof` operator, influencing Python's `.__sizeof__()` methods. That's because the concept of "size" equally makes sense for non-sequences; neither "count" nor "length" does. So of course "length" cannot mean what the author calls "size".

maleldil 9 hours ago

Big-endian naming is great. I've adopted it since I first read it about it in matklad's blog.

akst 9 hours ago

Have you got a link to this blog post?

dd82 7 hours ago

not sure about a post, but have https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/TI... bookmarked

cb321 12 hours ago

As @SkiFire correctly observes[^1], off-by-1 problems are more fundamental than 0-based or 1-based indices, but the latter still vary enough that some kind of discrimination is needed.

For many years (decades?) now, I've been using "index" for 0-based and "number" for 1-based as in "column index" for a C/Python style [ix] vs. "column number" for a shell/awk/etc. style $1 $2. Not sure this is the best terminology, but it is nice to have something consistent. E.g., "offset" for 0-based indices means "off" and even the letter "o" in some case becomes "the zero of some range". So, "offset" might be better than "index" for 0-based.

[^1]: https://news.ycombinator.com/item?id=47100056

matklad 11 hours ago

Ha! I also use `line_number = line_index + 1` convention!

cb321 8 hours ago

:-)

If it helps anyone explain the SkiFire point any better, I like to analogize it to an I-bar cursor vs. a block cursor for text entry. An I-bar is unambiguously "between characters" while a block cursor is not. So, there are questions that arise for block cursors that basically never arise for I-bar cursors. When just looking at an integer like 2 or 3, there is no cursor at all. So, we must instead rely on names/conventions/assumptions with their attendant issues.

To be clear, I liked the SkiFire explanation, but having multiple ways to describe/think about a problem is usually helpful.

throwaway27448 11 hours ago

Ordinal is nice because it explicitly starts at 1.

adrian_b 10 hours ago

Nit pick: only in few human languages the ordinal numbers start at 1.

In most modern languages, the ordinal numbers start at 2. In most old languages, and also in English, the ordinal numbers start at 3.

The reason for this is the fact that ordinal numbers have been created only recently, a few thousand years ago.

Before that time, there were special words only for certain positions of a sequence, i.e. for the first and for the last element and sometimes also for a few elements adjacent to those.

In English, "first", "second" and "last", are not ordinal numbers, but they are used for the same purpose as ordinal numbers, though more accurately is to say that the ordinal numbers are used for the same purpose with these words, as the ordinal numbers were added later.

The ancient Indo-European languages had a special word for the other element of a pair, i.e. the one that is not the first element of a pair. This word was used for what is now named "second". In late Latin, the original word that meant "the other of a pair" has been replaced with a word meaning "the following", which has been eventually also taken by English through French in the form of "second".

MarkusQ 8 hours ago

Meta nit pick: You are conflating linguist's jargon with mathematician's jargon.

In much the same way as physicists co-opted common words (e.g. "work" and "energy") to mean very specific things in technical contexts, both linguists and mathematicians gave "ordinal" a specific meaning in their respective domains. These meanings are similar but different, and your nit pick is mistakenly asserting that one of these has priority over the other.

"Ordinal" in linguistics is a word for a class of words. The words being classified may be old, but the use of "ordinal" to denote them is a comparatively modern coinage, roughly contemporary with the mathematicians usage. Both come from non-technical language describing putting things in an "orderly" row (c.f. cognates such as "public order", "court order", etc.) which did not carry the load you are trying to place on them.

layer8 4 hours ago

There is “zeroth” though as an ordinal humeral, which was already used long before computers came around, as for example in “the zeroth power of a number” (according to Merriam-Webster). So it’s still not quite unambiguous. :)

layer8 6 hours ago

Not true in general, ordinal numbers start at 0: https://en.wikipedia.org/wiki/Ordinal_number

wahern 20 hours ago

Relatedly, a survey of array nomenclature was performed for the ISO C committee when choosing the name of the new countof operator: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3469.htm

It was originally proposed as lengthof, but the results of the public poll and the ambiguity convinced the committee to choose countof, instead.

pansa2 10 hours ago

The reason many languages prefer `length` to `count`, I think, is that the former is clearly a noun and the latter could be a verb. `length` feels like a simple property of a container whereas `count` could be an algorithm.

`countof` removes the verb possibility - but that means that a preference for `countof` over `lengthof` isn't necessarily a preference for `count` over `length`.

ncruces 10 hours ago

But count is more clearly a dimensionless number of elements, and not a size measured in some unit (e.g. bytes).

layer8 5 hours ago

I tend to use numFoos (short for “number of foos”), and only use fooCount when the variable is used for actual counting (like an errorCount variable that is incremented for each error).

Countof is strange, because one doesn’t talk about the “count of something” in English, other than uses like “on the count of three” (or the “count of Monte Cristo” ;)).

zahlman 4 hours ago

When I see "countof" I expect an operation that lets me filter the container and tell me the count of things that meet some condition (probably described with a unary predicate, but perhaps just an element to check for equality).

JSR_FDED 17 hours ago

Using the same length of related variable names is definitely a good thing.

Just lining things up neatly helps spot bugs.

It’s the one thing I don’t like about strict formatters, I can no longer use spaces to line things up.

craig552uk 15 hours ago

I've never yet seen a linter option for assignment alignment, but would definitely use it if it were available

ivanjermakov 14 hours ago

AlignConsecutiveAssignments in clang-format might be the right fit.

https://clang.llvm.org/docs/ClangFormatStyleOptions.html

skydhash 13 hours ago

I know prettier can isolate a code section from changes by adding comments. And I think others can too.

matheusmoreira 4 hours ago

Really like this. I'll follow this practice from now on.

Fraterkes 12 hours ago

Is there any reason to not just switch to 1-based indexing if we could? Seems like 0-based indexing really exacerbates off-by-one errors without much benefit

SkiFire13 12 hours ago

I'm not sure what that has to do with the article, but anyway: https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD831...

That said, I'm not sure how 1-based indexing will solve off-by-1 errors. They naturally come from the fencepost problem, i.e. the fact that sometimes we use indexes to indicate elements and sometimes to indicate boundaries between them. Mixing between them in our reasoning ultimately results in off-by-1 issues.

Fraterkes 12 hours ago

This is an article that (among other things) talks about off-by-one errors being caused by mixing up index and count (and having to remember to subtract 1 when converting between the two). That's what it has to with it.

adrian_b 11 hours ago

If you always use half-open intervals, you never have to subtract 1 from anything.

With half-open intervals, the count of elements is the difference between the interval bounds, adjacent intervals share 1 bound and merging 2 adjacent intervals preserves the extreme bounds.

Any programming problem is simplified when 0-based indexing together with half-open intervals are always used, without exceptions.

The fact that most programmers have been taught when young to use 1-based ordinal numbers and closed intervals is a mental handicap, but normally it is easy to get rid of this, like also getting rid of the mental handicap of having learned to use decimal numbers, when there is no reason to ever use them instead of binary numbers.

SkiFire13 10 hours ago

I must have missed that part, my bad

cindyllm 10 hours ago

[dead]

dgrunwald 9 hours ago

When accessing individual elements, 0-based and 1-based indexing are basically equally usable (up to personal preference). But this changes for other operations! For example, consider how to specify the index of where to insert in a string. With 0-based indexing, appending is str.insert(str.length(), ...). With 1-based indexing, appending is str.insert(str.length() + 1, ...). Similarly, when it comes to substr()-like operations, 0-based indexing with ranges specified by inclusive start and exclusive end works very nicely, without needing any +1/-1 adjustments. Languages with 1-based indexing tend to use inclusive-end for substr()-like operations instead, but that means empty substrings now are odd special cases. When writing something like a text editor where such operations happen frequently, it's the 1-based indexing that ends up with many more +1/-1 in the codebase than an editor written with 0-based indexing.

adrian_b 11 hours ago

This is a matter of opinion.

My opinion is that 1-based indexing really exacerbates off-by-one errors, besides requiring a more complex implementation in compilers, which is more bug-prone (with 1-based addressing, the compilers must create and use, in a manner transparent for the programmer, pointers that do not point to the intended object but towards an invalid location before the object, which must never be accessed through the pointer; this is why using 1-based addressing was easier in languages without pointers, like the original FORTRAN, but it would have been more difficult in languages that allow pointers, like C, the difficulty being in avoiding to expose the internal representation of pointers to the programmer).

Off-by-one errors are caused by mixing conventions for expressing indices and ranges.

If you always use a consistent convention, e.g. 0-based indexing together with half-open intervals, where the count of elements equals the difference between the interval bounds, there are no chances for ever making off-by-one errors.

GuB-42 9 hours ago

Because it is not how computers work. It doesn't matter much for high level languages like LUA, you rarely manipulate raw bytes and pointers, but in system programming languages like Zig, it matters.

To use the terminology from the article, with 0-based indexing, offset = index * node_size. If it was 1-based, you would have offset = (index - 1) * node_size + 1.

And it became a convention even for high level languages, because no matter what you prefer, inconsistency is even worse. An interesting case is Perl, which, in classic Perl fashion, lets you choose by setting the $[ variable. Most people, even Perl programmers consider it a terrible feature and 0-based indexing is used by default.

pansa2 10 hours ago

Fundamentally, CPUs use 0-based addresses. That's unavoidable.

We can't choose to switch to 1-based indexing - either we use 0-based everywhere, or a mixture of 0-based and 1-based. Given the prevalence of off-by-one errors, I think the most important thing is to be consistent.

layer8 5 hours ago

1-based indexing doesn’t work well as soon as you have a start offset within a sequence, from which you want to index. Then the first element is startIndex + 0, not startIndex + 1. 0-based indexing generalizes better in that way.

tialaramex 12 hours ago

I would bet that in the opposite circumstance you'd say the same thing:

"Is there any reason to not just switch to 0-based indexing if we could? Seems like 1-based indexing really exacerbates off-by-one errors without much benefit"

The problem is that humans make off-by-one errors and not that we're using the wrong indexing system.

Fraterkes 12 hours ago

No indexing system is perfect, but one can be better than another. Being able to do array[array.length()] to get the last item is more concise and less error prone than having to add -1 every time.

Programming languages are filled with tiny design choices that don’t completely prevent mistakes (that would be impossible) but do make them less likely.

adrian_b 11 hours ago

Having to use something like array[length] to get the last element demonstrates a defect of that programming language.

There are better programming languages, where you do not need to do what you say.

Some languages, like Ada, have special array attributes for accessing the first and the last elements.

Other languages, like Icon, allow the use of both non-negative indices and of negative indices, where non-negative indices access the array from its first element towards its last element, while negative indices access the array from its last element towards its first element.

I consider that your solution, i.e. using array[length] instead of array[length-1], is much worse. While it scores a point for simplifying this particular expression, it loses points by making other expressions more complex.

There are a lot of better programming languages than the few that due to historical accidents happen to be popular today.

It is sad that the designers of most of the languages that attempt today to replace C and C++ have not done due diligence by studying the history of programming languages before designing a new programming language. Had they done that, they could have avoided repeating the same mistakes of the languages with which they want to compete.

tialaramex 7 hours ago

array[array.length()] is nonsense if the array is empty.

You should prefer a language, like Rust, in which [T]::last is Option<&T> -- that is, we can ask for a reference to the last item, but there might not be one and so we're encouraged to do something about that.

IMNSHO The pit of success you're looking for is best dug with such features and not via fiddling with the index scheme.

GoblinSlayer 9 hours ago

If your design works better in one scenario usually means it works worse in other scenarios, you just shuffled garbage around.

bruce343434 12 hours ago

You say "seems like", can you argue/show/prove this?

Fraterkes 12 hours ago

I think that many obo errors are caused by common situations where people can mistakenly mix up index and count. You could eliminate a (small) set of those situations with 1-based indexing: accessing items from the ends of arrays/lists.

meindnoch 11 hours ago

And in turn you'd introduce off by one errors when people confuse the new 1-based indexes with offsets (which are inherently 0-based).

So yeah, no. People smarter than you have thought about this before.

naasking 9 hours ago

> Is there any reason to not just switch to 1-based indexing if we could? Seems like 0-based indexing really exacerbates off-by-one errors without much benefit

You'd just get a different set of off-by-one errors with 1-based indexing.

qouteall 19 hours ago

With modern IDE and AI there is no need to save letters in identifier (unless too long). It should be "sizeInBytes" instead of "size". It should be "byteOffset" "elementOffset" instead of "offset".

pveierland 17 hours ago

When correctness is important I much prefer having strong types for most primitives, such that the name is focused on describing semantics of the use, and the type on how it is represented:

    struct FileNode {
        parent: NodeIndex<FileNode>,
        content_header_offset: ByteOffset,
        file_size: ByteCount,
    }

Where `parent` can then only be used to index a container of `FileNode` values via the `std::ops::Index` trait.

Strong typing of primitives also help prevent bugs like mixing up parameter ordering etc.

kqr 17 hours ago

I agree. Including the unit in the name is a form of Hungarian notation; useful when the language doesn't support defining custom types, but looks a little silly otherwise.

canucker2016 12 hours ago

Depends on what variant of Hungarian you're talking about.

There's Systems Hungarian as used in the Windows header files or Apps Hungarian as used in the Apps division at Microsoft. For Apps Hungarian, see the following URL for a reference - https://idleloop.com/hungarian/

For Apps Hungarian, the variable incorporates the type as well as the intent of the variable - in the Apps Hungarian link from above, these are called qualifiers.

so for the grandparent example, rewritten in C, would be something like:

    struct FileNode {
        FileNode *pfnParent;
        DWORD ibHdrContent;
        DWORD cb;
    }

For Apps Hungarian, one would know that the ibHdrContent and cb fields are the same type 'b'. ib represents an index/offset in bytes - HdrContent is just descriptive, while cb is a count of bytes. The pfnParent field is a pointer to a fn-type with name Parent.

One wouldn't mix an ib with a pfn since the base types don't match (b != fn). But you could mix ibHdrContent and cb since the base types match and presumably in this small struct, they refer to index/offset and count for the FileNode. You'd have only one cb for the FileNode but possibly one or more ibXXXX-related fields if you needed to keep track of that many indices/offsets.

groundzeros2015 16 hours ago

Long names become burdensome to read when they are used frequently in the same context

ivanjermakov 14 hours ago

When the same name is used a thousand times in a codebase, shorter names start to make sense. See aviation manuals or business documentation, how abbreviation-dense they are.

layer8 5 hours ago

When you’re juggling inputBufferSizeInBytes, outputBufferSizeInBytes, intermediateRepresentationBufferSizeInBytes, it becomes unwieldy and cumbersome.

I once had a coworker like that, whose identifiers often stretched into the 30-50 characters range.You really don’t want that.

throwaway2027 19 hours ago

Isn't that more tokens though?

post-it 17 hours ago

Only until they develop some kind of pre-AI minifier and sourcemap tool.

0x457 18 hours ago

Sure you get one or two word extra worth of tokens, but you save a lot more compute and time figuring what exactly this offset is.

Onavo 19 hours ago

Not significantly, it's one word.

meindnoch 11 hours ago

Tokens are not words.

Onavo 14 minutes ago

Relatively, they are the same, depends if it's word level.

stephc_int13 8 hours ago

Could not approve more a I use near identical naming convention in my C codebase. Not using the standard C library to avoid inconsistencies and the awful naming habits of that era.

akdor1154 17 hours ago

The 'same length for complementary names' thing is great.

kgwxd 7 hours ago

So many arguments could have been avoided if the convention was to use o instead of i in c-like for loops.

navane 11 hours ago

I hoped to learn some more excel lookup tactics, alas

donkeybeer 10 hours ago

I was thinking sql.

zephen 19 hours ago

The invariant of index < count, of course, only works when using Djikstra's half-open indexing standard, which seems to have a few very vocal detractors.

tromp 14 hours ago

See https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD831... for Dijkstra's thoughts on indexing.

GolDDranks 18 hours ago

Fortunately only a few. Djikstra's is obviously the most reasonable system.

zephen 8 hours ago

Obviously to you and me, but you can see comments right here where others disagree.

And the detractors certainly have momentum in certain segments on their side.

Historically, of course, it was languages like Fortran and COBOL and even Smalltalk, but even today we have MATLAB, R, Lua, Mathematica, and julia.

Big-endian won in network byte order, but lost the CPUs. One-based indexing won in mathematical computing so far, and lost main-stream languages so far, but the julia folks are trying to change that.

dataflow 20 hours ago

Is there any other example of "length" meaning "byte length", or is it just Rust just being confusing? I've never seen this elsewhere.

Offset is ordinarily just a difference of two indices. In a container I don't recall seeing it implicitly refer to byte offset.

SabrinaJewson 20 hours ago

In general in Rust, “length” refers to “count”. If you view strings as being sequences of Unicode scalar values, then it might seem odd that `str::len` counts bytes, but if you view strings as being a subset of byte slices it makes perfect sense that it gives the number of UTF-8 code units (and it is analoguous to, say, how Javascript uses `.length` to return the number of UTF-16 code units). So I think it depends on perspective.

dataflow 17 hours ago

That makes sense, I agree -- seems Rust is on board here too.

AlotOfReading 20 hours ago

It's the usual convention for systems programming languages and has been for decades, e.g. strlen() and std::string.length(). Byte length is also just more useful in many cases.

dataflow 16 hours ago

No, those are counts by definition, and byte lengths only by coincidence. Look at wcslen() and std::wstring::length().

zahlman 4 hours ago

> or is it just Rust just being confusing?

It doesn't mean "byte length", so much as "byte" happens to be the element type. Unicode is conventionally represented as UTF-8, so the container can't be directly indexed to yield a character.

wyldfire 18 hours ago

A length could refer to lots of different units - elements, pages, sectors, blocks, N-aligned bytes, kbytes, characters, etc.

Always good to qualify your identifiers with units IMO (or types that reflect units).

card_zero 21 hours ago

I can't read the starts of any lines, the entire page is offset about 100 pixels to the left. :) Best viewed in Lynx?

Flow 16 hours ago

Looks perfect here. iOS Safari

userbinator 16 hours ago

Or learn an array language and never worry about indexing or naming ;-)

Everything else looks disgustingly verbose once you get used to them.