Why does modern Perl avoid UTF-8 by default?

Questions : Why does modern Perl avoid UTF-8 by default?

I wonder why most modern solutions built using Perl don’t enable UTF-8 by default.

I understand there are many legacy problems for core Perl scripts, where it may break things. But, from my point of view, in the 21st century, big new projects (or projects with a big perspective) should make their software UTF-8 proof from scratch. Still I don’t see it happening. For example, Moose enables strict and warnings, but not Unicode. Modern::Perl reduces boilerplate too, but no UTF-8 handling.

Why? Are there some reasons to avoid UTF-8 in modern Perl projects in the year 2011?


Commenting @tchrist got too long, so I’m adding it here.

It seems that I did not make myself clear. Let me try to add some things.

tchrist and I see situation pretty similarly, but our conclusions are completely in opposite ends. I agree, the situation with Unicode is complicated, but this is why we (Perl users and coders) need some layer (or pragma) which makes UTF-8 handling as easy as it must be nowadays.

tchrist pointed to many aspects to cover, I will read and think about them for days or even weeks. Still, this is not my point. tchrist tries to prove that there is not one single way “to enable UTF-8”. I have not so much knowledge to argue with that. So, I stick to live examples.

I played around with Rakudo and UTF-8 was just there as I needed. I didn’t have any problems, it just worked. Maybe there are some limitation somewhere deeper, but at start, all I tested worked as I expected.

Shouldn’t that be a goal in modern Perl 5 too? I stress it more: I’m not suggesting UTF-8 as the default character set for core Perl, I suggest the possibility to trigger it with a snap for those who develop new projects.

Another example, but with a more negative tone. Frameworks should make development easier. Some years ago, I tried web frameworks, but just threw them away because “enabling UTF-8” was so obscure. I did not find how and where to hook Unicode support. It was so time-consuming that I found it easier to go the old way. Now I saw here there was a bounty to deal with the same problem with Mason 2: How to make Mason2 UTF-8 clean?. So, it is pretty new framework, but using it with UTF-8 needs deep knowledge of its internals. It is like a big red sign: STOP, don’t use me!

I really like Perl. But dealing with Unicode is painful. I still find myself running against walls. Some way tchrist is right and answers my questions: new projects don’t attract UTF-8 because it is too complicated in Perl 5.

Total Answers: 7 Answers 7


Popular Answers:

  1. :

    1. Set your PERL_UNICODE envariable to AS. This makes all Perl scripts decode @ARGV as UTF‑8 strings, and sets the encoding of all three of stdin, stdout, and stderr to UTF‑8. Both these are global effects, not lexical ones.

    2. At the top of your source file (program, module, library, dohickey), prominently assert that you are running perl version 5.12 or better via:

      use v5.12; # minimal for unicode string feature use v5.14; # optimal for unicode string feature 
    3. Enable warnings, since the previous declaration only enables strictures and features, not warnings. I also suggest promoting Unicode warnings into exceptions, so use both these lines, not just one of them. Note however that under v5.14, the utf8 warning class comprises three other subwarnings which can all be separately enabled: nonchar, surrogate, and non_unicode. These you may wish to exert greater control over.

      use warnings; use warnings qw( FATAL utf8 ); 
    4. Declare that this source unit is encoded as UTF‑8. Although once upon a time this pragma did other things, it now serves this one singular purpose alone and no other:

      use utf8; 
    5. Declare that anything that opens a filehandle within this lexical scope but not elsewhere is to assume that that stream is encoded in UTF‑8 unless you tell it otherwise. That way you do not affect other module’s or other program’s code.

      use open qw( :encoding(UTF-8) :std ); 
    6. Enable named characters via N{CHARNAME}.

      use charnames qw( :full :short ); 
    7. If you have a DATA handle, you must explicitly set its encoding. If you want this to be UTF‑8, then say:

      binmode(DATA, ":encoding(UTF-8)"); 

    There is of course no end of other matters with which you may eventually find yourself concerned, but these will suffice to approximate the state goal to “make everything just work with UTF‑8”, albeit for a somewhat weakened sense of those terms.

    One other pragma, although it is not Unicode related, is:

     use autodie; 

    It is strongly recommended.


    ⸗ ⸗


    My own boilerplate these days tends to look like this:

    use 5.014; use utf8; use strict; use autodie; use warnings; use warnings qw< FATAL utf8 >; use open qw< :std :utf8 >; use charnames qw< :full >; use feature qw< unicode_strings >; use File::Basename qw< basename >; use Carp qw< carp croak confess cluck >; use Encode qw< encode decode >; use Unicode::Normalize qw< NFD NFC >; END { close STDOUT } if (grep /P{ASCII}/ => @ARGV) { @ARGV = map { decode("UTF-8", $_) } @ARGV; } $0 = basename($0); # shorter messages $| = 1; binmode(DATA, ":utf8"); # give a full stack dump on any untrapped exceptions local $SIG{__DIE__} = sub { confess "Uncaught exception: @_" unless $^S; }; # now promote run-time warnings into stack-dumped # exceptions *unless* we're in an try block, in # which case just cluck the stack dump instead local $SIG{__WARN__} = sub { if ($^S) { cluck "Trapped warning: @_" } else { confess "Deadly warning: @_" } }; while (<>) { chomp; $_ = NFD($_); ... } continue { say NFC($_); } __END__ 


    Saying that “Perl should [somehow!] enable Unicode by default” doesn’t even start to begin to think about getting around to saying enough to be even marginally useful in some sort of rare and isolated case. Unicode is much much more than just a larger character repertoire; it’s also how those characters all interact in many, many ways.

    Even the simple-minded minimal measures that (some) people seem to think they want are guaranteed to miserably break millions of lines of code, code that has no chance to “upgrade” to your spiffy new Brave New World modernity.

    It is way way way more complicated than people pretend. I’ve thought about this a huge, whole lot over the past few years. I would love to be shown that I am wrong. But I don’t think I am. Unicode is fundamentally more complex than the model that you would like to impose on it, and there is complexity here that you can never sweep under the carpet. If you try, you’ll break either your own code or somebody else’s. At some point, you simply have to break down and learn what Unicode is about. You cannot pretend it is something it is not.

    goes out of its way to make Unicode easy, far more than anything else I’ve ever used. If you think this is bad, try something else for a while. Then come back to : either you will have returned to a better world, or else you will bring knowledge of the same with you so that we can make use of your new knowledge to make better at these things.



    At a minimum, here are some things that would appear to be required for to “enable Unicode by default”, as you put it:

    1. All source code should be in UTF-8 by default. You can get that with use utf8 or export PERL5OPTS=-Mutf8.

    2. The DATA handle should be UTF-8. You will have to do this on a per-package basis, as in binmode(DATA, ":encoding(UTF-8)").

    3. Program arguments to scripts should be understood to be UTF-8 by default. export PERL_UNICODE=A, or perl -CA, or export PERL5OPTS=-CA.

    4. The standard input, output, and error streams should default to UTF-8. export PERL_UNICODE=S for all of them, or I, O, and/or E for just some of them. This is like perl -CS.

    5. Any other handles opened by should be considered UTF-8 unless declared otherwise; export PERL_UNICODE=D or with i and o for particular ones of these; export PERL5OPTS=-CD would work. That makes -CSAD for all of them.

    6. Cover both bases plus all the streams you open with export PERL5OPTS=-Mopen=:utf8,:std. See uniquote.

    7. You don’t want to miss UTF-8 encoding errors. Try export PERL5OPTS=-Mwarnings=FATAL,utf8. And make sure your input streams are always binmoded to :encoding(UTF-8), not just to :utf8.

    8. Code points between 128–255 should be understood by to be the corresponding Unicode code points, not just unpropertied binary values. use feature "unicode_strings" or export PERL5OPTS=-Mfeature=unicode_strings. That will make uc("xDF") eq "SS" and "xE9" =~ /w/. A simple export PERL5OPTS=-Mv5.12 or better will also get that.

    9. Named Unicode characters are not by default enabled, so add export PERL5OPTS=-Mcharnames=:full,:short,latin,greek or some such. See uninames and tcgrep.

    10. You almost always need access to the functions from the standard Unicode::Normalize module various types of decompositions. export PERL5OPTS=-MUnicode::Normalize=NFD,NFKD,NFC,NFKD, and then always run incoming stuff through NFD and outbound stuff from NFC. There’s no I/O layer for these yet that I’m aware of, but see nfc, nfd, nfkd, and nfkc.

    11. String comparisons in using eq, ne, lc, cmp, sort, &c&cc are always wrong. So instead of @a = sort @b, you need @a = Unicode::Collate->new->sort(@b). Might as well add that to your export PERL5OPTS=-MUnicode::Collate. You can cache the key for binary comparisons.

    12. built-ins like printf and write do the wrong thing with Unicode data. You need to use the Unicode::GCString module for the former, and both that and also the Unicode::LineBreak module as well for the latter. See uwc and unifmt.

    13. If you want them to count as integers, then you are going to have to run your d+ captures through the Unicode::UCD::num function because ’s built-in atoi(3) isn’t currently clever enough.

    14. You are going to have filesystem issues on filesystems. Some filesystems silently enforce a conversion to NFC; others silently enforce a conversion to NFD. And others do something else still. Some even ignore the matter altogether, which leads to even greater problems. So you have to do your own NFC/NFD handling to keep sane.

    15. All your code involving a-z or A-Z and such MUST BE CHANGED, including m//, s///, and tr///. It’s should stand out as a screaming red flag that your code is broken. But it is not clear how it must change. Getting the right properties, and understanding their casefolds, is harder than you might think. I use unichars and uniprops every single day.

    16. Code that uses p{Lu} is almost as wrong as code that uses [A-Za-z]. You need to use p{Upper} instead, and know the reason why. Yes, p{Lowercase} and p{Lower} are different from p{Ll} and p{Lowercase_Letter}.

    17. Code that uses [a-zA-Z] is even worse. And it can’t use pL or p{Letter}; it needs to use p{Alphabetic}. Not all alphabetics are letters, you know!

    18. If you are looking for variables with /[[email protected]%]w+/, then you have a problem. You need to look for /[[email protected]%]p{IDS}p{IDC}*/, and even that isn’t thinking about the punctuation variables or package variables.

    19. If you are checking for whitespace, then you should choose between h and v, depending. And you should never use s, since it DOES NOT MEAN [hv], contrary to popular belief.

    20. If you are using n for a line boundary, or even rn, then you are doing it wrong. You have to use R, which is not the same!

    21. If you don’t know when and whether to call Unicode::Stringprep, then you had better learn.

    22. Case-insensitive comparisons need to check for whether two things are the same letters no matter their diacritics and such. The easiest way to do that is with the standard Unicode::Collate module. Unicode::Collate->new(level => 1)->cmp($a, $b). There are also eq methods and such, and you should probably learn about the match and substr methods, too. These are have distinct advantages over the built-ins.

    23. Sometimes that’s still not enough, and you need the Unicode::Collate::Locale module instead, as in Unicode::Collate::Locale->new(locale => "de__phonebook", level => 1)->cmp($a, $b) instead. Consider that Unicode::Collate::->new(level => 1)->eq("d", "ð") is true, but Unicode::Collate::Locale->new(locale=>"is",level => 1)->eq("d", " ð") is false. Similarly, “ae” and “æ” are eq if you don’t use locales, or if you use the English one, but they are different in the Icelandic locale. Now what? It’s tough, I tell you. You can play with ucsort to test some of these things out.

    24. Consider how to match the pattern CVCV (consonsant, vowel, consonant, vowel) in the string “niño”. Its NFD form — which you had darned well better have remembered to put it in — becomes “ninx{303}o”. Now what are you going to do? Even pretending that a vowel is [aeiou] (which is wrong, by the way), you won’t be able to do something like (?=[aeiou])X) either, because even in NFD a code point like ‘ø’ does not decompose! However, it will test equal to an ‘o’ using the UCA comparison I just showed you. You can’t rely on NFD, you have to rely on UCA.



    And that’s not all. There are a million broken assumptions that people make about Unicode. Until they understand these things, their code will be broken.

    1. Code that assumes it can open a text file without specifying the encoding is broken.

    2. Code that assumes the default encoding is some sort of native platform encoding is broken.

    3. Code that assumes that web pages in Japanese or Chinese take up less space in UTF‑16 than in UTF‑8 is wrong.

    4. Code that assumes Perl uses UTF‑8 internally is wrong.

    5. Code that assumes that encoding errors will always raise an exception is wrong.

    6. Code that assumes Perl code points are limited to 0x10_FFFF is wrong.

    7. Code that assumes you can set $/ to something that will work with any valid line separator is wrong.

    8. Code that assumes roundtrip equality on casefolding, like lc(uc($s)) eq $s or uc(lc($s)) eq $s, is completely broken and wrong. Consider that the uc("σ") and uc("ς") are both "Σ", but lc("Σ") cannot possibly return both of those.

    9. Code that assumes every lowercase code point has a distinct uppercase one, or vice versa, is broken. For example, "ª" is a lowercase letter with no uppercase; whereas both "ᵃ" and "ᴬ" are letters, but they are not lowercase letters; however, they are both lowercase code points without corresponding uppercase versions. Got that? They are not p{Lowercase_Letter}, despite being both p{Letter} and p{Lowercase}.

    10. Code that assumes changing the case doesn’t change the length of the string is broken.

    11. Code that assumes there are only two cases is broken. There’s also titlecase.

    12. Code that assumes only letters have case is broken. Beyond just letters, it turns out that numbers, symbols, and even marks have case. In fact, changing the case can even make something change its main general category, like a p{Mark} turning into a p{Letter}. It can also make it switch from one script to another.

    13. Code that assumes that case is never locale-dependent is broken.

    14. Code that assumes Unicode gives a fig about POSIX locales is broken.

    15. Code that assumes you can remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged, wrong, and justification for capital punishment.

    16. Code that assumes that diacritics p{Diacritic} and marks p{Mark} are the same thing is broken.

    17. Code that assumes p{GC=Dash_Punctuation} covers as much as p{Dash} is broken.

    18. Code that assumes dash, hyphens, and minuses are the same thing as each other, or that there is only one of each, is broken and wrong.

    19. Code that assumes every code point takes up no more than one print column is broken.

    20. Code that assumes that all p{Mark} characters take up zero print columns is broken.

    21. Code that assumes that characters which look alike are alike is broken.

    22. Code that assumes that characters which do not look alike are not alike is broken.

    23. Code that assumes there is a limit to the number of code points in a row that just one X can match is wrong.

    24. Code that assumes X can never start with a p{Mark} character is wrong.

    25. Code that assumes that X can never hold two non-p{Mark} characters is wrong.

    26. Code that assumes that it cannot use "x{FFFF}" is wrong.

    27. Code that assumes a non-BMP code point that requires two UTF-16 (surrogate) code units will encode to two separate UTF-8 characters, one per code unit, is wrong. It doesn’t: it encodes to single code point.

    28. Code that transcodes from UTF‐16 or UTF‐32 with leading BOMs into UTF‐8 is broken if it puts a BOM at the start of the resulting UTF-8. This is so stupid the engineer should have their eyelids removed.

    29. Code that assumes the CESU-8 is a valid UTF encoding is wrong. Likewise, code that thinks encoding U+0000 as "xC0x80" is UTF-8 is broken and wrong. These guys also deserve the eyelid treatment.

    30. Code that assumes characters like > always points to the right and < always points to the left are wrong — because they in fact do not.

    31. Code that assumes if you first output character X and then character Y, that those will show up as XY is wrong. Sometimes they don’t.

    32. Code that assumes that ASCII is good enough for writing English properly is stupid, shortsighted, illiterate, broken, evil, and wrong. Off with their heads! If that seems too extreme, we can compromise: henceforth they may type only with their big toe from one foot. (The rest will be duct taped.)

    33. Code that assumes that all p{Math} code points are visible characters is wrong.

    34. Code that assumes w contains only letters, digits, and underscores is wrong.

    35. Code that assumes that ^ and ~ are punctuation marks is wrong.

    36. Code that assumes that ü has an umlaut is wrong.

    37. Code that believes things like contain any letters in them is wrong.

    38. Code that believes p{InLatin} is the same as p{Latin} is heinously broken.

    39. Code that believe that p{InLatin} is almost ever useful is almost certainly wrong.

    40. Code that believes that given $FIRST_LETTER as the first letter in some alphabet and $LAST_LETTER as the last letter in that same alphabet, that [${FIRST_LETTER}-${LAST_LETTER}] has any meaning whatsoever is almost always complete broken and wrong and meaningless.

    41. Code that believes someone’s name can only contain certain characters is stupid, offensive, and wrong.

    42. Code that tries to reduce Unicode to ASCII is not merely wrong, its perpetrator should never be allowed to work in programming again. Period. I’m not even positive they should even be allowed to see again, since it obviously hasn’t done them much good so far.

    43. Code that believes there’s some way to pretend textfile encodings don’t exist is broken and dangerous. Might as well poke the other eye out, too.

    44. Code that converts unknown characters to ? is broken, stupid, braindead, and runs contrary to the standard recommendation, which says NOT TO DO THAT! RTFM for why not.

    45. Code that believes it can reliably guess the encoding of an unmarked textfile is guilty of a fatal mélange of hubris and naïveté that only a lightning bolt from Zeus will fix.

    46. Code that believes you can use printf widths to pad and justify Unicode data is broken and wrong.

    47. Code that believes once you successfully create a file by a given name, that when you run ls or readdir on its enclosing directory, you’ll actually find that file with the name you created it under is buggy, broken, and wrong. Stop being surprised by this!

    48. Code that believes UTF-16 is a fixed-width encoding is stupid, broken, and wrong. Revoke their programming licence.

    49. Code that treats code points from one plane one whit differently than those from any other plane is ipso facto broken and wrong. Go back to school.

    50. Code that believes that stuff like /s/i can only match "S" or "s" is broken and wrong. You’d be surprised.

    51. Code that uses PMpM* to find grapheme clusters instead of using X is broken and wrong.

    52. People who want to go back to the ASCII world should be whole-heartedly encouraged to do so, and in honor of their glorious upgrade they should be provided gratis with a pre-electric manual typewriter for all their data-entry needs. Messages sent to them should be sent via an ᴀʟʟᴄᴀᴘs telegraph at 40 characters per line and hand-delivered by a courier. STOP.



    I don’t know how much more “default Unicode in ” you can get than what I’ve written. Well, yes I do: you should be using Unicode::Collate and Unicode::LineBreak, too. And probably more.

    As you see, there are far too many Unicode things that you really do have to worry about for there to ever exist any such thing as “default to Unicode”.

    What you’re going to discover, just as we did back in 5.8, that it is simply impossible to impose all these things on code that hasn’t been designed right from the beginning to account for them. Your well-meaning selfishness just broke the entire world.

    And even once you do, there are still critical issues that require a great deal of thought to get right. There is no switch you can flip. Nothing but brain, and I mean real brain, will suffice here. There’s a heck of a lot of stuff you have to learn. Modulo the retreat to the manual typewriter, you simply cannot hope to sneak by in ignorance. This is the 21ˢᵗ century, and you cannot wish Unicode away by willful ignorance.

    You have to learn it. Period. It will never be so easy that “everything just works,” because that will guarantee that a lot of things don’t work — which invalidates the assumption that there can ever be a way to “make it all work.”

    You may be able to get a few reasonable defaults for a very few and very limited operations, but not without thinking about things a whole lot more than I think you have.

    As just one example, canonical ordering is going to cause some real headaches. "x{F5}" ‘õ’, "ox{303}" ‘õ’, "ox{303}x{304}" ‘ȭ’, and "ox{304}x{303}" ‘ō̃’ should all match ‘õ’, but how in the world are you going to do that? This is harder than it looks, but it’s something you need to account for.

    If there’s one thing I know about Perl, it is what its Unicode bits do and do not do, and this thing I promise you: “ ̲ᴛ̲ʜ̲ᴇ̲ʀ̲ᴇ̲ ̲ɪ̲s̲ ̲ɴ̲ᴏ̲ ̲U̲ɴ̲ɪ̲ᴄ̲ᴏ̲ᴅ̲ᴇ̲ ̲ᴍ̲ᴀ̲ɢ̲ɪ̲ᴄ̲ ̲ʙ̲ᴜ̲ʟ̲ʟ̲ᴇ̲ᴛ̲ ̲ ”

    You cannot just change some defaults and get smooth sailing. It’s true that I run with PERL_UNICODE set to "SA", but that’s all, and even that is mostly for command-line stuff. For real work, I go through all the many steps outlined above, and I do it very, ** very** carefully.


    ¡ƨdləɥ ƨᴉɥʇ ədoɥ puɐ ʻλɐp əɔᴉu ɐ əʌɐɥ ʻʞɔnl poo⅁

  2. We’re all in agreement that it is a difficult problem for many reasons, but that’s precisely the reason to try to make it easier on everybody.

    There is a recent module on CPAN, utf8::all, that attempts to “turn on Unicode. All of it”.

    As has been pointed out, you can’t magically make the entire system (outside programs, external web requests, etc.) use Unicode as well, but we can work together to make sensible tools that make doing common problems easier. That’s the reason that we’re programmers.

    If utf8::all doesn’t do something you think it should, let’s improve it to make it better. Or let’s make additional tools that together can suit people’s varying needs as well as possible.

    `

  3. While reading this thread, I often get the impression that people are using “UTF-8” as a synonym to “Unicode“. Please make a distinction between Unicode’s “Code-Points” which are an enlarged relative of the ASCII code and Unicode’s various “encodings”. And there are a few of them, of which UTF-8, UTF-16 and UTF-32 are the current ones and a few more are obsolete.

    Please, UTF-8 (as well as all other encodings) exists and have meaning in input or in output only. Internally, since Perl 5.8.1, all strings are kept as Unicode “Code-points”. True, you have to enable some features as admiringly covered previously.

  4. There’s a truly horrifying amount of ancient code out there in the wild, much of it in the form of common CPAN modules. I’ve found I have to be fairly careful enabling Unicode if I use external modules that might be affected by it, and am still trying to identify and fix some Unicode-related failures in several Perl scripts I use regularly (in particular, iTiVo fails badly on anything that’s not 7-bit ASCII due to transcoding issues).

  5. You should enable the unicode strings feature, and this is the default if you use v5.14;

    You should not really use unicode identifiers esp. for foreign code via utf8 as they are insecure in perl5, only cperl got that right. See e.g. http://perl11.org/blog/unicode-identifiers.html

    Regarding utf8 for your filehandles/streams: You need decide by yourself the encoding of your external data. A library cannot know that, and since not even libc supports utf8, proper utf8 data is rare. There’s more wtf8, the windows aberration of utf8 around.

    BTW: Moose is not really “Modern Perl”, they just hijacked the name. Moose is perfect Larry Wall-style postmodern perl mixed with Bjarne Stroustrup-style everything goes, with an eclectic aberration of proper perl6 syntax, e.g. using strings for variable names, horrible fields syntax, and a very immature naive implementation which is 10x slower than a proper implementation. cperl and perl6 are the true modern perls, where form follows function, and the implementation is reduced and optimized.

  6. Most professional programmers suck

    I have come across too many people doing this job for their living who were plain crappy at what they were doing. Crappy code, bad communication skills, no interest in new technology whatsoever. Too many, too many…

  7. A degree in computer science does not—and is not supposed to—teach you to be a programmer.

    Programming is a trade, computer science is a field of study. You can be a great programmer and a poor computer scientist and a great computer scientist and an awful programmer. It is important to understand the difference.

    If you want to be a programmer, learn Java. If you want to be a computer scientist, learn at least three almost completely different languages. e.g. (assembler, c, lisp, ruby, smalltalk)

  8. SESE (Single Entry Single Exit) is not law

    Example:

    public int foo() { if( someCondition ) { return 0; } return -1; } 

    vs:

    public int foo() { int returnValue = -1; if( someCondition ) { returnValue = 0; } return returnValue; } 

    My team and I have found that abiding by this all the time is actually counter-productive in many cases.

  9. C++ is one of the WORST programming languages – EVER.

    It has all of the hallmarks of something designed by committee – it does not do any given job well, and does some jobs (like OO) terribly. It has a “kitchen sink” desperation to it that just won’t go away.

    It is a horrible “first language” to learn to program with. You get no elegance, no assistance (from the language). Instead you have bear traps and mine fields (memory management, templates, etc.).

    It is not a good language to try to learn OO concepts. It behaves as “C with a class wrapper” instead of a proper OO language.

    I could go on, but will leave it at that for now. I have never liked programming in C++, and although I “cut my teeth” on FORTRAN, I totally loved programming in C. I still think C was one of the great “classic” languages. Something that C++ is certainly NOT, in my opinion.

    Cheers,

    -R

    EDIT: To respond to the comments on teaching C++. You can teach C++ in two ways – either teaching it as C “on steroids” (start with variables, conditions, loops, etc), or teaching it as a pure “OO” language (start with classes, methods, etc). You can find teaching texts that use one or other of these approaches. I prefer the latter approach (OO first) as it does emphasize the capabilities of C++ as an OO language (which was the original design emphasis of C++). If you want to teach C++ “as C”, then I think you should teach C, not C++.

    But the problem with C++ as a first language in my experience is that the language is simply too BIG to teach in one semester, plus most “intro” texts try and cover everything. It is simply not possible to cover all the topics in a “first language” course. You have to at least split it into 2 semesters, and then it’s no longer “first language”, IMO.

    I do teach C++, but only as a “new language” – that is, you must be proficient in some prior “pure” language (not scripting or macros) before you can enroll in the course. C++ is a very fine “second language” to learn, IMO.

    -R

    ‘Nother Edit: (to Konrad)

    I do not at all agree that C++ “is superior in every way” to C. I spent years coding C programs for microcontrollers and other embedded applications. The C compilers for these devices are highly optimized, often producing code as good as hand-coded assembler. When you move to C++, you gain a tremendous overhead imposed by the compiler in order to manage language features you may not use. In embedded applications, you gain little by adding classes and such, IMO. What you need is tight, clean code. You can write it in C++, but then you’re really just writing C, and the C compilers are more optimized in these applications.

    I wrote a MIDI engine, first in C, later in C++ (at the vendor’s request) for an embedded controller (sound card). In the end, to meet the performance requirements (MIDI timings, etc) we had to revert to pure C for all of the core code. We were able to use C++ for the high-level code, and having classes was very sweet – but we needed C to get the performance at the lower level. The C code was an order of magnitude faster than the C++ code, but hand coded assembler was only slightly faster than the compiled C code. This was back in the early 1990s, just to place the events properly.

    -R

  10. A degree in Computer Science or other IT area DOES make you a more well rounded programmer

    I don’t care how many years of experience you have, how many blogs you’ve read, how many open source projects you’re involved in. A qualification (I’d recommend longer than 3 years) exposes you to a different way of thinking and gives you a great foundation.

    Just because you’ve written some better code than a guy with a BSc in Computer Science, does not mean you are better than him. What you have he can pick up in an instant which is not the case the other way around.

    Having a qualification shows your commitment, the fact that you would go above and beyond experience to make you a better developer. Developers which are good at what they do AND have a qualification can be very intimidating.

    I would not be surprized if this answer gets voted down.

    Also, once you have a qualification, you slowly stop comparing yourself to those with qualifications (my experience). You realize that it all doesn’t matter at the end, as long as you can work well together.

    Always act mercifully towards other developers, irrespective of qualifications.

  11. Lazy Programmers are the Best Programmers

    A lazy programmer most often finds ways to decrease the amount of time spent writing code (especially a lot of similar or repeating code). This often translates into tools and workflows that other developers in the company/team can benefit from.

    As the developer encounters similar projects he may create tools to bootstrap the development process (e.g. creating a DRM layer that works with the company’s database design paradigms).

    Furthermore, developers such as these often use some form of code generation. This means all bugs of the same type (for example, the code generator did not check for null parameters on all methods) can often be fixed by fixing the generator and not the 50+ instances of that bug.

    A lazy programmer may take a few more hours to get the first product out the door, but will save you months down the line.

  12. Don’t use inheritance unless you can explain why you need it.

  13. The world needs more GOTOs

    GOTOs are avoided religiously often with no reasoning beyond “my professor told me GOTOs are bad.” They have a purpose and would greatly simplify production code in many places.

    That said, they aren’t really necessary in 99% of the code you’ll ever write.

  14. I’ve been burned for broadcasting these opinions in public before, but here goes:

    Well-written code in dynamically typed languages follows static-typing conventions

    Having used Python, PHP, Perl, and a few other dynamically typed languages, I find that well-written code in these languages follows static typing conventions, for example:

    • Its considered bad style to re-use a variable with different types (for example, its bad style to take a list variable and assign an int, then assign the variable a bool in the same method). Well-written code in dynamically typed languages doesn’t mix types.

    • A type-error in a statically typed language is still a type-error in a dynamically typed language.

    • Functions are generally designed to operate on a single datatype at a time, so that a function which accepts a parameter of type T can only sensibly be used with objects of type T or subclasses of T.

    • Functions designed to operator on many different datatypes are written in a way that constrains parameters to a well-defined interface. In general terms, if two objects of types A and B perform a similar function, but aren’t subclasses of one another, then they almost certainly implement the same interface.

    While dynamically typed languages certainly provide more than one way to crack a nut, most well-written, idiomatic code in these languages pays close attention to types just as rigorously as code written in statically typed languages.

    Dynamic typing does not reduce the amount of code programmers need to write

    When I point out how peculiar it is that so many static-typing conventions cross over into dynamic typing world, I usually add “so why use dynamically typed languages to begin with?”. The immediate response is something along the lines of being able to write more terse, expressive code, because dynamic typing allows programmers to omit type annotations and explicitly defined interfaces. However, I think the most popular statically typed languages, such as C#, Java, and Delphi, are bulky by design, not as a result of their type systems.

    I like to use languages with a real type system like OCaml, which is not only statically typed, but its type inference and structural typing allow programmers to omit most type annotations and interface definitions.

    The existence of the ML family of languages demostrates that we can enjoy the benefits of static typing with all the brevity of writing in a dynamically typed language. I actually use OCaml’s REPL for ad hoc, throwaway scripts in exactly the same way everyone else uses Perl or Python as a scripting language.

  15. Programmers who spend all day answering questions on Stackoverflow are probably not doing the work they are being paid to do.

  16. Code layout does matter

    Maybe specifics of brace position should remain purely religious arguments – but it doesn’t mean that all layout styles are equal, or that there are no objective factors at all!

    The trouble is that the uber-rule for layout, namely: “be consistent”, sound as it is, is used as a crutch by many to never try to see if their default style can be improved on – and that, furthermore, it doesn’t even matter.

    A few years ago I was studying Speed Reading techniques, and some of the things I learned about how the eye takes in information in “fixations”, can most optimally scan pages, and the role of subconsciously picking up context, got me thinking about how this applied to code – and writing code with it in mind especially.

    It led me to a style that tended to be columnar in nature, with identifiers logically grouped and aligned where possible (in particular I became strict about having each method argument on its own line). However, rather than long columns of unchanging structure it’s actually beneficial to vary the structure in blocks so that you end up with rectangular islands that the eye can take in in a single fixture – even if you don’t consciously read every character.

    The net result is that, once you get used to it (which typically takes 1-3 days) it becomes pleasing to the eye, easier and faster to comprehend, and is less taxing on the eyes and brain because it’s laid out in a way that makes it easier to take in.

    Almost without exception, everyone I have asked to try this style (including myself) initially said, “ugh I hate it!”, but after a day or two said, “I love it – I’m finding it hard not to go back and rewrite all my old stuff this way!”.

    I’ve been hoping to find the time to do more controlled experiments to collect together enough evidence to write a paper on, but as ever have been too busy with other things. However this seemed like a good opportunity to mention it to people interested in controversial techniques 🙂

    [Edit]

    I finally got around to blogging about this (after many years parked in the “meaning to” phase): Part one, Part two, Part three.

  17. Opinion: explicit variable declaration is a great thing.

    I’ll never understand the “wisdom” of letting the developer waste costly time tracking down runtime errors caused by variable name typos instead of simply letting the compiler/interpreter catch them.

    Nobody’s ever given me an explanation better than “well it saves time since I don’t have to write ‘int i;’.” Uhhhhh… yeah, sure, but how much time does it take to track down a runtime error?

  18. Opinion: Never ever have different code between “debug” and “release” builds

    The main reason being that release code almost never gets tested. Better to have the same code running in test as it is in the wild.

  19. Pagination is never what the user wants

    If you start having the discussion about where to do pagination, in the database, in the business logic, on the client, etc. then you are asking the wrong question. If your app is giving back more data than the user needs, figure out a way for the user to narrow down what they need based on real criteria, not arbitrary sized chunks. And if the user really does want all those results, then give them all the results. Who are you helping by giving back 20 at a time? The server? Is that more important than your user?

    [EDIT: clarification, based on comments]

    As a real world example, let’s look at this Stack Overflow question. Let’s say I have a controversial programming opinion. Before I post, I’d like to see if there is already an answer that addresses the same opinion, so I can upvote it. The only option I have is to click through every page of answers.

    I would prefer one of these options:

    1. Allow me to search through the answers (a way for me to narrow down what I need based on real criteria).

    2. Allow me to see all the answers so I can use my browser’s “find” option (give me all the results).

    The same applies if I just want to find an answer I previously read, but can’t find anymore. I don’t know when it was posted or how many votes it has, so the sorting options don’t help. And even if I did, I still have to play a guessing game to find the right page of results. The fact that the answers are paginated and I can directly click into one of a dozen pages is no help at all.


    bmb

  20. Respect the Single Responsibility Principle

    At first glance you might not think this would be controversial, but in my experience when I mention to another developer that they shouldn’t be doing everything in the page load method they often push back … so for the children please quit building the “do everything” method we see all to often.

Tasg: perl, unicode

Answer Link
jidam
  • Unable to run NoraUI mvn verify goal
  • Unable to run my app on emulator in VS Code
  • Unable to run multiple instances of libVLC(MobileVLCKit) in IOS via flutter framework
  • Unable to run make on griddb source on ubuntu 20.04 (building from source)
  • Unable to run latexindent macOS Monterey 12.0.1
  • Unable to run kotlinc-native command
  • Unable to run JUnit Test… Java.lang.ExceptionInInitializerError (Android Studio)
  • Unable to run java with -Xmx > 966m
  • Unable to run ionic cap run android from wsl2 inorder to start android emulator
  • Unable to run Intel HAXM installer: Cannot start process, the working directory does not exist
  • fs
  • Unable to run Google Analytics sample code
  • unable to run flutter run after upgarding to flutter 2.8.0 from 2.5.3
  • Unable to run Django with PostgreSQL in Docker
  • Unable to Run Container Using testcontainers
  • Unable to run ClojureScript Hello World program, Error building classpath. Error reading edn.
  • unable to run client command for apache karaf 4.3.3 through remote server
  • Unable to run c program 2nd time using eclipse
  • unable to run c++ in visual studio code on m1 chipset
  • Unable to run Android Instrumented Tests
  • Unable to run adb, check your Android SDK installation and ANDROID_SDK_ROOT environment variable: …AndroidSdkplatform-toolsadb.exe
  • Unable to run a singlespecific .spec.ts file through angular cli using ng test –include option
  • Unable to run a Mango query
  • Unable to return response back to view in laravel from package
  • Unable to return object reference in std::optional
  • Unable to return NULL in a function that expects an integer return type
  • Unable to return correct change in JavaScript Cash Register
  • Unable to retrieve version information from Elasticsearch nodes. Request timed out
  • Unable to retrieve values from Axios Response data
  • Unable to retrieve dotenv JWT secret Error: secretOrPrivateKey must have a value
  • Unable to resolve your shell environment
  • Unable to resolve token for FCM while implementing Push notification for Xamarin
  • Unable to resolve the request yii
  • Unable to resolve service for type Swashbuckle.AspNetCore.Swagger.ISwaggerProvider
  • Unable to resolve service for type Microsoft.EntityFrameworkCore.Diagnostics.IDiagnosticsLogger