Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tinyformat aware of locales. #44

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

lgritz
Copy link
Contributor

@lgritz lgritz commented Oct 11, 2017

Without all this, you will get situations where format("%g",123.45)
might return "123,45" if it happens to be running on a machine with
locale "fr_FR", for example. If that is written to an output file that
is later read on a computer using a locale with '.' as the decimal mark,
it could mis-parse as "123.0", leading to hilarious and tragic results.

It seems like the safe thing is to force the string-returning varieties
of tinyformat to use classic locale ('.' decimal), with an optional
means for the odd users to purposely specify a different (or native)
locale. It also seems that the best policy for the stream-appending
varieties of format() is to use the local already associated with those
streams (C++ lets each stream have its own locale).

  1. format() to an existing stream (including cout) is now documented to
    honor the stream's existing locale (as it always did).

  2. format() that returns a string now will always format the string using
    the classic "C" locale, in particular meaning that floating point
    numbers will use '.' as the decimal mark, regardless of the "native"
    locale set by the OS or process that may have an unexpected choice of
    decimal point.

  3. Add a new variety of format() that returns a string, but which takes
    an explicit locale, for cases where you don't want the "C" locale
    (for example, when generating strings that will appear in a UI that
    you want localized properly). You can make this variety use the
    current locale by passing std::locale() as the first argument (the
    default constructor of std::locale just grabs the global process
    locale).

tinyformat.h Outdated
template<typename... Args>
std::string format(const char* fmt, const Args&... args)
{
std::ostringstream oss;
oss.imbue (std::locale::classic()); // force "C" locale with '.' decimal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little uneasy about interfering here with the default behaviour of using the global locale. But it's certainly more convenient than having to pass the desired locale as a parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative design is for the unadorned format() to use the native locale (as before) and have users who want the C locale use the new variety and explicitly pass std::locale::classic(). But that seems error-prone to me, since most users would probably prefer to always have '.' and (like me) be shocked that this wasn't already the case.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tough one as there's two quite valid but competing use cases.

  • For machine readable output, consistency is super important. Using the classic() locale is the right thing for this. It's also generally more consistent between environments which is great.
  • When generating content for users it's arguably be more correct - and consistent with the other format functions - to use the current global locale for format().

One way out of the conundrum might just be to add a new cformat function ("format in C locale"), and leave format as it is for consistency with the other flavours of format. Obviously a better name than cformat would be good...

@lgritz
Copy link
Contributor Author

lgritz commented Oct 12, 2017

I updated it slightly with a couple spots where I realized I needed to transfer locale settings from one stream to another.

@lgritz
Copy link
Contributor Author

lgritz commented Oct 12, 2017

By the way, it's perfectly acceptable for you to reject this and take the stance that tinyformat doesn't want this complication or any possible performance penalty, and so is just always going to use the global native locale, so be it, and that apps who are concerned about that ought to just set the locale globally. But that makes it very difficult for people writing libraries (who want to use tinyformat internal to the library) who don't think that the library internals have any business either assuming or changing anything about the global state of any apps that use them.

Copy link
Owner

@c42f c42f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Larry, thanks for this. I'm definitely open to making use with locales possible/sensible. I'm not sure on the best design yet, but this looks pretty close. The versions which take an input stream and use the existing stream locale seems clearly correct. The version takes an explicit locale is also obviously right.

Unfortunately I've never had cause to use a locale in production code, so if any of the watchers of this repo have experience and would like to chime in that would be useful.

tinyformat.h Outdated
template<typename... Args>
std::string format(const char* fmt, const Args&... args)
{
std::ostringstream oss;
oss.imbue (std::locale::classic()); // force "C" locale with '.' decimal
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tough one as there's two quite valid but competing use cases.

  • For machine readable output, consistency is super important. Using the classic() locale is the right thing for this. It's also generally more consistent between environments which is great.
  • When generating content for users it's arguably be more correct - and consistent with the other format functions - to use the current global locale for format().

One way out of the conundrum might just be to add a new cformat function ("format in C locale"), and leave format as it is for consistency with the other flavours of format. Obviously a better name than cformat would be good...

@lgritz
Copy link
Contributor Author

lgritz commented Oct 12, 2017

I haven't used locale in production code before today, either! Meaning, I always naively assumed tinyformat would print the same for everybody as it did for me, "123.45". And hence the problem -- a Swedish OSL user had his system locale set to "sv_SE.UTF-8" and it was introducing lots of subtle problems related to numeric literals in his OSL source code. (His immediate bug was related to parsing, but the tinyformat side would have messed up writing the .oso file and many other things.)

I don't have strong feelings about the naming conventions. If you want format() to always obey the global locale, and have a separate cformat() (format_c? something else?) to be hard-coded for classic, that's ok by me. I don't directly expose tinyformat, it's all wrapped by my functions to it won't directly affect my users. This patch is the way I did it in the tinyformat that's embedded in oiio, but whatever you decide for the baseline, I will re-sync with you and adhere to your convention for the tinyformat calls themselves.

@lgritz
Copy link
Contributor Author

lgritz commented Oct 12, 2017

BTW, I agree that for the string case, there are potentially two use cases: force C locale, or use the native/global one. I really struggled with this issue myself, but in the end decided (for my wrapper functions) that most people are ignorant about locales and expect the "C" convention and will be surprised to get anything different if they inadvertently use the default calls. The only people who want locale-dependent formatting are the ones intentionally using internationalization, and they can be trusted with the burden of using extra arguments and specially-named functions. That was my rationale, anyway.

@c42f
Copy link
Owner

c42f commented Oct 12, 2017

most people are ignorant about locales and expect the "C" convention and will be surprised to get anything different if they inadvertently use the default calls

I do buy this reasoning, it seems completely sensible. The thing that makes me pause is that naive use of the stream versions of format should also behave the same as the string version. Eg, it's pretty common to open a file stream and use tfm::format to output directly to it.

I suppose we could change all versions of format to use the C locale, and just provide a format(locale, stream, ...) overload for people who want to do otherwise. (Or an lformat() to use the already imbued stream locale, which might be more efficient. I'm not sure what efficiency concerns there are here.)

@lgritz
Copy link
Contributor Author

lgritz commented Oct 12, 2017

I'd be for the "all existing versions use C locale, new funcs for explicit", I think that's consistent and probably in line with what 99% of users think is really happening (though they are wrong).

But then I'm not 100% sure how to handle the logistics of the "C" versions that write to existing streams. Does each of those functions, then, need to store the stream's original locale, change it to C, output, then restore? For that matter, is ostream::imbue() thread safe? And will it cause even more chaos if a stream that is everywhere else "localized" suddenly has individual writes (that used tinyformat) adhering to different conventions?

It all hurts my head. I'm fine with us all sleeping on this for a few days to let it percolate, there is not really any urgency to decide.

The good news is that MOST technical users will never see a change one way or the other, since they already have their global locale set to the "C" equivalent -- so none of these choices is likely to change things for anybody who isn't already living on one of those edge cases that's currently broken in some way.

@c42f
Copy link
Owner

c42f commented Oct 12, 2017

Yes - that's all I was thinking - all stream based format versions would need to save and restore the original locale. It's certainly pretty ugly internally, though maybe not so different from how I deal with format flags which have the same kind of state change and restore performed on the stream.

Which thread safety issues are you particularly thinking about? The locale itself would be const (I assume), and the use of the stream/streambuf would already be non-threadsafe as per usual. So at a guess it seems fine to me. Could easily be wrong though!

I'm certainly happy to let this percolate for a few days before a final decision.

@lgritz
Copy link
Contributor Author

lgritz commented Oct 12, 2017

I didn't have any thread safety issues that I knew about for sure... just musing about unknowns, but I guess you're right, we're already accepting that tinyformat is changing (and hopefully restoring) persistent state about the stream.

I'm coming around to seeing the wisdom of declaring that all the "default" functions should be consistent with each other and use C locale -- whether making a string or outputting to a stream -- and have an additional version that takes an explicit locale if you want anything else (including passing stream.getloc() to cause the output to conform to the stream's previous locale). That has the nice property of ensuring that

format (out, "%g", floatvar);

and

out << format("%g", floatvar);

are 100% identical, regardless of any prior state of the stream. If you want something else, we'll have a

format (out, loc, "%g", floatvar);

???

@nigels-com
Copy link
Contributor

As a contrarian sort of argument, I think that tinyformat should continue defaulting to the global or stream-specified locale. That's the normal and well established behaviour, rightly or wrongly.

So I see the problem more in terms of a convenient way to use "C" locale for string formatting specifically. Which brings me back to the cformat previously mentioned.

I'd be inclined to explicitly set the global locale for all CLI tools via std::locale::global. This applies to printf, std::ostream as well as tfm::format.

That leaves a more complicated reality for GUI applications that likely depend on not having the global locale forced to "C" classic. In light of that -- mass-migrating code from cformat seems justifiable, and at least more explicit about the contract.

@nigels-com
Copy link
Contributor

Unfortunately I've never had cause to use a locale in production code, so if any of the watchers of this repo have experience and would like to chime in that would be useful.

So let's say you had a library that was concerned with generating valid GLSL source code, being linked into a pointy-clicky internationalised GUI application that happened to have some widgets using GLSL for their shaders. It's not desirable for the GLSL formatting of numbers to be locale-dependent. But the UI portions of the application are required to be localised into Japanese, or whatever.

In this situation the solution was to actually eliminate string formatting (printf and std::iostreams) from the GLSL translating library, along with the locale dependency. Removing the format string was considered a security enhancement - at the time it was also being used as part of the web browser Java stack. No more printf, no more varargs, no more worries. (pre C++11)

@lgritz
Copy link
Contributor Author

lgritz commented Oct 14, 2017

So what's the verdict, @c42f ? I think our choices are:

  1. Abandon this PR, you don't want to deal with it. PRO: easy for you. CON: leaves the problems.

  2. Accept the PR as-is --- string formats "C" locale, stream output honors the stream's locale, both have versions that take an explicit locale parameter. PRO: work is done. CON: still admittedly a little weird.

  3. Default varieties of format() always output C locale (regardless of global or stream settings), but also have versions that take an explicit locale parameter. PRO: probably what most users want, hard to screw up for persistent data. CON: Inconsistent with printf and stream I/O.

  4. format() to stream always honors stream locale, format() to string always honors global locale, and a new cformat() is added that always uses C locale. In this case, maybe none need to take an explicit locale parameter? (Would you ever want a locale that is neither the native global one, nor "C"?) PRO: consistent with printf/sprintf. CON: most users probably want C locale always, so why make them jump through hoops, make the minority who are formatting for localized UI be the ones to do extra work and use special functions?

Chris Kulla @fpsunflower has been on vacation all week, but I'd also like to get his opinion before we make any final decisions. He's usually good at spotting any sloppiness in the decisions I'm making.

@nigels-com
Copy link
Contributor

@lgritz Out of curiosity, are you primarily in a CLI context, or a GUI context?

@lgritz
Copy link
Contributor Author

lgritz commented Oct 15, 2017

Both!

My use of tinyformat is in two software packages, and in both cases their heart is a library that's used from inside other applications (used by wide range of commercial, proprietary, and other open source projects), but both packages also contain CLI tools that are extensively used in their own right.

Setting "C" locale is trivial for the CLI tools, but does nothing for the library internals.

Interestingly, one of my projects where I use tinyformat is a programming language, so in that case, it's really crucial that the parsing and generating of output (text representation of floating point numbers, in particular) be canonical rather than locale-dependent. But the language JITs from the library, inside other apps that have extensive UI that may be localized, so I can't count on the global locale being "C".

@c42f
Copy link
Owner

c42f commented Oct 17, 2017

Right thanks for the ideas, I think the design tradeoffs are becoming clearer. To me it seems that:

  1. Some programs require formatting in both the C locale and some region specific locale. There should be a clean way to distinguish between these.
  2. Using the default locale is fraught with danger for any program which produces machine readable output. However, it's easy to forget about this, leading to bugs.
  3. Bugs in producing machine readable output are arguably more severe/unrecoverable than internationalization bugs.
  4. The only time you're actually thinking about locales is when you're doing internationalization. In this case, you probably want to set the global locale sensibly for the program as a whole, rather than passing locale objects around. Also, since you're actually thinking about it, you'll hopefully have the presence of mind to use the correct localization-aware formatting function.

With (1) in mind, I think it would be most useful to have two sets of functions

  • One set which uses the stream state / std::locale::global()
  • Another which always overrides the locale with std::local::classic()

The difficult decision is then which of these should be called format(). (3) and (4) argue that format should use the classic locale, with another function using the locale from the stream state. Some naming ideas - lformat() formatl() format_loc() formatloc() for "localized format" or "locale format" or "format with localization"?

@lgritz
Copy link
Contributor Author

lgritz commented Oct 17, 2017

I'm currently of the opinion that the unadorned format() should always use "C" locale. It seems nearly foolproof, your points 2-3 are very compelling.

I agree that the main use cases are "always C" and "use the global" and both of those should be simple, without having to pass an explicit locale. But it's so easy to also have the version that takes an explicit locale, I'd be inclined to keep that flexibility available in case somebody needs it. It does not compromise the design or simplicity of either of the other two varieties and is truly just a few lines of code.

Of your choices, I like format_loc() best, but I don't feel very strongly about it (I wrap your functions in other names anyway). It's a little clunky, but perhaps in a good way -- we want people to be very deliberate about when they use it.

@lgritz
Copy link
Contributor Author

lgritz commented Oct 17, 2017

@c42f
Copy link
Owner

c42f commented Oct 17, 2017

Wow, that world map of decimal separator use is a bit of a horror show. I didn't realize the countries were split close to half and half on this decision. I suppose the cost of certain internationalization bugs might be quite severe - for an obvious example, the difference between $100,000 and $100.000 - at least in Australia the , is still commonly used as a thousands separator. Clearly everyone should just give up and adopt the Arabic decimal separator ;-)

But it's so easy to also have the version that takes an explicit locale ...

True, though I'm kind of more inclined to add this later if someone actually needs it. Personally I think I'd be much more likely to use the global locale if I wanted locale specific formatting.

@lgritz
Copy link
Contributor Author

lgritz commented Oct 17, 2017

At this time, I have no objection to just having the two common options (C and global), and we can add the other if we want it later.

@c42f
Copy link
Owner

c42f commented Oct 19, 2017

Uh oh. I've got quite bad news about performance. Before we go further I hacked up the speed tests - just on master - so that we save and restore the locale in printf. Before the changes:

$ make speed_test
g++ -Wall -Werror -std=c++11 -DTINYFORMAT_USE_VARIADIC_TEMPLATES -O3 -DNDEBUG tinyformat_speed_test.cpp -o tinyformat_speed_test
running speed tests...
printf timings:
real 1.23
user 1.22
sys 0.00
iostreams timings:
real 1.75
user 1.75
sys 0.00
tinyformat timings:
real 2.10
user 2.10
sys 0.00
boost timings:
real 8.77
user 8.76
sys 0.00

Applying the following patch

$ git diff
diff --git a/Makefile b/Makefile
index 6cef5f5..09417ea 100644
--- a/Makefile
+++ b/Makefile
@@ -38,7 +38,7 @@ tinyformat.html: README.rst
        rst2html.py README.rst > tinyformat.html

 tinyformat_speed_test: tinyformat.h tinyformat_speed_test.cpp Makefile
-       $(CXX) $(CXXFLAGS) -O3 -DNDEBUG tinyformat_speed_test.cpp -o tinyformat_speed_test
+       $(CXX) $(CXXFLAGS) $(CXX11FLAGS) -DTINYFORMAT_USE_VARIADIC_TEMPLATES -O3 -DNDEBUG tinyformat_speed_test.cpp -o tinyformat_speed_test

 bloat_test:
        @for opt in '' '-O3 -DNDEBUG' ; do \
diff --git a/tinyformat.h b/tinyformat.h
index 85a22c1..b5e1c40 100644
--- a/tinyformat.h
+++ b/tinyformat.h
@@ -975,7 +975,9 @@ std::string format(const char* fmt, const Args&... args)
 template<typename... Args>
 void printf(const char* fmt, const Args&... args)
 {
+    std::locale loc = std::cout.imbue(std::locale::classic());
     format(std::cout, fmt, args...);
+    std::cout.imbue(loc);
 }

 template<typename... Args>

gives the following

$ make speed_test
g++ -Wall -Werror -std=c++11 -DTINYFORMAT_USE_VARIADIC_TEMPLATES -O3 -DNDEBUG tinyformat_speed_test.cpp -o tinyformat_speed_test
running speed tests...
printf timings:
real 1.22
user 1.22
sys 0.00
iostreams timings:
real 1.76
user 1.76
sys 0.00
tinyformat timings:
real 3.79
user 3.36
sys 0.42
boost timings:
real 8.53
user 8.53
sys 0.00

It's a bit of a performance disaster.

I was puzzled by the system time. Running strace reveals that the stream is flushed every time imbue is called. I think this is the real culprit, as inserting a simple std::cout::flush into printf results in similar performance degradation in the speed test.

@lgritz
Copy link
Contributor Author

lgritz commented Oct 19, 2017

Dammit, there's no winning on this.

What about the string versions? Is there any performance difference?

So, potential strategies to mitigate perf issues:

  1. Default behavior is always to follow the global/stream locale, with a format_c() that forces C locale and/or format_loc() that takes an explicit locale.

  2. Default C locale for strings, stream locale for stream output, and overrides for both.

@c42f
Copy link
Owner

c42f commented Oct 19, 2017

Yeah, it's pretty awful. Taking this into consideration, the best option seems like it might be a new family of cformat() / format_c() functions which force C locale for both string and streams. Or maybe just a single new function cformat() for the string version as the stream versions already provide locale control (albeit not perhaps what people naively expect).

For the string version, the cost might be somewhat mitigated by the other stuff it's doing, namely allocating string buffers, etc, and flushing doesn't involve a syscall to writev()... actually so this may be ok. Needs testing (later).

@c42f
Copy link
Owner

c42f commented Oct 19, 2017

It's all surprisingly messy, and does make me wonder whether locale support in the standard library was a bit of an afterthought.

@lgritz
Copy link
Contributor Author

lgritz commented Oct 19, 2017

If using C locale makes the string formatting slow, then I think the only choice is to leave the original format() to use global (for strings) and stream locales, and add format_c/format_loc when people have a specific preference and are willing to pay the cost. (My choice 1 a couple messages ago.)

If the string formatting is the same speed either way (i.e., if the actual stream flush on a real file or console stream is the expensive part), then my choice 2 from earlier is still a possibility, if you want the string format() to do what most users expect and have the special call for locale-intentioned user.

I am cursing the authors of the standard library for this. It's so obvious that locales should have been opt-in on each call, with the default calls being locale-independent.

@lgritz
Copy link
Contributor Author

lgritz commented Oct 25, 2017

You want to make a final call on this, Chris? I don't have an especially strong preference, but I'd like to get it settled one way or the other so I can figure out what to do in my downstream projects and not have to change them a second time.

@lgritz
Copy link
Contributor Author

lgritz commented Oct 25, 2017

I'm happy to revise my submission to match whatever decision you make.

@lgritz
Copy link
Contributor Author

lgritz commented Oct 27, 2017

On the OIIO side, we've made the decision not to expose the location-explicit API calls, but instead to simply declare that since OIIO is all about file I/O at its core, it must be consistent across all users and platforms, therefore all of its numeric input and output are "C" locale.

I think that this is probably not the right decision for tinyformat. My current recommendation is that you make the default (format()) be fixed to C locale, and add a second format_loc() that outputs using the native locale, and I think that will cover you. (Alternately, format could use native, and format_c could use C locale.) Let me know how you want to go, for any of these plans, and I'll revise this PR accordingly.

@lgritz
Copy link
Contributor Author

lgritz commented Nov 7, 2017

@c42f Poke... let me know how you'd like me to revise this.

@c42f
Copy link
Owner

c42f commented Nov 8, 2017

Sorry Larry, I've been dithering on making a decision on this because there's no obvious nice choice. But it's still worth making a choice. I'd like to do a little more benchmarking, but my current feeling is that the std library forces us into leaving the family of format functions as they are for performance and consistency. This means introducing a utility version format_c which always uses the classic C locale. format_c should have a string version, and could also have versions which format to a stream for convenience (though one should arguably set the locale on the stream in those cases instead).

As a side note - it's always been my intention to reproduce printf as closely as possible so that using tinyformat is just a function rename. So letting a stream based setting control the locale is at least consistent there.

Coincidentally I've just had a nice PR in #45 which implements posix argument reordering, again for localization related reasons. @jrohel - since you're apparently in the process of dealing with localization, I'd be interested if you can share your experience as it relates to this PR, if you have some.

@nigels-com
Copy link
Contributor

Agreed. No performance implications and opt-in for forcing the locale. Sounds good to me.

The default functions all are dependent on either the global native
locale, or the locale associated with the stream they are outputting
to. This can lead to situations where format("%g",123.45) might return
"123,45" if it happens to be running on a machine with locale "fr_FR",
for example. If that is written to an output file that is later read on
a computer using a locale with '.' as the decimal mark, it could
mis-parse as "123.0", leading to hilarious and tragic results.

This patch adds format_c, printf_c, printfln_c, which force the classic
"C" locale ('.' decimals, among other things). This incurs a performance
penalty -- in particular, when being used on I/O streams it will force a
flush when the locale for the stream is saved and restored -- but it is
the safe thing to do for persistent, portable output.
@lgritz
Copy link
Contributor Author

lgritz commented Nov 8, 2017

I've updated the PR with a much less ambitious version. All existing functions are unchanged, but I added a format_c, printf_c, printfln_c that force C locale, even if a little more expensive.

@jrohel
Copy link
Contributor

jrohel commented Feb 8, 2018

@c42f OK, my experiences:
I use tinyformat in a localized application. That means translated texts and localized format of numbers, etc. My experience with tinyformat is good. I only use this my extension #45. Nothing more. Please, do not change default behaviour.

  • If I set global C++ locale (std::locale::global) then I assume tfm::format(const char* fmt, const Args&... args) will use it.

  • If I set std::cout locale (std::cout.imbue) then I assume tfm::printf and tfm::printfln will use it.

  • If I want to produce machine readable text output then I set appropriate format parameters (imbue, ...) on the output stream. In another cases I need more languages at the same time (eg. text translator). Than I set output stream with specific parameters for each language. Machine readable text is only one of the languages.
    For example in binary world we have similar problems. Instead of lacale we must be aware eg. of endianness.

If you want you can add another function "format_c()" or more general function "format_loc(locale, ...)" - first parameter will be locale. "format_c()" then will be only a special case of "format_loc()".
But once again. Please, do not change default behaviour. I think that it behaves logicaly now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants