Hello, I’ve tried to find someone else using OpenBSD in various places for a while now, but with no success, so I’m hoping someone will read this.

I’m wondering what your output is from file(1) on a file you know has text encoded as UTF-8.

On my system (7.3-stable) the output is “Non-ISO extended-ASCII text”, and I’m trying to figure out if this is how it should be, or if I did something wrong setting up the system.

So, if you have a computer with OpenBSD and a minute to spare, could you try running file(1) on a UTF-8 file and see if it identifies it as UTF-8 or “Non-ISO extended-ASCII text”?

Thanks in advance

    • tycho@lemmy.sdf.org
      link
      fedilink
      arrow-up
      1
      ·
      1 year ago

      I explored the source of file(1) and the part to determine file types of text file seems to be in text.c: https://cvsweb.openbsd.org/cgi-bin/cvsweb/~checkout~/src/usr.bin/file/text.c?rev=1.3&content-type=text/plain

      And especially this part:

      static int
      text_try_test(const void *base, size_t size, int (*f)(u_char))
      {
      	const u_char	*data = base;
      	size_t		 offset;
      
      	for (offset = 0; offset < size; offset++) {
      		if (!f(data[offset]))
      			return (0);
      	}
      	return (1);
      }
      
      const char *
      text_get_type(const void *base, size_t size)
      {
      	if (text_try_test(base, size, text_is_ascii))
      		return ("ASCII");
      	if (text_try_test(base, size, text_is_latin1))
      		return ("ISO-8859");
      	if (text_try_test(base, size, text_is_extended))
      		return ("Non-ISO extended-ASCII");
      	return (NULL);
      }
      

      So file(1) is not capable of saying if a file is UTF-8 right now. There is some other file (/etc/magic) which can help to determine if a text file is UTF-7 or UTF-8-EBCDIC because those need a BOM but as you said UTF-8 does not need a BOM. So it looks like we are stuck here :)

      • pmk@lemmy.sdf.orgOP
        link
        fedilink
        arrow-up
        1
        ·
        1 year ago

        Thank you. At least I know now that it’s the expected output of utf-8 files, that’s good to know. Thank you again.

        • tycho@lemmy.sdf.org
          link
          fedilink
          arrow-up
          1
          ·
          1 year ago

          Yes it looks like utf8 is a first-class citizen but really it is ASCII which is 100% supported. From the FAQ:

          The OpenBSD base system fully supports the ASCII character set and encoding, and partially supports the UTF-8 encoding of the Unicode character set.