Encoding of Git-p4 Messages and Authors

Initially, it should be noted that the perforce server lacks support for unicode, rendering the setting of P4CHARSET ineffective. Subsequently, I examined the output of basic commands, which were indeed in ANSI format (verified through notepad++ or ISO-8859-1 on redirected output). The command indicates that LANG=en_US.UTF-8. In conclusion, my assumption is that all p4 client output is in ISO-8859-1, while git-p4 assumes it to be in UTF-8. I attempted to change the python encoding to ascii, but the issue persisted as the messages and authors were not properly migrated.


Question:

Today, I have the opportunity to migrate some significantly outdated perforce repositories to git. Although this task is quite fascinating, there is one issue that has drawn my attention. The commit messages and even the author names in these repositories are not encoded correctly, specifically regarding special characters.

Therefore, I attempted to conduct an investigation to determine the source of the problem.

  • Initially, it is important to note that the perforce server does not have the capability to handle unicode characters. Therefore, adjusting the P4CHARSET will not result in any changes. However,

    Unicode clients require a unicode enabled server.

    remains applicable.
  • Afterwards, I examined the results of basic commands such as

    p4 users

    , which were indeed in ANSI format (verified using Notepad++, or ISO-8859-1 as indicated by

    file -bi

    when the output was redirected).
  • The LANG=en_US.UTF-8 command is stated in the

    locale

    code.

I believe that all output from the P4 client is in ISO-8859-1, whereas git-p4 assumes it to be in UTF-8.

I made an attempt to rewrite the commit messages using

git filter-branch --msg-filter 'iconv -f iso-8859-1 -t utf-8' -- --all

However, this does not address the problems, especially since it is not meant to modify the names of the authors.

Is there anyone who knows how to ensure that the output is converted to UTF-8 prior to being received by git-p4?


Update:

I attempted to replace the default p4 commands output by adding a basic shell script to the beginning of the PATH.

/usr/bin/p4 $@ | iconv -f iso-8859-1 -t utf-8

However, this action results in the destruction of the clearly utilized
marshalled python
objects.

  File "/usr/local/bin/git-p4", line 2467, in getBranchMapping
    for info in p4CmdList(command):
  File "/usr/local/bin/git-p4", line 480, in p4CmdList
    entry = marshal.load(p4.stdout)
ValueError: bad marshal data


Update2:

In the provided example
changing default encoding of python?
, an attempt was made to convert
python encoding
to its ASCII representation.

export export PYTHONIOENCODING="ascii"
python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)'

Output:

('ascii', 'ascii')

However, there are still some messages and authors that have not been migrated accurately.


Update 3:

Despite attempts to fix the git-p4.py

def commit(self, details, files, branch, parent = "")

function, no improvement was achieved.

self.gitStream.write(details["desc"])

to one of those

self.gitStream.write(details["desc"].encode('utf8', 'replace'))
self.gitStream.write(unicode(details["desc"],'utf8')

did just raise:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 29: ordinal not in range(128)

Since I lack experience as a Python developer, I am uncertain about the next steps to take.


Solution:

I have a suspicion that the data type

details["desc"]

is a byte string, which is referred to as ‘str’ in Python 2.

Consequently, prior to

decode

it to Unicode, you must

encode

it.

print type(details["desc"])

to find out the type.

details["desc"].decode("iso-8859-1").encode("UTF-8")

Converting from iso-8859-1 to UTF-8 could potentially be helpful.

Frequently Asked Questions