How Can I Convert A String To The Idna Coded, Encoding With 'idna' Coded Failed
Solution 1:
IDNA is an algorithm used to encode domain names, or hostnames. What you provide as example is an URL, so it includes characters that can not work in a domain name and hence can not be encoded and hence your error.
You need to separate the domain (host) name from the rest, apply IDNA only to it (but useless in your example as your hostname is purely ASCII already), and reconstruct your URL.
The specific error you quote comes from the following fact: as IDNA deals with names, per the DNS definition, it works at the label level. A label is somethings between dots, so first step is to split things. Your string is then handled that way:
- outlook-stg
- d-a-tf
- de/mapi/emsmdb/?MailboxId=cf27be4f-8605-40e4-94ab-d8cea3cc03bc@test
- com
And a label in the DNS can not be more than 63 bytes. Your third string, even for now not considering that it has disallowed characters (like @) that can never happen in a domain name, even with IDNA encoding, is 68 bytes long, hence the exact error you get.
If I artificially shrink it I then get another error, as expected based on above explanations:
>>> print(idna.encode('outlook-stg.d-a-tf.de/mapi/emsmdb/?MId=cf27be4f-8605-40e4-94ab-d8cea3cc03bc@test.com'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/idna/core.py", line 358, in encodes= alabel(label)
  File "/usr/local/lib/python3.7/site-packages/idna/core.py", line 270, in alabel
    ulabel(label)
  File "/usr/local/lib/python3.7/site-packages/idna/core.py", line 304, in ulabel
    check_label(label)
  File "/usr/local/lib/python3.7/site-packages/idna/core.py", line 261, in check_label
    raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+002F at position 3 of 'de/mapi/emsmdb/?mid=cf27be4f-8605-40e4-94ab-d8cea3cc03bc@test' not allowed
(U+002F is / of course, another character disallowed in a domain name, hence rejected during IDNA encoding)
Note that there are rules also to encoding "non ascii characters" in other parts of the URL, that is the path, which is why the top governing standard is now IRI: RFC 3987 It says, even if in a convoluted way, exactly the above:
Replace the ireg-name part of the IRI by the part converted using the ToASCII operation specified in section 4.1 of [RFC3490] on each dot-separated label, and by using U+002E (FULL STOP) as a label separator, with the flag UseSTD3ASCIIRules set to TRUE, and with the flag AllowUnassigned set to FALSE for creating IRIs and set to TRUE otherwise.
So, depending on your needs, you should:
- Parse your string as an URI/IRI (with a proper library, do not expect to do it properly with a regex yourself)
- Now that you have the hostname part, you can apply IDNA on it, as needed (but the URI/IRI parsing library may do the work for you in fact already, so double check)
- And reconstruct the full URI/IRI if you want after that.
Post a Comment for "How Can I Convert A String To The Idna Coded, Encoding With 'idna' Coded Failed"