San Diego State University logo

Codes in other languages

Hiragana
Writing
 

The hiragana writing system is one of 3 writing used in the Japanese language. Two of the three writing systems (hiragana and katakana) are syllabaries (one symbol per syllable); one (kanji) is ideographic (one symbol per word). Since languages have many more words than syllables, syllabaries are a lot easier to learn ideographic writing systems. Depending how you count, hiragana has as many as 107 characters, though many of these are composable from simple rules. Kanji on the other hand has thousands of characters.

The term romaji is also used. this denotes a conventional transliteration of Japanese sounds into Roman characters. It is not an official Japanese wriuting system but it shows up fairly often, usually for the convenience of foreigners.

The basic
characters
 

The sounds  

The romaji representations are pretty good approximations but this website has some nice recordings to help you hear some differences.

Pay close attention to the "r" series (syllables whose romaji rep begins with "r"). Is that sound at the beginnning really an "r"?

What English vowel is "e" closest to? The one in let or the one in late?

Is the t in the T-series really English t? As in the word top?

Pitch accent  

In some sense the idea a syllable based writing system is very natural to Japanese.

Japanese has pitch accent. This means that accent syllables of words are signalled by pitch (frequency) rather than stress (a combination of amplitude and duration). This means the timing of Japanese syllables is constant, which gives it a prosody very unlike that of English.

It also means Japanese has the option of multi-syllables with no accent (no high tone); English does not seem to have unaccented multi-syllable words.

Accent on first syllable Accent elsewhere, or accentless

    I-ma now i-MA (Western style) living room
    SA-ke salmon sake, alcohol
    NI-ho-n (counter for long thin objects) ni-HO-n Japan
    HA-shi chopsticks ha-SHI bridge
    KA-ki oyster ka-KI persimmon

In principle this should reduce the constraints on combining Japanese syllables. In English there are syllable types that are very likely to be unstressed (any syllable with a shwa in it). Japanese does not have this. Syllables are more equal, at least as far as timing and accent goes.

General
Uesugi's
Cipher
 

A simple polysyllabic substitution cipher is described here.

Actually a message written in General Uesugi's cipher gives one more important clue that a substition cipher might be in use.

Normal Japanese texts mix all 3 writing systems.

Here is an example. All 3 writing systems are used in the very first line.

In fact Japanese quite commonly switches writing systems in the middle of a word. Noamally hiragana is used for function words, such as auxiliary verbs, as well as endings, such as verb suffixes and case markers.

In the first line we see right after the kanji characters 社が出資 :

している  si te i ru
which is the progressive form of the very common verb do, written in hiragana.

So when we see a message written entirely in hiragana, either it is being written by or for a child, or for some stylistic "cuteness" effect, or something funny is going on.

Cryptographic
Properties
 

Let us investigate the question of whether this cipher should be harder to break or easier to break than an English subtitution cipher.

We COULD make the following assumption (consistent with Kerckhoff's Principle): The cipher is disyllabic, that is that we know that 2 hiragana characters of ciphertext are always substituted for 1 hiragana cxharacter of plain text.

However we could also just GUESS that and do a simple statistical test to see if the guess is right. We do the index of coincidence test (Friedman's test), but for digraphs. That is we check to see what the probability of choosing a sequence of pairs of letters in a row is.

The reasoning is as follows:

If the ciphertext is constructed with with a disyllabic cipher hiragana pairs will come up way more often then expected. For example, in the example we just looked ta:
    び ==> のた
    ほ ==> べそ
    ゆ ==> てさ
Thus the sequence of characters のて will show up with the same frequency as び which will be much higher than that of のて . If we do this for all pairs the result will be a much higher index of concidence than expected if this were an ordinary Japanese text translated into hiragana.

The other clue of course is that there are fewer hiragana characters in the message than exist in Japanese, since using two characters of ciphertext for each character of plaintext means not all characters get used up in the ciphertext alphabet.

This illustrates one of of the motivations for Kerckhoff's Principle. Keeping the KIND of code you are using a secret may be much more difficult than it seems at first. Statistical analysis can reveal a great deal. So choose a code that is secure even when the algorithm being used is known.

Some questions:

  1. There are more hiragana than there are English letters. Does this make the cipher harder?
  2. What about the frequencies of Japanese syllables? If they are all equally timed, perhaps they are evenly distributed as well?

Answers:

  1. There are more hiragana than there English letters. Does this make the cipher harder? No not once we know to substitute 2 characters for 1. The next step is the statistical analysis; the code will only be harder if the statistical properties of hiragana really ARE different from those of an alphabet.
  2. What about the frequencies of Japanese syllables? If they are all equally timed, perhaps they are evenly distributed as well? An even distributionwould be the hardest to decode, just as our polygraphic cipher was hard.
Frequencies for
Hiragana
 

Here is a table of Japanese hiragana frequencies. Of course what we are interested in is not the frequencies of hiragana in Japanese text as it is usually written (which would be the frequencies of grammatical formatives syllables and function word syllables), but the frequencies of the hiragana characters for text that is ALL translated into hiragana (which should roughly spproximate the frequencies of Japanese syllables in speech).

The table was constructed using the following steps.

  1. The above news articles were translated into hiragana. This is here. This requires a dictionary and morphological analyzer. The morphological analyzer used was juman, which contains the necessary kanji dictionary.
  2. Statistics were compiled counting individual hiragana characters.
Interpreting
the numbers
 

The numbers generally show that hiragana frequencies are "bumpy". RThat is, there is quite avariation in frequency from most frequent to least frequent.

But how comparable is this the case of English letters?

We need a measure of how "bumpy" a probability distribution is. WE hereby introduce the concept of entropy.

Entropy  

Entropy is a measure of the average amount of surprise in a probability distribution.

If the probability distribution characterizes a set of signals in some channel (such as English letters in English texts), it measures how easy it is to predict the signals. The higher the entropy the harder it is to predict.

High entropy means high average surprise, which means low predictablility.

Probability is itself a meassure of surprise. The lower the probability the greater the surpirse:

(1)  Surp(x) = 1/Prob(x)
We also want a measure of surprise that has the following property. The measure of the amount of surpirse of two independent events is just the sum of their surprise values:
(2)  Surp(x;y) = Surp(x) + Surp(y)
Equation (1) does not have this neat property::
Prob(x;y) = Prob(x) * Prob(y)
Supr(x;y) = 1/(Prob(x) * Prob(y))
It turns out this does not in general equal:
1/Prob(x) + 1/Prob(y)
For example:
1/4 * 1/3 = 1/12
1/(1/12) = 12
But:
1/(1/4) = 4
1/(1/3) = 3
4 + 3 = 7

What we need to get this to work is some function f such that:

f(x * y) = f(x) + f(y)
It turns out the log function does this!

Surp(x) =  - log Prob(x)
The log of a number between 0 and 1 is always negative, so we throw in the minus sign to get surprising events to have bigger surprise values. For example:
Suppose:
  Prob(A) = 1/8
  log Prob(A) = - log 8 = -3
And suppose 
  Prob(B) =  1/4 
  log Prob(A) = - log 4 = -2
Now A is the more surprising (less probable event), so we throw in the minus signs to assign A the bigger surprise value:
Surp(A) = - log prob(A) = - -3 = 3
Surp(B) = - log prob(B) = - -2 = 2

The last thing you need to know ios that computer scientists and information technologists and generally people who worry about signals, channels, channel capacity, and noise, like to call surprise "information". So rather than saying the measure of the surprise of A is 3, they say th emeasure of the information of A is 3.

Finally we're interested in the AVERAGE amount of surpirse for all the events in the probability distribiton (for all possible signals), so we add the information measure for each signal weighted by its probability. This really is the general definition of what an average is. For all the signals 1,n:

H(p) = - Sumi=1...n p(i) * log p(i)
For some reason H is the letter used for entropy.
Example
Entropy of
English
letters
 
Entropy for ../models/let_nr.txt
Let    Prob     Information               Avg
---------------------------------------------
e    0.123519	3.01719511707	0.372680923665
t    0.091202	3.45479072768	0.315083823946
a    0.080872	3.6282158994	0.293421076216
o    0.075482	3.72772354013	0.281376028256
i    0.073973	3.75685740382	0.277906012733
n    0.070675	3.82265621172	0.270166227763
s    0.064620	3.95187543917	0.255370190879
r    0.063176	3.9844795943	0.251723482849
h    0.051893	4.26831624746	0.221495735029
l    0.042018	4.57284869646	0.192141956528
d    0.037956	4.71952822809	0.179134413425
c    0.032046	4.96371189971	0.159067111538
u    0.027356	5.1919988953	0.14203232178
m    0.024467	5.35301897232	0.130972315196
f    0.022234	5.49108867051	0.1220888655
p    0.021212	5.55897553618	0.117916989074
g    0.020374	5.61712693929	0.114443344261
w    0.018895	5.72585167118	0.108189967327
y    0.018290	5.77280111469	0.105584532388
b    0.016459	5.92497950522	0.0975192376765
v    0.010586	6.56169863069	0.0694621417045
k    0.007437	7.07106351252	0.0525874993426
x    0.001921	9.02392676566	0.0173349633168
j    0.001648	9.24506804214	0.0152358721334
q    0.001034	9.91754809901	0.0102547447344
z    0.000656	10.5740165647	0.00693655486645
Sample Space:       26
Entropy:            4.18706288699
Entropy per signal: 0.890781105188
(4.1870628869941813, 26, 0.89078110518776332)
To get the entropy per signal, I just divided by log 26, because there 26 letters. This tells me on average how surprising each letter is.
Example
Entropy of
Hiragana
Characters
 
Entropy for ../japanese_models/hir_freq_nr.txt
Let    Prob     Information               Avg
---------------------------------------------
う    0.073907	3.75814517605	0.277753235527
ん    0.069157	3.85398090291	0.266529757302
い    0.066484	3.91084900552	0.260008885283
し    0.040915	4.61122633785	0.188668325613
き    0.031462	4.99024580579	0.157003113542
に    0.030106	5.05380515104	0.152149857877
か    0.029626	5.07699233802	0.150410975006
ょ    0.029277	5.09408846084	0.149139627868
の    0.027069	5.20721459818	0.140954091958
ち    0.026263	5.25082446631	0.137902402959
く    0.025535	5.29138013072	0.135115391638
と    0.025279	5.30591679602	0.134128270687
は    0.021296	5.55327371341	0.118262517001
た    0.020157	5.63257525367	0.113535819388
ゅ    0.017647	5.82443324441	0.102783773464
こ    0.017515	5.83526520163	0.102204670007
て    0.017004	5.87798202568	0.0999492063647
さ    0.016562	5.91597928893	0.0979804489832
つ    0.015702	5.99290785955	0.0941006392107
な    0.015462	6.01512924674	0.0930059284131
せ    0.014067	6.15154150488	0.0865337343492
じ    0.013827	6.17636801627	0.0854006405609
が    0.013742	6.18526420162	0.0849979006587
る    0.013649	6.19506093445	0.0845563866942
ろ    0.012913	6.27503197749	0.0810294879253
り    0.012262	6.34966188002	0.0778595539727
け    0.011061	6.49837436751	0.071878518879
を    0.011014	6.50451767617	0.0716407576854
ど    0.010402	6.58699524673	0.0685179245565
っ    0.010286	6.60317413097	0.0679202491112
よ    0.010232	6.61076802044	0.0676413783852
ぜ    0.010139	6.62394082207	0.0671601359949
で    0.009697	6.68824580072	0.0648559195296
お    0.008829	6.82353424166	0.0602449838196
ら    0.008434	6.88956726305	0.0581066102965
ご    0.007954	6.97410372252	0.0554720210089
す    0.007768	7.0082410839	0.0544400167398
あ    0.007744	7.01270533205	0.0543063900914
ー    0.007620	7.03599328694	0.0536142688465
も    0.007613	7.03731920646	0.0535751111188
だ    0.007597	7.04035446342	0.0534855728586
め    0.006962	7.1662824706	0.0498916585603
ま    0.006853	7.18904859766	0.0492665500398
れ    0.006768	7.2070547162	0.0487773463193
え    0.006567	7.25054982942	0.0476143607298
ぎ    0.005908	7.40311445856	0.0437376002212
ひ    0.005862	7.41439131663	0.0434631618981
ほ    0.005288	7.56306210782	0.0399934724261
み    0.005273	7.56716028794	0.0399016361983
そ    0.005249	7.57374168711	0.0397545701157
ね    0.004676	7.74050935479	0.036194621743
げ    0.004413	7.82402453735	0.0345274202833
ゃ    0.004389	7.83189201446	0.0343741740515
わ    0.004366	7.83947215433	0.0342271354258
ふ    0.004227	7.88615017229	0.0333347567783
ぶ    0.004219	7.8888831971	0.0332831982086
や    0.003630	8.10581473644	0.0294241074933
ン    0.003607	8.11498486154	0.0292707503956
ば    0.002747	8.50792737425	0.0233712764971
ぱ    0.002708	8.52855654575	0.0230953311259
ル    0.002406	8.69914764215	0.020930149227
ス    0.002274	8.78055203043	0.0199669753172
べ    0.002135	8.87154821482	0.0189407554386
む    0.002057	8.9252424908	0.0183592238036
ざ    0.002057	8.9252424908	0.0183592238036
ト    0.002049	8.93086430031	0.0182993409513
び    0.002003	8.96362186351	0.0179541345926
イ    0.001879	9.0558192179	0.0170158843104
へ    0.001747	9.16090467641	0.0160041004697
ゆ    0.001716	9.18673473182	0.0157644367998
ぽ    0.001709	9.19263188765	0.015710207896
n    0.001701	9.19940114366	0.0156481813454
o    0.001693	9.20620231143	0.0155861005132
v    0.001693	9.20620231143	0.0155861005132
ぞ    0.001654	9.23982505015	0.015282670633
ぼ    0.001468	9.41193231648	0.0138167166406
ラ    0.001383	9.49798312817	0.0131357106663
フ    0.001352	9.53068913304	0.0128854917079
ク    0.001329	9.55544318005	0.0126991839863
ド    0.001220	9.67890313687	0.011808261827
ず    0.001120	9.80228555238	0.0109785598187
リ    0.001019	9.93863023316	0.0101274642076
ぐ    0.000996	9.97156663726	0.00993168037071
レ    0.000941	10.0535176566	0.00946036011486
づ    0.000879	10.1518492142	0.00892347545926
ッ    0.000732	10.415868731	0.00762441591112
ぴ    0.000717	10.4457392606	0.00748959504987
Sample Space:       87
Entropy:            5.49661603106
Entropy per signal: 0.85312187428
(5.4966160310600634, 87, 0.85312187428022668)
To get the entropy per signal, I just divided by log 87, because there were 87 letters. This tells me on average how surprising each letter is.

Now compare the per signal entropy measure for English and Japanbese hiragana:

English letters:   0.89078110518776332
Japanese hiragana: 0.85312187428 
The English entropy per signal is actually a little higher. This means the average surprise on seeing a new English letter is greater than the average surprise on seeing a new Japanese hiragana character.
Evaluating the
Code
 

We considereed the hypothesis that the Hiragana substitution code might be a better code because hiragana characters would be more evenly distributed than English characters.

This turned out to be wrong. In fact the hiragana distribution was quite bumpy.

We tried to get precise about the idea of the bumpiness of a probability distribution by introducing the notion entropy:

H(p) = - Sumi=1...n p(i) * log p(i)
This measure the average amount of surprise (or informativeness of the n signals.

We adjusted the entropy for the size of the signal space to get something called per signal entropy. This measure turned out to be slightly lower for Japanese:

English letters:   0.89078110518776332
Japanese hiragana: 0.85312187428 
How good a measure of "toughness of code" is this?

Well we argued that the toughest code is one in which every character has equal probability. The per character entropy for such a code is always 1. For example, if a signal system has 8 characters all equally probable, then the entropy is:

H(p) = 8 * (1/8 * - log (1/8))
     = 8 * (1/8 * (- - 3))
     = 8 * (1/8 * 3)
     = 8 * 3/8
     = 3
To get the per signal entropy we divide by log 8 = 3:
Hps(p) = H(p)/log 8 = 3/3 = 1
So the per signal entrop is a measure of how close a system is to the hardest case. It will always be a nuimber between 0 and 1. And by this measure Hiragana makes an easier substitution cipher than English characters.