Which countries have the most beautiful women? (My deeply flawed analysis)
So, I happened upon the Wikipedia page about the Miss World pageant and noticed that it had a list of winners by country. For example, India has won Miss World 5 times. But, of course, India has a very large population so you'd expect it to be able to churn out a few beauties. So, to get a better idea here is a population adjusted list of countries that have won Miss World:
| Country | Wins | Pop. | Wins/Pop. | Normalized |
|---|
| Bermuda | 1 | 66163 | 0.0000151141876879827 | 100.00% |
| Iceland | 3 | 316252 | 0.00000948610601672084 | 62.76% |
| Grenada | 1 | 110000 | 0.00000909090909090909 | 60.15% |
| Guam | 1 | 173456 | 0.00000576515081634536 | 38.14% |
| Jamaica | 3 | 2651000 | 0.000001131648434553 | 7.49% |
| Trinidad and Tobago | 1 | 1305000 | 0.000000766283524904215 | 5.07% |
| Sweden | 3 | 9182927 | 0.000000326693221017656 | 2.16% |
| Puerto Rico | 1 | 3994259 | 0.000000250359328225836 | 1.66% |
| Austria | 2 | 8316487 | 0.000000240486157195941 | 1.59% |
| Ireland | 1 | 4339000 | 0.000000230467849734962 | 1.52% |
| Finland | 1 | 5308208 | 0.000000188387493481793 | 1.25% |
| Venezuela | 5 | 28199822 | 0.000000177306083705067 | 1.17% |
| Israel | 1 | 7282000 | 0.000000137324910738808 | 0.91% |
| Netherlands | 2 | 16408557 | 0.000000121887622415548 | 0.81% |
| Dominican Republic | 1 | 9760000 | 0.000000102459016393443 | 0.68% |
| Czech Republic | 1 | 10381130 | 0.0000000963286270377117 | 0.64% |
| Australia | 2 | 21290000 | 0.0000000939408172851104 | 0.62% |
| Greece | 1 | 11216708 | 0.0000000891527175353054 | 0.59% |
| Peru | 2 | 28674757 | 0.0000000697477575834383 | 0.46% |
| UK | 4 | 60487300 | 0.0000000661295842267716 | 0.44% |
| Argentina | 2 | 40301927 | 0.0000000496254186555397 | 0.33% |
| South Africa | 2 | 43700000 | 0.000000045766590389016 | 0.30% |
| Poland | 1 | 38518241 | 0.0000000259617255107781 | 0.17% |
| France | 1 | 64473140 | 0.0000000155103350015216 | 0.10% |
| Turkey | 1 | 70586256 | 0.0000000141670639111387 | 0.09% |
| Egypt | 1 | 80335036 | 0.0000000124478689472424 | 0.08% |
| Germany | 1 | 82210000 | 0.0000000121639703199124 | 0.08% |
| Russia | 1 | 142008838 | 0.00000000704181524251329 | 0.05% |
| Nigeria | 1 | 148000000 | 0.00000000675675675675676 | 0.04% |
| US | 2 | 304072000 | 0.00000000657738956562919 | 0.04% |
| Brazil | 1 | 186757608 | 0.00000000535453420457174 | 0.04% |
| India | 5 | 1132446000 | 0.00000000441522156464856 | 0.03% |
| China | 1 | 1321851888 | 0.000000000756514409124179 | 0.01% |
So, far and away, the top three are Bermuda, Iceland and Grenada. Given that Bermuda is the winner, and a tax-haven, and has a sub-tropical climate... Hamilton here I come! Labels: pseudo-randomness
Interesting real-world Apache Problem
I'm working with a large client who has a number of web servers behind a load balancer. This morning one Apache 1.3 had failed to come up on one of them. The client sends a SIGUSR1 to each Apache once an hour to force a graceful reload. This particular machine had operated correctly restarting Apache once per hour for 54 hours (since a recent reboot of the machine) and then died. A quick look in the Apache error.log file showed the following: module "mod_jk.c" could not be loaded, because the dynamic module limit was reached. Please increase DYNAMIC_MODULE_LIMIT and recompile. Naturally I went looking for a problem with mod_jk which was the wrong place to look. Scrolling through the log file I noticed that every time Apache restarted we'd get the error: Cannot remove module mod_include.c: not found in module listThis was where the real problem lay. A quick httpd -l showed that mod_include was compiled into the client's Apache and looking in the httpd.conf revealed that mod_include was also being loaded with LoadModule: LoadModule includes_module modules/mod_include.soWhen a module is both statically linked into Apache and dynamically loaded you run into a nasty problem: Apache doesn't complain when you start, but it will fail to unload the double loaded module on exit. So for every SIGUSR1 a single slot of the DYNAMIC_MODULE_LIMIT was used up. The default DYNAMIC_MODULE_LIMIT is 64 and with 10 real dynamic modules and a boot once per hour it took 54 hours to consume every slot in the module limit. Removing the errorneous LoadModule fixed the problem. Labels: pseudo-randomness
Digg 3 Million
A quick update on my previous estimate of Digg users shows that, as predicted, Digg passed the 3 million user mark during March, 2008.  Comparing this estimated data and data from the Digg API shows that around 20% of the 3m accounts are not active. I speculated before that these accounts had been banned for spamming or other activities (that's around 600,000 bad accounts). Growth appears to be the same as before adding around 150,000 accounts per month. Labels: pseudo-randomness
It's 3am and there's a crisis somewhere in the world
If you follow US politics then you'll know that Hillary Clinton has a couple of ads that start with a 3am phone call to the Whitehouse. The first ad was intended as a slam against Barack Obama implying that he didn't have the experience to deal with such a crisis. The second is going up against John McCain claiming he doesn't want to do anything about the housing finance problem in the US. You know if it's 3am and there's a crisis in the world there's only one place and only one man to call. CTU and Jack Bauer. First of all, he's already up. 3am is nothing to Jack. Hillary and McCain both look like they could use the sleep and Obama looks like he gets his beauty sleep every night. So, Jack's ready to go before any of them. As much as you might think Obama is David Palmer (safe pair of hands), he's more like a Wayne Palmer (a slick little fighter) and you know what that means: gets blown up within five minutes of being president and then pops a brain vein and the evil VP has to take over. The only good thing to say about Obama is that he would call Jack. Now, McCain might look like a Bauer type with his military background and heroic time spent as a PoW. But here's the difference. McCain was 5 years as a PoW, no one came to get him out so he can't be that valuable. Also, if Jack had been held hostage in North Vietnam for 5 years there wouldn't be a North Vietnam now. Because you can bet the life of the next random CTU cast member the Jack would have (a) escaped and (b) annihilated everyone involved. So, that leaves Hillary. She can't tell sniper fire from a little girl with flowers. In fact she reminds me more and more of the evil Vice Presidents that pop up in every 24 trying to take power from the real president. Hell, she's probably even got backing from Phillip Bauer. Only one word of warning: make sure it's a man that calls Jack. If it's a woman he's bound to have been involved with her, she'll turn out to be a double agent or her father will be evil, and Jack'll be distracted. If it has to be a woman, make sure it's Chloe. Labels: pseudo-randomness
Names: Boys vs. Girls
Using data from the 1990 US Census I was amazed to discover that 90% of the US male population has one of 1,219 first names, but 90% of the female population has one of 4,275. There are 3.5x as many female first names as male first names. The top 10 male first names are: James, John, Robert, Michael, William, David, Richard, Charles, Joseph and Thomas (which account for 23.2% of the male population; 50% of the population have one of only 60 names). The top 10 female first names are: Mary, Patricia, Linda, Barbara, Elizabeth, Maria, Susan, Margaret and Dorothy (which account for 10.7% of the female population; 50% of the population have on of 139 names). You can also see that all the variety in names happens between the 80% and 90%. For males 80% of the population is covered by 27% of the names; for females 80% is covered by 19% of the names).  The large numbers of female names appears to be because there are lots of variants of female names compared to male names. A quick run through calculating the Levenshtein distance between names and selecting the 10 closest for each gives an average distance of: Male: 2.62 Female: 2.01 So female names are more 'similar' than male names, hence the variety created by all these variants. The other thing we can extract from this data is the prevalence of names beginning with certain letters and weight adjust based on the occurrence of each name.   Things are much more polarized when you look at trailing letters (for example, the trailing letter A is an almost sure sign that it's a woman; the opposite is true of D):   So combining the two it's possible to give a 'maleness' score (the blue part) to each final letter:  Labels: pseudo-randomness
Sleeping with the enemy
I loaded the dish washer:  My SO loaded the dish washer:  Who needs help? Labels: pseudo-randomness
Any sufficiently simple explanation is indistinguishable from magic
Well, that's true if you are a fool. Take for example the mystical belief that the number 11 or 11:11 is somehow significant. Uri Geller goes on about this on his web site. To quote from Geller's web site (and you'll find other similar thinking on many 11:11 web sites): String theory is said to be the theory of everything. It is a way of describing every force and matter regardless of how large or small or weak or strong it is. There are a few eleven's that have been found in string theory.
I find this to be interesting since this theory is supposed to explain the universe! The first eleven that was noticed is that string theory has to have 11 parallel universes (discussed in the beginning of the "11.11" article) and without including these universes, the theory does not work.
The second is that Brian Greene has 11 letters in his name. For those of you who do not know, he is a physicist as well as the author of The Elegant Universe, which is a book explaining string theory. (His book was later made into a mini series that he hosted.) Another interesting find is that Isaac Newton (who's ideas kicked off string theory many years later) has 11 letters in his name as well as John Schwarz. Schwarz was one of the two men who worked out the anomalies in the theory. Plus, 1 person + 1 person = 2 people = equality.
Also, the two one's next to each other is 11. The two men had to find the same number (496) on both sides of the equation in order for the anomalies to be worked out, so the equation had to have equality! There were two matching sides to the equation as well because they ultimately got 496 on both sides. So, the 1 + 1 = 2 = equality applies for the equation as well.
I added a little bold type there because it amused me; pity that Mr Geller didn't look up the definition of equation before writing that line. But key to this whole belief is that the number 11 keeps turning up at random. When I first read about this I looked up at the clock and it was 11:43. Whoa! Spooky! But then I remembered Benford's Law. Benford's Law is essentially that in lots of real-life data the leading digit is 1 with a probability of about 30% (instead of the 10% you'd expect if the first digit was random from 0 through 9) and hence numbers beginning with 1 occur more often than numbers starting with any other digit. A simple illustration is my clock experience. What's the probability that if you look at a clock at random that the first digit is a 1? Well it's more likely than any other number. For a clock showing 12 hour time it cycles through: 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11. A simple count will show you that the number 1 is the first digit for 8 out of the 24 hours and that all the other digits occur 2 times in 24 hours. So what's the probability that if I glance at a clock at random I'll see a 1 at the beginning? 8/24 or 1/3 of the time... which is Benford's Law. Now, Benford's Law isn't restricted to time. It occurs all over the place (Wikipedia lists: electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants) and so if you walk through life looking at random numbers you'll see numbers starting with a 1 more often than any other number. In 1988 a mathematician named Ted Hill showed why this is the case for many real-world systems. But, what about 11? I hear you ask. Well if the first digit is more likely to be 1 than any other than it's clear that you are more likely to see numbers in the range 10 through 19 more than other two digit numbers, but a more interesting offshoot of Benford's Law is explained here. Essentially as you walk through the digits of a number you are more likely to see a 1 than another digit, but that effect diminishes the longer the number gets. The probability that the the second digit is a 1 is about 11% (instead of the expected 10%) and given that the probability that the first digit is a 1 is 30% you are bound to come across 11 more frequently than you'd expect (if numbers were random). So, it's no surprise that we see lots of 11s, and hence there's a simple explanation for all those 11s. Either that or I've been missing the call of the 11:11 Spirit Guardians all these years: These 11:11 Wake-Up Calls on your digital clocks, mobile phones, VCR’s and microwaves are the "trademark" prompts of a group of just 1,111 fun-loving Spirit Guardians, or Angels. Once they have your attention, they will use other digits, like 12:34, or 2:22 to remind you of their presence. Invisible to our eyes, they are very real.
Labels: mathematics, pseudo-randomness
Would they hide me?
This is pretty much exclusively a technical blog, but I was very struck by something legendary investor Warren Buffett said when answering the question "How do you define happiness and what about your life makes you most happy?": I know a woman in her 80’s, a Polish Jew woman forced into a concentration camp with her family but not all of them came out. She says, “I am slow to make friends because when I look at people, I have one question in mind; would they hide me?” If you get to be my age, or younger for that matter, and have a lot of people that would hide you, then you can feel pretty good about how you’ve lived your life.
I know people on the Forbes 400 list whose children would not hide them. “He’s in the attic, he’s in the attic.” Some of them keep compensating by joining board seats or getting honorary degrees, but it doesn’t change the fact that no one will give a damn when they are gone. The most powerful force in the world is unconditional love. To hoard it is a terrible mistake in life. The more you try to give it away, the more you get it back.
Labels: pseudo-randomness
Tonight, I'm going to write myself an Aston Martin
This is the story of my attempt to 'cheat' in an on-line spot-the-ball competition to win an Aston Martin. It's also the story of my failure, but you get free source code that implements automatic detection of image alteration using copy/paste or tools like the Clone Tool in Photoshop. First, take a look at this photo:

Notice anything strange? In fact this image has been tampered with to cover up a truck. The truck is completely hidden by foliage. Here's the original:

Wouldn't it be nice to be able to detect that automatically? It is possible. Here's an image automatically generated by my code showing what was moved. All of the red was moved to the blue (or the other way around).

I was motivated to work on this program by greed (or at least my never-ending love of having a little flutter on things). Best of the Best runs spot-the-ball competitions in airports to win very expensive cars. But they also run the same competition online. That meant I could get my hands on the actual image used... could I process it to discover where the ball had been removed? (In reality, this isn't the right way to win because the actual ball position is not governed by where it actually was, but where a judge thinks it was). Would it be cheating if I could? Apparently not, the competition rules say I should use my skill and judgment in determining the ball position. Surely, skill covers my programming ability. So, I went looking for tampering algorithms and eventually came across Detection of Copy-Move Forgery in Digital Images written by Jessica Fridrich at SUNY Binghamton. The paper describes an algorithm for detecting just the sort of changes I thought I was looking for. Unfortunately, I know nothing about image processing. Fortunately, the paper is written in a very clear style and a bit of Internet research enabled me to track down the knowledge I didn't have. (Also, thanks to Jessica for sending me the original images she used to test my implementation). In brief the algorithm does the following: - Slide a 16x16 block across the entire image from left hand corner to bottom right hand corner. For each 16x16 block perform a discrete cosine transform (DCT) on it and then quantize the 16x16 block using an expanded version of the standard JPEG quantization matrix.
- Each quantized DCT transformed block is stored in a matrix with one row per (x,y) position in the original image (the (x,y) being the upper left hand corner of the 16x16 block being examined).
- The resulting matrix is lexicographically sorted and then rows that match in the matrix are identified. For each pair of matching rows (x1,y1) and (x2,y2) the shift vector (x1-x2,y1-y2) (normalized by swapping if necessary so that the first value is +ve) is computed and for each shift vector a count is kept of the number of times it is seen.
- Finally the shift vectors with a count > some threshold are examined, the corresponding pair of positions in the image are found and the 16x16 blocks they represent are highlighted.
Here's another picture showing a golfing image that's been touched up to remove something from the grass: To get access to image data I used the FreeImage library and wrote a small C program that implements Jessica's algorithm. You can download the source here; it's released to you under the GNU GPL. The program has two key parameters that affect how the image is processed: the quality factor and the threshold. The quality factor is a number used to 'blur' the image (actually it changes the quantization): the higher the factor the more blurring and hence more 16x16 blocks are likely to seem the same to the algorithm. Increasing the quality factor will tend to increase the false matches. The threshold is simply the number of blocks that have to appear to have been copied together. This prevents us from seeing a single 16x16 block as evidence of copying. Increasing the threshold means ever larger groups of blocks have to be identified together before they are identified as copying. Back at Best of the Best I grabbed the image for Supercar Competition (SC-272), cut out a section that I thought the ball had to be in (just to speed up processing) and ran the algorithm. After some parameter tweaking the algorithm came up only with what look like false matches to me (along the bar where it's all one color):

And, of course, that's not where the judge thought the ball was. So, I guess I won't be driving home in the V8 Vantage, but what geek needs that when they've got a cool piece of software that detects copy/move forgery in images? Which leaves me with one question: how are spot-the-ball images generated? Is this an algorithm problem, a problem because they use JPG (which is already transformed) for their images, or are these images generated in some other way? Labels: pseudo-randomness
Interface to SQLite database in 23 lines of Arc
One thing that the first release of Arc was missing was access to any sort of database, but that's easily remedied. Here are 23 lines of Arc code that provide access to a SQLite database: (= db! 'nil)
(def db+ (name (o host "localhost") (o port 49153)) (let (i o) (connect-socket host port) (db> o name) (if (db< i) (list i o))))
(def sql ((i o) q) (db> o q) (if (db< i) (readall i 200)))
(def db- (db) (map close db))
(def db> (o s) (write s o) (writec #\return o) (writec #\newline o) (flush-socket o))
(def db< (i) (= db! (read i)) (iso db! 200))
The three functions you need to care about are db+ (get a connection to a named SQLite database), db- (close a connection to a database) and sql (execute a SQL query and return a list (or lists) of rows. There's also db! which contains the status of the last command (200 for OK, or 500 followed by a string explaining the error). Here's a little Arc session creating a database, putting some data in it and then querying it. The database called test didn't exist at the start of this session: arc> (= db (db+ "test")) (#<input-port> #<output-port>) arc> (sql db "create table foo (id integer primary key, text varchar(255))") nil arc> (sql db "select * from foo") nil arc> (sql db "insert into foo (text) values ('first');") nil arc> (sql db "select * from foo") (("1" "first")) arc> (sql db "insert into foo (text) values ('something else')") nil arc> (sql db "select * from foo") (("1" "first") ("2" "something else")) arc> (db- db) nil
To make this work I had to write a TCP server that wraps SQLite (it's just a small C program that you can get here). The C program listens on a port for connections from your Arc program and handles queries. I did have to make a small patch to Arc itself (since arc0 doesn't contain any outgoing socket code). My patch adds the ability to make a TCP connection to a remote machine and to flush an output port (add this to your ac.scm): (xdef 'connect-socket (lambda (host port) (let-values ([(in out) (tcp-connect host port)]) (list in out)))) (xdef 'flush-socket (lambda (s) (flush-output s)))
(Apologies if I have abused Scheme there, I'm a Scheme n00b) All this code is released under the same license as Arc itself. Labels: arc, pseudo-randomness
My first Arc project: a simple Wiki
The only way to learn a programming language is to write something in it. So, I decided it was time to dig into Arc and my first project is a very simple Wiki. Here's the source ( wiki.arc): ; A wiki written in Arc (arc0) ; ; Copyright (c) 2008 John Graham-Cumming ; ; (load "wiki.arc") ; (wsv) ; ; Then go to http://localhost:8080/show
(load "web.arc") (load "util.arc")
(= pagedir* "wiki/")
(def histfiles (page) (sort > (map [coerce _ 'int] (rem [is "current" _] (dir (pagepath page))))))
(def nexthist (page) (let h (histfiles page) (if h (++ (car h)) 0)))
(def pagepath (page) (string pagedir* (page 0) "/" (page 0) (page 1) "/" page ))
(def pagefile (page (o file)) (string (pagepath page) "/" (or file "current")))
(def slurp (page (o file)) (if (let p (pagefile page file) (if (file-exists p) (readfile p)))))
(def upperlen (word) (len (keep upper word)))
(def is-wikilink (word) (if (alphas word) (if (~is (word 0) (downcase (word 0))) (>= (upperlen word) 2))))
(mac url-show (page) `(string "show?p=" ,page))
(mac url-edit (page) `(string "edit?p=" ,page))
(mac link-show (page text) `(link ,text (url-show ,page)))
(mac link-edit (page text) `(link ,text (url-edit ,page)))
(def wikify (word) (if (is-wikilink word) (if (file-exists (pagefile word)) (link-show word word) (pr word)(link-edit word "?")) (pr word)) (ws))
(mac spew-raw (page) `(spew ,page [pr _ " "]))
(mac spew-wiki (page (o file)) `(spew ,page [wikify (string _)] ,file))
(def spew (page f (o file)) (let p (pagepath page) (if (dir-exists p) (map f (flat (map tokens (slurp page file)))) (pr "This page does not yet exist."))))
(def squash (file body) (writefile1 body file))
(def save-page (req) (w/$ p (w/$ t (squash (pagefile p) t) (squash (string (pagepath p) "/" (nexthist p)) t)) (url-show p)))
(mac mtime (f) `(datetime (file-mtime ,f)))
(def last-modified (page) (let f (pagefile page) (if (file-exists f) (pr "Last modified: " (mtime f)))))
(mac show-page (page) `(whitepage (tag h1 (link-show ,page ,page)) (spew-wiki ,page) (br 2) (hr) (last-modified ,page) (br) (link-edit ,page "[edit]") (ws) (link "[history]" (string "history?p=" ,page))))
(mac edit-page (page) `(whitepage (tag h1 (pr (string "Editing " ,page))) (arform save-page (textarea "t" 25 80 (spew-raw ,page)) (hidden "p" ,page) (br) (submit "Save")) (link-show ,page "[cancel]") (br 2)))
(def revision (page rev) (tag li (pr "Revision: " ) (link rev (string "revision?p=" page "&r=" rev)) (pr " modified " (mtime (string (pagepath page) "/" rev)))))
(mac history-page (page) `(whitepage (tag h1 (pr (string "History of " ,page))) (tag ul (map [revision ,page _] (histfiles ,page))) (hr) (link-show ,page (string "Back to " ,page))))
(mac revision-page (page rev) `(whitepage (tag h1 (pr "Showing revision " ,rev " of " ,page)) (spew-wiki ,page ,rev) (br 2) (hr) (last-modified ,page) (br) (link-show ,page (string "Back to " ,page))))
(defop show req (w/$ p (if p (show-page ($ "p")) (show-page "HomePage"))))
(defop edit req (w/$ p (ensure-dir (pagepath p)) (edit-page p)))
(defop history req (history-page ($ "p")))
(defop revision req (revision-page ($ "p") ($ "r")))
(def wsv () (ensure-dir pagedir*) (asv))
It loads two helpers. The first contains common utilities that aren't really Wiki-related ( util.arc): (def alpha (c) (or (<= #\a c #\z) (<= #\A c #\Z)))
(def alphas (str) (is (keep alpha str) str))
(def upper (c) (is (upcase c) c))
(def datetime ((o time (seconds))) (let val (tostring (system (string "date -u -r " time " \"+%Y-%m-%d %H:%M\""))) (subseq val 0 (- (len val) 1))))
And the second contains enhancement to Arc's web/HTML handling ( web.arc): (mac hidden (name val) `(gentag input type 'hidden name ,name value ,val))
(mac hr () `(gentag hr))
(mac ws () `(pr " "))
(mac $ (r) `(arg req ,r))
(mac w/$ (r . body) `(with (,r ($ (string ',r))) ,@body))
In web.arc there are a couple of bits of syntax to make accessing form/URL arguments easier: ($ "p") (which gets the value of the argument p) and (w/$ p ...) which sets a variable called p to the value of the argument p and then evaluates the rest of the expression. All this is released under the same license as Arc. (Since I have never programmed in Arc before, and it's been almost 20 years since I stopped coding in LISP or ML, I'd appreciate constructive comments). Labels: arc, pseudo-randomness
The Arc Challenge Explained
When I first looked at the Arc Challenge code my reaction, like that of many people, was WTH? Here's the code: (defop said req (aform [w/link (pr "you said: " (arg _ "foo")) (pr "click here")] (input "foo") (submit)))
Within the context of the Arc web/app server this creates a page called /said which has a form on it: <form method=post action="x"> <input type=hidden name="fnid" value="JtCw8ju328"> <input type=text name="foo" value="" size=10> <input type=submit value="submit"> </form>
That form accepts a single parameter called foo and redirects to /x. When clicking submit the user is taken to a page with a single link on it: <a href="x?fnid=bHJpJ5G1DH">click here</a>
Following that link brings up a page showing what you typed in the first; here's the output when I typed hello in the form: you said: hello
So, how does that work? Firstly, the defop defines an 'operation' (which is just a page within the web server). In this case the page is called said and hence is bound to /said. There's a single argument, called req, which will contain the HTTP request when said is called by the server. When said is called it uses aform to create an HTML form. To see this more clearly I've removed the clever part (and replaced it with X): (aform X (input "foo") (submit))
So aform creates a form with an simple HTML input with the name foo and a submit button. The clever bit is what happens when the form is submitted. By default the form submits to the page /x. This is hard-coded in the source of the Arc server. It makes use of a neat feature of the Arc server: fnids. When the form was generated a hidden field was inserted with a unique 'function id' (the fnid). This fnid is used by the /x URL to lookup a function to call when /x is activated. (Note this example uses URLs/hidden form fields for the fnid, there's no reason why it couldn't be stored in a cookie). The function called is actually the first argument to aform which has been stored away to be called when necessary. Here's the function definition: [w/link (pr "you said: " (arg _ "foo")) (pr "click here")]
[ ... _ ... ] is special Arc syntax for a function with a single argument called _. So the first argument to aform is a function definition, and that function is assigned a unique fnid and that fnid can be used to lookup that function and call it. The single argument consists of the HTTP request used to activate the function. The w/link macro creates a page consisting of the words click here linked to another page. The link is, once again, done using a function and fnid. The function called when the link is clicked is: (pr "you said: " (arg _ "foo"))
w/link's first argument is an expression that will be evaluated within the context of a function (which is entirely hidden inside the server) and used to output the page. It retrieves the foo argument from the HTTP request at the time of the initial POST. What's neat here is the mapping between functions and fnids so that pages are just functions and the lookup of the right page to go to is handled automatically. Labels: arc, pseudo-randomness
USD/GBP exchange rate getting out of hand according to Sky News
The Digg Heat Map
After my post about the number of registered Digg users got picked up by bloggers (Cliff Notes: 2.7m registered users, about 19% have been banned) I took at look at the information available through the Digg API. The API has and end point to get user information (in chunks of up to 100 users) and that user information includes the date and time of registration to the nearest second (along with the user's name and icon and number of profile views). So I wrote up a little script and used it to pull down information on 100,000 Digg users so that I could look at the pattern in registration times. I was specifically looking to see if there was evidence that Digg's audience is based in a certain geography, or that the geography had changed over time. To make sure that I had a good spread of data I plotted the chart of signups over time for comparison with the chart I generated for my previous blog post. Comparing the two it looks like I have good coverage of the life of Digg:  I then plotted the registration times from the Digg API on a simple heat map: the x-axis is the hour during the day that the user registered (US West Coast time) and the y-axis is the month and year. Here's the map:  The brighter the red, the more people signed up during that hour. Note that brightness is calculated on a per-row basis so that although the number of users has increased each row is considered on the same scale (0-255). In 2004 Digg was really only just starting and the early developers were clearly not morning people :-) It's very obvious looking at this that Digg's audience has not changed geographically greatly since 2005. There's a strong band of signups between 0600 and 1500 US West Coast time. That indicates to me that Digg's audience is overwhelmingly US based. If you consider the 4 US time zones, and imagine people are working between 0900 and 1800 (most web surfing is done at work) then it's easy to draw a little chart that shows when you'd expect the peak to be. The following chart shows the number of US time zones that are working relative to US West Coast time:  That corresponds very nicely to the hot zone of Digg registrations. It's also obvious the US registrations vastly outweigh UK and European. The UK is 8 hours ahead of the US West Coast, and most of Europe is 9. So at midnight in California it's 0800 in the UK and 0900 in Europe. Looking at the blackness of night in the heat chart indicates that there aren't that many European Digg users. So, my summary is that I think Digg remains mostly a US phenomenon with most of the users signing up while at work. Thank you Digg for providing such a nice API. And if anyone else has suggestions for data I can poke at, please email me. Labels: pseudo-randomness
How many users does Digg have?
According to my calculations: around 2.7m registered users. I obtained this number by finding random Digg users and extracting their user id. The user id is in a hidden HTML form input field on each Digg user's page. The Digg user page also gives their date of registration. Using this I was able to plot every month from December 2004 (when Kevin Rose registered) up to this month. Here's a picture:  Of course, the user id might be being generated in some other way, but I suspect that it's an auto_increment field in their database. To validate my data I checked the Digg blog and in March 2007 Kevin Rose wrote that Digg had 1 million users. This ties directly with my data, and hence I think I'm correct. If this data is correct then looking at the last 6 months Digg is growing at a rate of around 110,000 users per month. Now that doesn't say how many are active, but it's interesting data nonetheless. Observations: 1. The inflection point at June 2006 corresponds to the launch of Digg v3 which marks the point that Digg stopped being just about technology and added categories like World and Business and Entertainment. 2. Kevin Rose was on the cover of Business Week in August 2006. 3. Another calibration point is this post listing the number of Digg users as around 700,000 in December 2006. That also matches my data. 4. There's a big jump of users in December 2006 when Digg added multimedia features like videos. 5. The May 2007 jump corresponds to the AACS key "controversy". 6. Yet more confirmation that these figures are correct. TechCrunch reports that Digg had 200,000 users in March 2006. 7. "Kurt" pointed out that the Digg API gives a user count of 2,211,964. That's interesting because it gives us a way to estimate the number of spammer/abuser accounts banned by Digg: about 500k or around 19% or registered users. Labels: pseudo-randomness
Is Digg really worth $300m?
There's a rumor going around that Digg is trying to sell itself to somebody (anybody?) for $300m. That seems like a lot of money to me so I went searching for comparative statistics. Prior to taking their series B financing, Digg was rumoured to be looking for (and failing to find) a buyer at $150m. Back in 2005 News Corp. bought Intermix for $580m. Intermix spent $69m of that money to acquire the remaining 47% of MySpace that it didn't already own. The company had 27 million unique users per month at the time of acquisition. So they paid roughly $22 per user. Now take a look at Digg today with about 17.8m unique visitors. Using that same metric Digg would be valued at $392m. But now look at the attention span of those people on the site. Visitors to MySpace.com look at an an average of 37 pages per visit whereas Digg users look at 5. In terms of time Digg users spend 0.04% of their time on Digg, whereas MySpace engages users for 7.3% of their time. So > 7 times the pages viewed, plus 180x more time spent on MySpace than Digg. Now, the average stay at Digg is 2m25s and MySpace is 24m26s. So people spend 12 times longer on MySpace than on Digg. So MySpace is way more engaging that Digg, which isn't surprising given that Digg is place you go to go someplace else. You're unlikely to spend long there. So how do you discount the $392m based on engagement..? By % of time spent there Digg is worth $2m, by length of stay it's $32m, by pages viewed it's $56m. Now, Digg has taken about $11m in VC money so assuming participating preferred etc. the VCs aren't going to be happy below that $32m number. My guess is that Digg is worth somewhere between $50m and $100m. Of course, I don't know what their revenues are and if they are making a lot on advertising then those numbers could change. The rumour has always been that Digg users don't click on ads so their revenue might not be spectacular. Another telling statistic on Digg is its Alexa rank which dropping fast. And if they are out there looking to be sold the price will be depressed. $300m: too high IMHO. Labels: pseudo-randomness
The Seven Point Scale
For some time now I've noticed the scoring things on a scale of 1 to 7 seems to be a good way of evaluating some analogue or continuous phenomenon. I was first introduced to 7 point scales by John Ousterhout (who's best known as the Tcl instigator). John likes to use a 7 point scale to evaluate interviewees as follows:
| 1 | Worst candidate imaginable. I will quit if you hire them. |
|---|
| 2 | Very negative. Will argue strongly against hiring. |
|---|
| 3 | Negative. |
|---|
| 4 | Totally ambivalent about this candidate. |
|---|
| 5 | Positive. |
|---|
| 6 | Enthusiastic. Will argue strongly to hire. |
|---|
| 7 | Best candidate imaginable. I will quit if you don't hire them. |
|---|
And we used to look for candidates with 5 and above votes from all interviewers and at least one 6. We did once have someone vote 7 on a candidate, but it's very rare to see 1 or 7. Turns out that seven point scales are not that uncommon. Kinsey used one in defining types of sexuality:
| 1 | Exclusively heterosexual. |
|---|
| 2 | Predominantly heterosexual, only incidentally homosexual. |
|---|
| 3 | Predominantly heterosexual, but more than incidentally homosexual. |
|---|
| 4 | Equally heterosexual and homosexual. |
|---|
| 5 | Predominantly homosexual, but more than incidentally heterosexual |
|---|
| 6 | Predominantly homosexual, only incidentally heterosexual |
|---|
| 7 | Exclusively homosexual. |
|---|
And there's actually research into the accuracy of 7 points scales. See for example this report that indicates that 7 point scales give as much information as 10 point scales when rating happiness. And here's another paper recommending seven point scales for measurement. And then there's the Likert scale which has been in use since the 1930s which has 5 and 7 point variants. Seven point scales are neat because they have a clear middle point and between the middle and end points there are just two choices. That gives them to capture variations in opinions without presenting too many choices (leading to vacillation) or too few (meaning that too much data is lost). I'm using them lots of different places: most recently in the votes on books I've recently read. Labels: pseudo-randomness
Transitive decay in social networks
When I was doing research in computer security we often used to say "trust isn't transitive". What we meant was that if Alice trusts Bob and Bob trusts Charlie then we can't assume that Alice trusts Charlie. Another way to think of this is to look at your own friends and friends of friends; do you trust the friends of friends? It's likely that you do not (if you did then it's likely that they would actually already be a friend and not a FOAF). Clearly, trust is not a constant value across all friends, so each of your N friends will have a trust value ("how much you trust them"), which I'll call Ti, assigned to them. A friend you'd trust with your life has Ti = 1 (perhaps they're a candidate for a BFF), and a friend you don't trust at all has Ti = 0. (I'll ignore the question of why you even have friends with Ti = 0, but in the context of computer social networks you probably do have some). In social situations we are only exposed to this FOAF trust problem occasionally, but with 'social networking' a current web buzzword we see social networks, or social graphs, and can traverse them. Many web sites are trying to use this graph traversal to build services (e.g. LinkedIn allows you to send requests across the network, or ask questions; Facebook is hoping that graph traversal will be the new application distribution method). But any graph traversal suffers from the FOAF trust problem. In a social network online this gets expressed by statements like "Just because Alice likes the Werewolf application and shares it with Bob and Charlie is friends with Bob, that doesn't mean that Charlie wants to be a Werewolf", or "A message crossing between more than one hop won't get passed all the time". I dare say that LinkedIn, Facebook and others could actually characterize the rate at which the FOAF attenuates messages (be they actual messages, application installations, or any other meme) passing through the network. I'm going to posit that the amount of trust a user would place in a FOAF (and a FOAFOAF, a FOAFOAFOAF, ... ) decays rapidly with the number of FOAF hops traversed. Intuitively, if Alice trusts Bob with Talice,bob and Bob trusts Charlie with Tbob,charlie then how much does Alice trust Charlie? Less than she trusts Bob because Charlie is not her friend, and she can only evaluate Charlie based on Bob's recommendation. The more she trusts Bob the more she should trust Charlie. So some sort of estimate Talice,charlie is created from Talice,bob and Tbob,charlie taking into account these trust estimates. A simple combination would be Talice,charlie = Talice,bob * Tbob,charlie (this assumes quite the opposite of the original declaration above: here trust is transitive to a certain degree). The problem with this is that it treats all trust relationships as having equal weight, no matter how far they are from the original person (in this case, Alice). Imagine the case where Alice trusts Bob with Talice,bob = 1, Bob trusts Charlie Tbob,charlie = 1. This formula gives Talice,charlie = 1 which would probably not reflect most people's intuitive grasp of trust. If in addition, Charlie trusts Dave with Tcharlie,dave = 1 then we get Talice,dave = 1 which seems even more unlikely. What's needed is a way to decay trust the further apart people are. One way to to this is for each person to have their own damping factor the encodes how much they trust another person's trust. So Alice might trust other people's recommendations with factor Dalice (in the range [0,1]). The formula would be updated to have Talice,charlie = Dalice * Talice,bob * Tbob,charlie Talice,dave = Dalice * Talice,charlie = Dalice * Talice,bob * Dbob * Tbob,charlie * Tcharlie,dave But that's still essentially linear. I think trust looks more like an inverse square law so that distance is explicitly encoded. With that Talice,charlie = Dalice * Talice,bob * Tbob,charlie / 1^2 Talice,dave = Dalice * Talice,charlie / 2^2 = Dalice * Talice,bob * Dbob * Tbob,charlie * Tcharlie,dave / 4 This seems to fit better intuition because trust of distant people drops away very rapidly. Now, since this is only a hypothesis it would be interesting to measure the reach of messages passing inside a social network to look at the actual 'pass it on' rates to see if they match intuition. Anyone out there got lots of social network data I could analyze? Perhaps there's a Facebook application developer who's tracked enough invite/install data that this could be verified. Labels: mathematics, pseudo-randomness
Double-checking Dawkins
I've been reading through Richard Dawkins' books and am currently half way through The Blind Watchmaker (2006 paperback edition) and on page 119 he writes: In my computer's ROM, location numbers 64489, 64490 and 64491, taken together, contain a particular pattern of contents---1s and 0s which---when interpreted as instructions, result in the computer's little loudspeaker uttering a blip sound. This bit pattern is 10101101 00110000 11000000.
Of course, this piqued my curiosity. Did Dawkins just make that up, or is this really the contents of a specific bit of memory on a specific computer? The book was first published in 1986, so I just had to figure out what it was. Starting with the instructions and converting to hex we have AD 30 C0. Now, considering the main processors around at the time there are three possible interpretations of these three bytes: Z-80/8080 XOR L ; JR NC C0 6502: LDA C030 6809: JSR 30C0 The first didn't look at all plausible, but both the other two do. The 6809 looks like it could be calling a sub-routine at location 30C0 and the 6502 would work if C030 were actually memory-mapped I/O and a read was necessary to cause the blip. I couldn't find any machine with a meaningful interpretation of the 30C0 location on a 6809 machine, but 6502 turned out to be a different story. On an Apple ][ memory in the range C000-C0FF is mapped to various bits of I/O: The memory space on the Apple II between $C000 and $CFFF was assigned to handle input and output. From $C000 to $C0FF the space was reserved for various soft-switches used to control the display, and various built-in I/O devices, such as the keyboard, paddles, annunciators, and the cassette port.
Poking around further I discovered that a read from C030 (known as SPKR) will blip the speaker on an Apple ][. So Dawkins is telling the truth and he's using an Apple ][. So returning to the memory locations what program is he talking about? The three memory locations are FBE9, FBEA and FBEB. That turns out to be inside the Monitor ROM on an Apple ][ (the stuff that Woz wrote) and happily a complete ROM listing was provided with the Apple ][. So looking in a scanned Apple ][ manual we find that those locations are inside the BELL2 routine. Specifically on page 163 of the manual we find the following: FBE4: A9 0C 597 BELL2 LDA #$0C TOGGLE SPEAKER AT FBE6: 20 A8 FC 598 JSR WAIT 1 KHZ FOR .1 SEC FBE9: AD 30 C0 599 LDA SPKR FBEC: 88 600 DEY FBED: D0 F5 601 BNE BELL2 FBEF: 60 602 RTS2B RTS
The line in bold is exactly the code that Dawkins is referrring to. So, while writing this famous text on evolution, Dawkins had time to poke around inside the Apple ][ ROM and write about it in his book. Respect. Labels: pseudo-randomness
Windows just likes to mess with me
I don't use Windows very much, but when I do my trusty Windows 2000 SP4 VM comes in handy (I guess now that Vista is out I should upgrade to XP). This morning I installed a big chunk of Windows updates that needed dealing with and Windows said:  Just why is the Restart Now button greyed? :-) Labels: pseudo-randomness
Steve Gibson's PPP... new version 3 in Java and C
If you've been following along you'll know that I've implemented Steve Gibson's PPP system in Java and C. The Java code is in the form of a CLDC/MIDP project for cell phones and the C code is a command-line tool. Steve's made a major change to the algorithm which he's calling PPPv3. My updated code is here: The C source code is available in ppp3-c.zip. The Java source code is available in ppp3-java.zip. The compiled Java is available in ppp3.jar. Read my original blog posts for details.  Labels: pseudo-randomness, security
Cryptographically and Constantly Changing Port Opening (or C3PO)
In another forum I was just talking about a little technique that I came up with for securing a server that I want on the Internet, but to be hard for hackers to get into. I've done all the right things with firewalling and shutting down services so that only SSH is available. But that still leaves port 22 sitting there open for someone to bang on. So what I wanted was something like port knocking (for an introduction to that you can read my DDJ article Practical Secure Port Knocking). To avoid doing the classic port knocking where you have to knock the right way to open port 22 I came up with a different scheme which I call Cryptographically and Constantly Changing Port Opening or C3PO. The server and any SSH client who wish to connect to it share a common secret. In my case that's just a long passphrase that we both know. Both bits of software hash this secret with the current UTC time in minutes (using SHA256) to get 256 bits of random data. This data changes every minute. The 256 bits gives me 16 words of random data. Those 16 words can be interpreted as 16 different port numbers. Once a minute the server reconfigures iptables to open those 16 ports forwarding one of them (which corresponds to word[0] in the hash) to SSH and the other 15 to a blacklist service. At any one time 16 ports are open (i.e. respond to a SYN) with only one being SSH and the other 15 being a trap to be sprung by an attacker. The 16 ports change once a minute. Since both sides can compute the hash the client is able to compute where the SSH server is residing at that moment and contact it. Once contact is established the connection remains open for the duration of the session. New sessions, of course, will need to recompute the hash once a minute. The blacklist service serves to tarpit an attacker. Any connection attempt to one of the other 15 sockets causes the IP address of the attacker to be blacklisted (again an iptables change) which means that hitting any of the 15 ports causes the attacker to shut off their access to the SSH server for the next 15 minutes. A casual NMAP of my machine gets your IP address blacklisted and shows up a random selection of open ports. A real user connects to the SSH server first time because they know where it resides. Of course, this doesn't replace making sure that the SSH server is up to date, and that passwords are generated carefully, but it seriously frustrates a potential attacker. If this is of interest to others I'd be happy to release my code (which is currently in Perl) as a GPL2 project. Labels: mathematics, pseudo-randomness, security
The effect of trying to save electricity
Over the last couple of years I've tried to cut down electricity use in two simple ways: 1. Gradually replace incandescent light bulbs with low power fluorescents. I only do this when an incandescent bulb dies, currently I'd estimate that half the bulbs in the house are converted. 2. Switching off appliances that are not in use. This has been achieved by grouping appliances on power strips with physical on/off switches. When not in use I can kill off the TV and related appliances, the Internet and all networking, computers and peripherals. I've not got a couple of years of data from electricity bills. For the first 10 months of 2006 I used 2,821 kWh of electricity, for the same period on 2007 (during which I've implemented 1 and 2 above) I've used 2,345. That's 83% of the amount used in 2006. And that's despite an increase in electricity use during the winter of 2007. Here's a chart which shows the trend.  Labels: pseudo-randomness
PPPv2 in Java and C
Recently, I released C and Java versions of Steve Gibson's PPP system for password generation. Steve updated the algorithm to PPPv2 which uses a different hash (SHA256 instead of SHA384) and a slightly different plain text generation algorithm (see the PPP pages for details). I've now updated my code to PPPv2 and am releasing it here. The C source code is available in ppp-c.zip. The Java source code is available in ppp-java.zip. The compiled Java is available in ppp.jar. Read my original blog posts for details. All source code is released under the BSD License. Labels: mathematics, pseudo-randomness
A Java client implementation of Steve Gibson's PPP
I recently produced an open source implementation of Steve Gibson's Perfect Paper Passwords system in C. It occurred to me that a better implementation would be a Java client for my mobile phone (thus eliminating the need for printing and carrying the paper passwords). Here's my PPP client implementation running on my Motorola RAZR. It's written in Java using CLDC 1.0 and MIDP 2.0.   You can download and install the JAR file. The current version is 1.0.0. Labels: mathematics, pseudo-randomness
An open source implementation of Steve Gibson's PPP algorithm
Steve Gibson has come up with a simple two-factor password scheme that relies on printed cards of passcodes generated using a combination of SHA-384 and Rijndael. The idea is that a system could prompt the user for one of the passcodes in addition to their normal password. Steve calls this his Perfect Paper Passwords system and has given a detailed description of the algorithm. As usual he's released code written in assembly language as a DLL for Windows. He hasn't released his source code (he never does), so I thought it would be interesting to write my own implementation of his algorithm. Here's the C code: #include <sys/time.h> #include <string.h> #include <stdio.h> #include <stdlib.h>
#include "rijndael.h" #include "sha2.h"
#pragma pack(1)
typedef unsigned char Byte;
typedef union __Passcode { unsigned long as_long; struct { Byte byte[4]; } bytes; } Passcode;
typedef struct __PasscodeString { char character[5]; } PasscodeString;
typedef unsigned long long SixtyFour;
typedef union __OneTwoEigh
|