CZ Talk:Statistics

From Citizendium
Revision as of 04:32, 2 August 2007 by imported>Aleksander Stos (→‎Word count)
Jump to navigation Jump to search

This is fantastic--I think I can speak for everyone when I say that this is very much appreciated, Alex. --Larry Sanger 09:59, 18 May 2007 (CDT)

Definitely Cool! Keep them coming. --Matt Innis (Talk) 10:32, 18 May 2007 (CDT)

Thanks. Needless to say that everybody is invited to edit this page, add his own work or make requests for further improvements, comments about the page, suggestions of new ideas etc. --Aleksander Stos 14:36, 18 May 2007 (CDT)
Terrific page, very informative. Maybe you could put in a slightly clearer explanation of #6, Is this the number of users that log in each day, averaged over a month? BTW it would be easier to edit if you could break the page up into sections. (Maybe I'll try and you can revert if you don't like it.) I think the main data that stands out as missing would be the number of visitors to the site, which could be presented in different ways. David Hoffman 16:09, 18 May 2007 (CDT)
#6 is the number of users that actually made an edit given month (no average, just count). Now, I put a line of explanation, but I'm not sure about it -- do correct it please if you find a good formulation! BTW, something similar to the number of users that log in each day, averaged over a month is the "daily use" section, but it concerns the actual edits instead of logins (no login info is publicly available). Thanks for your remarks! --Aleksander Stos 16:47, 18 May 2007 (CDT)
It is great seeing some of these stats. It gives me a good feel on how alive CZ is and how it is growing. Robert Winmill 16:12, 18 May 2007 (CDT)

Source file

To produce these stats, on May 18 I dumped the histories of edits of all pages. This was transformed into an xml-like file in a format similar to "stub-meta-history" dump files released by Wikipedia. I'm willing to share the data with the interested CZ members, so if you want to make your own stats, just let me know on my talk page. --Aleksander Stos 17:43, 18 May 2007 (CDT)

Further development

I've just put some fresh data (I plan to update graphs too). Please do copy edit. Perhaps reorganisation of headers/text would be needed as well. I'm willing to feed similar info in future and the present structure is not well suited for this. --Aleksander Stos 02:23, 10 June 2007 (CDT)

I love this, Alexander. Do you think you could graph articles by workgroup? Editors by workgroup? If it's not too time consuming, I'd like to see the progressions over time. Nancy Sculerati 02:23, 10 June 2007 (CDT)

Good idea! Give me a few days.--Aleksander Stos 14:32, 10 June 2007 (CDT)
OK, here it goes. I hope that's what you requested. But do not hesitate to point out if something should be improved. --Aleksander Stos 08:48, 11 June 2007 (CDT)

This is amazing, again--and getting better. I'd like to note that May was actually one of our first months where nothing was happening that would "skew" the statistics. November, launch. January and February, self-registration. March (late), launch. April, aftereffects of launch. Moreover, May is one of the busiest months of the school year. --Larry Sanger 08:56, 11 June 2007 (CDT)

I can only second. Alexander, do you have any way to estimate readers? After all, tht is really our ultimate goal, to provide the reader, rather than to exist for the user. Is that possible to know? Nancy Sculerati 09:12, 11 June 2007 (CDT)
You're right. A similar question was raised before, see here. Unfortunately, I have no data. Access logs to CZ servers are not publicly available. I do not even know how they look like or how big they are (if I had such a file I could try my luck). Nevertheless, some _relative_ (comparative) stats exist. They are produced by specialized enterprises like Alexa and rely the on data provided by an "army" of users of AlexaBar or something. According to them, the Citizendium is number 4 in the world of "open content encyclopaedias" (look at this) and during few days after launch we were more popular than Britannica ;-). However, it looks like we have to work hard to stay high in the ranking. BTW, you easily recognize the "big vandalism era" and the launch on graphs. --Aleksander Stos 12:19, 11 June 2007 (CDT)

Forums

Is there anyway data can be compiled from them? That's where nearly all meta-discussion occurs.  —Stephen Ewen (Talk) 21:46, 24 July 2007 (CDT)

Forum has its own statistics page, see here. It could be linked from here (done)--Aleksander Stos 03:04, 25 July 2007 (CDT)

Word count

Any easy way to make a total CZ (main space) word count? --Larry Sanger 00:06, 28 July 2007 (CDT)

Yes, at least approximately -- I put it on my todo list. BTW, it would be great if we had some database dumps/backups, something like this --Aleksander Stos 07:54, 28 July 2007 (CDT)

Well, I counted that. It seem that is it safe to assume that CZ has over 4,200K words in the mainspace. I worked on the "raw" wikitext (i.e. what we see while editing). The templates were cut off (so tables, boxes etc), as well as obvious technical parts (categories, images, www links). I counted refs and headings, however. Just a choice. Comments/questions regarding methodology welcome! If what has been done appears to be reasonable, we can put it on the page. --Aleksander Stos 08:01, 30 July 2007 (CDT)

Excellent--thanks! Very interesting, too, so what's the average word count per article? I assume you're counting more than just the words contained in CZ Live articles? --Larry Sanger 08:31, 30 July 2007 (CDT)

About 1250 words/article (3351 pages as listed on Special:Allpages -- subpages and disambigs included, redirs, obviously, skipped). But, just like in the case of salaries in a population, the average does not tell us that much. The distribution of the article length follows a power law: there are many short articles and relatively few extremely long ones. In such a situation the median appears to be more meaningful. Our median is 552, which means that a half of our articles is longer than that and a half is shorter. BTW, we have about 2300 articles longer than 250 words and 2600 articles longer than 150 words (the rest being really short stubs, disambigs and a couple of almost empty experimental subpages). --Aleksander Stos 12:10, 30 July 2007 (CDT)
Still working on it; results to be announced soon (i.e. put on the page). The devil is in the details.. I'm fine tuning regular expressions in the script -- and thus a definition of a 'word' in the jungle of wikisyntax (eg. the number 45,67 counts for a word? two? what about U.S.? indefinite articles count?). This changes not that much. The basic question is whether at the present stage the subpages are to be included or not. They are either almost empty or very long drafts - "copies" of its 'main article'. If so, then we have about 4000K words (a bit less) with the median 568 (a bit better). This seems to be more accurate. Just thinking loudly. Aleksander Stos 10:03, 31 July 2007 (CDT)

Two things: indefinite articles (words!) always count, I thought. Also, anything of the form X/Draft should not be counted if X is other than a redirection page. All other subpages of the main namespace should be counted. --Larry Sanger 10:09, 31 July 2007 (CDT)

Thanks for the hint. Well, for the global word count I think it's OK to do as you suggest and, clearly, the drafts should be simply excluded anyway. But when it comes to the question "how long is our average article" then adding all the subpages systematically biases the result. I mean that at present there are many empty placeholders. Furthermore, some *standard* subpages, as galleries and tables, will always be (almost) "empty" for the word count procedure as it stands. Also, the links pages, if counted separately, bring a systematic bias -- average comments to links will always be shorter than the associated 'main' article. I feel that giving to the links the same "weight" as to the 'main' article results in inaccurate average (median). Either we simply skip links or we *concatenate* them to its main page (i.e. we count clusters, not individual subpages). The latter seems to be the right approach, i.e. this IMHO would give the best answer to the "how-long-are-articles" question. In this case, the answer is about 4100K words (total) and the median length 562. If we count each subpage separately, the resulting median would be 517 -- the difference is not negligible and, frankly, I think 562 better corresponds to what we could label as our "average" article seen on the screen. Furthermore, the 'cluster' method would allow meaningful comparisons to other wikis (that put everything on the same page). Aleksander Stos 05:32, 2 August 2007 (CDT)