Silk Road forums

Discussion => Security => Topic started by: n1ll0 on January 10, 2013, 09:05 am

Title: De-Anonymization Through Statistical Analysis of Writing Samples
Post by: n1ll0 on January 10, 2013, 09:05 am
I came across an interesting article today makes me consider making long posts. What do some of you security experts out there think of this concept of identifying and de-anonymizing anonymous forum users using statistical analysis of writing?

**Clearnet**
http://www.scmagazine.com.au/News/328135,linguistics-identifies-anonymous-users.aspx
**Clearnet**

Granted, the techniques require very large writing samples, and would depend on the anonymous user also posting prolificly in a public capacity as well. Still makes you wonder whether there is enough random material out to identify you.

Title: Re: De-Anonymization Through Statistical Analysis of Writing Samples
Post by: happy_kitten on January 10, 2013, 12:07 pm
I actually stumbled over this yesterday. I was researching the capture of Sabu, leader of the hacker group LulzSec. That obviously lead me to The Jester, the patriotic hacker which long have been opposing LulzSec and Anonymous. He seems to be quite an expert when it comes do de-anonymize people. It seems like Sabu was at least partly identified by his style of writing.

Here is The Jesters take on it [CLEARNET PDF]: reapersec.files.wordpress.com/ 2012/ 03/ if-i-am-wronge280a6-i_ll-say-i_m-wrong-here_s-my-apology-c2ab-copy.pdf

My account at this forum is in no way connectable with my account on the marketplace. It now seems like a good idea.

Title: Re: De-Anonymization Through Statistical Analysis of Writing Samples
Post by: impkin on January 10, 2013, 12:20 pm
Surely there must be a way to develop code that a) throws a thesaurus at your text, and b) tweaks sentence syntax so that your original meaning still comes through without too much distortion?  Maybe as simple as running the text through multiple language translations and then back to the original language? 
Title: Re: De-Anonymization Through Statistical Analysis of Writing Samples
Post by: spazzmatrazz on January 10, 2013, 01:10 pm
Surely there must be a way to develop code that a) throws a thesaurus at your text, and b) tweaks sentence syntax so that your original meaning still comes through without too much distortion?  Maybe as simple as running the text through multiple language translations and then back to the original language?

Tools like this are already common - they are known as content spinners and are popular with spammers as they prevent detection of duplicate messages by hashing and also weaken bayesian spam filters.

They usually aren't very good and can result in your message being misunderstood, so you need to proofread the output.

It may still be worth it though.

Examples (clearnet):

http://www.wpcontentspinner.com/

http://www.contentboss.com/
Title: Re: De-Anonymization Through Statistical Analysis of Writing Samples
Post by: SelfSovereignty on January 10, 2013, 01:34 pm
I think it's entirely possible to identify SR forum users this way.  If you have a database large enough (IBMs Watson "knowledge" would probably do it), and enough computing power tied to each other (again, Watson would probably do it)... I have no doubt you could identify half the people here.  Frankly half of my "protection," is that I'm not worth finding.  I'm not even a vendor, why would they give a fuck about finding me -- if I was a big fish though, I'd definitely post a whole, WHOLE lot less...
Title: Re: De-Anonymization Through Statistical Analysis of Writing Samples
Post by: n1ll0 on January 10, 2013, 07:22 pm
I think it's entirely possible to identify SR forum users this way.  If you have a database large enough (IBMs Watson "knowledge" would probably do it), and enough computing power tied to each other (again, Watson would probably do it)... I have no doubt you could identify half the people here.  Frankly half of my "protection," is that I'm not worth finding.  I'm not even a vendor, why would they give a fuck about finding me -- if I was a big fish though, I'd definitely post a whole, WHOLE lot less...

Exactly my thinking.. haha I have not even purchased anything yet so I am not overly concerned with being identified myself.. but I would think it would be a serious concern for some of the more prolific posters like kfmkewm, pine, DPR, and some others..  This is assuming of course they have made some contributions that are attributed to them IRL.
Title: Re: De-Anonymization Through Statistical Analysis of Writing Samples
Post by: Empathy101 on January 10, 2013, 10:03 pm
http://3suaolltfj2xjksb.onion/hiddenwiki/index.php/Anonymous_Writing_Style

Typographical Style

    When numbering a list, format it like this list has been formatted in this document. Include the dot at the end of every item. If this list is not read on a wiki, the format is “1. Item”.
    If the document's author is to be identified, write their name under the heading without the word ”by”, like in this document.
    One space separates sentences. Two spaces is an older, minority typing convention. See how strange it is.
    Write sentences as shortly as possible.
    One clear line separates paragraphs with no indenting.
    Use short paragraphs to group related ideas together.
    Write references to date in long format from specific to general, e.g. “1 January 2000”.

Dialectical Style

    Avoid using BIG WORDS where possible, even when a few big words might actually shorten an otherwise long sentence that is made up of many small words. More people know fewer words. Use the Basic English vocabulary as much as possible. (https://secure.wikimedia.org/wikipedia/en/wiki/Basic_English)
    Never use expressions or sayings that are used in any specific geographical region only. “Our proverbs lie too close to home”. Also, “one man's proverb is another man's confusion”.
    DO NOT use contractions, like “don't”.
    Never use non-English words.
    Should rhetorical questions ever be necessary? DO NOT USE RHETORICAL QUESTIONS. Rather use commands or statements.
    Do not begin a sentence with “and”, "or" or “but”. HOWEVER, rather use “also” or “furthermore”, "alternatively", and “however”, respectively. FURTHERMORE, never begin a sentence with an abbreviation, e.g. “FOR EXAMPLE, rather begin a sentence with the words 'For example' than with the capitalized abbreviation 'E.g.'.” Alternatively, a different word order could be considered.
    When referring to an example, never use “an example”, but rather use something better, FOR EXAMPLE “e.g.” (“example given”).
    IN STEAD OF using “in place of”, use “in stead of” instead. (Notice the spaces in “in stead of” as opposed to “instead”. Also notice, specifically, the respective contexts within which each is used. “Instead” is never followed by “of”.)
    AS LONG AS the phrases “as long as” and “such that” are used SUCH THAT “so long as” and “so that” are replaced by them, then a greater degree of anonymity might be achieved and maintained.
    EveryONE will notice when someONE uses “anybody” in stead of “anyone”. No-ONE will be able to identify anyONE by their writing style when everyONE uses “someone” in stead of “somebody”, etc.
    “THEIR 'theirs' and 'theres' are not THEIRS to mix up THERE. THEY'RE all of ours, for unambiguous interpretation.” (“Their” indicates possession. “ThERE”, similar to “hERE”, is an answer to “whERE?”. “They're” is a contraction of “they are”, which should not be used in any case, according to item 1.2.3.)
    IF making a comparison between things, THEN it is better to use “than” THAN “then”.
    The minority-used "different than" and the majority-used "different from" differ FROM each other, and are therefore different in that the former reduces the anonymity offered by the latter.
    Remember NOT TO SPLIT "to infinitives" by inserting another word. Rather use the applicable word before or after the infinitive, e.g. “It is possible for a writer STILL to be identified ...” in stead of “... to still be identified ...”.
    There is not ANY MORE time left to use “any more” as a reference to time, ANYMORE. Just like “anymore” has never been used as a reference to quantity, ever.
    "FURTHER, to extend this logical argument, 'farther' refers to an increase in physical distance."
    As far as possible, write without first, second and third person references. Where it cannot be avoided, use the first and third person plurals only: “we”, “us”, “our”, “ours” and “ourselves”; and “they”, “them”, “their”, “theirs” and “themselves”, respectively. Never use the second person singular or plural “you”, “your”, “yours” and “yourself” or “yourselves”. When a second person singular reference cannot be avoided, use “one”, “one's” and “oneself” in stead of “you” etc. When referring to a general singular person, it is quite appropriate to refer THEM in the third person plural in stead of referring to him/her by means of such duplicating male/female slash forms.
    Use words, such as “probably”, “possibly”, “maybe”, “perhaps”, “could”, “should” etc., that refer to possibilities only, as little as possible. State verifiable facts, referring to independent source material. Do not speculate or theorize.

Use American English spelling, punctuation and grammar. More people use it.
Spelling

    TO write “too” without TWO “o”s, is not enough. It is TOO little. (Also compare the first proverb in item 1.2.2.)
    “State their favORite colOR, odOR and flavOR.”
    Critically analyZE other people's non-American “-SE” spelling of words such as "criticiZE" in order to help them increase their anonymity, as well as ours, by conforming to the American "-ZE" spelling.
    PRACTICE not to confuse the noun ("practiSe") with the verb.

Punctuation

    Only use commas when creating a list in a SENTENCE, OR at the end of a quote. (Also use commas before “BUT”, AND to separate clauses and PHRASES, ESPECIALLY if a SINGLE, LONGER sentence is required in order to reduce the repetitiveness of many shorter SENTENCES, LIKE in this case. Commas are also used when addressing a person by NAME, ANONYMOUS, E.G. “Anonymous, texts have been seen that were very difficult to read and interpret unambiguously without these further comma rules.” Consider the significant change in meaning of this quoted sentence that the inclusion or exclusion of a comma can make. Do not use a comma before “and”, "or" OR “etc.” at the end of a list.)
    No dots are included in abbreviations. This includes titles such as Mr, Dr, Mrs etc. An exception to this is at the end of a sentence. Furthermore, “etc.” gets a dot even in the middle of a sentence, as any American English spell checker will indicate. (Note the letter order: it is not “ect.”, “ec tetera”, but rather “etc.”, “et cetera”.)
    Quoting is done with double quotation marks, e.g. ”LIKE SO”. Notice that the period came after the end quote. When quoting in a sentence ”LIKE SO”, place the comma after the end quote. “...Unless when quoting DIECT SPEECH,” said the editor, “as is done in this QUOTE.” Multiple, nested quotes should use alternating double and single quotation marks in order to keep track of the level of quotation, e.g. HE said, “SHE said, 'IT said, “BLAH, blah, blah.”'”
    Never use exclamation marks or smilies! ;)
    Rather use a comma (,) in stead of parentheses ( "(" and ")" ), except when parentheses might actually help to clarify layout and/or meaning, e.g. linking to another site. Place the URL in parentheses like this: "(http://kpvz7ki2v5agwt35.onion/)" (without the quotation marks).
    Avoid using DASHES, RATHER use commas. ALTERNATIVELY, start a new sentence. (Proper spelling and grammar require some HYPHEN-WORDS to be hyphenated.)
    Use date and time short format as follows: yyyy-mm-dd hh:mm:ss.dcm; consistently from general to specific, for ease of interpretation. Note the hyphens, the colons and the decimal point, for ease of reading. Use “BCE” and “CE” (“Before Common Era” and “Common Era”) after a date, in stead of “BC” and “AD”.
    Use a decimal point (“1.234”) in stead of a comma (“1,234”) to indicate decimal fractions.
    Use a comma to separate thousands for ease of reading (“1,234,567.890”: “one million, two hundred and thirty-four thousand, five hundred and sixty-seven, point eight nine zero”).
    There are SEVENTY-TWO words for the numbers from "TWENTY-ONE" to "NINETY-NINE", that are not multiples of *10*, which must be written with hyphens.” (Use number words to count things, but number symbols to refer to specific numbers.) (Also see item 1.4.6.)

Grammar

    A “U”-WORD or acronym does not get an “an” before it when it is pronounced with the “y” consonant sound (e.g. “a UCLA-student”) in stead of being pronounced with the “oo” vowel sound (e.g. “an ulema”).
    An acronym beginning with F, H, L, M, N, R, S or X, gets an "an" before it, because the pronunciations of the names of all these consonant letters begin with a vowel sound: "ef", "aytch", "el", "em", "en", "ar", "es" and "ex".

Conclusion

This list contains only some of the most common spelling, punctuation, grammar and style errors on the Web.

It seems ironic that the established rules for proper spelling, punctuation, grammar and respectable style, which are often accused of "restricting freedom of expression", now just exactly facilitate greater anonymity and freedom of speech. Lack of knowledge reveals identity by the same mistakes being made repeatedly in ignorance. If a set of rules could be made and applied to consistently make the same mistakes without reason, then surely it must also be possible to apply the universally accepted rules, with all their exceptions, as consistently as what the mistakes have been made. Simply change the set of rules for consistently making the same mistakes to the set of rules for consistently keeping the universal rules.
Title: Re: De-Anonymization Through Statistical Analysis of Writing Samples
Post by: wasta on January 10, 2013, 11:45 pm
There is soft and hardware that take care of this, totally automated.
Completely automatic.

See for equipment the spycables at wikileaks from dec 2011.

Yes, they can !

Most people will write the same mistake over and over again.
Software will filter those words out from internet traffic by mass analyses.
Once someone uses a name on the darknet, that is been used on the clearnet too, he/she is a sitting duck for L.E.

Not a lot of action, but expect to be on their inventory-list.

L.E. is after the sellers.
Buyers don't have priority.