kmfkewm - I really do get the desire to see which of your friends have seen links to the payload, and the desire to reduce unneeded duplication of metadata to share that.
It is more than just a desire to reduce bandwidth, although that does come into play as well. The primary reason why users need to be able to tell which of their friends have seen a payload, is so they know who to respond to when they make a response to the message in the payload. Picture it with E-mail and a mail archive:
Alice sends a tagged encrypted message to a mail archive. This is the payload. Now she wants Bob and Carol to see the message and be able to respond to it. So she sends Bob an E-mail with the tag of the message and a key to decrypt the message, and she tells Bob that Carol can also see the message. Alice sends the same E-mail to Carol, but letting her know that Bob can see the message. The E-mails Alice sends to Bob and Carol can be seen as the metadata packets. Now, if Bob knows who Carol is, when he makes a response to the message he can know to tell Carol about it. If Alice never told Bob that Carol knew about the message, Bob could only make a response and tell Alice about it. But in this case there is not group communication taking place, rather it is like Alice holds a conversation with Bob and independently holds a conversation with Carol, about the same topic. So it is required for group communication that Bob knows Carol is part of the group communicating. Now there are cryptographic tricks we can do to make this more secure, for example we don't want Bob to learn anything about Carol if he doesn't already know who she is. Additionally, we don't want Bob to even know how many people Alice pointed to the message, unless he knows all of the people that she pointed to the message.
So the most important reason for clients to know who all has seen a message is so that they know who to respond to when they make a response to the message. Saving bandwidth by not resending a ton of metadata packets is just an added advantage of this.
I'm kinda stuck at this question:
How does the overhead of performing searches relate to the overhead of duplicated metadata objects?
I am not sure I understand this question. If there are duplicated metadata objects (although each one is a bit different, even if it points to the same message), that will increase the number of searches that need to be performed as well. If Alice points Bob to a message he already knows about, then Bob still needs to search for that metadata object, because he doesn't know what it points to until he downloads it, and downloading it requires him to search for it. He is capable of searching for it because he has a shared secret search string between him and Alice, but until he actually downloads it he has no idea what it is he is downloading.
Here is an abstract from one of the papers that looks like a suitable candidate (however I still need to read the one way indexing paper, it looks like it might be better actually in that it could provide censorship resistance as well) :
www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA456185
A system for private stream searching, introduced by Ostrovsky and Skeith [18], allows a client to
provide an untrusted server with an encrypted search query. The server uses the query on a stream
of documents and returns the matching documents to the client while learning nothing about the
nature of the query. We present a new scheme for conducting private keyword search on stream-
ing data which requires O(m) server to client communication complexity to return the content
of the matching documents, where m is the size of the documents. The required storage on the
server conducting the search is also O(m). Our technique requires some metadata to be returned
in addition to the documents; for this we present a scheme with O(m log(t/m)) communication
and storage complexity. In many streaming applications, the number of matching documents is
expected to be a fixed fraction of the stream length; in this case the new scheme has the optimal
O(m) overall communication and storage complexity with near optimal constant factors. The pre-
vious best scheme for private stream searching was shown to have O(m log m) communication
and storage complexity. In applications where m > m, we may revert to an alternative method of
returning the necessary metadata which has O(m log m) communication and storage complexity;
in this case constant factor improvements over the previous scheme are achieved. Our solution
employs a novel construction in which the user reconstructs the matching files by solving a sys-
tem of linear equations. This allows the matching documents to be stored in a compact buffer
rather than relying on redundancies to avoid collisions in the storage buffer as in previous work.
We also present a unique encrypted Bloom filter construction which is used to encode the set of
matching documents. In this paper we describe our scheme, prove it secure, analyze its asymptotic
performance, and describe several extensions.
The Internet currently has several different types of sources of information. These include con-
ventional websites, time sensitive web pages such as news articles and blog posts, real time public
discussions through channels such as IRC, newsgroup posts, online auctions, and web based fo-
rums or classified ads. One common link between all of these sources is that searching mechanisms
are vital for a user to be able to distill the information relevant to him.
Most search mechanisms involve a client sending a set of search criteria to a server and the
server performing the search over some large data set. However, for some applications a client
would like to hide his search criteria, i.e., which type of data he is interested in. A client might
want to protect the privacy of his search queries for a variety of reasons ranging from personal
privacy to protection of commercial interests.
A naive method for allowing private searches is to download the entire resource to the client
machine and perform the search locally. This is typically infeasible due to the large size of the data
to be searched, the limited bandwidth between the client and a remote entity, or to the unwillingness
of a remote entity to disclose the entire resource to the client.
In many scenarios the documents to be searched are being continually generated and are already
being processed as a stream by remote servers. In this case it would be advantageous to allow
clients to establish persistent searches with the servers where they could be efficiently processed.
Content matching the searches could then be returned to the clients as it arises. For example,
Google News Alerts system [1] emails users whenever web news articles crawled by Google match
their registered search keywords. In this paper we develop an efficient cryptographic system which
allows services of this type while provably maintaining the secrecy of the search criteria.
Private Stream Searching Recently, Ostrovsky and Skeith defined the problem of “private fil-
tering”, which models the situations described above. They gave a scheme based on the homomor-
phism of the Paillier cryptosystem [19, 9] providing this capability [18]. First, a public dictionary
of keywords D is fixed. To construct a query for the disjunction of some keywords K ⊆ D, the
user produces an array of ciphertexts, one for each w ∈ D. If w ∈ K, a one is encrypted; otherwise
a zero is encrypted. A server processing a document in its stream may then compute the product
of the query array entries corresponding to the keywords found in the document. This will result
in the encryption of some value c, which, by the homomorphism, is non-zero if and only if the
document matches the query. The server may then in turn compute E (c)f = E (cf ), where f is
the content of the document, obtaining either an encryption of (a multiple of) the document or an
encryption of zero.
Ostrovsky and Skeith propose the server keep a large array of ciphertexts as a buffer to accu-
mulate matching documents; each E (cf ) value is multiplied into a number of random locations in
the buffer. If the document matches the query then c is non-zero and copies of that document will
be placed into these random locations; otherwise, c = 0 and this step will add an encryption of 0
to each location, having no effect on the corresponding plaintexts. A fundamental property of their
solution is that if two different matching documents are ever added to the same buffer location
then we will have a collision and both copies will be lost. If all copies of a particular matching
document are lost due to collisions then that document is lost, and when the buffer is returned to
the client, he will not be able to recover it.
To avoid the loss of data in this approach one must make the buffer sufficiently large so that this
event does not happen. This requires that the buffer be much larger than the expected number of
required documents. In particular, Ostrovsky and Skeith show that a given probability of success-
fully obtaining all matching documents may be obtained with a buffer of size O(m log m),1 where
m is the number of matching documents. While effective, this scheme results in inefficiency due
to the fact that a significant portion of the buffer returned to the user consists of empty locations
and document collisions.
Note that we would need to use the extension in 5.2 of the paper which eliminates a globally public dictionary, as the keywords searched for are random, secret, and long.
At some point, every viewer of every thread will be performing an "is it read?" search one time for each contact they have that they want an answer for. You can cache results on the client, so it's just a "What about the folks that hadn't seen it last time?" set of searches every time you reopen the client.
Sure
You're trading an increase in CPU-time (to perform additional searches) to affect a decrease in storage (duplicated metadata objects).
Sort of, but the CPU-time required for a client to decrypt a search is actually quite large. Decryption of search results can take many minutes and a lot of processing power at the clients end. So not searching for things you already have actually is probably a significant reduction in required client CPU-time as well as certainly a decrease in storage at the PSS (in this case) database.
The DoS/Flood exposure of the PIR/etc method seems to be twofold: CPU, in terms of search flood, and storage. Storage is what actually concerns me the most, without the ability to share the load of storage, ala Freenet. Proof of Work gives you a rate limiting mechanism that can be ratcheted up to an appropriate level, of course. But PoW works well to limit storage.. not so much with searches (CPU). "Please fill out a CAPTCHA to see if Alice has read this... please fill one out for Bob..." Obviously, you could trade a non-human-interactive PoW (some concept of hashcash, etc).
Yes storage concerns me the most as well. Pretty much we need to have several PSS servers sharing the load to gain the benefit of decentralization (ie: taking down a single server has no effect on the overall system). But it seems very wasteful to have, say, two servers with 4TB of storage capacity, that need to be exact mirrors of each other. Adding more servers doesn't really increase storage capacity in this case, it only increases redundancy. But at the same time, having it so different messages go to different servers will very likely hurt the anonymity of the system and make it weak to traffic analysis. Just because PIR/PSS/OWI/Whatever makes it so the clients can download messages without the server knowing which message they downloaded, does not make it perfectly immune to traffic analysis. It sets a strong foundation upon which we need to build up our traffic analysis resistance. I think that we should certainly have proof of work to send messages to at least discourage spam and flooding of the PIR-like servers. It is trivial to implement hash based proof of work I could make such a system in half an hour tops. I don't much care for the idea of CAPTCHA , it is easy to circumvent and makes things much less nice for the legitimate users.
I just don't have a feel for how CPU-intensive each individual search is. If it's trivial, I think I like your method, but remain concerned about the Layer 7 reverse-mapping of content viewers. I'm afraid it exposes too much data about who has seen what, which would be a shame since PIR is starting with the best available method to keep that from happening. But I'll freely admit that I haven't thought this out all the way yet.
For clients searching is very CPU intensive. It will likely take several minutes (depending on CPU of course) for the client to decrypt their obtained search results. I don't think searching is actually very CPU intensive for the servers, but let me read the entire document through again before I make that claim. The PIR for Pynchon Gate is extremely CPU intensive for the servers but not very much so for the clients. Most PIR systems I have read about are very CPU intensive for the servers, and there tends to be a direct trade off between CPU and bandwidth (for example, the easiest PIR is that of BitMessage, everybody gets everything, where no significant processing needs to be done by the server but it needs to send enormous amounts of data to every client. Pynchon Gate PIR is the opposite of this, in that the server needs to do enormous amounts of computation, but only needs to send small amounts of data to every client). But in the PSS papers and other PIR-Like systems I have read about, it seems that most of the load is put on the clients, which is actually fantastic.
Writing or definitively auditing code that people will trust their lives to just isn't something I'm personally comfortable with doing, so I'm not much use to audit your code or implement algorithms from whitepapers. I have a lot of skillsets, but that one is too weak to trust anybody else's life with. Although the more I think about the direction the Internet/surveillance/etc is heading, the more I think that maybe it ought to be my focus moving forward.
Do you think you are better programmer / know more about security than the BitMessage people? Do you know not to directly encrypt payloads with RSA? Are you pretty good at math? Do you know C programming? The thing is yeah it is best if only top experts program stuff like this. But to help people we only need to be better than the "competition". Right now the competition for systems like this is not that good. The only systems similar to this are Syndie, BitMessage, Frost (and I suppose Dissent as well, but I don't think anyone is using that yet). Syndie doesn't include a network but only an interface that can be used with any network:storage pair. BitMessage is horrible and obviously not made by people who know what they are doing. Frost I know little about, but it is essentially relying on the anonymity of Freenet, which we can improve upon. Dissent I don't know enough to comment on, but quite likely it is the best of the bunch considering it is coming from academics instead of hobbyists. Also we have solutions like Tor and php forums, and I2P and the same, but that is so far below the aspirations of this that it can hardly even be compared. We need to make this because the solution of Tor + php forum, or I2P and the same, is simply not anywhere near good enough. Even if Tor is programmed perfectly and the PHP forum has no flaws, by the very design it is just not going to be resistant enough to strong attackers.
pretty much what I am trying to say is that even if you are not top security expert in the world, you can still help contribute to this and still look over code that is done. Peoples lives will not depend on you, they will depend on everybody who contributes to the project. The more people who contribute to writing code and to auditing code the better it is going to be. At first I tried doing this all by myself, and two years later I realized I bit off more than I could chew and other people became involved. Things improved! Not all of the people involved are experts, I consider myself only to be a hobbyist as well but I still implemented a cryptographic packet format, and the expert people who I asked to audit my code said it looked correct. That was the first cryptographic system I implemented, and it was hard work, but by sticking to the whitepaper and researching the shit out of things, I was able to produce a system that true experts in the field said I had produced correctly.