«	August '21	»
S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

An Apple Affray

Monday, 16 August 2021

My previous blog entry went unexpectedly viral. I've never had something sit at the #1 position of Hacker News for hours, or be on their front page for an entire day. In addition, I saw heavy traffic from Reddit, Slack, and lots of other social media platforms.

The commotion about Apple's plan to scan client-side content for child sexual abuse media (CSAM) has not dulled. Instead, there are many more people criticizing Apple's decision. (A Google search for "Apple" and "CSAM" turns up hundreds of negative reviews.) The "screeching voices of the minority" seems to be the majority opinion: Apple's intent may be good, but their solution is a bad idea. In this blog entry, I'm going to cover a little of what's happened over the last week with regards to Apple's CSAM scanning efforts and make some guesses about Apple's real goals. (I question their intent.)

Disclaimer: I am not an attorney and this is not legal advice. This is my non-attorney understanding of these laws. Also, this blog entry contains supposition and opinion based on (1) my prior knowledge and (2) observed omissions in Apple's disclosures. My speculations are not based on anything Apple has formally announced.

Replies from Apple

There has been a lot of criticisms about Apple's proposed CSAM scanner. While nobody disagrees that CSAM is a problem, I have yet to see anyone (other than Apple) agree that Apple's client-side scanning solution is a good thing. Apple is even having trouble justifying this to Apple's own employees. Over the last week, Apple has been on the defensive, releasing new documents every few days that try to explain what they are doing in an effort to give it a positive spin. (As Gizmodo put it, "Apple will keep clarifying this CSAM mess until moral improves.")

Apple's FAQ

I released my previous blog entry a week ago Sunday. By that evening, my blog was swamped by Applebot. According to Apple's documentation, Applebot obeys robots.txt crawling restrictions. My site has a robots.txt that says "don't crawl my blog". Applebot repeatedly retrieved my robots.txt, and yet, it was definitely crawling my blog. (It seems that Apple doesn't like following their own rules.) My blog detected the crawling and auto-blocked Applebot.

A few hours later, Apple released an FAQ. (Is the timing coincidental? I'll let you decide.) Personally, I don't doubt anything that Apple said in this FAQ. Rather, I think it is important to focus on what they didn't say. For example:

Apple's original announcement mentioned three areas for new filtering:
1. iMessage filtering for child accounts
2. CSAM scanning on Apple iOS and MacOS systems, and
3. Changes to Siri and Search to detect "unsafe situations" and will "intervene when users try to search for CSAM-related topics."
All of the updates from Apple have focused on (1) and (2). Apple has quietly stopped talking about Apple's search-related filtering. Personally, I don't think it is going away; I think they are hoping nobody will notice it. (Will users be able to use Siri or Apple's search to find information about this CSAM discussion? Or will Apple block that from search results?)

The FAQ didn't address their "1 in 1 trillion" claimed accuracy rate.

Nothing in the FAQ addressed the detection method or limitations.

Apple's Interview

In an interview with the Wall Street Journal (WSJ) and writeup at 9to5mac, Apple's representative stated that the dual announcement (iMessage and CSAM scanning) has caused confusion. (Again, Apple overlooked the changes to the search results.) However, there are lots of technical reviews from experts in the field who were not confused. We (I'm including myself in this group) focused on the CSAM scanner implementation. Nothing in the interview with Apple's SVP Craig Federighi mitigated these concerns.

The WSJ interview with Apple's SVP Craig Federighi (available on YouTube) included some key things that people (who are not familiar with NCMEC) may have overlooked. Here's the interview on YouTube (it's worth watching):

The big things I noticed (with timestamps into the interview):

(1:40) Federighi begins by stating that Apple is not looking for CP on your iPhone. Instead, he says they are "looking for illegal images of child pornography" in the iCloud but doing it by scanning your iPhone. (I don't see the difference; this is doublespeak and bad spin control.)

(2:00) Federighi states that all of the other big providers (Facebook, Google, etc.) scan for content in the cloud. First, "the cloud" is another term for someone else's computer. Google can only search Google's servers and Facebook can only search Facebook's servers. Federighi never explains why Apple does not search Apple's servers.

(2:20) Federighi mentioned that Apple wants to scan on the client devices because it is "much much more private". He gives no information about how only scanning files that were uploaded to the iCloud (server-side) is less private than scanning files on a personal device. If anything, scanning on the server would be more private since the scanner cannot accidentally scan any other personal files.

(2:40) Federighi says that the client-side scanner creates some flags that identify what was found. He provided no details about what is in these flags. In other documentation released by Apple, Apple believes that the contents cannot be determined from the client's side. I find this hard to believe since everything needed to create this token is found on the client system. A determined reverse-engineer should be able to figure out what is going on.

(3:00) According to Federighi, Apple only looks at the content if their iCloud receives something like 30 flagged files from a user's account. At my FotoForensics service, I report people when they upload a single CSAM image. In 9 years, the most CP that a single person has uploaded to my service at one time was 8 pictures. (I had stepped out of the office for the day, and there was no admin to stop him.) I have to wonder: how many people does Apple expect to report to NCMEC when they have a high threshold like 30 pictures? To me, this looks like justification for not reporting to NCMEC. ("Nobody crossed the threshold, so we didn't report anything.")

(3:15 - 3:40) Federighi states that Apple only looks at the flagged files (via a "safety voucher"). People who produce, collect, and/or distribute CSAM typically have old and new files. The "only match known signature" approach used by PhotoDNA (and apparently by NeuralHash, if we believe the SVP) will only identify old files; it won't find new content. Many service providers will detect CSAM, block the account, and review other account content in case there is additional CSAM material that was not known by the scanner. (I do this, and I know other companies that do this.) Apple appears to be planning to do the bare minimum: only report the detection of known CSAM content and don't check if there are other associated CSAM files. While this follows the letter of the law, I think it fails to follow the intent of the law. Moreover, it does nothing to help children who are currently being victimized and appear in new CSAM content.

(4:20) Federighi mentions that Apple created NeuralHash for scanning Apple devices. Ah, this reminded me of the NDA related to PhotoDNA that I mentioned in my last blog entry. The NDA forbids you from distributing PhotoDNA. Forensic tool vendors, including Cellebrite and WetStone, have licenses for including PhotoDNA with their products. However, they cannot distribute PhotoDNA outside of their tools. I suspect that Apple couldn't get a license to distribute PhotoDNA to every iPhone, so they built their own solution (NeuralHash).

(7:25) Federighi states that a copy of the hash file is placed on every Apple device and it is not updated per-country or per-device. However, it can always be updated globally with new signatures. Moreover, at (2:50), he says that only part of the analysis is done on the device. It sounds like the server-side can do country-specific or user-specific scanning based on the signatures created on the client-side.

Getting Technical (Trillion, Million, what's the difference?)

Last Friday (three days ago), Apple released a technical threat model. This PDF does a better job explaining how the CSAM scanning is expected to work. However, it still leave some serious issues.

For example, Apple's tech document identifies the false-positive rate for a single picture as 3 in 100 million. That is a far cry from the original "1 in 1 trillion" claimed accuracy. (It's off by a factor of 30,000.) While 3 in 100 million may still sound like a really low error rate, Apple's iCloud should be receiving billions of pictures per year. They should have hundreds of false matches, but it shouldn't overwhelm a single part-time employee. (To put things into perspective, the busiest day at FotoForensics had 10,572 pictures. It took me less than 3 hours to review them. That includes the added steps of categorizing some of the content for research projects. A few hundred per year is no problem!)

Apple phones support "burst mode" -- multiple pictures at a time, and they all look similar. Professional photographers may have hundreds of similar pictures from a single photo shoot. There are other ways to take similar photos. For example, I'm sure that we've all experience those family gatherings where someone says "Billy blinked! Everyone hold their pose while I take it again!" If one picture in a series has a false-positive, then the entire series will likely have multiple false-positives. Basing the reporting threshold on the number of flagged photos does not reduce the likelihood of a mistake because pictures are often not independent.

The technical paper reinforces the statement that Apple will only look at flagged pictures after they have been uploaded to iCloud. This means that Apple can always scan pictures on the iCloud servers and does not need to scan on the client's side. (There has to be another reason that they want to scan on the client...)

I will mention: some technical criticisms have speculated that Apple may be moving to a fully-encrypted iCloud service. However, if it's fully-encrypted, then Apple's staff cannot review suspected content. I don't think this is the reason for client-side scanning.

The Simple Solution

There is a simple solution here. Rather than scanning pictures on the client's device, Apple could scan pictures on the server-side as they are uploaded to the iCloud. This is the approach used by Facebook, Twitter, Google, and even my own FotoForensics service. If you view the trade-offs from the legal perspective, then doing server-side scanning is a no-brainer.

When scanning on the server:

You have no expectation of seeing CSAM, so uploads in general are not a concern. And since the scanner only flags strongly-suspected CSAM, it isn't a privacy issue. (This complies with 18 USC 2252 and related laws.)

The terms of service can be written to include scanning for CSAM in uploaded materials. (Apple already has this in sections V(B)(j) and V(E).) This removes any privacy violations.

If the scanner flags any suspected CSAM, you can have internal staff review the content for confirmation. There is no transfer of content outside of the company, so this action complies with the law.

If it is CSAM, then it can be reported directly to NCMEC. Again, this complies with the law.

The initial disclosure from Apple made it sound like Apple would be scanning devices and uploading suspected CSAM to Apple for confirmation. In this scenario, the program has no expectation of encountering CSAM. However, when it sees something suspected as CSAM, it sends it to the company and not NCMEC. This violates 18 USC § 2258 since it can only be sent to NCMEC. Moreover, Apple then sends it to their staff for review. This is the intentional collection and distribution for the purpose of viewing prior to reporting to NCMEC. (That's forbidden by law.)

This new technical document says that the pictures are flagged, but only evaluated after they are uploaded to the iCloud. With a false-positive rate of 3 in 100 million, this is effecting knowingly transferring content that is believed to be CSAM. Apple cannot claim that they didn't know a priori. In this case, Apple is aiding in the distribution of suspected CSAM to Apple for the purpose of review by Apple's staff. (Violates federal law.)

On top of this, with a 3 in 100 million chance of incorrectly flagging a file as CSAM, Apple has a strong reason to believe (beyond a reasonable doubt) that a single matched file is likely CSAM. And yet, Apple won't review or report it unless the account has multiple flagged files. Multiple flagged files don't mean they are certain; it only decreases the already-slim chance of making a false automated classification. 18 USC § 2258A(e) says that a service provider (like Apple) must report it when they see it. In this case, Apple's program sees it, but Apple refuses to report it unless some arbitrary even-lower threshold is matched.

PhotoDNA is not perfect. But if PhotoDNA matches even one file, then a service provider is supposed to investigate and report. (Some providers just report without investigating.) At 3 in 100 million, NeuralHash seems more accurate than PhotoDNA. And yet, Apple won't investigate an individual file that matches.

I don't expect Apple's CSAM solution to result in more reports to NCMEC. I expect Apple to submit fewer reports because they are justifying why, as a service provider, they don't want to investigate and report.

Ulterior Motives

The more I think about it, the more I suspect that there are multiple ulterior motives at play here.

In Apple's FAQ (August 9th), Apple begins by pointing out and doubling-down on their commitment to fighting CSAM. To me, it was reminiscent of politicians caught in a cheating scandals. The apology tours always start with statements about their commitments to family values.

When 4chan first started, it didn't moderate any content. As a result, it had lots of child pornography (CP). As relayed to me by law enforcement officers, the site's admin (a guy called "moot") received a visit from law enforcement. They gave him a choice: remove and report CP, or go to jail for facilitating CSAM distribution. That's when 4chan began to moderate content.

Similarly, Facebook used to have a serious child porn problem. In 2010, they had a strong crackdown. They had a serious cleanup again in 2018. Today, Facebook is responsible for 94% of the CSAM reports that NCMEC receives. My own FotoForensics service hasn't reported CSAM from Facebook in years. (Whatever Facebook is doing, it's working.)

Craigslist had a huge CP and child human trafficking problem. The excuse was that the site admins did not closely monitor the site. After a friendly visit, Craigslist began to crack down on the problem. In 2018, they completely disabled their "personals" section, stopping the problem cold. (Prior to 2018, I was reporting CP that had filenames consistent with Craigslist. I haven't seen any CP from Craigslist since they shutdown their personals section.)

In my last blog entry, I mentioned how I'm seeing an increase in CSAM that appears to be from Apple devices and services. I think Apple has a serious CSAM problem, and I think someone had a talk with them. To me, this solution -- scanning every Apple device for possible CSAM content -- appears to be Apple's knee-jerk reaction to a serious (and undisclosed) legal threat.

Quick! Make Lemonade!

Being a big company, I also suspect that Apple took all of their sharp minds and tried to turn this into a way to make the best of this situation. Since they didn't know the final outcome, I think they came up with options to benefit them no matter what.

Reading between the lines, I suspect that they came up with this solution because of a combination of options:

Accounting: Cutting costs. Scanning on the cloud is expensive. They can save the costs associated with the computation overhead if they push the scanning from the cloud to user devices. (Remember: reviewers at Apple will be able to manually confirm matches. That means that they can access content on iCloud. This is not about encryption; this is about resource distribution and associated costs.)

Development: Future functionality. Apple likely wants the option to do client-side scans for other content later. Sure, the FAQ and WSJ interview says that they would never do that. However, Apple has already gone against some of their privacy promises. For example, China demanded access to the iCloud, while Russia demanded country-specific software. Apple caved to both demands.

In the United States, child porn is classified as prohibited content. However, CP on a computer is legal in Russia; CSAM scanning in Russia won't find illegal content. Russia, China, India, etc. may have their own hashes for prohibited content -- and that content doesn't have to be porn. Apple may be forced to use this infrastructure to scan for other types of content. (Remember: they can't be forced to do it if the technology does not exist.) By providing this client-side technology, Apple is effectively planning for this future option.

PR: Discourage sex offenders. Apple's security and encryption seems to be making iOS the go-to choice for child abusers. If this solution works, then Apple won't be treated as the "product of choice" for pedophiles. Then again, even if Apple doesn't deploy their client-side scanning solution, Apple has recently created enough noise in the media to potentially drive the offenders away from Apple products. (I'm not kidding; just being vocal can help mitigate the problem.)

HR: Reduce reporting requirements. Apple's arbitrarily high bar for evaluation and reporting strongly suggests that Apple wants a reason to not report to NCMEC. If Apple is allowed to do this, then they are likely to rarely ever report to NCMEC. Their solution is effectively justification for not reporting. Among other things, this means that Apple wouldn't need a large department devoted to reviewing content and reporting it. It also reduces the overhead in mental fatigue, emotional issues, and high turnover associated with this kind of work.

Legal: Further reduce reporting requirements. If there are legal challenges that prevent Apple from doing client-side scanning, then Apple can claim that they wanted to report, but can't due to legal challenges. Again, it looks like Apple doesn't want to obey the law and report CSAM to NCMEC, so they are manufacturing reasons to justify why they can't do it.

Business: Market dominance. On one hand, Apple doesn't seem to want to report to NCMEC. On the other hand, Apple seems to want a stronger partnership with NCMEC. (That leaked memo from NCMEC seems to confirm this as a mutual desire.) Although NCMEC does not formally endorse any vendor, NCMEC only provides Microsoft's PhotoDNA as a solution to services that want to scan for CSAM. If Apple's NeuralHash is accepted, then it will displace Microsoft's PhotoDNA as the go-to tool for scanning. Every service currently wants PhotoDNA, because it's the only solution provided by NCMEC. Next: Every service will want Apple's NeuralHash. Imagine if all of Apple's competitors (Google, Microsoft, Twitter, Facebook, etc.) had to license NeuralHash from Apple... (As the meme goes, "Step 3: Profit!")

In my opinion, it looks like Apple doesn't want to report to NCMEC. But if they are forced to do it, then they can mitigate how much they need to report, reduce their own costs associated with scanning and reporting, and try to own the market. While Apple claims to be doing this to combat CSAM, I suspect that pronouncement is just spin.

Read more about Forensics, Image Analysis, Network, Privacy, Security | Comments (11) | Direct Link

One Bad Apple

Sunday, 8 August 2021

My in-box has been flooded over the last few days about Apple's CSAM announcement. Everyone seems to want my opinion since I've been deep into photo analysis technologies and the reporting of child exploitation materials. In this blog entry, I'm going to go over what Apple announced, existing technologies, and the impact to end users. Moreover, I'm going to call out some of Apple's questionable claims.

Disclaimer: I'm not an attorney and this is not legal advice. This blog entry includes my non-attorney understanding of these laws.

The Announcement

In an announcement titled "Expanded Protections for Children", Apple explains their focus on preventing child exploitation.

The article starts with Apple pointing out that the spread of Child Sexual Abuse Material (CSAM) is a problem. I agree, it is a problem. At my FotoForensics service, I typically submit a few CSAM reports (or "CP" -- photo of child pornography) per day to the National Center for Missing and Exploited Children (NCMEC). (It's actually written into Federal law: 18 U.S.C. § 2258A. Only NMCEC can receive CP reports, and 18 USC § 2258A(e) makes it a felony for a service provider to fail to report CP.) I don't permit porn or nudity on my site because sites that permit that kind of content attract CP. By banning users and blocking content, I currently keep porn to about 2-3% of the uploaded content, and CP at less than 0.06%.

According to NCMEC, I submitted 608 reports to NCMEC in 2019, and 523 reports in 2020. In those same years, Apple submitted 205 and 265 reports (respectively). It isn't that Apple doesn't receive more picture than my service, or that they don't have more CP than I receive. Rather, it's that they don't seem to notice and therefore, don't report.

Apple's devices rename pictures in a way that is very distinct. (Filename ballistics spots it really well.) Based on the number of reports that I've submitted to NCMEC, where the image appears to have touched Apple's devices or services, I think that Apple has a very large CP/CSAM problem.

[Revised; thanks CW!] Apple's iCloud service encrypts all data, but Apple has the decryption keys and can use them if there is a warrant. However, nothing in the iCloud terms of service grants Apple access to your pictures for use in research projects, such as developing a CSAM scanner. (Apple can deploy new beta features, but Apple cannot arbitrarily use your data.) In effect, they don't have access to your content for testing their CSAM system.

If Apple wants to crack down on CSAM, then they have to do it on your Apple device. This is what Apple announced: Beginning with iOS 15, Apple will be deploying a CSAM scanner that will run on your device. If it encounters any CSAM content, it will send the file to Apple for confirmation and then they will report it to NCMEC. (Apple wrote in their announcement that their staff "manually reviews each report to confirm there is a match". They cannot manually review it unless they have a copy.)

While I understand the reason for Apple's proposed CSAM solution, there are some serious problems with their implementation.

Problem #1: Detection

There are different ways to detect CP: cryptographic, algorithmic/perceptual, AI/perceptual, and AI/interpretation. Even though there are lots of papers about how good these solutions are, none of these methods are foolproof.

The cryptographic hash solution

The cryptographic solution uses a checksum, like MD5 or SHA1, that matches a known image. If a new file has the exact same cryptographic checksum as a known file, then it is very likely byte-per-byte identical. If the known checksum is for known CP, then a match identifies CP without a human needing to review the match. (Anything that reduces the amount of these disturbing pictures that a human sees is a good thing.)

In 2014 and 2015, NCMEC stated that they would give MD5 hashes of known CP to service providers for detecting known-bad files. I repeatedly begged NCMEC for a hash set so I could try to automate detection. Eventually (about a year later) they provided me with about 20,000 MD5 hashes that match known CP. In addition, I had about 3 million SHA1 and MD5 hashes from other law enforcement sources. This might sound like a lot, but it really isn't. A single bit change to a file will prevent a CP file from matching a known hash. If a picture is simple re-encoded, it will likely have a different checksum -- even if the content is visually the same.

In the six years that I've been using these hashes at FotoForensics, I've only matched 5 of these 3 million MD5 hashes. (They really are not that useful.) In addition, one of them was definitely a false-positive. (The false-positive was a fully clothed man holding a monkey -- I think it's a rhesus macaque. No children, no nudity.) Based just on the 5 matches, I am able to theorize that 20% of the cryptographic hashes were likely incorrectly classified as CP. (If I ever give a talk at Defcon, I will make sure to include this picture in the media -- just so CP scanners will incorrectly flag the Defcon DVD as a source for CP. [Sorry, Jeff!])

The perceptual hash solution

Perceptual hashes look for similar picture attributes. If two pictures have similar blobs in similar areas, then the pictures are similar. I have a few blog entries that detail how these algorithms work.

NCMEC uses a perceptual hash algorithm provided by Microsoft called PhotoDNA. NMCEC claims that they share this technology with service providers. However, the acquisition process is complicated:

Make a request to NCMEC for PhotoDNA.
If NCMEC approves the initial request, then they send you an NDA.
You fill out the NDA and return it to NCMEC.
NCMEC reviews it again, signs, and revert the fully-executed NDA to you.
NCMEC reviews your use model and process.
After the review is completed, you get the code and hashes.

Because of FotoForensics, I have a legitimate use for this code. I want to detect CP during the upload process, immediately block the user, and automatically report them to NCMEC. However, after multiple requests (spanning years), I never got past the NDA step. Twice I was sent the NDA and signed it, but NCMEC never counter-signed it and stopped responding to my status requests. (It's not like I'm a little nobody. If you sort NCMEC's list of reporting providers by the number of submissions in 2020, then I come in at #40 out of 168. For 2019, I'm #31 out of 148.)

Since NCMEC was treating PhotoDNA as a trade secret, I decided to reverse engineer the algorithm using some papers published by Microsoft. (No single paper says how it works, but I cobbled together how it works from a bunch of their marketing blurbs and high-level slides.) I know that I have implemented it correctly because other providers who have the code were able to use my hashes to correctly match pictures.

Perhaps there is a reason that they don't want really technical people looking at PhotoDNA. Microsoft says that the "PhotoDNA hash is not reversible". That's not true. PhotoDNA hashes can be projected into a 26x26 grayscale image that is only a little blurry. 26x26 is larger than most desktop icons; it's enough detail to recognize people and objects. Reversing a PhotoDNA hash is no more complicated than solving a 26x26 Sudoku puzzle; a task well-suited for computers.

I have a whitepaper about PhotoDNA that I have privately circulated to NCMEC, ICMEC (NCMEC's international counterpart), a few ICACs, a few tech vendors, and Microsoft. The few who provided feedback were very concerned about PhotoDNA's limitations that the paper calls out. I have not made my whitepaper public because it describes how to reverse the algorithm (including pseudocode). If someone were to release code that reverses NCMEC hashes into pictures, then everyone in possession of NCMEC's PhotoDNA hashes would be in possession of child pornography.

The AI perceptual hash solution

With perceptual hashes, the algorithm identifies known image attributes. The AI solution is similar, but rather than knowing the attributes a priori, an AI system is used to "learn" the attributes. For example, many years ago there was a Chinese researcher who was using AI to identify poses. (There are some poses that are common in porn, but uncommon in non-porn.) These poses became the attributes. (I never did hear whether his system worked.)

The problem with AI is that you don't know what attributes it finds important. Back in college, some of my friends were trying to teach an AI system to identify male or female from face photos. The main thing it learned? Men have facial hair and women have long hair. It determined that a woman with a fuzzy lip must be "male" and a guy with long hair is female.

Apple says that their CSAM solution uses an AI perceptual hash called a NeuralHash. They include a technical paper and some technical reviews that claim that the software works as advertised. However, I have some serious concerns here:

The reviewers include cryptography experts (I have no concerns about the cryptography) and a little bit of image analysis. However, none of the reviewers have backgrounds in privacy. Also, although they made statements about the legality, they are not legal experts (and they missed some glaring legal issues; see my next section).

Apple's technical whitepaper is overly technical -- and yet doesn't give enough information for someone to confirm the implementation. (I cover this type of paper in my blog entry, "Oh Baby, Talk Technical To Me" under "Over-Talk".) In effect, it is a proof by cumbersome notation. This plays to a common fallacy: if it looks really technical, then it must be really good. Similarly, one of Apple's reviewers wrote an entire paper full of mathematical symbols and complex variables. (But the paper looks impressive. Remember kids: a mathematical proof is not the same as a code review.)

Apple claims that there is a "one in one trillion chance per year of incorrectly flagging a given account". I'm calling bullshit on this.

Facebook is one of the biggest social media services. Back in 2013, they were receiving 350 million pictures per day. However, Facebook hasn't released any more recent numbers, so I can only try to estimate. In 2020, FotoForensics received 931,466 pictures and submitted 523 reports to NCMEC; that's 0.056%. During the same year, Facebook submitted 20,307,216 reports to NCMEC. If we assume that Facebook is reporting at the same rate as me, then that means Facebook received about 36 billion pictures in 2020. At that rate, it would take them about 30 years to receive 1 trillion pictures.

According to all of the reports I've seen, Facebook has more accessible photos than Apple. Remember: Apple says that they do not have access to users' photos on iCloud, so I do not believe that they have access to 1 trillion pictures for testing. So where else could they get 1 trillion pictures?

Randomly generated: Testing against randomly generated pictures is not realistic compared to photos by people.

Videos: Testing against frames from videos means lots of bias from visual similarity.

Web crawling: Scraping the web would work, but my web logs rarely show Apple's bots doing scrapes. If they are doing this, then they are not harvesting at a fast enough rate to account for a trillion pictures.

Partnership: They could have some kind of partnership that provides the pictures. However, I haven't seen any such announcements. And the cost for such a large license would probably show up in their annual shareholder's report. (But I haven't seen any disclosure like this.)

NCMEC: In NCMEC's 2020 summary report, they state that they received 65.4 million files in 2020. NCMEC was founded in 1984. If we assume that they received the same number of files every year (a gross over-estimate), then that means they have around 2.5 billion files. I do not think that NCMEC has 1 trillion examples to share with Apple.

Perhaps Apple is basing their "1 in 1 trillion" estimate on the number of bits in their hash?

With cryptographic hashes (MD5, SHA1, etc.), we can use the number of bits to identify the likelihood of a collision. If the odds are "1 in 1 trillion", then it means the algorithm has about 40 bits for the hash. However, counting the bit size for a hash does not work with perceptual hashes.

With perceptual hashes, the real question is how often do those specific attributes appear in a photo. This isn't the same as looking at the number of bits in the hash. (Two different pictures of cars will have different perceptual hashes. Two different pictures of similar dogs taken at similar angles will have similar hashes. And two different pictures of white walls will be almost identical.)

With AI-driven perceptual hashes, including algorithms like Apple's NeuralHash, you don't even know the attributes so you cannot directly test the likelihood. The only real solution is to test by passing through a large number of visually different images. But as I mentioned, I don't think Apple has access to 1 trillion pictures.

What is the real error rate? We don't know. Apple doesn't seem to know. And since they don't know, they appear to have just thrown out a really big number. As far as I can tell, Apple's claim of "1 in 1 trillion" is a baseless estimate. In this regard, Apple has provided misleading support for their algorithm and misleading accuracy rates.

The AI interpretation solution

An AI-driven interpretation solution tries to use AI to learn contextual elements. Person, dog, adult, child, clothing, etc. While AI systems have come a long way with identification, the technology is nowhere near good enough to identify pictures of CSAM. There are also the extreme resource requirements. If a contextual interpretative CSAM scanner ran on your iPhone, then the battery life would dramatically drop. I suspect that a charged battery would only last a few hours.

Luckily, Apple isn't doing this type of solution. Apple is focusing on the AI-driven perceptual hash solution.

Problem #2: Legal

Since Apple's initial CSAM announcement, I've seen lots of articles that focus on Apple scanning your files or accessing content on your encrypted device. Personally, this doesn't bother me. You have anti-virus (AV) tools that scan your device when your drive is unlocked, and you have file index systems that inventory all of your content. When you search for a file on your device, it accesses the pre-computed file index. (See Apple's Spotlight and Microsoft's Cortana.)

You could argue that you, as the user, have a choice about which AV to use, while Apple isn't giving you a choice. However, Microsoft ships with Defender. (Good luck trying to disable it; it turns on after each update.) Similarly, my Android ships with McAfee. (I can't figure out how to turn it off!)

The thing that I find bothersome about Apple's solution is what they do after they find suspicious content. With indexing services, the index stays on the device. With AV systems, potential malware is isolated -- but stays on the device. But with CSAM? Apple says:

Only when the threshold is exceeded does the cryptographic technology allow Apple to interpret the contents of the safety vouchers associated with the matching CSAM images. Apple then manually reviews each report to confirm there is a match, disables the user’s account, and sends a report to NCMEC.

In order to manually review the match, they must have access to the content. This means that the content must be transferred to Apple. Moreover, as one of Apple's tech reviewers wrote, "Users get no direct feedback from the system and therefore cannot directly learn if any of their photos match the CSAM database." This leads to two big problems: illegal searches and illegal collection of child exploitation material.

Illegal Searches

As noted, Apple says that they will scan your Apple device for CSAM material. If they find something that they think matches, then they will send it to Apple. The problem is that you don't know which pictures will be sent to Apple. You could have corporate confidential information and Apple may quietly take a copy of it. You could be working with the legal authority to investigate a child exploitation case, and Apple will quietly take a copy of the evidence.

To reiterate: scanning your device is not a privacy risk, but copying files from your device without any notice is definitely a privacy issue.

Think of it this way: Your landlord owns your property, but in the United States, he cannot enter any time he wants. In order to enter, the landlord must have permission, give prior notice, or have cause. Any other reason is trespassing. Moreover, if the landlord takes anything, then it's theft. Apple's license agreement says that they own the operating system, but that doesn't give them permission to search whenever they want or to take content.

Illegal Data Collection

The laws related to CSAM are very explicit. 18 U.S. Code § 2252 states that knowingly transferring CSAM material is a felony. (The only exception, in 2258A, is when it is reported to NCMEC.) In this case, Apple has a very strong reason to believe they are transferring CSAM material, and they are sending it to Apple -- not NCMEC.

It does not matter that Apple will then check it and forward it to NCMEC. 18 U.S.C. § 2258A is specific: the data can only be sent to NCMEC. (With 2258A, it is illegal for a service provider to turn over CP photos to the police or the FBI; you can only send it to NCMEC. Then NCMEC will contact the police or FBI.) What Apple has detailed is the intentional distribution (to Apple), collection (at Apple), and access (viewing at Apple) of material that they strongly have reason to believe is CSAM. As it was explained to me by my attorney, that is a felony.

At FotoForensics, we have a simple process:

People choose to upload pictures. We don't harvest pictures from your device.

When my admins review the uploaded content, we do not expect to see CP or CSAM. We are not "knowingly" seeing it since it makes up less than 0.06% of the uploads. Moreover, our review catalogs lots of types of pictures for various research projects. CP is not one of the research projects. We do not intentionally look for CP.

When we see CP/CSAM, we immediately report it to NCMEC, and only to NCMEC.

We follow the law. What Apple is proposing does not follow the law.

The Backlash

In the hours and days since Apple made its announcement, there has been a lot of media coverage and feedback from the tech community -- and much of it is negative. A few examples:

BBC: "Apple criticised for system that detects child abuse"

Ars Technica: "Apple explains how iPhones will scan photos for child-sexual-abuse images"

EFF: "Apple's Plan to 'Think Different' About Encryption Opens a Backdoor to Your Private Life"

The Verge: "WhatsApp lead and other tech experts fire back at Apple's Child Safety plan"

This was followed by a memo leak, allegedly from NCMEC to Apple:

I understand the problems related to CSAM, CP, and child exploitation. I've spoken at conferences on this topic. I am a mandatory reporter; I've submitted more reports to NCMEC than Apple, Digital Ocean, Ebay, Grindr, and the Internet Archive. (It isn't that my service receives more of it; it's that we're more vigilant at detecting and reporting it.) I'm no fan of CP. While I would welcome a better solution, I believe that Apple's solution is too invasive and violates both the letter and the intent of the law. If Apple and NCMEC view me as one of the "screeching voices of the minority", then they are not listening.

Update 2021-08-09: In response to widespread criticism, Apple quickly released an FAQ. This FAQ contradicts their original announcement, contradicts itself, contains doublespeak, and omits important details. For example:

The FAQ says that they don't access Messages, but also says that they filter Messages and blur images. (How can they know what to filter without accessing the content?)

The FAQ says that they won't scan all photos for CSAM; only the photos for iCloud. However, Apple does not mention that the default configuration uses iCloud for all photo backups.

The FAQ say that there will be no falsely identified reports to NCMEC because Apple will have people conduct manual reviews. As if people never make mistakes.

This is far from the complete list of issues with their FAQ. It does not resolve any of the concerns raised in this blog entry.

Read more about AI, Forensics, FotoForensics, Image Analysis, Mass Media, Privacy | Comments (50) | Direct Link

Fake Covid IDs

Saturday, 31 July 2021

For months at FotoForensics, we've been seeing a steady increase in covid vaccination cards and vaccination tests. The cards are almost all fake. The tests are about 50/50 real/altered. Here are a few examples (click to view at FotoForensics):

Keep in mind, I've seen more than just a few samples...

It's hard to make a good fake image. In the countries where these pictures are coming from, it's easier to get a covid vaccination shot. These people literally put more effort into faking these photos than it takes to get vaccinated.

Besides vaccination tests and cards, we've also seen photos of people in masks (good!), people getting vaccinated, people screaming at people who were vaccinated or wearing masks, boxes of vaccines, stacks of corpses, etc. While there are pictures from both the pro- and anti- groups, there appears to be more pictures related to anti-vaxxers. I don't know if it is the anti-vaxxers testing their forgeries, or someone testing a picture from an anti-vaxxer, but sometimes you can see the developer going through iterations as they alter their images.

Tracking an anti-vax campaign

A picture received last week really caught my attention. It wasn't like other anti-vax campaigns:

Warning: In this blog entry, I'm including a lot of URLs without hyperlinks. This is because I suspect that they are scams. Access them at your own risk.

A bunch of things stood out about this card that scream "fake".

Text, font, and photo: This is computer generated. This is not a scan or photo of an ID card.

Metadata: Nothing informative. Not direct from a camera.

ELA: Very consistent for a computer-generated image. The photo appears to be a low-quality crop from a JPEG that was then scaled smaller. (ELA shows the JPEG grid.)

Information: The card claims to be from a controlled test group. However, the few times I have seen real documents about test subjects, there has been a name for the group conducting the test. This card doesn't list a hospital, pharmaceutical company, or related research group. It also doesn't list any contact information (address, phone, etc.). These omissions scream "illegitimate" and "hoax".

The card does contain one interesting component: a barcode. (I'm always interested in barcodes!) This QR code only contains a URL: [https://members.vaxcontrolgroup.com/verify/5DJVJMPTYIGDHRYOMIR5TP] With a domain name, we can start digging.

Vax Control Group LLC

If you visit the URL, you'll see a web page that claims this person is a "Vaccine Control Group Participant" and that "it is illegal to discriminate against someone based upon their personal medical choices." Let me start by saying:

Participation in any kind of medical research (such as being in a control group) requires lots of paperwork and authorizations before recruiting test subjects. A legitimate medical study doesn't start by asking for random participants.

In the scientific community, there is a common testing method called an "A/B Test". One group gets a treatment (A) and the other is a control group that doesn't get the treatment (B). Then you see if there is a distinct difference between the two groups. With the covid vaccine, we have a very solid A/B test: the (A) group are vaccinated and the (B) group are unvaccinated. Currently, hospitals are full and the patients are nearly 100% unvaccinated.

As I write this blog post, the CDC reports that 99% of deaths from covid are from unvaccinated people. That's a pretty solid scientific result: If you're vaccinated (the "A" group), then you probably won't die. If you're in the "B" group, then you can end up in the hospital and die. It's a no-brainer here: get vaccinated.

Tracking a web site

I took a deeper dive into their web site: [https://www.vaxcontrolgroup.com/].

According to this domain's registration record (WHOIS), it was registered about two months ago.

Domain Name: vaxcontrolgroup.com
Registry Domain ID: 2608252149_DOMAIN_COM-VRSN
Registrar WHOIS Server: whois.ionos.com
Registrar URL: http://ionos.com
Updated Date: 2021-05-31T22:59:14.000Z
Creation Date: 2021-04-28T09:54:26.000Z
Registrar Registration Expiration Date: 2022-04-28T09:54:26.000Z
Registrar: 1&1 IONOS SE

A brand new domain means it isn't established. When you're looking for "things that looks suspicious", this is a red flag. A new site giving questionable advice should be suspect.

The domain was registered through Ionos. I've seen Ionos before; they are often associated with fraud and scam activities.

Evaluating web content

At the bottom of the vaxcontrolgroup's web page are some links:

Privacy Policy: The text was definitely not written by anyone with legal experience. There is no mention of what information they collect, who has access, or how long it is retained. It does mention that they don't use Google Analytics or Facebook, but doesn't mention any other third-party services. These days, there's typically a blurb about GDPR -- especially if they do anything in Europe -- but that's not mentioned here. They also don't mention any cookie policy.

At minimum, they are definitely using cdn.jsdelivr.net, something on CloudFlare, PayPal for donations, and third-party cookies from Twitter. Their own screenshot for registration shows that they use cookies:

Terms and Conditions: Again, very minimal. They forbid crawlers and scrapers, but permit accessing anything on the site. They forbid making a copy or re-using content, but that is pretty meaningless. I'm using copies and re-using their content under Copyright Fair Use: criticism, comment, news reporting, teaching, and research.

Contact: If you click on their link to contact them, they want your name, email, and message. Or you can chat via Telegram.

Support: The contact page links to their support page. It wants similar information to the contact page and provides no additional insight into their organization.

About: This page makes a vague claim about doing research. However, it doesn't mention any medical or research associations. It also says that they will record your "Medical data (including: new and ongoing medical conditions, pregnancy outcomes, fertility etc...), Blood group and Discrimination (participants will be able to record companies that discriminate against them)." One would think that recording medical information would be mentioned in their privacy policy, but one would be wrong.

More telling about this company are the things that are missing:

Where are they located? That is not mentioned anywhere. Their "About" page does mention a "£6" subscription, and their donate page for registration gives amounts in "GBP". This suggests that they are in the UK. However, there is no city, state, or street address listed.

Who runs this? There is no mention of who created this organization, who is doing the "research", or who is involved in it.

At first glance, this looks like a scam site: they are not doing real research and you don't know who's behind it.

The Company

The page footer mentions one interesting thing: it lists the company name as "Control Group Cooperative Ltd, Reg No:13477806". This is easy enough to verify since the UK makes this information public: https://find-and-update.company-information.service.gov.uk/company/13477806.

The company filings show that they were registered on 25 Jun 2021 as "VAXCONTROLGROUP COOPERATIVE LTD" but changed their name to "CONTROL GROUP COOPERATIVE LTD" less than 3 weeks later.

The filings also list their board members. I was looking to see if anyone had any medical research experience. There's a data analyst, author, "company director", and a nurse. However, there's a twitter thread where someone tracked down this same group. He listed her as a "former nurse". There is nobody associated with any actual medical research organization.

The Address

Their registered office address is "117 Dartford Road, Dartford, England, DA1 3EN". Google Street View finds the address easily enough. However, their company name isn't visible. This brick building houses "Europcar" (a car rental company) on the lower floor, and "ADS Accountants" -- mentioned on a big yellow sign on the upper floor. (Next door, at 119 Dartford Road, is the ADS Business Centre, also in yellow. This address also shows up in some of the registration information.)

The accountant's banner on the upper floor gives their web site: [www.adsaccountants.com]. If you visit their web site, they list one of their services as "Company Formations". However, the associated hyperlink for company formations goes to a web site that doesn't exist [www.nationwidesecretarial.com].

I was able to find a UK company that has the same name, "ADS Accountants" (company registration 04172889), but they use a different URL [https://ads-accountants.co.uk/] and have a much slicker web page design. I decided to compare their web sites.

Website	www.adsaccountants.com	ads-accountants.co.uk
Name	ADS Accountants	ADS Accountants (Able Data Services Ltd)
Registration	N/A: I was unable to find their registration record.	04172889: First registered 05 Mar 2001. Updated annually.
Locations	Web site claims: Kent, Essex, London	Web site claims: Kent and Essex
People	Does not list anyone who works there. No staff, no identified accountants, no management, etc.	Does not list anyone who works there. No staff, no identified accountants, no management, etc.
Address	117 Dartford Road, Dartford, Kent, DA1 3EN (seen on billboard at the address)	117 Dartford Road, Dartford, Kent, DA1 3EN (seen on web page)
Nationwide Secretarial	Web site has hyperlinks to [www.nationwidesecretarial.com], but that host does not exist.	Business registration lists their secretary as Nationwide Secretarial Services Ltd at the same Kent address.
Web site last updated	2009 (via HTTP last-modified headers)	2020 and 2021 (via their site map [https://ads-accountants.co.uk/page-sitemap.xml])
Built-by	Bottom of the page says "Powered by XL-2.NET". Hyperlinks goes to an Indonesian online gambling site (NSFW and looks more scam than real)	Bottom of the page says "Built by BULL MEDIA". Hyperlink goes to a site that says they build web sites.
Copyright	Bottom of the page says "2004-2007" (not maintained)	Bottom of the page says "© Able Data Services", but no date

Other accounting offices that I've seen list the management, highest ranking partners, and/or significant accomplishments (customers, cases, testimonials, etc.) that can be used for references. With both of these sites, I see none of that. I don't know who works there and I don't know what they have worked on.

Both of these accounting services also mention "Nationwide Secretarial". However, this is also really odd. The host [www.nationwidesecretarial.com] does not exist. However, if you leave off the "www." (nationwidesecretarial.com) then you go to what appears to be a web-crawler trap. These are usually used to associate lots of random words and phrases with a web site in order to draw more traffic from search engines. In this case, there's no company information. Instead, there are hundreds of links with random hosts and they go to pages with hundreds of links that link to template-based articles.

So we have a company devoted to fake research that is located at the address of very suspicious accounting and secretarial offices. They appear to have two web sites. One is unmaintained, has broken links, and was created by an Indonesian online gambling service. The other has the same name, claims to be at the same street address, but uses a different URL than the one on the billboard outside the building. Both reference a secretarial service that hosts a web crawler trap and none identify anyone who works there.

Unlike the United States, the UK doesn't have a Better Business Bureau. I looked on Yelp and didn't find them listed. There is a site called "192.com" that is kind of like a phonebook. It doesn't find "Vax Control Group" or "Control Group Cooperative". It also doesn't find "ADS Accountants" at the specified street address. However, it does list "A D S Accountants" next door, at 119 Dartford Road, Dartford, Kent. They list the web site as [http://www.adsoutsidebars.com/], which goes to "ADS Outside Bars" (a pub) with a web site created by the same Indonesian gambling service. As far as I can tell, no customers have mentioned ever using ADS Accountants on any public forums. I'm a little surprised that an accounting firm could be around for over 20 years and nobody has ever talked about using their services.

I'm not saying that these accounting firms or the secretarial office are fraudulent. (I've seen plenty of real companies that have badly designed web sites.) I'm saying there are enough red flags that you should quickly run away from them. Despite all of these issues, these highly suspicious companies are where this fake covid research site is registered. I have to wonder: what attracted the fake covid research service to these questionable companies? Could this be a front for something more nefarious?

Feeding the Fraud

Most fraud schemes have a purpose. However, I'm trying to figure out the purpose here.

Nothing on this fake research site is going to change anyone's mind about vaccination. It doesn't provide any statistics, facts, or even arguments for their position.

I can see how they generate revenue. According to this anti-vax service, people who register as an "Associate" member (£4 for an individual, £14 for a family, plus a £6 quarterly fee) will "receive printed plastic cards for you and your dependants" [sic]. (~~While British spelling does different from American English, I believe they still spell "dependents" with "ents" and not "ants".~~ Update: Some brits have written in, pointing out that British English has two spellings for different parts of speech.) Printing and mailing the card certainly costs less than the Associate fee, making it very profitable. There's nothing legal or binding about this card; it's as legitimate as a fake drivers license. At best, it's for entertainment purposes only (but they don't say that).

Beyond the membership fee, they also ask for donations. Perhaps they are just interested in taking money from gullable anti-vaxxers?

One of the Control Group Cooperative Ltd directors has a Twitter account that he uses to continually tweet anti-vax propoganda. By providing his own "research" service, he can use it to give himself more creditibilty. (It looks like a credible reference if nobody looks closely.)

One possilble purpose could be to help drive a wedge into the debate. Whether to vaccinate or not is a very hot topic. The problem is that few of these debates cite any facts. Instead, news outlets are often trumpeting the rare corner cases and not the overall impact. Yes, vaccinated people can get the delta variant, but they don't mention that it's far less than 1%. To put this into context:

If you're unvaccinated, know that covid is as contagious as chicken pox. Every infected person that an unvaccinated person encounters means a 50% chance of infection. The first infected person you encounter gives you a 50% chance of getting infected. The 2nd person becomes 75%. The 3rd person becomes 88%, then 94%, etc. By the time you've been around 10 infected people, it's pretty much a given that you're also infected.

If you're infected, the odds of you needing hospitalization is around 5%-10% (see the risk calculator).

And if you're hospitalized, then there's a good chance that you'll die due to covid. Basically, it's around 1 in 10, but the odds get worse with age, pre-existing conditions, and hospital capacity. (If the hospitals are full and you can't be admitted, then you're probably going to die.)

Think of it this way:

If you go to a party with 100 unvaccinated people and 1 person who didn't know they were infected, then 50 people will be infected, 5 will require hospitalization, and 1 of those will die.

If you go to a party with 100 vaccinated people and 1 person who didn't know they were infected, then 50 people might get infected, but most likely none will require hospitalization or die.

That said, there is one reason why someone might want to sign up for this fake covid research service. Remember: they mail you a plastic card. It doesn't have any legal basis and it is not enforceable, but it's a physical card. There's a restaurant in Huntington Beach, California that only wants unvaccinated diners. If you're really hungry and stupid, then you can take your faux-research plastic card and eat there, along with a room full of other unvaccinated people. (You can definitely do this once. Maybe twice if you're lucky.)

Read more about Forensics, FotoForensics, Image Analysis, Mass Media | Comments (5) | Direct Link

Time Tested

Saturday, 10 July 2021

One of my friends recently had an interesting observation. With all of these advancements in computer software, why can't anyone correctly estimate wait times?

In her example, she was trying to scan in a 4x6 inch (10x15 cm) painting. The scanner's driver gave a message saying that it was using a high resolution and could take up to 6 minutes to complete. It finished after 30 seconds. She does this type of scan every few days; the warning message and actual run-times never change. It never learns that it takes less than 6 minutes.

The status bar starts at 6 minutes. 10 seconds later, it update to 50% and says 3 minutes remain. A few seconds after that, it shows over 90% complete with less than 1 minute remaining. Then it completes. The total actual time is about 30 seconds.

I've seen similar bad time estimates with system updates. My Mac said that a recent update might take 3 minutes. Then it rebooted and the message said it would take about 20 minutes. Then, after nearly 30 minutes, it rebooted again and said it would take 10 minutes -- only to complete after 3 minutes.

Updates with Windows are even more entertaining. I've seen it race from 0% to 98% in seconds, only to sit at 98% for a half hour. (The only way I knew it wasn't hung was that I could hear the hard drive grinding away.)

I guess the real question is: How do they come up with these estimates for completion?

Time Flies

Some applications are better than others when it comes to estimating completion times. The method used by 'wget', Firefox, and Chrome for estimating download times is very simple and pretty accurate. It uses a "moving average".

When a web server returns a file, it usually includes the file's size. The downloading program retrieves the download in chunks. It just needs to keep track of how much has been downloaded and the elapsed time since the download began. These two values provide the average download bitrate (bits per second). Then the bitrate can be applied to the number of bytes left to download for estimating the total time left.

At the beginning of the download, the time estimate is usually inaccurate. As more bytes are downloaded, the bitrate averages out to a pretty accurate estimate. Unless there is a sudden drop in network connectivity, it will probably be off by only a few seconds by the time it hits the 50% mark.

I use the same type of estimate at FotoForensics for computing the expected monthly volume of uploads. The first day of the month is very inaccurate. But after the first week, it's probably going to be in the right range. By the time it hits the last week of the month, it's usually within a few hundred pictures of the final number. (Sure, it can be off by a lot if there is a sudden flashmob, like when members of One Direction release anything. But that doesn't happen every month.)

This summer, I've been playing with an AI deep-learning system called 'darknet'. Darknet uses an even simpler approach: a "fixed average". It performs a quick benchmark at the beginning to see how fast it can run. Then it assumes that the rate holds for the next 5,552 hours. (At least, that's how long it thinks my current training run will take.)

Time Travel

The moving average (and static average) approach works well when there is only one active component. (E.g., only looking at the download rate or at the computations-per-second.) However, software updates require two steps. First comes the files that need to be retrieved or unpacked. Second is the installation and configuration. Some files require almost no additional effort to install and configure; the files just need to be copied into the right places. However, other files change system functionality. They may need to be integrated into a repository or have existing data migrated to a new format. This installation and configuration can vary per system and may take a long time.

If you're only counting the number of files processed or the average time per file size, then you're not counting the additional installation and configuration times. This is likely why Windows updates will race through from 0% to 98%, and then appear to hang; it unpacked the files quickly, but didn't take into account the installation and configuration times that must happen after the new files are moved into place. Macs also have this problem for estimating installation times. It may only take seconds to unpack each file, but it could take minutes to migrate internal database structures or to scan for additional components during the update process.

(What about Linux? Ubuntu's 'apt upgrade' command only tells you how many packages it is processing, with no time estimate. The same goes for SuSE, Debian, and other Linux flavors. Estimating the time to completion is hard, so they don't even try.)

6 Minutes and Counting

These different time estimate systems explain why software downloads have pretty good accuracy, but software updates can appear to be very wrong. And that scanner driver that claims "6 minutes" for a task that always takes 30 seconds? Most likely the developer just took a wild-ass guess (WAG) and intentionally over-estimated.

I often see this kind of over-estimate with online vendors. For example, Etsy is an online marketplace for handmade items. Sellers are allowed to specify the processing time per item. Etsy uses that information to approximate a delivery date for the buyer. An Etsy seller may list the item as processing in 5 days, but it may actually ship after 2 days. Part of this could be padding; sellers cannot give a range based on complexity so they often list the worst-case. However, some of this over-estimating could be intentional: customers give better reviews when they think an item arrived early. If you say it will take 5 days and it arrives in 3, then customers are usually happy that it arrived early.

Ironically, Etsy seems to look for instances of sellers shipping any time other than the estimated shipping date. As one Etsy seller lamented, Etsy doesn't take into account the time needed for some customizations. If sellers repeatedly ship too early, then they may receive a warning from Etsy about needing to adjust the shipping times. If sellers ship too late, then they get a warning about missing shipping dates. And if sellers receive too many warnings, their account could be suspended. (This is what happens when management never took statistics in school and only focuses on the mean, while ignoring the standard deviation.)

Usually having something arrive a little early is good news. But it can be bad if it arrives too early. This is particularly true when trying to coordinate multiple shipments that need to arrive at the same time. For example, this year I ordered yet-another server from Dell:

During the checkout process, Dell's automated shopping cart said that the server was supposed to ship after about 3 weeks. I used this estimate to order all of the other components needed for the server (extra hard drives, proprietary caddies for the drives, and additional network gear). It was all timed to arrive within the same week. I ordered all of the dependent components first, then I ordered the server.

After making the order, Dell surprised me by shipping it the next day. As a result, it arrived almost a month before everything else that was needed to make it work was able to arrive. I had an unusable server sitting on my floor for 3.5 weeks.

Everything else arrived within +/- 1 day of their estimates. Then I tried to install the server. That's when I discovered that it didn't have a hardware RAID. Unfortunately, between waiting for components and trying to debug why the operating system was seeing 4 hard drives and not 2 drives, this took it past the 30-day return period.

I checked with their technical support team. They verified the problem and identified the parts I needed. I ordered the hardware RAID card and RAID cable. (The cable connects the RAID card to the hard drives.) The RAID card arrived on time, but the RAID cable hadn't been shipped. Unfortunately, Dell notified me that they were further delaying the cable's delivery by more than a month. This means that I cannot even test the RAID card until after it is beyond the 30-day return period. And during this time, I am burning through the year-long parts warranty without being able to test the hardware. (No, I'm not happy with Dell right now.) The only saving grace was from Dell's tech support team, who came to my rescue with the necessary cable.

When the RAID cable's shipment slipped out by a month, Dell had me click on a link to confirm that I still wanted it. (It's a custom cable and only they sell it; I have no other option.) Ironically, when the server's shipping time was reduced by 3 weeks, there was no confirmation requirement.

Shipping a little early is good. Shipping weeks early -- without giving notice of a change in shipping -- can be really bad. This is especially true if it's part of a chain of dependent requirements.

6 Minutes or 30 Seconds?

I can understand using a WAG when the overall prediction includes too many complex variables. However, sometimes a WAG is used when the actual timing could be easily approximated.

With the scanner, the developers probably did an estimate based on the slowest hardware they support (e.g., a USB-1 port), without checking the type of USB connection that is in use by the scanner. With a slow USB-1 connection, that volume of scanner data could take minutes. However, USB-2 and USB-3 are significantly faster. The driver appears to never check the type of USB connection used by the scanner. (From a programmer's viewpoint, this is really easy to check.) As a result, the wild-ass guess (WAG) for the scanning time initially assumes "6 minutes".

Sadly, the developer also appears to have put in an arbitrary threshold. If the WAG is above the threshold, then the user sees a warning message that must be click before each scan. Smart software would keep track of the scan rate and update the WAG to something more realistic. Or they could record the user's choice to stop receiving the annoying warning box. But in this case, the initial estimation system never updates. The user is always prompted with a useless warning message that must be acknowledged before the scanner will work. (This fails usability.) If the annoying popup wasn't there, then the user probably wouldn't complain that the initial "6 minutes" estimate was wrong.

This goes back to the original question: how do these application come up with their time estimates?

Some programs do a one-time benchmark and assume the rate holds for the entire duration.

Some programs track the percent of work completed based on a single metric. This works great if there is only one activity (e.g., downloading a file), but can fail if there are other time-consuming tasks (e.g., files unpacked is followed by installation and configuration).

And some just take a wild-ass guess and don't care if they are wrong, or what may be impacted by being wrong.

As a user, you have no idea. Unless you have reason to believe otherwise, just assume that the estimate is grossly inaccurate -- either too high or too low.

Read more about Programming | Comments (2) | Direct Link

Up to your knees in alterations

Tuesday, 29 June 2021

I often receive requests to perform analysis on images. While I usually don't have time for pro bono work, I occasionally get a request that is really interesting and where I can't say 'no'. Earlier this month, I was contacted by a fact-checking group regarding a photo that was suspected of being altered. I didn't know it at the time, but this photo had been going viral all over Asia.

The picture shows Taiwan's President Tsai Ing-wen kneeling before three US senators. The senators (sitting from left to right: Chris Coons, Dan Sullivan, and Tammy Duckworth) had visited Taiwan on 6-June-2021, promising to deliver 750,000 additional COVID vaccines. (Click on the picture to view the analysis page at FotoForensics.)

Knee-Jerk Reaction

Initially I was asked by Annie Lab (a fact-checking group out of Hong Kong University's school of Journalism) whether their interpretation of the error level analysis (ELA) results was correct. Unfortunately, the picture they sent me was quickly detected as a picture-of-a-picture (PoP). Specifically, I could readily determine that it was a screenshot.

The main things I noticed:

There's a thin white line across the top of the picture and a thicker brown line down the left side. Those are edges from a screen capture.

The metadata's EXIF explicitly says this is a "Screenshot".

The metadata includes an ICC Profile generated on a Mac. This isn't just any screenshot; this is a Mac screenshot.

The file format is a PNG. However, ELA shows faint JPEG grids that do not align with JPEG's normal 8x8 or 16x16 grid size; the PNG's grids are larger than 16x16. This means the picture was originally a JPEG and then scaled larger to be rendered on the screen. The lossless PNG format from the screenshot retained all of these artifacts.

What this means: the picture initially used JPEG encoding (or it could be MPEG, like a frame from a video -- both JPEG pictures and MPEG videos use variations of the same general grid-based encoding approach). The picture was scaled larger (more than doubling the size). Then a screen capture application was used to grab a portion of this screen. Besides scaling the picture, applications and manual screenshot selections often crop out edges from the picture. (This picture was definitely scaled and likely cropped.)

Unfortunately, while ELA could detect some of the image handling, it was not informative for detecting any alterations to the base picture's content. Identifying artifacts from alterations in a picture is like trying to pull fingerprints off of a drinking glass:

If only one person touched the glass, then a forensic examiner can probably recover a fingerprint.

If two people touched the glass, then the first person's fingerprints may still be there, or it may be partially distorted by the second person's fingerprints. The first prints could even be completely obscured.

The more people who handle the glass, the less likely it is that an examiner will be able to recover the first person's fingerprints. An analyst will still be able to recover "someone's" fingerprints, but it probably won't be the first person's prints.

This screenshot of a scaled and cropped image is like running the glass through a dishwasher twice, and then having someone with really dirty hands touch every portion of the glass. I could tell everything about the picture from the time it was captured as a screenshot, but I could not detect any alterations made to the people or the room before it was scaled, cropped, and captured.

Knee Pads

The representative from Annie Lab responded by sending me links to the source picture on Twitter. (Note: The Twitter account @DiDi11409116 looks to me like a well-funded propaganda account that generates and spreads disinformation. It certainly does not look like an authoritative news source.)

Twitter resaves all pictures at a low quality and may scale the image. (I grabbed the "orig" version, which omits the scaling.) ELA will only catch extreme edits from Twitter pictures. Fortunately, this picture is definitely extreme:

ELA clearly highlights the legs of the tripod as being digitally altered (added or selective editing). However, there are also some faint attributes that indicate additional alterations:

There is a horizontal black line along the bottom 20% of the picture. It passes just below the mens's feet, below the peak of the tripod's legs, and above the woman's knees. This looks like a blend line, where the bottom 20% was processed differently.

The woman on the far right (US Senator Tammy Duckworth, in blue dress) is sitting in a wheelchair. The bottom of her wheelchair, below the blend line, appears to have been digitally altered.

When describing what to look for with ELA, I always point out that similar textures should have similar ELA coloring. The red carpet above the blend line is darker under ELA than the red carpet below the blend line. This is an inconsistency. The carpet below the blend line has newer edits than the carpet above the line.

In addition to the ELA findings, there are other observable differences:

In my initial review of the screenshot, I had mentioned scaling. The Twitter image is 640x488, while the screenshot is 1632x1230. When viewing a picture on Twitter, Twitter's web page may scale the image larger to fit the screen. Scaling a picture larger by more than 200% can result in a blur that is seen across the entire image.

My initial review of the screenshot also mentioned likely cropping. It's subtle, but the Twitter image has a few rows of extra pixels along the bottom and right edges of the image. These were cropped out by the screenshot.

The Bee's Knees

Annie Lab sent me one other link: A picture of the same event at the Associated Press:

This is the same picture, but without the bottom 20% of the image. Everything below the Twitter version's blend line is missing from this picture.

The metadata shows that the picture was post-processed by an Adobe product. (This is expected; the image contains the AP's watermarks and metadata annotations.) However, it also says that the image came directly from the Taiwan Presidential Office.

Knee Deep

I couldn't find this exact picture on the President of Taiwan's official web page. However, other pictures on their photo album page contain metadata identifying Flickr. It wasn't hard to find the President of Taiwan's Flickr stream. I found the source picture from 2021-06-06.

Flickr has a couple of download options, but one of them includes "original". The "original" is the source picture as it was uploaded to Flickr. I grabbed a copy for analysis:

The ELA result is not what I expect from a camera-original photo; there's too much coloring on the floor and furniture. There's also some rainbowing (separation of the chrominance red and blue), which is common with post-processing on an Adobe product.

In this case, the metadata is awesome. It identifies the camera as a Canon EOS 5D Mark III. It says the file was originally named "436A6884.CR2". The CR2 format is Canon's Camera Raw. Raw files must be converted to JPEG before they can be used on the web. Someone used Adobe Photoshop Lightroom Classic 10.2 on a Mac to convert the CR2 to JPEG. They also performed a lot of color alterations; adjusting color curves, tones, and a little selective sharpening. (This explains all of the coloring seen in the ELA result.)

While this is still not a camera-original picture, it is still very close to original. In addition, it pre-dates the altered Twitter version by more than a day.

Now we know: The President of Taiwan met with 3 US Senators on 6-June-2021. A photo was taken, showing her standing at a podium, but not showing her feet. Someone then took this picture and digitally extended the bottom, adding bent knees that give the impression of her kneeling. In order to extend the floor, they also had to add in the bottom of the wheelchair. Finally, they added in the feet of the tripod.

In the Twitter picture, there is a black blob that appears to be on the floor in front of the tripod. In this Flicker version, it's clear that the blob is the tripod adjustment knob.

(This isn't all that you can tell about the Twitter version. With other tools and careful observation, you may also notice inconsistent shadows on the new carpet, inconsistent blurring and focus, replication in the additional carpet, and a change to the carpet's overall pattern that coincides with the detected blend line.)

Knee Slapper

My tools can identify image handling and alterations. However, the folks at Annie Lab took source tracking to the next level. In their report, they found the source image that was used for the bent knees. (That is some incredible forensics!) They also spotted other edits that were too small for ELA to detect at this resolution, like changes to the nameplates. Someone went through a great deal of effort to make these alterations before posting them to Twitter.

In the original photo, President Tsai Ing-wen is photographed standing before the sitting US Senators. She clearly has their attention and respect. In the propaganda photo, someone digitally added bent legs, portraying President Tsai Ing-wen as subservient or weak before the US Senators. However, by exposing the kind of effort that went into developing the forgery, it gives the impression that the meeting was far more impactful that originally suggested. (If it wasn't a powerful message, then the propagandist would not have made the effort to make the President look weaker.)

I'd like to thank Annie Lab for sharing this picture with me and for incorporating some of my findings in their report.

Read more about Forensics, FotoForensics, Image Analysis, Mass Media | Comments (2) | Direct Link

(Page 1 of 184, totaling 920 entries) next page »

Jul	AUG	Sep
	20
2020	2021	2022

About

Popular Posts

Tools

Links

Calendar

Archives

Feeds

Categories