Planet Topic Maps

October 25, 2016

Patrick Durusau

Finding “unknown string format” in 1.7 GB of files – Parsing Clinton/Podesta Emails

Testing my “dirty” script against Podesta Emails (1.7 GB), some 17,296 files, I got the following message:

Traceback (most recent call last):
File “”, line 20, in
date = dateutil.parser.parse(msg[‘date’])
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 303, in parse
raise ValueError, “unknown string format”
ValueError: unknown string format

Now I have to find the file that broke the script.

Beginning Python programmers are laughing at this point because they know using:

for name in glob.glob('*.eml'):

is going to make finding the offending file difficult.


Consulting the programming oracle (Stack Overflow) on ordering of glob.glob in Python I learned:

By checking the source code of glob.glob you see that it internally calls os.listdir, described here:

Key sentence: os.listdir(path) Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order. It does not include the special entries ‘.’ and ‘..’ even if they are present in the directory.

Arbitrary order. :)

Interesting but not quite an actionable answer!

Take a look out:

Order is arbitrary, but you can sort them yourself

If you want sorted by name:


sorted by modification time:

import os
sorted(glob.glob('*.png'), key=os.path.getmtime)

sorted by size:

import os
sorted(glob.glob('*.png'), key=os.path.getsize)


So for ease in finding the offending file(s) I adjusted:

for name in glob.glob('*.eml'):


for name in sorted(glob.glob('*.eml')):

Now I can tail the results file in question and the next file is where the script failed.

More on the files that failed in a separate post.

by Patrick Durusau at October 25, 2016 09:26 PM

Monetizing Twitter Trolls

Alex Hern‘s coverage of Twitter’s fail-to-sell story, Did trolls cost Twitter $3.5bn and its sale?, is a typical short on facts story about abuse on Twitter.

When I say short on facts, I don’t deny any of the anecdotal accounts of abuse on Twitter and other social media.

Here’s the data problem with abuse at Twitter:

As of May of 2016, Twitter had 310 million active monthly users over 1.3 billion accounts.

Number of Twitter users who are abusive (trolls): unknown

Number of Twitter users who are victims: unknown

Number of abusive tweets, daily/weekly/monthly: unknown

Type/frequency of abusive tweets, language, images, disclosure: unknown

Costs to effectively control trolls: unknown

Trolls and abuse should be opposed both at Twitter and elsewhere, but without supporting data, creating corporate priorities and revenues to effectively block (not end, block) abuse isn’t possible.

Since troll hunting at present is a drain on the bottom line with no return for Twitter, what if Twitter were to monetize its trolls?

That is create a mechanism whereby trolls became the drivers of a revenue stream from Twitter.

One such approach would be to throw off all the filtering that Twitter does as part of its basic service. If you have Twitter basic service, you will see posts from everyone from committed jihadists to the Federal Reserve. Not blocked accounts, no deleted accounts, etc.

Twitter removes material under direct court order only. Put the burden and expense on going to court for every tweet on both individuals and governments. No exceptions.

Next, Twitter creates the Twitter+ account, where for an annual fee, users can access advanced filtering that includes blocking people, language, image analysis of images posted to them, etc.

Price point experiments should set the fees for Twitter+ accounts. Filtering will be a decision based on real revenue numbers. Not flights of fancy by the Guardian or Sales Force.

BTW, the open Twitter I suggest creates more eyes for ads, which should also improve the bottom line at Twitter.

An “open” Twitter will attract more trolls and drive more users to Twitter+ accounts.

Twitter trolls generate the revenue to fight them.

I rather like that.


by Patrick Durusau at October 25, 2016 03:38 PM

Clinton/Podesta Emails, Dirty Data, Dirty Script For Testing

Despite Micheal Best’s (@NatSecGeek) efforts at collecting the Podesta emails for convenient bulk download, Podesta Emails Zipped, the bulk downloads don’t appear to have attracted a lot of attention. Some 276 views as of today.

Many of us deeply appreciate Michael’s efforts and would like to see the press and others taking fuller advantage of this remarkable resource.

To encourage you in that direction, what follows is a very dirty script for testing the DKIM signatures in the emails and extracting data from the emails for writing to a “|” delimited file.


import dateutil.parser
import email
import dkim
import glob

output = open("verify.txt", 'w')

output.write ("id|verified|date|from|to|subject|message-id \n")

for name in glob.glob('*.eml'):
filename = name
f = open(filename, 'r')
data =
msg = email.message_from_string(data)

verified = dkim.verify(data)

date = dateutil.parser.parse(msg['date'])

msg_from = msg['from']
msg_from1 = " ".join(msg_from.split())
msg_to = str(msg['to'])
msg_to1 = " ".join(msg_to.split())
msg_subject = str(msg['subject'])
msg_subject1 = " ".join(msg_subject.split())
msg_message_id = msg['message-id']

output.write (filename + '|' + str(verified) + '|' + str(date) +
'|' + msg_from1 + '|' + msg_to1 + '|' + msg_subject1 +
'|' + str(msg_message_id) + "\n")


Download podesta-test.tar.gz, unpack that to a directory and then save/uppack to the same directory, then:


Import that into Gnumeric and with some formatting, your content should look like: test-clinton-24Oct2016.gnumeric.gz.

Verifying cryptographic signatures takes a moment, even on this sample of 754 files so don’t be impatient.

This script leaves much to be desired and as you can see, the results aren’t perfect by any means.

Comments and/or suggestions welcome!

This is just the first step in extracting information from this data set that could be used with similar data sets.

For example, if you want to graph this data, how are you going to construct IDs for the nodes, given the repetition of some nodes in the data set?

How are you going to model those relationships?

Bonus question: Is this output clean enough to run the script on the full data set? Which is increasing on a daily basis?

by Patrick Durusau at October 25, 2016 02:05 AM

October 23, 2016

Patrick Durusau

Data Science for Political and Social Phenomena [Special Interest Search Interface]

Data Science for Political and Social Phenomena by Chris Albon.

From the webpage:

I am a data scientist and quantitative political scientist. I specialize in the technical and organizational aspects of applying data science to political and social issues.

Years ago I noticed a gap in the existing data literature. On one side was data science, with roots in mathematics and computer science. On the other side were the social sciences, with hard-earned expertise modeling and predicting complex human behavior. The motivation for this site and ongoing book project is to bridge that gap: to create a practical guide to applying data science to political and social phenomena.

Chris has organized three hundred and twenty-eight pages on Data Wrangling, Python, R, etc.

If you like learning from examples, this is the site for you!

Including this site, what other twelve (12) sites would you include in a Python/R Data Science search interface?

That is an interface that has indexed only that baker’s dozen of sites. So you don’t spend time wading through “the G that is not named” search results.

Serious question.

Not that I would want to maintain such a beast for external use, but having a local search engine tuned to your particular interests could be nice.

by Patrick Durusau at October 23, 2016 08:53 PM

Boosting (in Machine Learning) as a Metaphor for Diverse Teams [A Quibble]

Boosting (in Machine Learning) as a Metaphor for Diverse Teams by Renee Teate.

Renee’s summary:

tl;dr: Boosting ensemble algorithms in Machine Learning use an approach that is similar to assembling a diverse team with a variety of strengths and experiences. If machines make better decisions by combining a bunch of “less qualified opinions” vs “asking one expert”, then maybe people would, too.

Very much worth your while to read at length but to setup my quibble:

What a Random Forest does is build up a whole bunch of “dumb” decision trees by only analyzing a subset of the data at a time. A limited set of features (columns) from a portion of the overall records (rows) is used to generate each decision tree, and the “depth” of the tree (and/or size of the “leaves”, the number of examples that fall into each final bin) is limited as well. So the trees in the model are “trained” with only a portion of the available data and therefore don’t individually generate very accurate classifications.

However, it turns out that when you combine the results of a bunch of these “dumb” trees (also known as “weak learners”), the combined result is usually even better than the most finely-tuned single full decision tree. (So you can see how the algorithm got its name – a whole bunch of small trees, somewhat randomly generated, but used in combination is a random forest!)

All true but “weak learners” in machine learning are easily reconfigured, combined with different groups of other “weak learners,” or even discarded.

None of which is true for people who are hired to be part of a diverse team.

I don’t mean to discount Renee’s metaphor because I think it has much to recommend it, but diverse “weak learners” make poor decisions too.

Don’t take my word for it, watch the 2016 congressional election results.

Be sure to follow Renee on @BecomingDataSci. I’m interested to see how she develops this metaphor and where it leads.


by Patrick Durusau at October 23, 2016 07:50 PM

Twitter Logic: 1 call on Github v. 885,222 calls on Twitter

Chris Albon’s collection of 885,222 tweets (ids only) for the third presidential debate of 2016 proves bad design decisions aren’t only made inside the Capital Beltway.

Chris could not post his tweet collection, only the tweet ids under Twitter’s terms of service.

The terms of service reference the Developer Policy and under that policy you will find:

F. Be a Good Partner to Twitter

1. Follow the guidelines for using Tweets in broadcast if you display Tweets offline.

2. If you provide Content to third parties, including downloadable datasets of Content or an API that returns Content, you will only distribute or allow download of Tweet IDs and/or User IDs.

a. You may, however, provide export via non-automated means (e.g., download of spreadsheets or PDF files, or use of a “save as” button) of up to 50,000 public Tweets and/or User Objects per user of your Service, per day.

b. Any Content provided to third parties via non-automated file download remains subject to this Policy.
…(emphasis added)

Just to be clear, I find Twitter extremely useful for staying current on CS research topics and think developers should be “…good partners to Twitter.”

However, Chris is prohibited from posting a data set of 885,222 tweets on Gibhub, where users could download it with no impact on Twitter, versus every user who want to explore that data set must submit 885,222 requests to Twitter servers.

Having one hit on Github for 885,222 tweets versus 885,222 on Twitter servers sounds like being a “good partner” to me.

Multiple that by all the researchers who are building Twitter data sets and the drain on Twitter resources grows without any benefit to Twitter.

It’s true that someday Twitter might be able to monetize references to its data collections, but server and bandwidth expenses are present line items in their budget.

Enabling the distribution of full tweet datasets is one step towards improving their bottom line.

PS: Please share this with anyone you know at Twitter. Thanks!

by Patrick Durusau at October 23, 2016 06:24 PM

Political Noise Data (Tweets From 3rd 2016 Presidential Debate)

Chris Albon has collected data on 885,222 debate tweets from the third Presidential Debate of 2016.

As you can see from the transcript, it wasn’t a “debate” in any meaningful sense of the term.

The quality of tweets about that debate are equally questionable.

However, the people behind those tweets vote, buy products, click on ads, etc., so despite my title description as “political noise data,” it is important political noise data.

To conform to Twitter terms of service, Chris provides the relevant tweet ids and a script to enable construction of your own data set.

BTW, Chris includes his Twitter mining scripts.


by Patrick Durusau at October 23, 2016 05:49 PM

Validating Wikileaks Emails [Just The Facts]

A factual basis for reporting on alleged “doctored” or “falsified” emails from Wikileaks has emerged.

Now to see if the organizations and individuals responsible for repeating those allegations, some 260,000 times, will put their doubts to the test.

You know where my money is riding.

If you want to verify the Podesta emails or other email leaks from Wikileaks, consult the following resources.

Yes, we can validate the Wikileaks emails by Robert Graham.

From the post:

Recently, WikiLeaks has released emails from Democrats. Many have repeatedly claimed that some of these emails are fake or have been modified, that there’s no way to validate each and every one of them as being true. Actually, there is, using a mechanism called DKIM.

DKIM is a system designed to stop spam. It works by verifying the sender of the email. Moreover, as a side effect, it verifies that the email has not been altered.

Hillary’s team uses “”, which as DKIM enabled. Thus, we can verify whether some of these emails are true.

Recently, in response to a leaked email suggesting Donna Brazile gave Hillary’s team early access to debate questions, she defended herself by suggesting the email had been “doctored” or “falsified”. That’s not true. We can use DKIM to verify it.

Bob walks you through validating a raw email from Wikileaks with the DKIM verifier plugin for Thunderbird. And demonstrating the same process can detect “doctored” or “falsified” emails.

Bob concludes:

I was just listening to ABC News about this story. It repeated Democrat talking points that the WikiLeaks emails weren’t validated. That’s a lie. This email in particular has been validated. I just did it, and shown you how you can validate it, too.

Btw, if you can forge an email that validates correctly as I’ve shown, I’ll give you 1-bitcoin. It’s the easiest way of solving arguments whether this really validates the email — if somebody tells you this blogpost is invalid, then tell them they can earn about $600 (current value of BTC) proving it. Otherwise, no.

BTW, Bob also points to:

Here’s Cryptographic Proof That Donna Brazile Is Wrong, WikiLeaks Emails Are Real by Luke Rosiak, which includes this Python code to verify the emails:



Verifying Wikileaks DKIM-Signatures by teknotus, offers this manual approach for testing the signatures:


But those are all one-off methods and there are thousands of emails.

But the post by teknotus goes on:

Preliminary results

I only got signature validation on some of the emails I tested initially but this doesn’t necessarily invalidate them as invisible changes to make them display correctly on different machines done automatically by browsers could be enough to break the signatures. Not all messages are signed. Etc. Many of the messages that failed were stuff like advertising where nobody would have incentive to break the signatures, so I think I can safely assume my test isn’t perfect. I decided at this point to try to validate as many messages as I could so that people researching these emails have any reference point to start from. Rather than download messages from wikileaks one at a time I found someone had already done that for the Podesta emails, and uploaded zip files to

Emails 1-4160
Emails 4161-5360
Emails 5361-7241
Emails 7242-9077
Emails 9078-11107

It only took me about 5 minutes to download all of them. Writing a script to test all of them was pretty straightforward. The program dkimverify just calls a python function to test a message. The tricky part is providing context, and making the results easy to search.

Automated testing of thousands of messages

It’s up on Github

It’s main output is a spreadsheet with test results, and some metadata from the message being tested. Results Spreadsheet 1.5 Megs

It has some significant bugs at the moment. For example Unicode isn’t properly converted, and spreadsheet programs think the Unicode bits are formulas. I also had to trap a bunch of exceptions to keep the program from crashing.

Warning: I have difficulty opening the verify.xlsx file. In Calc, Excel and in a CSV converter. Teknotus reports it opens in LibreOffice Calc, which just failed to install on an older Ubuntu distribution. Sing out if you can successfully open the file.

Journalists: Are you going to validate Podesta emails that you cite? Or that others claim are false/modified?

by Patrick Durusau at October 23, 2016 01:27 AM

October 22, 2016

Patrick Durusau

Python and Machine Learning in Astronomy (Rejuvenate Your Emotional Health)

Python and Machine Learning in Astronomy (Episode #81) (Jack VanderPlas)

From the webpage:

The advances in Astronomy over the past century are both evidence of and confirmation of the highest heights of human ingenuity. We have learned by studying the frequency of light that the universe is expanding. By observing the orbit of Mercury that Einstein’s theory of general relativity is correct.

It probably won’t surprise you to learn that Python and data science play a central role in modern day Astronomy. This week you’ll meet Jake VanderPlas, an astrophysicist and data scientist from University of Washington. Join Jake and me while we discuss the state of Python in Astronomy.

Links from the show:

Jake on Twitter: @jakevdp

Jake on the web:

Python Data Science Handbook:

Python Data Science Handbook on GitHub:

Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data:

PyData Talk:

eScience Institue: @UWeScience

Large Synoptic Survey Telescope:

AstroML: Machine Learning and Data Mining for Astronomy:

Astropy project:

altair package:

If you social media feeds have been getting you down, rejoice! This interview with Jake VanderPlas covers Python, machine learning and astronomy.

Nary a mention of current social dysfunction around the globe!

Replace an hour of TV this weekend with this podcast. (Or more hours with others.)

Not only will you have more knowledge, you will be in much better emotional shape to face the coming week!

by Patrick Durusau at October 22, 2016 03:11 PM

Validating Wikileaks/Podesta Emails

A quick heads up that Robert Graham is working on:


While we wait for that post to appear at Errata Security, you should also take a look at DomainKeys Identified Mail (DKIM).

From the homepage:

DomainKeys Identified Mail (DKIM) lets an organization take responsibility for a message that is in transit. The organization is a handler of the message, either as its originator or as an intermediary. Their reputation is the basis for evaluating whether to trust the message for further handling, such as delivery. Technically DKIM provides a method for validating a domain name identity that is associated with a message through cryptographic authentication.

In particular, review RFC 5585 DomainKeys Identified Mail (DKIM) Service Overview. T. Hansen, D. Crocker, P. Hallam-Baker. July 2009. (Format: TXT=54110 bytes) (Status: INFORMATIONAL) (DOI: 10.17487/RFC5585), which notes:

2.3. Establishing Message Validity

Though man-in-the-middle attacks are historically rare in email, it is nevertheless theoretically possible for a message to be modified during transit. An interesting side effect of the cryptographic method used by DKIM is that it is possible to be certain that a signed message (or, if l= is used, the signed portion of a message) has not been modified between the time of signing and the time of verifying. If it has been changed in any way, then the message will not be verified successfully with DKIM.

In a later tweet, Bob notes the “DKIM verifier” add-on for Thunderbird.

Any suggestions on scripting DKIM verification for the Podesta emails?

That level of validation may be unnecessary since after more than a week of “…may be altered…,” not one example of a modified email has surfaced.

Some media outlets will keep repeating the “…may be altered…” chant, along with attribution of the DNC hack to Russia.

Noise but it is a way to select candidates for elimination from your news feeds.

by Patrick Durusau at October 22, 2016 12:49 AM

October 21, 2016

Patrick Durusau

Guide to Making Search Relevance Investments, free ebook

Guide to Making Search Relevance Investments, free ebook

Doug Turnbull writes:

How well does search support your business? Are your investments in smarter, more relevant search, paying off? These are business-level questions, not technical ones!

After writing Relevant Search we find ourselves helping clients evaluate their search and discovery investments. Many invest far too little, or struggle to find the areas to make search smarter, unsure of the ROI. Others invest tremendously in supposedly smarter solutions, but have a hard time justifying the expense or understanding the impact of change.

That’s why we’re happy to announce OpenSource Connection’s official search relevance methodology!

The free ebook? Guide to Relevance Investments.

I know, I know, the title is a interest killer.

Think Search ROI. Not something you hear about often but it sounds attractive.

Runs 16 pages and is a blessed relief from the “data has value (unspecified)” mantras.

Search and investment in search is a business decision and this guide nudges you in that direction.

What you do next is up to you.


by Patrick Durusau at October 21, 2016 03:36 AM

October 20, 2016

Patrick Durusau

Every Congressional Research Service Report – 8,000+ and growing!

From the homepage:

We’re publishing reports by Congress’s think tank, the Congressional Research Service, which provides valuable insight and non-partisan analysis of issues of public debate. These reports are already available to the well-connected — we’re making them available to everyone for free.

From the about page:

Congressional Research Service reports are the best way for anyone to quickly get up to speed on major political issues without having to worry about spin — from the same source Congress uses.

CRS is Congress’ think tank, and its reports are relied upon by academics, businesses, judges, policy advocates, students, librarians, journalists, and policymakers for accurate and timely analysis of important policy issues. The reports are not classified and do not contain individualized advice to any specific member of Congress. (More: What is a CRS report?)

Until today, CRS reports were generally available only to the well-connected.

Now, in partnership with a Republican and Democratic member of Congress, we are making these reports available to everyone for free online.

A coalition of public interest groups, journalists, academics, students, some Members of Congress, and former CRS employees have been advocating for greater access to CRS reports for over twenty years. Two bills in Congress to make these reports widely available already have 10 sponsors (S. 2639 and H.R. 4702, 114th Congress) and we urge Congress to finish the job.

This website shows Congress one vision of how it could be done.

What does include? includes 8,255 CRS reports. The number changes regularly.

It’s every CRS report that’s available on Congress’s internal website.

We redact the phone number, email address, and names of virtually all the analysts from the reports. We add disclaimer language regarding copyright and the role CRS reports are intended to play. That’s it.

If you’re looking for older reports, our good friends at may have them.

We also show how much a report has changed over time (whenever CRS publishes an update), provide RSS feeds, and we hope to add more features in the future. Help us make that possible.

To receive an email alert for all new reports and new reports in a particular topic area, use the RSS icon next to the topic area titles and a third-party service, like IFTTT, to monitor the RSS feed for new additions.

This is major joyful news for policy wonks and researchers everywhere.

A must bookmark and contribute to support site!

My joy was alloyed by the notice:

We redact the phone number, email address, and names of virtually all the analysts from the reports. We add disclaimer language regarding copyright and the role CRS reports are intended to play. That’s it.

The privileged, who get the CRS reports anyway, have that information?

What is the value in withholding it from the public?

Support the project but let’s put the public on an even footing with the privileged shall we?

by Patrick Durusau at October 20, 2016 01:20 AM

The Podesta Emails [In Bulk]

Wikileaks has been posting:

The Podesta Emails, described as:

WikiLeaks series on deals involving Hillary Clinton campaign Chairman John Podesta. Mr Podesta is a long-term associate of the Clintons and was President Bill Clinton’s Chief of Staff from 1998 until 2001. Mr Podesta also owns the Podesta Group with his brother Tony, a major lobbying firm and is the Chair of the Center for American Progress (CAP), a Washington DC-based think tank.

long enough for them to be decried as “interference” with the U.S. presidential election.

You have two search options, basic:


and, advanced:


As handy as these search interfaces are, you cannot easily:

  • Analyze relationships between multiple senders and/or recipients of emails
  • Perform entity recognition across the emails as a corpus
  • Process the emails with other software
  • Integrate the emails with other data sources
  • etc., etc.

Michael Best, @NatSecGeek, is posting all the Podesta emails as they are released at: Podesta Emails (zipped).

As of Podesta Emails 13, there is approximately 2 GB of zipped email files available for downloading.

The search interfaces at Wikileaks may work for you, but if you want to get closer to the metal, you have Michael Best to thank for that opportunity!


by Patrick Durusau at October 20, 2016 12:53 AM

October 19, 2016

Patrick Durusau

#Truth2016 – The year when truth “interfered” with a democratic election.

Unless you have been in solitary confinement or a medically induced coma for the last several weeks, you are aware that Wikileaks has been accused of “interfering” with the 2016 US presidential election.

The crux of that complaint is the release by Wikileaks of a series of emails collectively known as the Podesta Emails, which are centered on the antics of Hillary Clinton and her crew as she runs for the presidency.

The untrustworthy who made these accusations include the Department of Homeland Security and the Office of the Director of National Intelligence on Election Security. In a no facts revealed statement: Joint Statement from the Department Of Homeland Security and Office of the Director of National Intelligence on Election Security, the claim of interference is made but not substantiated.

The cry of “interference” has been taken up by an uncritical media and echoed by President Barack Obama.

There’s just one problem.

We know who was sent the emails in question and despite fanciful casting of doubt on their accuracy, out of hundred of participants, not one, nary one, has stepped forward with an original email to prove these are false.

Simple enough to ask some third-party expert to retrieve the emails in question from a server and then to compare to the Wikileaks releases.

But I have heard of no moves in that direction.

Have you?

The crux of the current line by the US government is that truthful documents may influence the coming presidential election. In a direction they don’t like.

Think about that for a moment: Truthful documents (in the sense of accuracy) interfering with a democratic election.

That makes me wonder what definition of “democratic” that Clinton, Obama and the media must share?

Not anything I would recognize as a democracy. You?

by Patrick Durusau at October 19, 2016 08:27 PM

S20-211a Hebrew Bible Technology Buffet – November 20, 2016 (save that date!)

S20-211a Hebrew Bible Technology Buffet

From the webpage:

On Sunday, November 20th 2016, from 1:00 PM to 3:30 PM, GERT will host a session with the theme “Hebrew Bible Technology Buffet” at the SBL Annual Meeting in room 305 of the Convention Center. Barry Bandstra of Hope College will preside.

The session has four presentations:

Presentations will be followed by a discussion session.

You will need to register for the Annual Meeting to attend the session.

Assuming they are checking “badges” to make sure attendees have registered. Registration is very important to those who “foster” biblical scholarship by comping travel and rooms for their close friends.

PS: The website reports non-member registration is $490.00. I would like to think that is a mis-print but I suspect its not.

That’s one way to isolate yourself from an interested public. By way of contrast, snail-mail Biblical Greek courses in the 1890’s had tens of thousands of subscribers. When academics complain of being marginalized, use this as an example of self-marginalization.

by Patrick Durusau at October 19, 2016 12:10 AM

October 18, 2016

Patrick Durusau

Threatening the President: A Signal/Noise Problem

Even if you can’t remember why the pointy end of a pencil is important, you too can create national news.

This bit of noise reminded me of an incident when I was in high school where some similar type person bragged in a local bar about assassinating then President Nixon*. Was arrested and sentenced to several years in prison.

At the time I puzzled briefly over the waste of time and effort in such a prosecution and then promptly forgot it.

Until this incident with the overly “clever” Trump supporter.

To get us off on the same foot:

18 U.S. Code § 871 – Threats against President and successors to the Presidency

(a) Whoever knowingly and willfully deposits for conveyance in the mail or for a delivery from any post office or by any letter carrier any letter, paper, writing, print, missive, or document containing any threat to take the life of, to kidnap, or to inflict bodily harm upon the President of the United States, the President-elect, the Vice President or other officer next in the order of succession to the office of President of the United States, or the Vice President-elect, or knowingly and willfully otherwise makes any such threat against the President, President-elect, Vice President or other officer next in the order of succession to the office of President, or Vice President-elect, shall be fined under this title or imprisoned not more than five years, or both.

(b) The terms “President-elect” and “Vice President-elect” as used in this section shall mean such persons as are the apparent successful candidates for the offices of President and Vice President, respectively, as ascertained from the results of the general elections held to determine the electors of President and Vice President in accordance with title 3, United States Code, sections 1 and 2. The phrase “other officer next in the order of succession to the office of President” as used in this section shall mean the person next in the order of succession to act as President in accordance with title 3, United States Code, sections 19 and 20.

Commonplace threatening letters, calls, etc., aren’t documented for the public but President Barack Obama has a Wikipedia page devoted to the more significant ones: Assassination threats against Barack Obama.

Just as no one knows you are a dog on the internet, no one can tell by looking at a threat online if you are still learning how to use a pencil or are a more serious opponent.

Leaving to one side that a truly serious opponent allows actions to announce their presence or goal.

The treatment of even idle bar threats as serious is an attempt to improve the signal-to-noise ratio:

In analog and digital communications, signal-to-noise ratio, often written S/N or SNR, is a measure of signal strength relative to background noise. The ratio is usually measured in decibels (dB) using a signal-to-noise ratio formula. If the incoming signal strength in microvolts is Vs, and the noise level, also in microvolts, is Vn, then the signal-to-noise ratio, S/N, in decibels is given by the formula: S/N = 20 log10(Vs/Vn)

If Vs = Vn, then S/N = 0. In this situation, the signal borders on unreadable, because the noise level severely competes with it. In digital communications, this will probably cause a reduction in data speed because of frequent errors that require the source (transmitting) computer or terminal to resend some packets of data.

I’m guessing the reasoning is the more threats that go unspoken, the less chaff the Secret Service has to winnow in order to uncover viable threats.

One assumes they discard physical mail with return addresses of prisons, mental hospitals, etc., or at most request notice of the release of such people from state custody.

Beyond that, they don’t appear to be too picky about credible threats, noting that in one case an unspecified “death ray” was going to be used against President Obama.

The EuroNews description of that case must be shared:

Two American men have been arrested and charged with building a remote-controlled X-ray machine intended for killing Muslims and other perceived enemies of the U.S.

Following a 15-month investigation launched in April 2012, Glenn Scott Crawford and Eric J. Feight are accused of developing the device, which the FBI has described as “mobile, remotely operated, radiation emitting and capable of killing human targets silently and from a distance with lethal doses of radiation”.

Sure, right. I will post a copy of the 67-page complaint, which uses terminology rather loosely, to say the least, in a day or so. Suffice it to say that the defendants never acquired a source for the needed radioactivity production.

On the order of having a complete nuclear bomb but not nuclear material to make it into a nuclear bomb. You would be in more danger from the conventional explosive degrading than the bomb as a nuclear weapon.

Those charged with defending public officials want to deter the making of threats, so as to improve the signal/noise ratio.

The goal of those attacking public officials is a signal/noise ratio of exactly 0.0.

Viewing threats from an information science perspective suggests various strategies for either side. (Another dividend of studying information science.)

*They did find a good picture of Nixon for the White House page. Doesn’t look as much like a weasel as he did in real life. Gimp/Photoshop you think?

by Patrick Durusau at October 18, 2016 11:27 PM

How To Read: “War Goes Viral” (with caution, propaganda ahead)


War Goes Viral – How social media is being weaponized across the world by Emerson T. Brooking and P. W. Singer.

One of the highlights of the post reads:

Perhaps the greatest danger in this dynamic is that, although information that goes viral holds unquestionable power, it bears no special claim to truth or accuracy. Homophily all but ensures that. A multi-university study of five years of Facebook activity, titled “The Spreading of Misinformation Online,” was recently published in Proceedings of the National Academy of Sciences. Its authors found that the likelihood of someone believing and sharing a story was determined by its coherence with their prior beliefs and the number of their friends who had already shared it—not any inherent quality of the story itself. Stories didn’t start new conversations so much as echo preexisting beliefs.

This extreme ideological segregation, the authors concluded, “comes at the expense of the quality of the information and leads to proliferation of biased narratives fomented by unsubstantiated rumors, mistrust, and paranoia.” As smartphone cameras and streaming video turn every bystander into a reporter (and everyone with an internet connection into an analyst), “truth” becomes a matter of emotional resonance.

Ooooh, “…’truth’ becomes a matter of emotional resonance.”

That is always true but give the authors their due, “War Goes Viral” is a masterful piece of propaganda to the contrary.

Calling something “propaganda,” or “media bias” is easy and commonplace.

Let’s do the hard part and illustrate why that is the case with “War Goes Viral.”

The tag line:

How social media is being weaponized across the world

preps us to think:

Someone or some group is weaponizing social media.

So before even starting the article proper, we are prepared to be on the look out for the “bad guys.”

The authors are happy to oblige with #AllEyesOnISIS, first paragraph, second sentence. “The self-styled Islamic State…” appears in the second paragraph and ISIS in the third paragraph. Not much doubt who the “bad guys” are at this point in the article.

Listing only each change of current actors, “bad guys” in red, the article from start to finish names:

  • Islamic State
  • Russia
  • Venezuela
  • China
  • U.S. Army training to combat “bad guys”
  • Israel – neutral
  • Islamic State (Hussain)

The authors leave you with little doubt who they see as the “bad guys,” a one-sided view of propaganda and social media in particular.

For example, there is:

No mention of Voice of American (VOA), perhaps one of the longest running, continuous disinformation campaigns in history.

No mention of Pentagon admits funding online propaganda war against Isis.

No mention of any number of similar projects and programs which weren’t constructed with an eye on “truth and accuracy” by the United States.

The treatment here is as one-sided as the “weaponized” social media of which the authors complain.

Not that the authors are lacking in skill. They piggyback their own slant onto The Spreading of Misinformation Online:

This extreme ideological segregation, the authors concluded, “comes at the expense of the quality of the information and leads to proliferation of biased narratives fomented by unsubstantiated rumors, mistrust, and paranoia.” As smartphone cameras and streaming video turn every bystander into a reporter (and everyone with an internet connection into an analyst), “truth” becomes a matter of emotional resonance.

How much of that is supported by The Spreading of Misinformation Online?

  • First sentence
  • Second sentence
  • Both sentences

The answer is:

This extreme ideological segregation, the authors concluded, “comes at the expense of the quality of the information and leads to proliferation of biased narratives fomented by unsubstantiated rumors, mistrust, and paranoia.”

The remainder of that paragraph was invented out of whole clothe by the authors and positioned with “truth” in quotes to piggyback on the legitimate academic work just quoted.

As smartphone cameras and streaming video turn every bystander into a reporter (and everyone with an internet connection into an analyst), “truth” becomes a matter of emotional resonance.

Is popular cant among media and academic types but no more than that.

Skilled reporting can put information in a broad context and weave a coherent narrative, but disparaging social media authors doesn’t make that any more likely.

“War Goes Viral” being a case in point.

by Patrick Durusau at October 18, 2016 12:14 AM

October 17, 2016

Patrick Durusau

XML Prague 2017 is coming

XML Prague 2017 is coming by Jirka Kosek.

From the post:

I’m happy to announce that call for papers for XML Prague 2017 is finally open. We are looking forward for your interesting submissions related to XML. We have switched from CMT to EasyChair for managing submission process – we hope that new system will have less quirks for users then previous one.

We are sorry for slightly delayed start than in past years. But we have to setup new non-profit organization for running the conference and sometimes we felt like characters from Kafka’s Der Process during this process.

We are now working hard on redesigning and opening of registration. Process should be more smooth then in the past.

But these are just implementation details. XML Prague will be again three day gathering of XML geeks, users, vendors, … which we all are used to enjoy each year. I’m looking forward to meet you in Prague in February.

Conference: February 9-11, 2016.

Important Dates:

Important Dates:

  • December 15th – End of CFP (full paper or extended abstract)
  • January 8th – Notification of acceptance/rejection of paper to authors
  • January 29th – Final paper

You can see videos of last year’s presentation (to gauge the competition): Watch videos from XML Prague 2016 on Youtube channel.

December the 15th will be here sooner than you think!

Think of it as a welcome distraction from the barn yard posturing that is U.S. election politics this year!

by Patrick Durusau at October 17, 2016 02:18 AM

Why I Distrust US Intelligence Experts, Let Me Count the Ways

Some US Intelligence failures, oldest to most recent:

  1. Pearl Harbor
  2. The Bay of Pigs Invasion
  3. Cuban Missile Crisis
  4. Vietnam
  5. Tet Offensive
  6. Yom Kippur War
  7. Iranian Revolution
  8. Soviet Invasion of Afghanistan
  9. Collapse of the Soviet Union
  10. Indian Nuclear Test
  11. 9/11 Attacks
  12. Iraq War (WMDs)
  13. Invasion of Afghanistan (US)
  14. Israeli moles in US intelligence, various dates

Those are just a few of the failures of US intelligence, some of which cost hundreds of thousands if not millions of lives.

Yet, you can read today: Trump’s refusal to accept intelligence briefing on Russia stuns experts.

There are only three reasons I can think of to accept findings by the US intelligence community:

  1. You are on their payroll and for that to continue, well, you know.
  2. As a member of the media, future tips/leaks depends upon your acceptance of current leaks. Anyone who mocks intelligence service lies is cut off from future lies.
  3. As a politician, the intelligence findings discredit facts unfavorable to you.

For completeness sake, I should mention that intelligence “experts” could be telling the truth but given their track record, it is an edge case.

Before repeating the mindless cant of “the Russians are interfering with the US election,” stop to ask your sources, “…based on what?” Opinions of all the members of the US intelligence community = one opinion. Ask for facts. No facts offered, report that instead of the common “opinion.”

by Patrick Durusau at October 17, 2016 01:42 AM

October 15, 2016

Patrick Durusau

Why Journalists Should Not Rely On Wikileaks Indexing – Podesta Emails

Clinton on Fracking, or, Another Reason to Avoid Wikileaks Indexing


The quote in the tweet is false.

Politico supplies the correct quotation in its post:

“Bernie Sanders is getting lots of support from the most radical environmentalists because he’s out there every day bashing the Keystone pipeline. And, you know, I’m not into it for that,” Clinton told the unions, according to the transcript. “My view is, I want to defend natural gas. … I want to defend fracking under the right circumstances.”

I’m guessing that “…under the right circumstances.” must have pushed Wikileaks too close to the 140 character barrier.

Ditto for the Wikileaks mis-quote of: “Get a life.”

Which reported as in the tweet, appears to refer to unbridled fracking.

Not so in the Politico post:

“I’m already at odds with the most organized and wildest” of the environmental movement, Clinton told building trades unions in September 2015, according to a transcript of the remarks apparently circulated by her aides. “They come to my rallies and they yell at me and, you know, all the rest of it. They say, ‘Will you promise never to take any fossil fuels out of the earth ever again?’ No. I won’t promise that. Get a life, you know.”

Doesn’t read quite the same way does it?

I supposed once you start lying it’s really hard to stop. Clinton is a good example of that and Wikileaks should not follow her example.

It’s hard to spot these lies because Wikileaks isn’t indexing the attachments.

You can search all day for “defend fracking,” “get a life” (by Clinton) and you will come up empty (at least as of today).

So that you don’t have to search for: 20150909 Transcript | Building Trades Union (Keystone XL) at Wikileaks – Podesta Emails, I have produced a PDF version of that attachment, Building-Trades-Union-Clinton-Sept-09-2015.pdf (my naming), for your viewing pleasure.

by Patrick Durusau at October 15, 2016 08:58 PM

Green’s Dictionary of Slang [New Commercializing Information Model?]

Green’s Dictionary of Slang

From the about page:

Green’s Dictionary of Slang is the largest historical dictionary of English slang. Written by Jonathon Green over 17 years from 1993, it reached the printed page in 2010 in a three-volume set containing nearly 100,000 entries supported by over 400,000 citations from c. ad 1000 to the present day. The main focus of the dictionary is the coverage of over 500 years of slang from c. 1500 onwards.

The printed version of the dictionary received the Dartmouth Medal for outstanding works of reference from the American Library Association in 2012; fellow recipients include the Dictionary of American Regional English, the Oxford Dictionary of National Biography, and the New Grove Dictionary of Music and Musicians. It has been hailed by the American New York Times as ‘the pièce de résistance of English slang studies’ and by the British Sunday Times as ‘a stupendous achievement, in range, meticulous scholarship, and not least entertainment value’.

On this website the dictionary is now available in updated online form for the first time, complete with advanced search tools enabling search by definition and history, and an expanded bibliography of slang sources from the early modern period to the present day. Since the print edition, nearly 60,000 quotations have been added, supporting 5,000 new senses in 2,500 new entries and sub-entries, of which around half are new slang terms from the last five years.

Green’s Dictionary of Slang has an interesting commercial model.

You can search for any word, freely, but “more search features” requires a subscription:

By subscribing to Green’s Dictionary of Slang Online, you gain access to advanced search tools (including the ability to search for words by meaning, history, and usage), full historical citations in each entry, and a bibliography of over 9,000 slang sources.

Current rate for individuals is £ 49 (or about $59.96).

In addition to being a fascinating collection of information, is the free/commercial split here of interest?

An alternative to:

The Teaser Model

Contrast the Oxford Music Online:

Grove Music Online is the eighth edition of Grove’s Dictionary of Music and Musicians, and contains articles commissioned specifically for the site as well as articles from New Grove 2001, Grove Opera, and Grove Jazz. The recently published second editions of The Grove Dictionary of American Music and The Grove Dictionary of Musical Instruments are still being put online, and new articles are added to GMO with each site update.

Oh, Oxford Music Online isn’t all pay-per-view.

It offers the following thirteen (13) articles for free viewing:

Sotiria Bellou, Greek singer of rebetiko song, famous for the special quality and register of her voice

Cell [Mobile] Phone Orchestra, ensemble of performers using programmable mobile (cellular) phones

Crete, largest and most populous of the Greek islands

Lyuba Encheva, Bulgarian pianist and teacher

Gaaw, generic term for drums, and specifically the frame drum, of the Tlingit and Haida peoples of Alaska

Johanna Kinkel, German composer, writer, pianist, music teacher, and conductor

Lady’s Glove Controller, modified glove that can control sound, mechanical devices, and lights

Outsider music, a loosely related set of recordings that do not fit well within any pre-existing generic framework

Peter (Joshua) Sculthorpe, Australian composer, seen by the Australian musical public as the most nationally representative.

Slovenia, country in southern Central Europe

Sound art, a term ecompassing a variety of art forms that utlize sound, or comment on auditory cultures

Alice (Bigelow) Tully, American singer and music philanthropist

Wars in Iraq and Afghanistan, soliders’ relationship with music is largely shaped by contemporary audio technology

Hmmm, 160,000 slang terms for free from Green’s Dictionary of Slang versus 13 free articles from Oxford Music Online.

Show of hands for the teaser model of Oxford Music Online?

The Consumer As Product

You are aware that casual web browsing and alleged “free” sites are not just supported by ads, but by the information they collect on you?

Consider this rather boastful touting of information collection capabilities:

To collect online data, we use our native tracking tags as experience has shown that other methods require a great deal of time, effort and cost on both ends and almost never yield satisfactory coverage or results since they depend on data provided by third parties or compiled by humans (!!), without being able to verify the quality of the information. We have a simple universal server-side tag that works with most tag managers. Collecting offline marketing data is a bit trickier. For TV and radio, we will with your offline advertising agency to collect post-log reports on a weekly basis, transmitted to a secure FTP. Typical parameters include flight and cost, date/time stamp, network, program, creative length, time of spot, GRP, etc.

Convertro is also able to collect other type of offline data, such as in-store sales, phone orders or catalog feeds. Our most popular proprietary solution involves placing a view pixel within a confirmation email. This makes it possible for our customers to tie these users to prior online activity without sharing private user information with us. For some customers, we are able to match almost 100% of offline sales. Other customers that have different conversion data can feed them into our system and match it to online activity by partnering with LiveRamp. These matches usually have a success rate between 30%-50%. Phone orders are tracked by utilizing a smart combination of our in-house approach, the inputting of special codes, or by third party vendors such as Mongoose and ResponseTap.v

You don’t have to be on the web, you can be tracked “in-store,” on the phone, etc.

Converto doesn’t mention explicitly “supercookies,” for which Verizon just paid a $1.35 Million fine. From the post:

“Supercookies,” known officially as unique identifier headers [UIDH], are short-term serial numbers used by corporations to track customer data for advertising purposes. According to Jacob Hoffman-Andrews, a technologist with the Electronic Frontier Foundation, these cookies can be read by any web server one visits used to build individual profiles of internet habits. These cookies are hard to detect, and even harder to get rid of.

If any of that sounds objectionable to you, remember that to be valuable, user habits must be tracked.

That is if you find the idea of being a product acceptable.

The Green’s Dictionary of Slang offers an economic model that enables free access to casual users, kids writing book reports, journalists, etc., while at the same time creating a value-add that power users will pay for.

Other examples of value-add models with free access to the core information?

What would that look like for the Podesta emails?

by Patrick Durusau at October 15, 2016 01:50 AM

October 14, 2016

Patrick Durusau

Becoming a Data Scientist:

Becoming a Data Scientist: Advice From My Podcast Guests

Out-gassing from political candidates has kept pushing this summary by Renée Teate back in my queue. Well, fixing that today!

René has created more data science resources than I can easily mention so in addition to this guide, I will mention only two:

Data Science Renee @BecomingDataSci, a Twitter account that will soon break into the rarefied air of > 10,000 followers. Not yet, but you may be the one that puts her over the top!

Looking for women to speak at data science conferences? Renée maintains Women in Data Science, which today has 815 members.

Sorry, three, her blog: Becoming a Data Scientist.

That should keep you busy/distracted until the political noise subsides. ;-)

by Patrick Durusau at October 14, 2016 12:51 AM

October 13, 2016

Patrick Durusau

Obama on Fixing Government with Technology (sigh)

Obama on Fixing Government with Technology by Caitlin Fairchild.

Like any true technology cultist, President Obama mentions technology, inefficiency, but never the people who make up government as the source of government “problems.” Nor does he appear to realize that technology cannot fix the people who make up government.

Those out-dated information systems he alludes to were built and are maintained under contract with vendors. Systems that are used by users who are accustomed to those systems and will resist changing to others. Still other systems rely upon those systems being as they are in terms of work flow. And so on. At its very core, the problem of government isn’t technology.

It’s the twin requirement that it be composed of and supplied by people, all of who have a vested interest and comfort level with the technology they use and, don’t forget, government has to operate 24/7, 365 days out of the year.

There is no time to take down part of the government to develop new technology, train users in its use and at the same time, run all the current systems which are, to some degree, meeting current requirements.

As an antidote to the technology cultism that infects President Obama and his administration, consider reading Geek Heresy, the description of which reads:

In 2004, Kentaro Toyama, an award-winning computer scientist, moved to India to start a new research group for Microsoft. Its mission: to explore novel technological solutions to the world’s persistent social problems. Together with his team, he invented electronic devices for under-resourced urban schools and developed digital platforms for remote agrarian communities. But after a decade of designing technologies for humanitarian causes, Toyama concluded that no technology, however dazzling, could cause social change on its own.

Technologists and policy-makers love to boast about modern innovation, and in their excitement, they exuberantly tout technology’s boon to society. But what have our gadgets actually accomplished? Over the last four decades, America saw an explosion of new technologies – from the Internet to the iPhone, from Google to Facebook – but in that same period, the rate of poverty stagnated at a stubborn 13%, only to rise in the recent recession. So, a golden age of innovation in the world’s most advanced country did nothing for our most prominent social ill.

Toyama’s warning resounds: Don’t believe the hype! Technology is never the main driver of social progress. Geek Heresy inoculates us against the glib rhetoric of tech utopians by revealing that technology is only an amplifier of human conditions. By telling the moving stories of extraordinary people like Patrick Awuah, a Microsoft millionaire who left his lucrative engineering job to open Ghana’s first liberal arts university, and Tara Sreenivasa, a graduate of a remarkable South Indian school that takes children from dollar-a-day families into the high-tech offices of Goldman Sachs and Mercedes-Benz, Toyama shows that even in a world steeped in technology, social challenges are best met with deeply social solutions.

Government is a social problem and to reach for a technology fix first, is a guarantee of yet another government failure.

by Patrick Durusau at October 13, 2016 09:27 PM

IBM’s Program Of Security Via Obscurity (Censorship)

Before today, my response to the question: “Does IBM promote security through obscurity?” would have been no!

Today? Full Disclosure @SecLists posted this tweet:


A working version of the URL:

I don’t suppose better software engineering practices and/or rapid repair of IBM’s software occurred to anyone?

by Patrick Durusau at October 13, 2016 08:15 PM

George Carlin’s Seven Dirty Words in Podesta Emails – Discovered 981 Unindexed Documents

While taking a break from serious crunching of the Podesta emails I discovered 981 unindexed documents at Wikileaks!

Try searching for Carlin’s seven dirty words at The Podesta Emails:

  • shit – 44
  • piss – 19
  • fuck – 13
  • cunt – 0
  • cocksucker – 0
  • motherfucker – 0 (?)
  • tits – 0

I have a ? after “motherfucker” because working with the raw files I show one (1) hit for “motherfucker” and one (1) hit for “motherfucking.” Separate emails.

For “motherfucker,” American Sniper–the movie, responded to by Chris Hedges – To: Podesta@Law.Georgetown.Edu

For “motherfucking,” H4A News Clips 5.31.15 – From/To:

“Motherfucker” and “motherfucking” occur in text attachments to emails, which Wikileaks does not search.

If you do a blank search for file attachments, Wikileaks reports there are 2427 file attachments.

Searching the Podesta emails at Wikileaks excludes the contents of 2427 files from your search results.

How significant is that?

Hmmm, 302 pdf, 501 docx, 167 doc, 12 xls, 9 xlsx – 981 documents excluded from your searches at Wikileaks.

For 9,011 emails, as of AM today, my local.

How comfortable are you with not searching those 981 documents? (Or additional documents that may follow?)

by Patrick Durusau at October 13, 2016 03:42 PM

October 12, 2016

Patrick Durusau

How-To Spot An Armchair Jihadist

To efficiently use law enforcement resources against threats to civil order, the police must recognize the difference between an actual jihadist and an armchair jihadist.

An armchair jihadist is one that talks a good game, dreams of raining fire and death on infidels, etc., but in truth, is the Walter Mitty of terrorism.

Unfortunately, law enforcement disproportionately captures armchair jihadists, for example, the arrest of Samata Ullah, who was charged in part with possession of:

…a book about guided missiles and a PDF version of a book about advanced missile guidance and control for a purpose connected with the commission, preparation or instigation of terrorism”

Admitting the romanticism of building one’s own arsenal, how successful do you think an individual or even a large group of individuals would be at building and testing a guided missile?

Here’s a broad outline of the major steps to building a laser guided missile:

The Manufacturing Process

Constructing the body and attaching the fins

1 The steel or aluminum body is die cast in halves. Die casting involves pouring molten metal into a steel die of the desired shape and letting the metal harden. As it cools, the metal assumes the same shape as the die. At this time, an optional chromium coating can be applied to the interior surfaces of the halves that correspond to a completed missile’s cavity. The halves are then welded together, and nozzles are added at the tail end of the body after it has been welded.

2 Moveable fins are now added at predetermined points along the missile body. The fins can be attached to mechanical joints that are then welded to the outside of the body, or they can be inserted into recesses purposely milled into the body.

Casting the propellant

3 The propellant must be carefully applied to the missile cavity in order to ensure a uniform coating, as any irregularities will result in an unreliable burning rate, which in turn detracts from the performance of the missile. The best means of achieving a uniform coating is to apply the propellant by using centrifugal force. This application, called casting, is done in an industrial centrifuge that is well-shielded and situated in an isolated location as a precaution against fire or explosion.

Assembling the guidance system

4 The principal laser components—the photo detecting sensor and optical filters—are assembled in a series of operations that are separate from the rest of the missile’s construction. Circuits that support the laser system are then soldered onto pre-printed boards; extra attention is given to optical materials at this time to protect them from excessive heat, as this can alter the wavelength of light that the missile will be able to detect. The assembled laser subsystem is now set aside pending final assembly. The circuit boards for the electronics suite are also assembled independently from the rest of the missile. If called for by the design, microchips are added to the boards at this time.

5 The guidance system (laser components plus the electronics suite) can now be integrated by linking the requisite circuit boards and inserting the entire assembly into the missile body through an access panel. The missile’s control surfaces are then linked with the guidance system by a series of relay wires, also entered into the missile body via access panels. The photo detecting sensor and its housing, however, are added at this point only for beam riding missiles, in which case the housing is carefully bolted to the exterior diameter of the missile near its rear, facing backward to interpret the laser signals from the parent aircraft.

Final assembly

6 Insertion of the warhead constitutes the final assembly phase of guided missile construction. Great care must be exercised during this process, as mistakes can lead to catastrophic accidents. Simple fastening techniques such as bolting or riveting serve to attach the warhead without risking safety hazards. For guidance systems that home-in on reflected laser light, the photo detecting sensor (in its housing) is bolted into place at the tip of the warhead. On completion of this final phase of assembly, the manufacturer has successfully constructed on of the most complicated, sophisticated, and potentially dangerous pieces of hardware in use today.

Quality Control

Each important component is subjected to rigorous quality control tests prior to assembly. First, the propellant must pass a test in which examiners ignite a sample of the propellant under conditions simulating the flight of a missile. The next test is a wind tunnel exercise involving a model of the missile body. This test evaluates the air flow around the missile during its flight. Additionally, a few missiles set aside for test purposes are fired to test flight characteristics. Further work involves putting the electronics suite through a series of tests to determine the speed and accuracy with which commands get passed along to the missile’s control surfaces. Then the laser components are tested for reliability, and a test beam is fired to allow examiners to record the photo detecting sensor’s ability to “read” the proper wavelength. Finally, a set number of completed guided missiles are test fired from aircraft or helicopters on ranges studded with practice targets.

Did Samata Ullah have the expertise and/or access to the expertise or manufacturing capability for any of those steps?

Moreover, could Samata Ullah have tested and developed a guided missile without someone noticing?

Possession of first principle reading materials, such as chemistry, rocket, missile, etc., manuals or guides is a clear sign an alleged jihadist is an armchair jihadist.

Another sign of an armchair jihadist, along with the possession of such reading materials, is their failure to obtain explosives, weapons, etc., in an effective way.

The United States, via the CIA and the US military, routinely distributes explosives and weapons around the world to various factions.

A serious jihadist need only travel to well known locations and get in line for explosives, RPGs (rocket-propelled grenades), mortars, etc.

Does the weapon in this photo look homemade?


Of course not! Anyone with a passport and a little imagination can possess a wide variety of harmful devices.

But then, they are not an armchair jihadist.

DIY missile/explosive reading clubs of jihadists are not threats to the public. Manufacturing of explosives and missiles are difficult and dangerous, tasks best left to professionals. They are more dangerous to each other than the general public.

When allocating law enforcement resources, remember that the only thing easier to acquire than weapons is possibly marijuana. Anyone planning on building weapons can be ignored as an armchair jihadist.

In the United States and the United Kingdom, law enforcement resources would be better spent in the pursuit of wealthy and governmental pedophiles.

PS: I started to edit the steps for building a guided missile for length but the description highlights the absurdity of the charges in question. Melting steel or aluminum and pouring it into a metal die? Please, that’s not a backyard activity. Neither is pouring molten rocket fuel using a centrifuge.

by Patrick Durusau at October 12, 2016 08:27 PM

British and Irish Legal Information Institute

British and Irish Legal Information Institute

From the webpage:

Welcome to BAILII, where you can find British and Irish case law & legislation, European Union case law, Law Commission reports, and other law-related British and Irish material. BAILII thanks The Scottish Council of Law Reporting for their assistance in establishing the Historic Scottish Law Reports project. BAILII also thanks Sentral for provision of servers. For more information, see About BAILII.

I ran across this wonderful legal resource while researching a legal issue in another post.

Obviously a great resource for legal research and scholars but also I suspect a great source of leisure reading, well, if you like that sort of thing.

The site also offered this handy list of world law resources:

When I said “leisure reading,” I was only partially joking. What we accept now as “the law,” wasn’t always so.

The history of how rights and obligations have evolved over centuries of human interaction are recorded in legislation and case law.

It is a history with all the mis-steps, failures, betrayals and intrigue that are commonplace in any human enterprise.


by Patrick Durusau at October 12, 2016 12:57 AM

October 11, 2016

Patrick Durusau

Parsing Foreign Law From News Reports (Warning For Journalists)

Cory Doctorow‘s headline: Scotland Yard charge: teaching people to use crypto is an act of terrorism red-lined my anti-government biases.

I tend towards “unsound” reactions when free speech is being infringed upon.

But my alarm and perhaps yours as well. was needlessly provoked in this case.

Cory writes:

In other words, according to Scotland Yard, serving a site over HTTPS (as this one is) and teaching people to use crypto (as this site has done) and possessing a secure OS (as I do) are acts of terrorism or potential acts of terrorism. In some of the charges, the police have explicitly connected these charges with planning an act of terrorism, but in at least one of the charges (operating a site served over HTTPS and teaching people about crypto) the charge lacks this addendum — the mere act is considered worthy of terrorism charges.

The concern over:

but in at least one of the charges (operating a site served over HTTPS and teaching people about crypto) the charge lacks this addendum — the mere act is considered worthy of terrorism charges.

is mis-placed.

Cory points to the original report here: Man arrested on Cardiff street to face six terror charges by Viram Dodd.

Cory’s alarm is not repeated by Dodd:

Ullah has been charged with directing terrorism, providing training in encryption programs knowing the purpose was for terrorism, and using his blog site to provide such training. His activities are alleged to have “the intention of assisting another or others to commit acts of terrorism”.

Beyond that (I haven’t seen the charging document), be aware that under English Criminal Procedure, the “charge” on which Cory places so much weight is defined as:


Pay particular attention to 7.3(1)(a)(i) (page 65):

…describes the offense in ordinary language, and…

A “charge” isn’t a technical specification of an offense under English criminal procedure. Which means you attach legal significance to charging language at your own peril. And to the detriment of your readers.

PS: I have contacted the Westminster Magistrates’ Court and requested a copy of the charging document. If and when that arrives, I will update this post with it.

by Patrick Durusau at October 11, 2016 11:48 PM

Bias in Data Collection: A UK Example

Kelly Fiveash‘s story, UK’s chief troll hunter targets doxxing, virtual mobbing, and nasty images starts off:

Trolls who hurl abuse at others online using techniques such as doxxing, baiting, and virtual mobbing could face jail, the UK’s top prosecutor has warned.

New guidelines have been released by the Crown Prosecution Service to help cops in England and Wales determine whether charges—under part 2, section 44 of the 2007 Serious Crime Act—should be brought against people who use social media to encourage others to harass folk online.

It even includes “encouraging” statistics:

According to the most recent publicly available figures—which cite data between May 2013 and December 2014—1,850 people were found guilty in England and Wales of offences under section 127 of the Communications Act 2003. But the numbers reveal a steady climb in charges against trolls. In 2007, there were a total of 498 defendants found guilty under section 127 in England and Wales, compared with 693 in 2008, 873 in 2009, 1,186 in 2010 and 1,286 in 2011.

But the “most recent publicly available figures,” doesn’t ring true does it?

Imagine that, 1850 trolls out of a total population of England and Wales of 57 million. (England 53.9 million, Wales 3.1 million, mid-2013)


Let’s look at the referenced government data, 25015 Table.xls.

For the months of May 2013 to December 2014, there are only monthly totals of convictions.

What data is not being collected?

Among other things:

  1. Offenses reported to law enforcement
  2. Offenses investigated by law enforcement (not the same as #1)
  3. Conduct in question
  4. Relationship, if any, between the alleged offender/victim
  5. Race, economic status, location, social connections of alleged offender/victim
  6. Law enforcement and/or prosecutors involved
  7. Disposition of cases without charges being brought
  8. Disposition of cases after charges brought but before trial
  9. Charges dismissed by courts and acquittals
  10. Judges who try and/or dismiss charges
  11. Penalties imposed upon guilty plea and/or conviction
  12. Appeals and results on appeal, judges, etc.

All that information exists for every reported case of “trolls,” and is recorded at some point in the criminal justice process or could be discerned from those records.

Can you guess who isn’t collecting that information?

The TheyWorkForYou site reports at: Communications Act 2003, Jeremy Wright, The Parliamentary Under-Secretary of State for Justice, saying:

The Ministry of Justice Court Proceedings Database holds information on defendants proceeded against, found guilty and sentenced for criminal offences in England and Wales. This database holds information on offences provided by the statutes under which proceedings are brought but not the specific circumstances of each case. It is not possible to separately identify, in all cases brought under section 127 of the Communications Act 2003, whether a defendant sent or caused to send information to an individual or a small group of individuals or made the information widely available to the public. This detailed information may be held by the courts on individual case files which due to their size and complexity are not reported to Justice Analytical Services. As such this information can be obtained only at disproportionate cost.
… (emphasis added)

I was unaware that courts in England and Wales were still recording their proceedings on vellum. That would be expensive to manually gather that data together. (NOT!)

How difficult is it from any policy organization, whether seeking greater protection from trolls and/or opposing classes of prosecution based on discrimination and free speech to gather the same data?

Here is a map of the Crown Prosecution Service districts:


Counting the sub-offices in each area, I get forty-three separate offices.

But that’s only cases that are considered for prosecution and that’s unlikely to be the same number as reported to the police.

Checking for police districts in England, I get thirty-nine.


Plus, another four areas for Wales:


The Wikipedia article List of law enforcement agencies in the United Kingdom, Crown dependencies and British Overseas Territories has links for all these police areas, which in the interest of space, I did not repeat here.

I wasn’t able to quickly find a map of English criminal courts, although you can locate them by postcode at: Find the right court or tribunal. My suspicion is that Crown Prosecution Service areas correspond to criminal courts. But verify that for yourself.

In order to collect the information already in the possession of the government, you would have to search records in 43 police districts, 43 Crown Prosecution Service offices, plus as many as 43 criminal courts in which defendants may be prosecuted. All over England and Wales. With unhelpful clerks all along the way.

All while the government offers the classic excuse:

As such this information can be obtained only at disproportionate cost.

Disproportionate because:

Abuse of discretion, lax enforcement, favoritism, discrimination by police officers, Crown prosecutors, judges could be demonstrated as statistical facts?

Governments are old hands at not collecting evidence they prefer to not see thrown back in their faces.

For example: FBI director calls lack of data on police shootings ‘ridiculous,’ ‘embarrassing’.

Non-collection of data is a source of bias.

What bias is behind the failure to collect troll data in the UK?

by Patrick Durusau at October 11, 2016 01:56 AM

October 10, 2016

Patrick Durusau

When 24 GB Of Physical RAM Pegs At 98% And Stays There

Don’t panic! It has a happy ending but I’m too tired to write it up for posting today.

Tune in tomorrow for lessons learned on FOIA answers that don’t set the information free.

by Patrick Durusau at October 10, 2016 01:38 AM

October 09, 2016

Patrick Durusau

Chasing File Names – Check My Work

I encountered a stream of tweets of which the following are typical:


Hmmm, is cf.7z a different set of files from ebd-cf.7z?

You could “eye-ball” the directory listings but that is tedious and error-prone.

Building on what we saw in Guccifer 2.0’s October 3rd 2016 Data Drop – Old News? (7 Duplicates out of 2085 files), let’s combine cf-7z-file-Sorted-Uniq.txt and ebd-cf-file-Sorted-Uniq.txt, and sort that file into cf-7z-and-ebd-cf-files-Sorted.txt.


uniq -d cf-7z-and-ebd-cf-files-Sorted.txt | wc -l

(“-d” for duplicate lines) on the resulting file, piping it into wc -l, will give you the result of 2177 duplicates. (The total length of the file is 4354 lines.)


uniq -u cf-7z-and-ebd-cf-files-Sorted.txt

(“-u” for unique lines), will give you no return (no unique lines).

With experience, you will be able to check very large file archives for duplicates. In this particular case, despite the circulating under different names, it appears these two archives contain the same files.

BTW, do you think a similar technique could be applied to spreadsheets?

by Patrick Durusau at October 09, 2016 02:04 AM

October 08, 2016

Patrick Durusau

DNC/DCCC/CF Excel Files, As Of October 7, 2016

A continuation of my post Avoiding Viruses in DNC/DCCC/CF Excel Files.

Where Avoiding Viruses… focused on avoiding the hazards and dangers of Excel-born viruses, this post focuses on preparing the DNC/DCCC/CF Excel files from Guccifer 2.0, as of October 7, 2016, for further analysis.

As I mentioned before, you could search through all 517 files to date, separately, using Excel. That thought doesn’t bring me any joy. You?

Instead, I’m proposing that we prepare the files to be concatenated together, resulting in one fairly large file, which we can then search and manipulate as one entity.

As a data cleanliness task, I prefer to prefix every line in every csv export, with the name of its original file. That will enable us to extract lines that mention the same person over several files and still have a bread crumb trail back to the original files.

Munging all the files together without such a step, would leave us either grepping across the collection and/or using some other search mechanism. Why not plan on avoiding that hassle?

Given the number of files requiring prefixing, I suggest the following:

for f in *.csv*; do
sed -i "s/^/$f,/" $f

This shell script uses sed with the -i switch, which means sed changes files in place (think overwriting specified part). Here the s/ means to substitute at the ^, start of each line, $f, the filename plus a comma separator and the final $f, is the list of files to be processed.

There are any number of ways to accomplish this task. Your community may use a different approach.

The result of my efforts is: guccifer2.0-all-spreadsheets-07October2016.gz, which weighs in at 61 MB compressed and 231 MB uncompressed.

I did check and despite having variable row lengths, it does load in my oldish version of gnumeric. All 1030828 lines.

That’s not all surprising for gnumeric, considering I’m running 24 GB of physical RAM. Your performance may vary. (It did hesitate loading it.)

There is much left to be done, such as deciding what padding is needed to even out all the rows. (I have ideas, suggestions?)

Tools to manipulate the CSV. I have a couple of stand-bys and a new one that I located while writing this post.

And, of course, once the CSV is cleaned up, what other means can we use to explore the data?

My focus will be on free and high performance (amazing how often those are found together Larry Ellison) tools that can be easily used for exploring vast seas of spreadsheet data.

Next post on these Excel files, Monday, October 10, 2016.

I am downloading the cf.7z Guccifer 2.0 drop as I write this update.

Watch for updates on the comprehensive file list and Excel files next Monday. October 8, 2016, 01:04 UTC.

by Patrick Durusau at October 08, 2016 01:05 AM

The “Fact Free” U.S. Intelligence Community (USIC)

The Joint Statement from the Department of Homeland Security and Office of the Director of National Intelligence on Election Security is a reminder of why the U.S. Intelligence Community (USIC) fails so very often.

The first paragraph:

The U.S. Intelligence Community (USIC) is confident that the Russian Government directed the recent compromises of e-mails from US persons and institutions, including from US political organizations. The recent disclosures of alleged hacked e-mails on sites like and WikiLeaks and by the Guccifer 2.0 online persona are consistent with the methods and motivations of Russian-directed efforts. These thefts and disclosures are intended to interfere with the US election process. Such activity is not new to Moscow—the Russians have used similar tactics and techniques across Europe and Eurasia, for example, to influence public opinion there. We believe, based on the scope and sensitivity of these efforts, that only Russia’s senior-most officials could have authorized these activities.

Do you see any facts in that first paragraph?

I see the conclusion “…are consistent with the methods and motivations of Russian-directed efforts,” but no facts to back that statement up.

Moreover, the second paragraph leaps for the “smoking gun” with:

Some states have also recently seen scanning and probing of their election-related systems, which in most cases originated from servers operated by a Russian company….

You would hope the U.S. Intelligence Community (USIC) would have heard of VPNs (virtual private networks).

No facts, just allegations that favor one party in the fast approaching U.S. presidential election.

Yes, intelligence agencies are interfering with the U.S. election, but its not Russian intelligence agencies.

by Patrick Durusau at October 08, 2016 12:49 AM

October 07, 2016

Patrick Durusau

Avoiding Viruses in DNC/DCCC/CF Excel Files

I hope you haven’t opened any of the DNC/DCCC/CF Excel files outside of a VM. 517 Excel Files Led The Guccifer2.0 Parade (October 6, 2016)


Files from trusted sources can contain viruses. Files from unknown or rogue sources even more so. However tempting (and easy) it is to open up alleged purloined files on your desktop, minimal security conscious users will resist the temptation.

Warning: I did NOT scan the Excel files for viruses. The best way to avoid Excel viruses is to NOT open Excel files.

I used ssconvert, one of the utilities included with gnumeric to bulk convert the Excel files to csv format. (Comma Separate Values is documents in RFC 4780.

Tip: If you are looking for a high performance spreadsheet application, take a look at gnumeric.

Ssconvert relies on file extensions (although other options are available) so I started with:

ssconvert -S donors.xlsx donors.csv

The -S option takes care of workbooks with multiple worksheets. You need a later version of ssconvert (mine is 1.12.9-1 (2013) and the current version of gnumeric and ssconvert is 1.12.31 (August 2016), to convert the .xlsx files without warning.

I’m upgrading to Ubuntu 16.04 soon so it wasn’t worth the trouble trying to stuff a later version of gnumeric/ssconvert onto my present Ubuntu 14.04.

Despite the errors, the conversion appears to have worked properly:


to its csv output:


I don’t see any problems.

I’m checking a sampling of the other conversions as well.

BTW, do notice the confirmation of reports from some commentators that they contacted donors who confirmed donating, but could not recall the amounts.

Could be true. If you pay protection money often enough, I’m sure it’s hard to recall a specific payment.

Sorry, I got distracted.

So, only 516 files to go.

I don’t recommend you do:

ssconvert -S filename.xlsx filename.csv

516 times. That will be tedious and error prone.

At least for Linux, I recommend:

for f in *.xls*; do
   ssconvert -S $f $f.csv

The *.xls* captures both .xsl and .xslx files, then invokes ssconvert -S on the file and then saves the output file with the original name, plus the extension .csv.

The wc -l command reports 1030828 lines in the consolidated csv file for these spreadsheets.

That’s a lot of lines!

I have some suggestions on processing that file, see: DNC/DCCC/CF Excel Files, As Of October 7, 2016.

by Patrick Durusau at October 07, 2016 09:36 PM

517 Excel Files Led The Guccifer2.0 Parade (October 6, 2016)

As of today, the data dumps by Guccifer2.0 have contained 517 Excel files.

The vehemence of posts dismissing this dumps makes me wonder two things:

  1. How many of the Excel files these commentators have reviewed?
  2. What is it that you might find in them that worries them so?

I don’t know the answer to #1 and I won’t speculate on their diligence in examining these files. You can reach your own conclusions in that regard.

Nor can I give you an answer to #2, but I may be able to help you explore these spreadsheets.

The old fashioned way, opening each file, at one Excel file per minute, assuming normal Office performance, ;-), would take longer than an eight-hour day to open them all.

You still must understand and compare the spreadsheets.

To make 517 Excel files more than a number, here’s a list of all the Guccifer2.0 released Excel files as of today: guccifer2.0-excel-files-sorted.txt.

(I do have an unfair advantage in that I am willing to share the files I generate, enabling you to check my statements for yourself. A personal preference for fact-based pleading as opposed to conclusory hand waving.)

If you think of each line in the spreadsheets as a record, this sounds like a record linkage problem. Except they have no uniform number of fields, headers, etc.

With record linkage, we would munge all the records into a single record format and then and only then, match up records to see which ones have data about the same subjects.

Thinking about that, the number 517 looms large because all the formats must be reconciled to one master format, before we start getting useful comparisons.

I think we can do better than that.

First step, let’s consider how to create a master record set that keeps all the data as it exists now in the spreadsheets, but as a single file.

See you tomorrow!

by Patrick Durusau at October 07, 2016 01:36 AM

October 06, 2016

Patrick Durusau

Unmasking Tor users with DNS

Unmasking Tor users with DNS by Mark Stockley.

From the post:

Researchers at the KTH Royal Institute of Technology, Stockholm, and Princeton University in the USA have unveiled a new way to attack Tor and deanonymise its users.

The attack, dubbed DefecTor by the researchers’ in their recently published paper The Effect of DNS on Tor’s Anonymity, uses the DNS lookups that accompany our browsing, emailing and chatting to create a new spin on Tor’s most well established weakness; correlation attacks.

If you want the lay-person’s explanation of the DNS issue with Tor, see Mark’s post. If you want the technical details, read The Effect of DNS on Tor’s Anonymity.

The immediate take away for the average user is this:

Donate, volunteer, support the Tor project.

Your privacy or lack thereof is up to you.

by Patrick Durusau at October 06, 2016 06:21 PM

Arabic/Russian Language Internet

No matter the result of the 2016 US presidential election, mis-information on areas where Arabic and/or Russian are spoken will increase.

If you are creating topic maps and/or want to do useful reporting on such areas consider:

How to get started investigating the Arabic-language internet by Tom Trewinnard, or,

How to get started investigating the Russian-language internet by Aric Toler.

Any hack can quote releases from official sources and leave their readers uninformed.

A journalist takes monotone “facts” from an “official” release and weaves a story of compelling interest to their readers.

Any other guides to language/country specific advice for journalists?

by Patrick Durusau at October 06, 2016 06:01 PM

XQuery Snippets on Gist

@XQuery tweeted today:

Check out some of the 1,637 XQuery code snippets on GitHub’s gist service

Not a bad way to get in a daily dose of XQuery!

You can also try Stack Overflow:

XQuery (3,000)

xquery-sql (293)

xquery-3.0 (70)

xquery-update (55)


by Patrick Durusau at October 06, 2016 05:36 PM

Terrorist HoneyPots?

I was reading Checking my honeypot day by Mark Hofman when it occurred to me that discovering CIA/NSA/FBI cybertools may not be as hard as I previously thought.

Imagine creating a <insert-current-popular-terrorist-group-name> website, replete with content ripped off from other terrorist websites, including those sponsored by the U.S. government.

Sharpen your skills at creating fake Twitter followers, AI-generated tweets, etc.

Instead of getting a Booz Allen staffer to betray their employer, you can sit back and collect exploits as they are used.

With just a little imagination, you can create honeypots on and off the Dark Web to attract particular intelligence or law enforcement agencies, security software companies, political hackers and others.

If the FBI can run a porn site, you can use a honeypot to collect offensive cyberweapons.

by Patrick Durusau at October 06, 2016 03:50 PM

Guccifer 2.0’s October 3rd 2016 Data Drop – Old News? (7 Duplicates out of 2085 files)

However amusing the headline ‘Guccifer 2.0’ Is Bullshitting Us About His Alleged Clinton Foundation Hack may be, Lorenzo Fanchschi-Bicchierai offers no factual evidence to support his claim,

… the hacker’s latest alleged feat appears to be a complete lie.

Or should I say that:

  • Clinton Foundation denies it has been hacked
  • The Hill whines about who is a donor where
  • The Daily Caller says, “nothing to see here, move along, move along”

hardly qualifies as anything I would rely on.

Checking the file names is one rough check for duplication.

First, you need a set of the file names for all the releases on Guccifer 2.0’s blog:

Relying on file names alone is iffy as the same “content” can be in files with different names, or different content in files with the same name. But this is a rough cut against thousands of documents, so file names it is.

So you can check my work, I saved a copy of the files listed at the blog in date order: guccifer2.0-File-List-By-Blog-Date.txt..

For combining files for use with uniq, you will need a sorted, uniq version of that file: guccifer2.0-File-List-Blog-Sorted-Uniq-lc-final.txt.

Next, there was a major dump of files under the file name 7dc58-ngp-van.7z, approximately 820 MB of files. (Not listed on the blog but from Guccifer 2.0.)

You can use your favorite tool set or grab a copy of: 7dc58-ngp-van-Sorted-Uniq-lc-final.txt.

You need to combine those file names with those from the blog to get a starting set of names for comparison against the alleged Clinton Foundation hack.

Combining those two file name lists together, sorting them and creating a unique list of file names results in: guccifer2.0-30Sept2016-Sorted-Unique.txt.

Follow the same process for ebd-cf.7z, the file that dropped on the 3rd of October 2016. Or grab: ebd-cf-file-Sorted-Uniq-lc-final.txt.

Next, combine guccifer2.0-30Sept2016-Sorted-Unique.txt (the files we knew about before the 3rd of October) with ebd-cf-file-Sorted-Uniq.txt, and sort those file names, resulting in: guccifer2.0-30Sept2016-plus-ebd-cf-file-Sorted.txt.

The final step is to apply uniq -d to guccifer2.0-30Sept2016-plus-ebd-cf-file-Sorted.txt, which should give you the duplicate files, comparing the files in ebd-cf.7z to those known before September 30, 2016.

The results?

11-26-08 nfc members raised.xlsx

Seven files out of 2085 doesn’t sound like a high degree of duplication.

At least not to me.


PS: On the allegations about the Russians, you could ask the Communists in the State Department or try the Army General Staff. ;-) Some of McCarty’s records are opening up if you need leads.

PPS: Use the final sorted, unique file list to check future releases by Guccifer 2.0. It might help you avoid bullshitting the public.

by Patrick Durusau at October 06, 2016 01:25 AM

October 05, 2016

Patrick Durusau

#Guccifer 2.0 Drop – Oct. 4, 2016 – File List

While you wait for your copy of the October 4, 2016 drop by #Guccifer 2.0 to download, you may want to peruse the file list for that drop: ebd-cf-file-list.gz.

A good starting place for comments on this drop is: Guccifer 2.0 posts DCCC docs, says they’re from Clinton Foundation – Files appear to be from Democratic Congressional Campaign Committee and DNC hacks. by Sean Gallagher.

The paragraph in Sean’s post that I find the most interesting is:

However, a review by Ars found that the files are clearly not from the Clinton Foundation. While some of the individual files contain real data, much of it came from other breaches Guccifer 2.0 has claimed credit for at the Democratic National Committee and the Democratic Congressional Campaign Committee—hacks that researchers and officials have tied to “threat groups” connected to the Russian Government. Other data could have been aggregated from public information, while some appears to be fabricated as propaganda.

To verify Sean’s claim of duplication, compare the file names in this dump against those from prior dumps.

Sean is not specific about which files/data are alleged to be “fabricated as propaganda.”

I continue to be amused by allegations of Russian Government involvement. When seeking funding, Russian (substitute other nationalities) possess super-human hacking capabilities. Yet, in cases like this one, which regurgitates old data, Russian Government involvement is presumed.

The inconsistency between Russian Government super-hackers and Russian Government copy-n-paste data leaks, doesn’t seem to be getting much play in the media.

Perhaps you can help on that score.


by Patrick Durusau at October 05, 2016 02:37 AM

An introduction to data cleaning with R

An introduction to data cleaning with R by Edwin de Jonge and Mark van der Loo.


Data cleaning, or data preparation is an essential part of statistical analysis. In fact, in practice it is often more time-consuming than the statistical analysis itself. These lecture notes describe a range of techniques, implemented in the R statistical environment, that allow the reader to build data cleaning scripts for data suffering from a wide range of errors and inconsistencies, in textual format. These notes cover technical as well as subject-matter related aspects of data cleaning. Technical aspects include data reading, type conversion and string matching and manipulation. Subject-matter related aspects include topics like data checking, error localization and an introduction to imputation methods in R. References to relevant literature and R packages are provided throughout.

These lecture notes are based on a tutorial given by the authors at the useR!2013 conference in Albacete, Spain.

Pure gold!

Plus this tip (among others):

Tip. To become an R master, you must practice every day.

The more data you clean, the better you will become!


by Patrick Durusau at October 05, 2016 12:33 AM

October 04, 2016

Patrick Durusau

Deep-Fried Data […money laundering for bias…]

Deep-Fried Data by Maciej Ceglowski. (paper) (video of same presentation) Part of Collections as Data event at the Library of Congress.

If the “…money laundering for bias…” quote doesn’t capture your attention, try:

I find it helpful to think of algorithms as a dim-witted but extremely industrious graduate student, whom you don’t fully trust. You want a concordance made? An index? You want them to go through ten million photos and find every picture of a horse? Perfect.

You want them to draw conclusions on gender based on word use patterns? Or infer social relationships from census data? Now you need some adult supervision in the room.

Besides these issues of bias, there’s also an opportunity cost in committing to computational tools. What irks me about the love affair with algorithms is that they remove a lot of the potential for surprise and serendipity that you get by working with people.

If you go searching for patterns in the data, you’ll find patterns in the data. Whoop-de-doo. But anything fresh and distinctive in your digital collections will not make it through the deep frier.

We’ve seen entire fields disappear down the numerical rabbit hole before. Economics came first, sociology and political science are still trying to get out, bioinformatics is down there somewhere and hasn’t been heard from in a while.

A great read and equally enjoyable presentation.


by Patrick Durusau at October 04, 2016 11:45 PM

Moral Machine [Research Design Failure]

Moral Machine

From the webpage:

Welcome to the Moral Machine! A platform for gathering a human perspective on moral decisions made by machine intelligence, such as self-driving cars.

We show you moral dilemmas, where a driverless car must choose the lesser of two evils, such as killing two passengers or five pedestrians. As an outside observer, you judge which outcome you think is more acceptable. You can then see how your responses compare with those of other people.

If you’re feeling creative, you can also design your own scenarios, for you and others to browse, share, and discuss.

The first time I recall hearing this type of discussion was over thirty years ago when a friend, taking an ethics class related the following problem:

You are driving a troop transport with twenty soldiers in the back and are about to enter a one lane bridge. You see a baby sitting in the middle of the bridge. Do you serve, going down an embankment, killing all on board or do you go straight?

A lively college classroom discussion erupted and continued for the entire class. Various theories and justifications were offered, etc. When the class bell rang, the professor announced the child perished 59 minutes, 59 seconds ago.

As you may guess, not a single person in the class called out “Swerve” when the question was posed.

The exercise was to illustrate that many “moral” decisions are made at the limits of human reaction time. Typically, 150 and 300 milliseconds. (Speedy Science: How Fast Can You React? is a great activity from Scientific American to test your reaction time.)

The examples in MIT’s Moral Machine perpetuate the myth that moral decisions are the result of reflection and consideration of multiple factors.

Considered moral decisions do exist. Dietrich Bonhoeffer deciding to participate in a conspiracy to assassinate Adolf Hitler. Lyndon Johnson supporting civil rights in the South. But those are not the subject of the “Moral Machine.”

Nor is the “Moral Machine” even a useful simulation of what a driven and/or driverless car would confront. Visibility isn’t an issue as it often is, there are no distractions, no smart phones ringing, no conflicting input from passengers, etc.

In short, the “Moral Machine” creates a fictional choice, about which to solicit your “moral” advice, under conditions you will never experience.

Separating pedestrians from vehicles (once suggested by Buckminster Fuller I think) is a far more useful exercise than college level discussion questions.

by Patrick Durusau at October 04, 2016 08:45 PM

Resource: Malware analysis – …

Resource: Malware analysis – learning How To Reverse Malware: A collection of guides and tools by Claus Cramon Houmann.

This resource will provide you theory around learning malware analysis and reverse engineering malware. We keep the links up to date as the infosec community creates new and interesting tools and tips.

Some technical reading to enjoy instead of political debates!


by Patrick Durusau at October 04, 2016 07:04 PM

“Just the texts, Ma’am, just the texts” – Colin Powell Emails Sans Attachments

As I reported in Bulk Access to the Colin Powell Emails – Update, I was looking for a host for the complete Colin Powell emails at 2.5 GB, but I failed on that score.

I can’t say if that result is lack of interest in making the full emails easily available or if I didn’t ask the right people. Please circulate my request when you have time.

In the meantime, I have been jumping from one “easy” solution to another, most of which involved parsing the .eml files.

But my requirement is to separate the attachment from the emails, quickly and easily. Not to parse the .eml files in preparation for further process.

How does a 22 character, command line sed expression sound?

Do you know of an “easier” solution?

sed -i '/base64/,$d' *

Reasoning the first attachment (in the event of multiple attachments) will include the string “base64″ so I pass a range expression that starts there and ends at the end of the message “$” and delete that pattern, d, and write the files in place “-i.”

There are far more sophisticated solutions to this problem but as crude as this may be, I have reduced the 2.5 GB archive file that includes all the emails and their attachments down to 63 megabytes.

Attachments are important too but my first steps were to make these and similar files more accessible.

Obtaining > 29K files through the drinking straw at DCLeaks or waiting until I find a host for a consolidated 2.5 GB files, doesn’t make these files more accessible.

A 63 MB download of the Colin Powells Emails With No Attachments may.

Please feel free to mirror these files.

PS: One oddity I noticed in testing the download. With Chrome, the file size inflates to 294MB. With Mozilla, the file size is 65MB. ? Both unpack properly. Suggestions?

PPS: More sophisticated processing of the raw emails and other post-processing to follow.

by Patrick Durusau at October 04, 2016 12:55 AM

October 03, 2016

Patrick Durusau

Security Community “Reasoning” About Botnets (and malware)

In case you missed it: Source Code for IoT Botnet ‘Mirai’ Released by Brian Krebs offers this “reasoning” about a recent release of botnet software:

The source code that powers the “Internet of Things” (IoT) botnet responsible for launching the historically large distributed denial-of-service (DDoS) attack against KrebsOnSecurity last month has been publicly released, virtually guaranteeing that the Internet will soon be flooded with attacks from many new botnets powered by insecure routers, IP cameras, digital video recorders and other easily hackable devices.

The leak of the source code was announced Friday on the English-language hacking community Hackforums. The malware, dubbed “Mirai,” spreads to vulnerable devices by continuously scanning the Internet for IoT systems protected by factory default or hard-coded usernames and passwords.

Being a recent victim of a DDoS attack, perhaps Kerbs anger about the release of Mirai is understandable. But only to a degree.

Non-victims of such DDoS attacks have been quick to take up the “sky is falling” refrain.

Consider Hacker releases code for huge IoT botnet, or, Hacker Releases Code That Powered Record-Breaking Botnet Attack, or, Brace yourselves—source code powering potent IoT DDoSes just went public: Release could allow smaller and more disciplined Mirai botnet to go mainstream, as samples.

Mirai is now available to “anyone” but where the reasoning of Kerbs and others breaks down is there is no evidence that “everyone” wants to run a botnet.

Even if the botnet was as easy (sic) to use as Outlook.

For example, gun ownership in the United States is now at 36% of the adult population, but roughly one-third of the population will not commit murder this coming week.

As of 2010, there were roughly 210 million licensed drivers in the United States. Yet, this coming week, it is highly unlikely that any of them will commandeer a truck and run down pedestrians with it.

The point is that the vast majority of users, even if they were competent to read and use the Mirai code, aren’t criminals. Nor does possession of the Mirai code make them criminals.

It could be they are just curious. Or interested in how it was coded. Or, by some off chance, they could even have good intentions and want to study it to fight botnets.

Attempting to prevent the spread of information hasn’t resulted in any apparent benefit, at least to the cyber community at large.

Perhaps its time to treat the cyber community as adults, some of who will make good decisions and some less so.

by Patrick Durusau at October 03, 2016 01:41 AM

Value-Add Of Mapping The Food Industry

Did you know that ten (10) companies control all of the major food/drink brands in the world?


(From These 10 companies control everything you buy, where you can find a larger version of this image.)

You could, with enough searching, have put together all ten of these mini-maps, but then that effort would have to be repeated by everyone seeking the same information.

But, instead of duplicating an initial investment to identify players and their relationships, you can focus on identifying their IP addresses, process control machinery, employees, and other useful data.

What are your value-add of mapping examples?

by Patrick Durusau at October 03, 2016 12:39 AM

October 02, 2016

Patrick Durusau

Nuremberg Trial Verdicts [70th Anniversary]

Nuremberg Trial Verdicts by Jenny Gesley.

From the post:

Seventy years ago – on October 1, 1946 – the Nuremberg trial, one of the most prominent trials of the last century, concluded when the International Military Tribunal (IMT) issued the verdicts for the main war criminals of the Second World War. The IMT sentenced twelve of the defendants to death, seven to terms of imprisonment ranging from ten years to life, and acquitted three.

The IMT was established on August 8, 1945 by the United Kingdom (UK), the United States of America, the French Republic, and the Union of Soviet Socialist Republics (U.S.S.R.) for the trial of war criminals whose offenses had no particular geographical location. The defendants were indicted for (1) crimes against peace, (2) war crimes, (3) crimes against humanity, and of (4) a common plan or conspiracy to commit those aforementioned crimes. The trial began on November 20, 1945 and a total of 403 open sessions were held. The prosecution called thirty-three witnesses, whereas the defense questioned sixty-one witnesses, in addition to 143 witnesses who gave evidence for the defense by means of written answers to interrogatories. The hearing of evidence and the closing statements were concluded on August 31, 1946.

The individuals named as defendants in the trial were Hermann Wilhelm Göring, Rudolf Hess, Joachim von Ribbentrop, Robert Ley, Wilhelm Keitel, Ernst Kaltenbrunner, Alfred Rosenberg, Hans Frank, Wilhelm Frick, Julius Streicher, Walter Funk, Hjalmar Schacht, Karl Dönitz, Erich Raeder, Baldur von Schirach, Fritz Sauckel, Alfred Jodl, Martin Bormann, Franz von Papen, Arthur Seyss-Inquart, Albert Speer, Constantin von Neurath, Hans Fritzsche, and Gustav Krupp von Bohlen und Halbach. All individual defendants appeared before the IMT, except for Robert Ley, who committed suicide in prison on October 25, 1945; Gustav Krupp von Bolden und Halbach, who was seriously ill; and Martin Borman, who was not in custody and whom the IMT decided to try in absentia. Pleas of “not guilty” were entered by all the defendants.

The trial record is spread over forty-two volumes, “The Blue Series,” Trial of the Major War Criminals before the International Military Tribunal Nuremberg, 14 November 1945 – 1 October 1946.

All forty-two volumes are available in PDF format and should prove to be a more difficult indexing, mining, modeling, searching challenge than twitter feeds.

Imagine instead of “text” similarity, these volumes were mined for “deed” similarity. Similarity to deeds being performed now. By present day agents.

Instead of seldom visited dusty volumes in the library stacks, “The Blue Series” could develop a sharp bite.

by Patrick Durusau at October 02, 2016 01:46 AM

Data Science Toolbox

Data Science Toolbox

From the webpage:

Start doing data science in minutes

As a data scientist, you don’t want to waste your time installing software. Our goal is to provide a virtual environment that will enable you to start doing data science in a matter of minutes.

As a teacher, author, or organization, making sure that your students, readers, or members have the same software installed is not straightforward. This open source project will enable you to easily create custom software and data bundles for the Data Science Toolbox.

A virtual environment for data science

The Data Science Toolbox is a virtual environment based on Ubuntu Linux that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes. You can run the Data Science Toolbox either locally (using VirtualBox and Vagrant) or in the cloud (using Amazon Web Services).

We aim to offer a virtual environment that contains the software that is most commonly used for data science while keeping it as lean as possible. After a fresh install, the Data Science Toolbox contains the following software:

  • Python, with the following packages: IPython Notebook, NumPy, SciPy, matplotlib, pandas, scikit-learn, and SymPy.
  • R, with the following packages: ggplot2, plyr, dplyr, lubridate, zoo, forecast, and sqldf.
  • dst, a command-line tool for installing additional bundles on the Data Science Toolbox (see next section).

Let us know if you want to see something added to the Data Science Toolbox.

Great resource for doing or teaching data science!

And an example of using a VM to distribute software in a learning environment.

by Patrick Durusau at October 02, 2016 01:31 AM

October 01, 2016

Patrick Durusau

Type-driven Development … [Further Reading]

The Further Reading slide from Edwin Brady’s presentation Type-driven Development of Communicating Systems in Idris (Lamda World, 2016) was tweeted as an image, eliminating the advantages of hyperlinks.

I have reproduced that slide with the links as follows:

Further Reading

On total functional programming

On interactive programming with dependent types

On types for communicating systems:

On Wadler’s paper, you may enjoy the video of his presentation, Propositions as Sessions or his slides (2016), Propositions as Sessions, Philip Wadler, University of Edinburgh, Betty Summer School, Limassol, Monday 27 June 2016.

by Patrick Durusau at October 01, 2016 08:49 PM

Government Contractor Persistence

Persistence of data is a hot topic in computer science but did you know government contractors exhibit persistence as well?

Remember the 22,000,000+ record leak from the US Office of Personnel Management?

Leaks don’t happen on their own and it turns out that Keypoint Government Solutions was weak link in the chain that resulted in that loss.

Cory Doctorow reports in Company suspected of blame in Office of Personnel Management breach will help run new clearance agency:

It’s still not clear how OPM got hacked, but signs point to a failure at one of its contractors, Keypoint Government Solutions, who appear to have lost control of their logins/passwords for sensitive OPM services.

In the wake of the hacks, the job of giving out security clearances has been given to a new government agency, the National Background Investigations Bureau.

NBIB is about to get started, and they’ve announced that they’re contracting out significant operations to Keypoint. Neither Keypoint nor the NBIB would comment on this arrangement.

The loss of 22,000,000 records?, well, that could happen to anybody.


Initiatives, sprints, proclamations, collaborations with industry, academia, etc., are unlikely to change the practice of cybersecurity in the U.S. government.

Changing cybersecurity practices in government requires:

  • Elimination of contractor persistence. One failure is enough.
  • Immediate and permanent separation of management and staff who fail to implement and follow standard security practices.
  • Separated staff and management barred from employment with any contractor with the government, permanently.
  • Staff of prior failed contractors barred from employment at present contractors. (An incentive for contractor staff to report shortfalls in current contracts.)
  • Multi-year funded contracts that include funding for independent red team testing of security.

A no consequences for failure of security policy defeats all known security policies.

by Patrick Durusau at October 01, 2016 05:59 PM

Version 2 of the Hubble Source Catalog [Model For Open Access – Attn: Security Researchers]

Version 2 of the Hubble Source Catalog

From the post:

The Hubble Source Catalog (HSC) is designed to optimize science from the Hubble Space Telescope by combining the tens of thousands of visit-based source lists in the Hubble Legacy Archive (HLA) into a single master catalog.

Version 2 includes:

  • Four additional years of ACS source lists (i.e., through June 9, 2015). All ACS source lists go deeper than in version 1. See current HLA holdings for details.
  • One additional year of WFC3 source lists (i.e., through June 9, 2015).
  • Cross-matching between HSC sources and spectroscopic COS, FOS, and GHRS observations.
  • Availability of magauto values through the MAST Discovery Portal. The maximum number of sources displayed has increased from 10,000 to 50,000.

The HSC v2 contains members of the WFPC2, ACS/WFC, WFC3/UVIS and WFC3/IR Source Extractor source lists from HLA version DR9.1 (data release 9.1). The crossmatching process involves adjusting the relative astrometry of overlapping images so as to minimize positional offsets between closely aligned sources in different images. After correction, the astrometric residuals of crossmatched sources are significantly reduced, to typically less than 10 mas. The relative astrometry is supported by using Pan-STARRS, SDSS, and 2MASS as the astrometric backbone for initial corrections. In addition, the catalog includes source nondetections. The crossmatching algorithms and the properties of the initial (Beta 0.1) catalog are described in Budavari & Lubow (2012).


There are currently three ways to access the HSC as described below. We are working towards having these interfaces consolidated into one primary interface, the MAST Discovery Portal.

  • The MAST Discovery Portal provides a one-stop web access to a wide variety of astronomical data. To access the Hubble Source Catalog v2 through this interface, select Hubble Source Catalog v2 in the Select Collection dropdown, enter your search target, click search and you are on your way. Please try Use Case Using the Discovery Portal to Query the HSC
  • The HSC CasJobs interface permits you to run large and complex queries, phrased in the Structured Query Language (SQL).
  • HSC Home Page

    – The HSC Summary Search Form displays a single row entry for each object, as defined by a set of detections that have been cross-matched and hence are believed to be a single object. Averaged values for magnitudes and other relevant parameters are provided.

    – The HSC Detailed Search Form displays an entry for each separate detection (or nondetection if nothing is found at that position) using all the relevant Hubble observations for a given object (i.e., different filters, detectors, separate visits).

Amazing isn’t it?

The astronomy community long ago vanquished data hoarding and constructed tools to avoid moving very large data sets across the network.

All while enabling more and not less access and research using the data.

Contrast that to the sorry state of security research, where example code is condemned, if not actually prohibited by law.

Yet, if you believe current news reports (always an iffy proposition), cybercrime is growing by leaps and bounds. (PwC Study: Biggest Increase in Cyberattacks in Over 10 Years)

How successful is the “data hoarding” strategy of the security research community?

by Patrick Durusau at October 01, 2016 01:44 AM

Going My Way? – Explore 1.2 billion taxi rides

Explore 1.2 billion taxi rides by Hannah Judge.

From the post:

Last year the New York City Taxi and Limousine Commission released a massive dataset of pickup and dropoff locations, times, payment types, and other attributes for 1.2 billion trips between 2009 and 2015. The dataset is a model for municipal open data, a tool for transportation planners, and a benchmark for database and visualization platforms looking to test their mettle.

MapD, a GPU-powered database that uses Mapbox for its visualization layer, made it possible to quickly and easily interact with the data. Mapbox enables MapD to display the entire results set on an interactive map. That map powers MapD’s dynamic dashboard, updating the data as you zoom and pan across New York.

Very impressive demonstration of the capabilities of MapD!

Imagine how you can visualize data from your hundreds of users geo-spotting security forces with their smartphones.

Or visualizing data from security forces tracking your citizens.

Technology cuts both ways.

The question is whether the sharper technology sword is going to be in your hands or those of your opponents?

by Patrick Durusau at October 01, 2016 01:20 AM

Introducing the Open Images Dataset

Introducing the Open Images Dataset by Ivan Krasin and Tom Duerig.

From the post:

In the last few years, advances in machine learning have enabled Computer Vision to progress rapidly, allowing for systems that can automatically caption images to apps that can create natural language replies in response to shared photos. Much of this progress can be attributed to publicly available image datasets, such as ImageNet and COCO for supervised learning, and YFCC100M for unsupervised learning.

Today, we introduce Open Images, a dataset consisting of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories. We tried to make the dataset as practical as possible: the labels cover more real-life entities than the 1000 ImageNet classes, there are enough images to train a deep neural network from scratch and the images are listed as having a Creative Commons Attribution license*.

The image-level annotations have been populated automatically with a vision model similar to Google Cloud Vision API. For the validation set, we had human raters verify these automated labels to find and remove false positives. On average, each image has about 8 labels assigned. Here are some examples:

Impressive data set, if you want to recognize a muffin, gherkin, pebble, etc., see the full list at dict.csv.

Hopeful the techniques you develop with these images will lead to more focused image recognition. ;-)

I lightly searched the list and no “non-safe” terms jumped out at me. Suitable for family image training.

by Patrick Durusau at October 01, 2016 01:10 AM

ggplot2 2.2.0 coming soon! [Testers Needed!]

ggplot2 2.2.0 coming soon! by Hadley Wickham.

From the post:

I’m planning to release ggplot2 2.2.0 in early November. In preparation, I’d like to announce that a release candidate is now available: version Please try it out, and file an issue on GitHub if you discover any problems. I hope we can find and fix any major issues before the official release.

Install the pre-release version with:

# install.packages("devtools")

If you discover a major bug that breaks your plots, please file a minimal reprex, and then roll back to the released version with:


ggplot2 2.2.0 will be a relatively major release including:

The majority of this work was carried out by Thomas Pederson, who I was lucky to have as my “ggplot2 intern” this summer. Make sure to check out other visualisation packages: ggraph, ggforce, and tweenr.

Just in case you are casual about time, tomorrow is October 1st. Which on most calendars means that “early November” isn’t far off.

Here’s an easy opportunity to test ggplot2.2.2.0 and related visualization packages. Before the official release.


by Patrick Durusau at October 01, 2016 12:29 AM

September 30, 2016

Patrick Durusau

ORWL – Downside of a Physically Secure Computer

Meet ORWL. The first open source, physically secure computer


If someone has physical access to your computer with secure documents present, it’s game over! ORWL is designed to solve this as the first open source physically secure computer. ORWL (pronounced or-well) is the combination of the physical security from the banking industry (used in ATMs and Point of Sale terminals) and a modern Intel-based personal computer. We’ve designed a stylish glass case which contains the latest processor from Intel – exactly the same processor as you would find in the latest ultrabooks and we added WiFi and Bluetooth wireless connectivity for your accessories. It also has two USB Type C connectors for any accessories you prefer to connect via cables. We then use the built-in Intel 515 HD Video which can output up to 4K video with audio.

The physical security enhancements we’ve added start with a second authentication factor (wireless keyfob) which is processed before the main processor is even powered up. This ensures we are able to check the system’s software for authenticity and security before we start to run it. We then monitor how far your keyfob is from your PC – when you leave the room, your PC will be locked automatically, requiring the keyfob to unlock it again. We’ve also ensured that all information on the system drive is encrypted via the hardware on which it runs. The encryption key for this information is managed by the secure microcontroller which also handles the pre-boot authentication and other security features of the system. And finally, we protect everything with a high security enclosure (inside the glass) that prevents working around our security by physically accessing hardware components.

Any attempt to get physical access to the internals of your PC will delete the cryptographic key, rendering all your data permanently inaccessible!

The ORWL is a good illustration that good security policies can lead to unforeseen difficulties.

Or as the blog post brags:

Any attempt to get physical access to the internals of your PC will delete the cryptographic key, rendering all your data permanently inaccessible!

All I need do to deprive you of your data (think ransomware), is to physically tamper with your ORWL.

Of interest to journalists who need the ability to deprive others of data on very short notice.

Perhaps a fragile version for journalists and a more resistance to abuse version for the average user.


by Patrick Durusau at September 30, 2016 06:57 PM

Multiple Backdoors found in D-Link DWR-932 B LTE Router [There is an upside.]

Multiple Backdoors found in D-Link DWR-932 B LTE Router by Swati Khandelwal.

From the post:

If you own a D-Link wireless router, especially DWR-932 B LTE router, you should get rid of it, rather than wait for a firmware upgrade that never lands soon.

D-Link DWR-932B LTE router is allegedly vulnerable to over 20 issues, including backdoor accounts, default credentials, leaky credentials, firmware upgrade vulnerabilities and insecure UPnP (Universal Plug-and-Play) configuration.

If successfully exploited, these vulnerabilities could allow attackers to remotely hijack and control your router, as well as network, leaving all connected devices vulnerable to man-in-the-middle and DNS poisoning attacks.

Moreover, your hacked router can be easily abused by cybercriminals to launch massive Distributed Denial of Service (DDoS) attacks, as the Internet has recently witnessed record-breaking 1 Tbps DDoS attack that was launched using more than 150,000 hacked Internet-connected smart devices.

Security researcher Pierre Kim has discovered multiple vulnerabilities in the D-Link DWR-932B router that’s available in several countries to provide the Internet with an LTE network.

The current list on this cyber-horror at is £95.97. Wow!

Once word spreads about its swiss-cheese like security characteristics, one hopes its used price will fall rapidly.

Swati’s post makes the start of a great checklist for grading penetration of the router for exam purposes.


PS: I’m willing to pay $10.00 plus shipping for one. (Contact me for details.)

by Patrick Durusau at September 30, 2016 02:58 AM

The Simpsons by the Data [South Park as well]

The Simpsons by the Data by Todd Schneider.

From the post:

The Simpsons needs no introduction. At 27 seasons and counting, it’s the longest-running scripted series in the history of American primetime television.

The show’s longevity, and the fact that it’s animated, provides a vast and relatively unchanging universe of characters to study. It’s easier for an animated show to scale to hundreds of recurring characters; without live-action actors to grow old or move on to other projects, the denizens of Springfield remain mostly unchanged from year to year.

As a fan of the show, I present a few short analyses about Springfield, from the show’s dialogue to its TV ratings. All code used for this post is available on GitHub.

Alert! You must run Flash in order to access Simpsons World, the source of Todd’s data.

Advice: Treat Flash as malware and run in a VM.

Todd covers the number of words spoken per character, gender imbalance, focus on characters, viewership, and episode summaries (tf-idf).

Other analysis awaits your imagination and interest.

BTW, if you want comedy data a bit closer to the edge, try Text Mining South Park by Kaylin Walker. Kaylin uses R for her analysis as well.

Other TV programs with R-powered analysis?

by Patrick Durusau at September 30, 2016 02:33 AM

Graph Computing with Apache TinkerPop

From the description:

Apache TinkerPop serves as an Apache governed, vendor-agnostic, open source initiative providing a standard interface and query language for both OLTP- and OLAP-based graph systems. This presentation will outline the means by which vendors implement TinkerPop and then, in turn, how the Gremlin graph traversal language is able to process the vendor’s underlying graph structure. The material will be presented from the perspective of the DSEGraph team’s use of Apache TinkerPop in enabling graph computing features for DataStax Enterprise customers.


Marko is brutally honest.

He warns the early part of his presentation is stream of consciousness and that is the truth!


That takes you to time mark 11:37 and the description of Gremlin as a language begins.

Marko slows, momentarily, but rapidly picks up speed.

Watch the video, then grab the slides and mark what has captured your interest. Use the slides as your basis for exploring Gremlin and Apache TinkerPop documentation.


by Patrick Durusau at September 30, 2016 02:09 AM