Comments
onion OP wrote
The Emergent System After 9/11: Data Isn’t a Problem, Using It Is
In the wake of the 9/11 attacks, the U.S. intelligence community and the Department of Defense poured billions into intelligence collection. Data was collected from around the world in a variety of forms to prevent new terrorist attacks against the U.S. homeland. Every conceivable relevant detail of information that could prevent an attack or hunt down those responsible for attack plotting was collected. Simply put, the United States does not suffer from a lack of data. The emerging capability gap between Beijing and Washington is the processing of this data that allows for the identification of details and patterns that are relevant to America’s national security needs.
Historically, the traditional intersection of data collection, analysis, and national defense were a cadre of people in the intelligence community and the Department of Defense known as analysts. A bottom-up evolution started after 9/11, has revolutionized how analysis is done and to what end. As data supplies grew and new demands for analysis emerged, the cadre began to cleave. The traditional cadre remained focused on strategic needs: warning policymakers and informing them of the plans and intentions of America’s adversaries. The new demands were more detailed and tactical, and the focus was on enabling operations, not informing the President. Who, specifically, should the U.S. focus its collection against? What member of a terrorist group should the U.S. military target and where does he live, what time does he drive to meet his buddies? This new, distinct cadre of professionals rose to meet the new demand – they became known as targeters.
The targeter is a detective who pieces together the life of a subject or network in excruciating detail: their schedule, their family, their social contacts, their interests, their possessions, their behavior, and so on. The targeter does all of this to understand the subject so well that they can assess their subject’s importance in their organization and predict their behavior and motivation. They also make reasoned and supported arguments as to where to place additional intelligence collection resources against their target to better understand them and their network, or what actions the USG or our allies should take against the target to diminish their ability to do harm.
The day-to-day responsibilities of a targeter include combing through intelligence collection, be it reporting from a spy in the ranks of al-Qa’ida, a drug cartel, or a foreign government (HUMINT); collection of enemy communications (SIGINT); images of a suspicious location or object (IMINT); review of social media, publications, news reports, etc.(OSINT); or materials captured by U.S. military or partner country forces during raids against a specific target, location, or network member (DOCEX). Using all of the information available, the targeter looks for specific details that will help assess their subject or networks and predict behaviors.
As more and more of the cadre cleaved into this targeter role, agencies began to formalize their roles and responsibilities. Data piled up and more targeters were needed. As this emergent system was being formalized into the bureaucracy, it quickly became overwhelmed by the volumes of data. Too few tools existed to exploit the datasets. Antiquated security orthodoxy surrounding how data is stored and accessed disrupting the targeter’s ability to find links. The bottom-up innovation stalled. Even within the most sophisticated and well-supported environments for targeting in the U.S. Government, the problem has persisted and is growing worse. Without attention and resolution, these issues may make the system obsolete.
onion OP wrote
The Threat of the Status Quo
Two practical issues loom over the future of targeting and effective, focused U.S. national security actions: data overload and targeter enablement.
The New Stovepipes
Since the 9/11 Commission Report, intelligence “stovepipes” became part of the American lexicon and reflected bureaucratic turf wars and politics. Information wasn’t shared between agencies that could have increased the probability that the attack could have been detected and prevented. Today, volumes of information are shared between agencies, exponentially more per month is collected and shared than what was in the months before 9/11. Ten years ago, a targeter pursuing a high value target (HVT) – say the leader of a terrorist group – couldn’t find, let alone analyze, all of the information of potential value to the manhunt. Too much poorly organized data means the targeter cannot possibly conduct a thorough analysis at the speed the mission demands. Details are missed, opportunities lost, patterns misidentified, mistakes made. The disorganization and walling off of data for security purposes means new stovepipes have appeared, not between agencies, but between datasets-often within the same agency. As the data volume grows, these challenges have also grown.
Authors have been writing about the issue of data overload in the national security space for years now. Unfortunately, progress to manage the issue or offer workable solutions has been modest, at best. Data of a variety of types and formats, structured and unstructured, flows into USG repositories every hour; 24/7/365. Every year it grows exponentially. In the very near future, there should be little doubt, the USG will collect against foreign 5G, IoT, advanced satellite internet, and adversary databases in the terabyte, petabyte, exabyte, or larger realm. The ingestion, processing, parsing, and sensemaking challenges of these data loads will be like nothing anyone has ever faced before.
Let’s illustrate the issue with a notional comparison.
The U.S. military in 2008, raided an al-Qa’ida safehouse in Iraq and recovered a laptop with a 1GB hard drive. The data on the hard drive was passed to a targeter for analysis. It contained a variety of documents, photos, and video. It took several hours and the help of a linguist, but the targeter was able to identify several leads and items of interest that would advance the fight against al-Qa’ida.
The Afghan Government in 2017, raided an al-Qa’ida media house and recovered over 40TB of data. The data on the hard drives was passed to a targeter for analysis. It contained a variety of documents, photos, and video. Let’s be nice to our targeter and say, only a quarter of the 40TB is video – that’s still as much as 5,000 hours. That’s 208 days of around-the-clock video review and she still hasn’t been able to review the documents, audio, or photos. Obviously, this workload is impossible given the pace of her mission, so she’s not going to do that. Her and her team only look for a handful of specific documents and largely discard the rest.
Let’s say the National Security Agency in 2025, collected 1.4 petabytes of leaked Chinese Government emails and attachments. Our targeter and all of her teammates could easily spend the rest of their careers reviewing the data using current methods and tools.
In real life, the raid on Usama Bin Ladin’s compound produced over 250GB of material. It took an interagency task force in 2011 many months to manually comb through the data and identify material of interest. These examples shed light on only a subset of data overload. Keep in mind, this DOCEX is only one source our targeter has to review to get a full picture of her target and network. She’s also looking through all of the potentially relevant collected HUMINT, SIGINT, IMINT, OSINT, etc. that could be related to her target. That’s many more datasets, often stovepipes within stovepipes, with the same outmoded tools and methods.
This leads us to our second problem, human enablement.
onion OP wrote
The Collapsing Emergent System
Much of our targeter’s workday is spent on information extraction and organization, the vast majority of which is, well, robot work. She’ll be repeating manual tasks for most of the day. She knows what she needs to investigate today to continue building her target or network profile. Today it’s a name and a phone number. She has a time consuming, tedious, and potentially error-prone effort ahead of her–a “swivel chair process”–tracking down the name and phone number in multiple databases using a variety of outmoded software tools. She’ll manually investigate her name and phone number in multiple stovepiped databases. She’ll map what she’s found in a network analysis tool, in an electronic document, or <wince> a pen to paper notebook. Now…finally…she will begin to use her brain. She’ll look for patterns, she’ll analyze the data temporally, she’ll find new associations and correlations, and she’ll challenge her assumptions and come to new conclusions. Too bad she spent 80% of her time doing robot work.
This is the problem as it stands today. The targeter is overwhelmed with too much unstructured and stovepiped information and does not have access to the tools required to clean, sift, sort and process massive amounts of data. And remember, the system she operates is about to receive exponentially more data. Absent change, a handful of things are almost certain to happen:
More raw data will be collected than is actually relevant, and as a result will increase the stress on infrastructure to store all of that data for future analysis. Infrastructure (technical and process related) will continue to fail to make raw data available to technologists and targeters to begin processing at a mission relevant pace. Targeters and analysts will continue to perform manual tasks that take the majority of their time, leaving little time for actual analysis and delivery of insights. The timeline from data to information, to insights, to decision making is extended exponentially as data exponentially increases. Insights as a result of correlations between millions of raw data points will be missed entirely, leading to incorrect targets being identified, missed targets or patterns, or targets with inaccurate importance being prioritized first. This may seem banal or weedy, but it should be very concerning. This system – how the United States processes the information it collects to identify and prevent threats – will not work in the very near future. The data stovepipes of the 2020s can result in a surprise or catastrophe like the institutional stovepipes of the 1990s; it won’t be a black swan. As the U.S. competes with Beijing, its national defense will require more speed, not less, against more data than ever before. It will require evaluating data and making connections and correlations faster than a human can. It will require the effective processing of this mass of data to identify precision solutions that reduce the scope of intervention to achieve our goals, while minimizing harm. Our current and future national defense needs our targeter to be motivated, enabled, and effective.
onion OP wrote
Innovating the System
To overcome the exponential growth in data and subsequent stovepiping, the IC doesn’t need to hire armies of 20-somethings to do around-the-clock analysis in warehouses all over northern Virginia. It needs to modernize its security approach to connect these datasets, and apply a vast suite of machine learning models and other analytics to help targeters start innovating. Now. Technological innovations are also likely to lead to more engaged, productive, and energized targeters who spend their time applying their creativity and problem-solving skills, and spend less time doing robot work. We can’t afford to lose any more trained and experienced targeters to this rapidly fatiguing system.
The current system as discussed, is one of unvalidated data collection and mass storage, manual loading, mostly manual review, and robotic swivel chair processes for analysis.
The system of the future breaks down data stovepipes and eliminates the manual and swivel chair robot processes of the past. The system of the future automates data triage, so users can readily identify datasets of interest for deep manual research. It automates data processing, cleaning, correlations and target profiling – clustering information around a potential identity. It helps targeters identify patterns and suggests areas for future research.
How do current and emerging analytic and ML techniques bring us to the system of the future and better enable our targeter? Here are four ideas to start with:
Automated Data Triage: As data is fed into the system, a variety of analytics and ML pipelines are applied. A typical exploratory data analysis (EDA) report is produced (data size, file types, temporal analysis, etc.). Additionally, analytics ingest, clean and standardize the data. ML and other approaches identify languages, set aside likely irrelevant information, summarize topics and themes, and identify named entities, phone numbers, email addresses, etc. This first step aids in validating data need, enables an improved search capability, and sets a new foundation for additional analytics and ML approaches. There are seemingly countless examples across the U.S. national security space. Automated Correlation: Output from numerous data streams is brought into an abstraction layer and prepped for next generation analytics. Automated correlation is applied across a variety of variables: potential name matches, facial recognition and biometric clustering, phone number and email matches, temporal associations, and locations. Target Profiling: Network, Spatial, and Temporal Analytics: As the information is clustered, our targeter now sees associations pulled together by the computer. The robot, leveraging its computational speed along with machine learning for rapid comparison and correlation, has replaced the swivel chair process. Our targeter is now investigating associations, validating the profile, refining the target’s pattern-of-life. She is coming to conclusions about the target faster and more effectively and is bringing more value to the mission. She’s also providing feedback to the system, helping to refine its results. AI Driven Trend and Pattern Analysis: Unsupervised ML approaches can help identify new patterns and trends that may not fit into the current framing of the problem. These insights can challenge groupthink, identify new threats early, and find insights that our targeters may not even know to look for. Learning User Behavior: Our new system shouldn’t just enable our targeter, it should learn from her. Applying ML behind the scenes that monitors our targeter can help drive incremental improvements of the system. What does she click on? Did she validate or refute a machine correlation? Why didn’t she explore a dataset that may have had value to her investigation and analysis? The system should learn and adapt to her behavior to better support her. Her tools should highlight where data may be that could have value to her work. It should also help train new hires. Let’s be clear, we’re far from the Laplace’s demon of HBO’s “Westworld” or FX’s “Devs”: there is no super machine that will replace the talented and dedicated folks that make up the targeting cadre. Targeters will remain critical to evaluating and validating these results, doing deep research, and applying their human creativity and problem solving. The national security space hires brilliant and highly educated personnel to tackle these problems, let’s challenge and inspire them, not relegate them to the swivel chair processes of the past.
We need a new system to handle the data avalanche and support the next generation. Advanced computing, analytics, and applied machine learning will be critical to efficient data collection, successful data exploitation, and automated triage, correlation, and pattern identification. It’s time for a new chapter in how we ingest, process, and evaluate intelligence information. Let’s move forward.
onion OP wrote
PERSPECTIVE — As the U.S. competes with Beijing and addresses a host of national security needs, U.S. defense will require more speed, not less, against more data than ever before. The current system cannot support the future. Without robots, we’re going to fail.
News articles in recent years detailing the rise of China’s technology sector have highlighted the country’s increased focus on advanced computing, artificial intelligence, and communication technologies. The country’s five year plans have increasingly focused on meeting and exceeding western standards, while constructing reliable, internal supply chains and research and development for artificial intelligence (AI). A key driver of this advancement are Beijing’s defense and intelligence goals.
Beijing’s deployment of surveillance in their cities, online, and financial spaces has been well documented. There should be little doubt that many of these implementations are being mined for direct or analogous uses in the intelligence and defense spaces. Beijing has been vacuuming up domestic data, mining the commercial deployment of their technology abroad, and has collected vast amounts of information on Americans, especially those in the national security space.
The goal behind this collection? The development, training, and retraining of machine learning models to enhance Beijing’s intelligence collection efforts, disrupt U.S. collection, and identify weak points in U.S. defenses. Recent reports clearly reflect the scale and focus of this effort – the physical relocation of national security personnel and resources to Chinese datacenters to mine massive collections to disrupt U.S. intelligence collection. Far and away, the Chinese exceed all other U.S. adversaries in this effort.
As the new administration begins to shape its policies and goals, we’re seeing typical media focus on political appointees, priority lists, and overall philosophical approaches but what we need is an intense focus on the intersection of data collection and artificial intelligence if the U.S. is to remain competitive and counter this rising threat.