IT Security and Defense-in-depth
I’ve been asked by several individuals to weigh in on the current crisis in Licking County Ohio, where nearly the entirety of their county government is unable to function as a result of a ransomware attack.
This post represents only my views, and not those of any of my employers or associates.
While I have every reason to believe that the content matter in this article will be beneficial for all environments, I do hereby disclaim any risk or damages (which may be severe!) resulting from anyone attempting to use this advice without the prerequisite knowledge and ability to do so in a manner that is appropriate for each environment. Beware: My bias is very heavily weighted toward defense of government operations. Attack vectors can vary significantly between industries / business activities – there is (despite this infuriating me to no end) no “one size fits all” solution to security posture. You must consult IT security specialists to determine what is appropriate for your type of organization.
Consider this post to be draft-quality! I welcome and encourage feedback from everyone knowledgeable about the topic.
At time of writing, I work for Fairfield County Ohio; Licking is immediately northeast of Fairfield. In the interest of full disclosure, I currently know no more about this incident than has been shared via the media (I received a “heads up” advisory from the state during the first day of the ongoing incident, but everything it contained has since been made public). I have no working relationship with Licking County’s IT staff, and have no in-depth inside knowledge of their policies, practices, or technical environment. I can, however, make reasonably sound assumptions based on my experience as a Windows systems administrator and from using some software packages that Licking County also uses.
Licking County has said that they were attacked by ransomware, and that they shut down all of their computers shortly after they discovered the attack to prevent it from spreading. They’ve also said that the IT department is working this weekend (rather than watching the Super Bowl), that they have to reformat all of their end-user workstations, and that normal service may not be restored for weeks. To the best of my knowledge, that’s the end of technical information that has been released.
Reading between the lines, I think that all of these statements are sound:
- They have one or more Active Directory forests. If they have multiple, then the attack may have taken advantage of privileged passwords common between the domains.
- Their phone systems are powered by Windows. The media reports that phones weren’t working for the majority of county departments; my knowledge of intergovernmental agreements leads me to believe that the departments who did continue to have phone service probably did so because it wasn’t provided by the county.
- The attack gained very high execution privileges. Because the attack caused widespread server outages, and they are reformatting all of the end-user workstations, I feel reasonably confident in saying that the attack was executed with privileges similar to that of domain administrator (though perhaps not “Domain Admins” itself; there are other groups in ADDS that can be used to gain domain administrative privileges).
- They do not use a reasonable end-user device deployment solution. If they did, re-deploying all of their end-user workstations would take a matter of hours (or potentially days, I suppose, depending on infrastructure), not days to weeks.
- Some or all of their backups were air-gapped/offline – either separated by air from computer systems or stored on a system that could not be controlled or updated by any windows system involved in the attack.
I’ll be the first to admit that any or all of the above statements could be incorrect. If they’re running Windows 2000 or 2003 servers, then pretty much none of the above is relevant, as those platforms are known to be vulnerable and must not ever be used for anything ever again, full stop. There are also many, many possibilities that are even worse – for example, they could have given end-users administrative rights to servers, or even domain administrative privileges. (No, that’s really not that far-fetched. When I came to Fairfield County, I was horrified to find that nearly all of our end-users had effective domain admin privileges and could do nearly anything to any server they could reach by IP address. I rectified that immediately, and none of those domains have existed for years.)
Properly securing any network is neither easy nor inexpensive, but “it’s the cost of doing business”. If you want to use computers in your organization, you must take measures to secure both the computers and the data they hold. To say that IT security is a multifaceted approach is an extraordinary understatement; potential security vulnerabilities are a fact of life, and they can happen nearly anywhere (from a bug in an IP switch that has been walled over and nobody remembers, to personnel simply not knowing what to do and making matters worse).
IT security begins and ends with people; literally every user has an important part.
IT security is not a skill one has from birth; it has to be taught and re-taught recurrently, with revised procedures as technology progresses over time. For example, all account holders at every level must receive proactive IT security training to prepare them for:
- Social engineering
- Malicious actors including peers, supervisors, and officials
Both periodic and randomized testing are excellent ideas. Everyone needs to be trained to immediately contact qualified IT personnel whenever they encounter anything that “just doesn’t seem right”, much less is known to be harmful.
Management (at large)
Organization management must accept that they are not administrative from a technical perspective. I cannot possibly recount the number of managers, elected officials, department heads, board members, and other high-ranking personnel who see the term “administrator” somewhere on the computer and immediately stop all work and inhibit anyone else from accomplishing work until they have personally been granted “administrative rights”. They can’t explain why they need it, and they don’t understand what it does, but by-God they must have it and nobody anywhere is going to tell them that they can’t have it. This opens an extraordinarily large security vulnerability, and I henceforth refuse to work in any organization who permits management (without sound technical justification) to hold technical administrative privileges.
Management must also understand the importance of the security policies that their IT group has developed, and must be willing to enforce them as strongly as necessary – including termination of employees who knowingly violate the policy. As an example, some middle management ignore password policy and either mandate that their subordinates provide their passwords to them whenever changed, or mandate that their subordinates use a specific formula to set their password(s). Upper management must immediately discipline and/or terminate any middle/lower manager having been found doing this. Failure at this point is one of the single largest vulnerabilities there are – particularly when an employee who appears to be happy is actually disgruntled. There is no valid reason for one user to be made aware of another user’s Active Directory account credentials. There are no exceptions to this rule. Ever.
Management (of IT/Operations)
This section in particular is very heavily biased as a result of my own experiences.
IT management must take the lead when it comes to developing and enforcing IT security policy, must demonstrate deep knowledge and enforcement of the policies, and must generally conduct themselves in a manner beyond reproach. I have MANY, MANY times been asked by end-users why they’re being required to follow specific IT security policies when IT staff visibly violate them routinely, and generally take a laissez faire attitude toward enforcing policy. If IT management isn’t consistently and uniformly enforcing policy, the entire social dynamic falls apart and end-users eventually stop trying to comply.
Projects accepted for work by IT management must have proper project management. I here use the term “project” to mean “any set of work which requires analysis, planning, or budget”; contrast this with “task”, meaning “a set of work which can be completed by helpdesk-level staff, following a checklist, using provided supplies if necessary”. A really quick test that is often (though not always) reasonable is that if it will take one or more people a length of time greater than one day, and the project does not involve a recurring/routine hardware or software deployment, then it should be considered a project and must have a project manager engaged to see it from beginning to end.
All IT managers must also be capable of determining very quickly and accurately which of their staff are capable of completing any given necessary task or project; to do so requires that the managers first have a general knowledge of the scope of work needing to be completed. Assigning work that has a high impact to the organization as a whole to less skilled individuals who tend not to think analytically is a very clear threat to the security of the IT environment as a whole. Many individuals will attempt to “rise to the occasion”, and will complete a task or project “successfully”; later analysis would reveal that the work was completed unsatisfactorily at best or detrimentally at worst. Monetary losses are of least concern here; protecting the environment is far more important.
Change management must also be enforced by IT management. Any change to the IT environment must be documented in some predetermined manner; all changes except those which result from (or may cause or prevent) loss of life or property must be submitted ahead of time and be approved through a predetermined process. “Change”, here, includes but is not limited to physical facility access control and logical/data access control (including Active Directory security and distribution groups). IT management need not necessarily be involved with the actual change management process, but must enforce its use.
Work In Progress (WIP) must be analyzed by IT management, and procedures must be implemented to control the quantity of work permitted to be in progress at any given time. Despite it becoming fashionable in the last couple of decades, humans have repeatedly been scientifically proven to be awful at multi-tasking, and are more likely to be accurate and efficient when given work in a serial manner, and be allowed to work uninterrupted for lengthy periods of time. WIP volume tends to be inversely proportional to accuracy and efficiency because people are proven to make more mistakes and remember far less when “juggling” multiple workloads.
Part of the WIP-control process is necessarily for IT management to refuse to accept projects from upper management or other high-ranking personnel when the projects would either increase WIP beyond predefined reasonable thresholds or would in any way negatively affect the IT environment overall. Put another way, IT management must “think globally” in that their entire organization’s operation as a whole depends on them, and they must “act locally” when they accept or refuse projects from specific individuals or departments. (Herein, “refuse” may also mean “defer”, depending on the locally implemented WIP practices.)
Continuing with the theme of IT worker accuracy and efficiency ultimately having a very significant impact on the ultimate security posture of the organization: an on-call policy must be developed and enforced. Many organizations have grown organically; though it may have once been appropriate for someone having a technological crisis outside of IT’s regular business hours to simply call and wake whichever IT employee they most know or trust, that model falls apart in all IT organizations with more than one on-call individual.
Using Fairfield County as an example, it was at one time common for certain high risk environments (such as the 911/PSAP office) to call one IT staff member with a given problem, while a different high risk environment (such as the highest jail officer on duty) would call a different IT staff member with the same problem, because they were unaware that the other department(s) were having the same problem, and there was no mechanism in place to ensure that a single individual got all of the on-call incidents thrown on any given day.
On-call scheduling should be worked out amongst the individuals who are actually in the on-call rotation, with only input from IT management. (I’ve also often seen the inverse situation, where IT management has mandated an on-call schedule that is detrimental to many of the on-call workers, who would have been able to acceptably determine who is on-call when, providing adequate coverage, amongst themselves. Also problematic is having lesser-skilled individuals in the on-call rotation, when after-hours incidents are often complex situations requiring analysis.) An automated incident handling system must be used for this; simply providing a single cell phone to be handed off, or publishing a schedule of whom to call when, could lead to the on-call individual being overwhelmed with calls given certain failure modes. An automated incident handling system could allow the on-call individual to acknowledge the additional complaint and continue on with less interruption. Some systems go so far as to permit the on-call individual to request help from their next-in-line for duty in case of overload or skill deficiency (or nonresponse).
Finally, IT management must insist that their entire environment be documented to an extent such that the entire IT/Operations team could be killed in any type of disaster, and highly skilled IT workers who have never before seen the environment could recover it and (relatively) quickly put it back into service.
IT/Operations personnel must be extremely knowledgeable in everything they do. Taking actions “above and beyond” technical competence is a recipe for potential disaster – seemingly innocent actions can have far-reaching and potentially devastating consequences. Even if they aren’t catastrophic or far-reaching, correcting mistakes, misconfigurations, or altogether inappropriate actions most generally take knowledgeable personnel far longer than the original task would have taken.
Additionally, operating policy and procedure must be adhered to in every case, even if exceedingly burdensome. If the policy or procedure is convoluted or unnecessary, then it should be amended and the revised document followed instead. Ignoring or circumventing policy simply to do someone a favor, or because a high-ranking individual demanded it, or for any other non-heroic reason, is not acceptable. Personnel who repeatedly demonstrate contempt for technical policy must be reprimanded and eventually terminated.
That said, all IT/Operations personnel must be empowered to and confident in refusing requests from others, including direct orders from their chain of command or from other high-ranking individuals, when compliance would likely lead to direct or indirect harm to the environment. Personnel who are not equipped to confront upper management and/or refuse to comply with unreasonable requests independent of support from their supervisors must not be permitted privileged access.
Beginning at this point, my target readers morph from general audiences to those who have experience in systems and network administration.
IP Transport and Network Administration
- An organization-wide plan must be developed to determine what types of traffic must be separated (into VLANs and then subnets, ensuring that only small groups of devices are impacted by malfunctioning or compromised equipment in the same broadcast domain). (I very, very, very strongly recommend that you never have any workstations in the same VLAN as servers.) Once that has been determined, a consistent plan must be formed to determine the resources that each of those types of traffic are permitted to reach (and/or the inverse, depending on resource direction).
- Firewalls (or other IP devices with firewall capabilities) must be used to control traffic flow by following the aforementioned resource access plan; breaking or omitting transport routes is not acceptable. The ACLs on those firewalls must be implemented consistently across the organization, according to the resource access plan, with any variation being specifically documented and justified.
- Wireless network presentation to end-users must be consistent organization-wide. Wired access may also be provided for guests. Neither of these types of traffic may be permitted to directly grant access to any data resources. Direct access to output devices, such as projectors, other presentation equipment, and printers that do not store data, may be permitted. If a user needs to access a network resource wirelessly, a VPN connection must be employed. While not security-related, consider the licensing (and therefore legal and financial) obligation posed by allowing individuals to access any resource (including deskphones).
- End-users who require access to internal resources while external to the network may be issued a VPN account. End-user accounts may not be shared for multiple individuals; every end-user must have their own authentication credentials.
- End-user VPN tunnels should have security assessment and mitigation performed during the connection process, to ensure that a machine with obvious vulnerabilities is unable to connect to a secure resource.
- DHCP must be implemented on all VLANs and subnets whose purpose is to connect devices performing any function other than strictly IP-transport. For example, all VLANs dedicated to end user devices, wireless networks, surveillance cameras, out-of-band management systems, facility access control systems, and building automation systems, must be DHCP-enabled.
- NAT must not be exclusively relied upon for transport security. NAT is generally more difficult to configure than firewall ACLs, makes troubleshooting more difficult, and was not designed to replace firewall ACLs.
- Privileged users permitted to administer IP transport equipment (switches, routers, firewalls, etc) must hold separate user accounts for administration of IP transport equipment. Shared privileged accounts are never permitted.
Windows, Active Directory, and Systems Administration
- Wherever Windows servers or clients are used, they must be joined to an Active Directory domain, which must be used to authenticate and administer them.
- Each Active Directory domain must have no less than two domain controllers on no less than two physical servers (and, if your organization has multiple sites, in no less than two sites). Loss of your final domain controller should be considered to be functionally equivalent to complete loss of your entire IT environment, fully inclusive.
- Server operating system instances need to have limited purpose, rather than be general purpose. To put it another way, there needs to be no more than one purpose for the existence of each virtual machine. For example, a single virtual machine may have an application suite comprised of ten applications on it, but it has one major purpose – running that application suite. A single virtual machine may not have three unrelated applications installed on it, as that significantly increases risk of compromise, data integrity loss, and data leaking; it can also impede application upgrades and migration, and can impede troubleshooting and fault recovery.
- Users must not be directly granted access to any resources; users may only gain access to resources by virtue of group membership.
- Comprehensive systems monitoring must be utilized to provide IT/Operations insight into systems performance, faults, configuration mistakes, etc. The monitoring software should be able to automatically open an incident in the automated on-call software in event it detects a major fault/alarm.
- New websites must use HTTPS+HSTS only; existing websites must begin service using HTTPS as soon as possible, with HSTS implementation as soon as possible thereafter. By “HTTPS”, I mean strong/modern security with generally accepted as secure certificates and protocols. TLSv1.1, TLSv1.2, and later may be used for public-facing services; TLSv1.2 or later should be used for internal-facing services; SSLv3 and earlier is prohibited and must not be negotiated.
- Self-signed certificates must not be considered an acceptable solution for transport encryption of any type. An enterprise certificate authority must be operated to provide encryption for internal-only resources (though external users may trust the CA root for chain completion). Resources that will be externally exposed must use certificates obtained from a generally accepted certification authority.
- End-users may not create or use any unmanaged account that is local to any domain-joined Windows server or client at any time.
- Accounts local to domain-joined computers, such as “BUILTIN\Administrator”, pose a significant risk to security by potentially permitting Local Account Movement (LAM). A tool (such as Local Administrator Password Solution (LAPS) by Microsoft) must be used to ensure that each machine has a unique password for any such account(s).
- Domain Name System (DNS) must be used and preferred everywhere, excepting only rare situations where it is technologically impossible to use. Accessing resources via a mechanism other than DNS, such as by referring to it by an IP address, has the potential to invoke unwanted behavior (obviously or not) and creates the potential for additional security vulnerability vectors.
- Multi-factor authentication for all privileged user accounts, and potentially for high-value end-user accounts.
- While this document is being written with primary concern to a Windows environment, the same principles apply identically to non-Windows devices, such as Linux and BSD servers, and Mac workstations.
Privileged Access Methodology
Privileged access accounts are a necessary burden for IT workers. There are various degrees of privilege; in most organizations, “Domain Admins” or functional equivalent is the end-all be-all of security in each Active Directory domain. If someone who is interactively authenticated as a domain admin on any domain-joined machine executes something, the resulting process has unfettered access to absolutely everything in the entire domain. If the application is malicious, or if the application has a vulnerability that is being exploited by someone or something else maliciously, then the resulting disaster could potentially be similar to what Licking County is currently facing. Contrast that type of account with an unprivileged end-user account: if they run an application, the resulting process doesn’t have access to the servers and technologically-critical resources that the domain admin’s process would give it. The result is that harm could come to the user’s data, and data that the user has access to, but the chances are very significantly diminished that any harm would come to the organization overall.
Microsoft strongly recommends (and I fully endorse) a mitigation method called a Tiered Security Model. In this model, resources and privileged user accounts are grouped into tiers according to their technological importance. This is a security-in-depth model, significantly reducing the ability for applications (much less malicious ones) to be executed with very high privileges. The result is that most processes are executed with the least privileges possible, reducing the scope of impact should anything be compromised via any vector.
Microsoft provides an example model of 3 tiers, 0-2:
- Tier 0 is the most privileged tier. This tier contains domain controllers and anything that can be used to gain access to domain controllers (such as agents running on domain controllers, hypervisors, and out-of-band management solutions for hardware that domain controllers run on).
- Tier 1 is holds applications, centralized data, and the vast majority of servers. This is the tier that malicious actors wish to gain entry to.
- Tier 2 is the least privileged tier, containing end-user devices, printers, copiers, projectors, phones, and other very low privilege devices. This is the entry point for the vast majority of attacks.
Computers that are used to connect to a specific tier for administrative access must also be contained by that tier. Higher tiers (lower index numbers) user accounts must be prevented from authenticating to lower tier resources (higher index numbers) with technical enforcement to ensure compliance. Putting that all together: administrative access to servers, applications, filesystems, and in particular domain controllers, must not ever be permitted from a lower tier (higher index number). That is to say, special Privileged Access/Administrative Workstations (PAWs) must be used to gain access to systems administration. This does not necessarily mean that separate hardware must be employed for privileged users at each tier, but there are risks to running operating systems at multiple tiers on the same physical computer.
The solution that I recommend is a mixture of multiple Microsoft recommendations. For users who hold accounts at tier 0 or tier 1, I recommend dedicated lightweight and extremely portable devices such as Surface Pro tablets (because they’ll need to have it with them or at home in case of emergency). If the user holds accounts at both tiers 0 and 1, then a virtual machine may be operated inside Hyper-V on the same machine. This is not in violation of privilege escalation methodology because lower tier machines can be operated on higher tier hardware; lower tier user accounts may not interactively access a higher tier resource, but that is circumvented in this model since the user will be interactively accessing the outer operating system environment with their tier 0 account. If the user also holds a tier 2 account, yet another virtual machine could be operated on the machine for tier 2 use. No unprivileged access is permitted using these devices.
For users holding a tier 2 account but not tier 0 or tier 1, they may either use a dedicated machine exclusively for tier 2, or tier 2 can be the “outer” operating system on a workstation of their choice, and a separate VM can be installed in Hyper-V on that machine for unprivileged use.
As an alternative to the above, dedicated RDS servers could be created for each tier; the highest tier that each person holds would be the operating system on their computer of choice. They would use remote desktop to connect to the appropriate RDS server (lower tier or unprivileged).
Note that these recommendations can vary by organization; my recommendations are most appropriate for my current employer, and similar organizations. The principles apply uniformly across organizations, regardless of the implementation details. For example, non-Windows platforms could be used as hypervisors for tiered VMs on PAWs, so long as the hypervisor platform is considered to be protected as equivalent to the highest hosted tier.
Note in the tier list above, I added emphasis to hypervisors in tier 0. All hardware must be considered to be at the highest tier of the workload it’s supporting. In this case, if you’re running 50 virtual machines on a hypervisor, and one of those virtual machines is a domain controller (tier 0), but all of the other virtual machines are merely application servers (tier 1), then the hypervisor must be managed at tier 0. I find this approach to be fundamentally problematic because of separation of privileges conflicts between domain controller administration and hypervisor administration. I very strongly recommend not running tier 0 virtual machines co-resident with tier 1 virtual machines. The logical conclusion is that domain controllers and their supporting tier 0 accessories need to have hardware separate from tier 1 servers.
Tier 0 users must be restricted to be very few; tier 0 administration is not required on a frequent basis for the vast majority of organizations. Because they hold “the keys to the kingdom”, so to speak, there must be no less than two (in case something happens to one or the other). There must be extraordinary and explicit technical justification for holding tier 0 privileges (any tier, really, but tier 0 in particular).
Regarding “Run As” and UAC elevation prompts
User Account Control (UAC) must be enabled on all Windows clients without exception. UAC must be enabled by default on all Windows servers, but may be disabled with specific documented technical justification. UAC prevents processes from gaining execution privileges higher than the least privilege necessary. This effectively means that having UAC enabled could be the difference between a single computer being compromised and the entire organization’s IT environment being compromised.
Shortly after UAC was introduced in Windows Vista, many IT and security consultants wrote articles extolling the virtues of alternate-credential prompts, so that the interactive session doesn’t run with elevated privileges and to reduce the chance of something capturing the credentials for the privileged account. UAC is extremely beneficial for the first reason, which is why I require that it be enabled on all workstations under my control, and be enabled by default for all servers under my control (with limited exceptions). It does not, however, necessarily protect against key loggers or certain vulnerabilities in Windows. UAC and Run As serve specific purposes, but do not ensure absolute protection against credential or token theft. In short, no tier X credential may ever be used on a resource within tier X+Y, where Y is nonzero. For example, no tier 0 (domain admin) or tier 1 (server/app admin) credential may ever be entered on a tier 2 resource (non-PAW workstation).
Lifecycle and Procurement
IT security cannot be technically enforced (or even implemented with any semblance of integrity) unless the hardware and software in the environment is genuine (has not been tampered with by anyone, malicious or not) and is within manufacturer’s tolerances for production use. A single server running an outdated operating system reduces the overall security posture of an organization, because technical enforcement of security policies incompatible with that operating system cannot begin until that operating system is removed from service. Therefore, all operating systems on all devices at all tiers must be supportable by their manufacturer and maintainer, and must be kept up-to-date following the manufacturer’s recommended patching procedures, in a timely manner. When the manufacturer discontinues security-related support for an operating system, the operating system must be removed from production urgently. Similarly, if the manufacturer discloses that vulnerabilities exist in a particular operating system, but they refuse to provide an acceptable correction to that defect, the operating system must no longer be considered acceptable for production use.
The firmware that powers the hardware is just as important as the operating system. All firmware, on everything, must be kept up-to-date for security mitigation. If the manufacturer doesn’t support it, it cannot be in production. Along those same lines, if a security-related vulnerability has been disclosed about a particular firmware package and the manufacturer fails to release a mechanism for correction, the hardware must no longer be considered acceptable for production use. Additionally, if the hardware is no longer supportable by its manufacturer, or is no longer appropriate for the workload it has been tasked with, it must not be in production.
Software must also be kept up-to-date (all software, whether it’s something running on an end-user workstation for only one person, or an Enterprise Resource Planning suite on a litany of servers, or an http service engine). Vulnerabilities in software pose extraordinary risk to the environment as a whole, and it is very much outside of the scope of this document to enumerate the various ways that software vulnerabilities can be exploited. Suffice it to say that software must be supportable by its manufacturer or maintainer, be appropriate for its workload, and must be taken out of production if it is no longer considered secure.
Workstations and other end-user devices necessarily require the least protection in the tiered security model, while they’re simultaneously the most open for attack, and are the entry point for most intrusions. This means that the technology stack (hardware, firmware, operating system(s), software, etc) must be kept up-to-date at all levels, as quickly as possible after security patches/corrections are made available by the each relevant manufacturer. It is completely infeasible to expect that this be maintained by IT personnel via manual processes; in all but the smallest of organizations (less than a few dozen computers, perhaps), there is far too much risk of updates not being applied correctly, or at all. The labor cost would also surely overwhelm any IT salary budget.
To keep all of the above up-to-date without bankrupting the organization with payroll costs (to say nothing of the inaccuracy and inefficiency that is inevitable with manual maintenance), the organization must employ tools that permit them to update large quantities of their environment with minimal employee involvement. For example, if Adobe Reader is installed on hundreds of workstations, and has a critical vulnerability for which a patch has been released, the patch needs to be installed on all of those hundreds of workstations as quickly as possible. Acceptable automation permits privileged personnel to create software deployment packages and publish them. The automation software identifies the impacted devices, applies the corrections or replacement software, and provides compliance reports for IT to review.
Maintaining the firmware, operating systems, and software in the above mentioned automation system, becomes difficult based on the quantity of unique stack combinations there are present in the environment. That is to say, it is more difficult to ensure maintenance policies are continuously kept up-to-date when there are twenty different combinations of hardware, firmware, and operating system in the environment, as opposed to two or three different combinations. Procurement policies must be developed, including lists of hardware and software that are explicitly approved or disapproved for acquisition. Replacement schedules must be developed, budgeted for, and enforced. Orders should be few, with large quantity of purchases in each order, rather than many orders with small quantity, because hardware and firmware is generally provided consistently within a specific order (or purchase bundle, which I consider an order in this specific context), whereas hardware and/or firmware tends to vary between orders (regardless of whether that is the intent of the acquiring organization).
Following this to its logical next-step, one must conclude that no workstation may be individually unique (software-wise, within the scope of what the automation system manages, and hardware-wise, at all). Workstations must be treated as an off-the-shelf commodity; if an end-user needs a workstation, one is taken off of the shelf, deployed by the automation system, and the end-user can begin using it a short while later. If a computer breaks down, an identical computer is removed from storage, deployed automatically, and the end-user can resume work shortly thereafter. In event of a major disaster such as Licking County’s, where all of the workstations needed to be reformatted, the automation software could trigger a remote deployment of all of the desired workstations, and they’d be ready for use as soon as the deployment process completed (likely hours later, rather than weeks or even months as is likely to be the case in Licking County’s situation).
This additionally requires that all acquired software and licensing be tracked in a central repository, with which the automation system is integrated, so that it is aware of which software should ultimately belong on each workstation. No workstation may have software installed on it in a one-off case outside of the automation system, unless there are very clear technical justifications and procedures for how they specifically will be updated, maintained, and reinstalled, in event of security issue or necessary redeployment.
User Accounts and Related Security
Having accurate records of which individuals took what actions is the basis upon which IT auditing is built; security posture and compliance cannot be sanely evaluated without all user accounts being justified both legally/politically and technically. Therefore, no person may be issued any user account, privileged or not, without the appropriate IT personnel first obtaining evidence of the relevant person’s identity.
Personnel turnover is high; in most industries that’s a generally accepted fact of modern life. There are lots of “moving parts” when on-boarding and off-boarding individuals; many of those parts involve security decisions that are justifiable at the time but may not be justifiable in the future (short or long term). As a result, there must be frequent auditing of all human-associated accounts (perhaps weekly auditing of all changes, and quarterly auditing of all accounts) to ensure that their status is appropriate (disabled vs enabled), and that their group memberships (or other security-impacting attributes) are still appropriate.
Whenever possible, automated measures should be employed causing HR personnel records to be continuously validated against IT records. Exceptions where HR indicates that an individual is no longer affiliated with the organization should be treated as a security breach; disabling accounts and immediately removing access to resources should be automated so that IT personnel need not manually intervene in order for someone’s termination to be successful from a data security perspective. Automating portions of on-boarding, such as automating the creation of user accounts, mailboxes, etc., should also be employed whenever possible; this improves both accuracy and efficiency from an identity-management perspective within IT organizations.
Every account issued must be secured as well as is reasonably possible for the account holder’s role in the organization. General baselines for this:
- Every user account must have a password.
- The password must meet technically enforced complex requirements.
- Passwords must expire, unless complete NIST password complexity and management recommendations are followed.
- Personnel must be reprimanded (and terminated for repeated offense) for sharing account passwords.
- Multi-factor authentication must be employed for all privileged accounts.
- Multi-factor authentication should be considered for high-value unprivileged accounts (such as those belonging to CEO, CFO, payroll/financial staff, and anyone else who has the ability to alter sensitive operational data.
Password fatigue happens users have too many passwords to remember. The result is that some users write down their passwords – leading to the situation where others (including malicious actors such as disgruntled employees) could find them. Users should not be permitted to write down passwords “for safekeeping”. Single Sign On (SSO) solutions reduce password fatigue by securely allowing the user to use a single password (generally their Active Directory user account password) to authenticate to numerous resources.
Disaster Recovery and/or High Availability
The organization must develop a very detailed plan for continuity of operations, including technical measures for disaster recovery. For organizations having multiple sites, I very strongly recommend having at least a second datacenter that operates in either an active-passive or active-active state with your primary datacenter. Disaster recovery and/or high availability planning must include documentation as to what resources the organization wants to be available highly or in event of a disaster, and what specific steps are being taken proactively (and any steps that would need to be taken during or after a disaster) in order to ensure that those goals are met.
A limited test of the disaster recovery system must be conducted periodically (at least once per year) to ensure that there are no unexpected deficiencies in the recovery documentation. If there are, the deficiencies must be corrected, and testing conducted again, until it “cleanly” succeeds. If high availability is desired between the data centers, planned load interchanges should be conducted frequently (perhaps a few times per year) to ensure that the plan in place is working as designed.
If a disaster recovery or high availability test fails and it is determined that additional resources are needed in order to cause the test to succeed, the failure must be considered a continuity of operations emergency and those resources must be allocated urgently.
As an aside, because I’ve multiple times witnessed this being missed: ensure that you’ve planned for availability of unified communications (phones, voicemail, faxes, email, etc) to be available in event of disaster or other complete loss of your primary datacenter. VOIP solutions such as SIP trunking from telephone carriers can make this much less difficult than it was just a few years ago. If your organization depends heavily on the ability to fax (such as healthcare, attorneys, government, etc), then ensure that you’ve implemented a solution that permits faxing to continue in event that the primary fax destination site suffers a complete loss.
Data Security and Logical Intrusion Prevention
Data security is critical to the success of any organization, is almost universally overlooked, hated, and/or ignored by employees, and is most likely to break down and cause business emergencies during already-ongoing disasters.
There is no possible way I could touch on even the highest-level concepts surrounding data security in this article; readers are encouraged to contract with data/cyber security consultants to evaluate their overall security posture, and to obtain specific recommendations.
There are a few points that I nonetheless consider significant enough to mention here:
- Endpoint security/defense is absolutely mandatory, and must be centrally monitored, with support of notification for outbreaks (or faults, or other conditions). “Endpoint defense” is a set of technologies including what many people know as “anti-virus software”.
- Data should be encrypted whenever possible, both on-the-wire and at-rest, to prevent data loss or unauthorized tampering.
- All mobile devices (smartphones, laptops, tablets, etc) must encrypt all data at-rest, and keys must not be made available without successful authentication from an authorized user.
- All end-user devices, mobile or not, must be remotely manageable and destructible from tools under the control of tier 1 privileged personnel. This provides privileged personnel the ability to remotely troubleshoot problems with devices, and also to remotely wipe or destroy devices that have fallen outside of mandatory compliance policies.
- Removable media must be carefully managed. Consider prohibiting it altogether, both by policy and technical enforcement, if your environment does not specifically warrant users having removable media access.
- E-mail vulnerability assessment and mitigation is mandatory for all organizations, but has become so complex that I consider it futile to attempt to manage at an organization level (for the vast majority of organizations). Instead, outsource this work to a company who is well-known for their excellence at this type of work. Malware/vulnerability/spam attacks can change nature within minutes; very large providers are in a much better position to be able to detect this than are most organizations. Organizations that attempt to handle this problem in-house tend to have a poor end-user experience (with too many or too few messages being mitigated or blocked), and IT personnel often have to spend lengthy amounts of time reviewing the vulnerability assessment software’s activity.
There has recently been a surge in products that make it possible to quickly and easily provide data restoration in event of tampering or corruption (such as a ransomware attack), when the storage or product itself has not also been tampered with or destroyed. Windows file servers support “previous versions” or shadow copies, which are a very fast and easy solution to enable, but it’s not very easy to use those tools to recover from a widespread attack. NetApp and other vendors manufacture network attached storage solutions which frequently take snapshots of data in-place; disasters not directly harming the NetApp itself can be quickly and easily recovered from by restoring the last-known-good snapshot. This is a gross oversimplification; additional research is necessary in order to provide implementation recommendations for each organization. I strongly recommend that everyone evaluate some solution similar to this, however, as their benefit would be extraordinary in event that functionality is unexpectedly required.
Data security must also be considered at a physical/hardware level. For example, copiers and printers often write documents to storage prior to printing them; a malicious actor gaining access to that storage could use that information nefariously. Persistent storage devices must be removed from servers, computers, printers, and any other devices that leave the control of the organization; such devices must be destroyed if not repurposed within the organization. If they are repurposed within the organization, care must be taken to ensure that they are not moved to an area of higher risk of compromise. For example, one wouldn’t want to move a copier with its existing storage from the Human Resources Director’s office to an area where the public could access it.
Some large companies, including Microsoft and NetApp, assert that backups are not necessary in certain cases.
For example, Microsoft’s Exchange 2016 Preferred Architecture calls for lag copy members within database availability groups. This means that deletions and corruptions within other database availability group members can be remediated by playing down the lag copy and using it as a source for recovery. When retention policies are properly configured, particularly with archiving enabled, this is a sound recommendation and it’s unlikely that permanent large-scale data loss would occur.
NetApp, as another example, often says that backups are unnecessary when using their storage solutions (in a particular redundant and geographically diverse configuration) to take snapshots on a frequent basis, because those snapshots can be applied or restored from, reversing whatever kinds of data loss or corruption may have occurred. Given that the implementation details strictly follow NetApp’s published recommendations, this is also generally sound advice, and it’s unlikely that permanent large-scale data loss would occur.
I, however, take exception to those recommendations being uniformly applied to customers and environments where IT security is not the highest priority, or the knowledge level of the least skilled of their privileged staff is insufficient to fully appreciate the ramifications of literally every administrative action they take. For example, a malicious actor holding tier 0 privileges would have the ability to completely destroy the Exchange lag copies as well as the NetApp snapshots, which would effectively leave the environment with no backup copies of their data from which to restore (potentially result in permanent data loss of all of those databases or systems).
Most people reading this article can understand how this might happen if a malicious actor is involved, but then say that your privileged personnel are extremely skilled and trustworthy, and you’re willing to place the immediate and long-term future of your organization in their hands. That tier 0 personnel effectively hold “the keys to the kingdom” is always the case and cannot be mitigated; it’s expected that every organization explicitly and implicitly trusts their tier 0 personnel to always make the best decision for the business. That makes the malicious actor argument hard to swallow. Let me drop this bombshell on you: what if an account with tier 0 privileges became compromised, through no fault of its owner? The above malicious actor argument suddenly becomes more believable. Protecting tier 0 (and almost as importantly, tier 1) resources must be your personnel’s highest priority, with no exceptions. An organization’s failure to enable privileged personnel to make those decisions can ultimately lead to the organization’s demise, through no fault of the privileged personnel.
Facility Access Control / Intrusion Prevention and Detection
If a malicious (or perhaps even negligent) actor were to gain direct physical access to select computing resources – storage devices in particular – any number of attack vectors could be viable:
- The attacker could read or copy data from storage devices such as hard drives or backup tapes. The organization would experience a data breach.
- The attacker could introduce hardware into servers that could watch for and copy encryption keys, so as to make data theft possible or easier at the attacker’s later convenience.
- The attacker could introduce malware into the environment, causing actions known or unknown to the organization for an extended period of time.
- The attacker could interrupt disk access, knowingly or unknowingly, rendering the content of the disks useless for both the attacker and the organization.
- The attacker could find a way to introduce administrative credentials into some system, potentially granting him privileges to access or control anything at or below that resource protection tier.
In order to prevent a malicious actor from gaining access to physical storage devices directly, physical security measures must be employed, such as cages, cabinets, reinforced doors, and hardened walls. Multi-factor authentication (such as an access badge and a PIN) must be employed to gain entry to areas such as datacenter and any other area containing a tier 0 resource. Activity logs must be retained for a predetermined period of time (at least a few years), indicating which individuals accessed each area at the given times. Access card piggybacking must be strictly prohibited.
Intrusion monitoring must be employed in areas containing tier 0 resources and should be employed in areas containing tier 1 resources.
Be cautious with facility access control
When implementing facility access control and intrusion prevent systems, you must be careful to ensure that you consider the life safety of all humans who could be at any location at any time. Your defense generally cannot also be offense; that is to say, you must always ensure that emergency egress is possible. This means not only must you not attempt to automatically trap any would-be malicious actors in place, but you must also plan for what happens in event of catastrophic failure of the facility access control system – or any part thereof.
For example, one common mechanism for controlling door access is to install an electronic door strike in the wall and replace or re-key the locking mechanism in the door handle. An access control device (such as a card reader) would permit ingress, allowing the door strike to operate long enough for the individual to pass through the door, and would then re-secure the strike. Egress could normally be controlled by a card reader, but if the card reader (or any other required component) fails, humans must be able to exit in case of emergency. In this scenario, the door handle could be allowed to permit egress in event of emergency.
Another type of mechanism common for controlling door access is call a mag lock – it’s effectively a large and extremely powerful electrically controlled magnet, mounted on the door frame (or in the wall, or something immediately adjacent to the door). A metal plate then is mounted to the door. When ingress is approved, the control system removes power from the magnet, allowing the door to move freely. Generally the individual is given a few seconds to pass through the door, and then the magnet re-energizes; once it comes in contact with the metal plate again, it will hold the door shut until de-energized. The door handle wouldn’t work for emergency egress, however, because the magnet is much stronger than any person(s) attempting to push or pull on the door. Generally a button needs to be mounted within a few feet of the door, clearly marked along the lines of “Press to Exit” or “Emergency Exit”. Pressing that button in an emergency would cause power to be forcibly removed from the magnet, regardless of what the facility access control system asserts should be its status. If the button were not present, or the button additionally malfunctioned in an emergency, it’s possible that people could be trapped in the room. Even if the possibility exists that they may be an attacker, it’s unwise to directly (or indirectly via negligence) cause someone to perish in a fire, structural collapse, or other such emergency.
User experience also plays an understated role in the effective security posture of the organization; inconsistent experiences tend to lead to user attempts at circumventing policy for various reasons (such as making work faster or easier). While security must be of utmost priority, user experience does not need to be materially burdensome to end-users; training and consistent experiences lead to compliance. For example, if all of the end-users in any given department are required to follow the same policies, and are given identical tools to do so, and no exceptions are granted for political (rather than technical) reasons, end-users are able to help one another with questions about how tasks should be completed.
This experience includes things that have relatively little importance or security value to IT personnel – such as user desktops being visually consistent, wireless networks being accessed identically between buildings or campuses, network performance being similar regardless of location, how one launches any given application working identically regardless of where the user is, etc. I strongly recommend that an organizational emphasis be placed upon user accuracy, efficiency, and consistent user experiences – it will lessen one-off headaches and significantly increase security compliance.
Public/Guest Resource Access
Guests (used herein to mean individuals who are not employed by or otherwise formally affiliated with the organization) often require some sort of limited access to computing resources. In local government, that usually comes in the form of meeting attendees wanting to connect their devices to “public Wi-Fi” and kiosk-like computers that are restricted to specific use cases. As mentioned above in “IP Transport”, no guest may ever be permitted direct access to any internal resources or be connected to a network where anyone else has been permitted direct access to any internal resources.
It is ill-advised to use standard workstations in locations where guests are granted access, without hardening their physical security vulnerabilities. That is to say, measures should be employed to prevent guests from:
- attaching removable media of any type to any computer
- accessing USB ports on computers or peripherals
- operating the power button
- viewing or manipulating BIOS/UEFI settings
Additionally, logical security hardening must be employed. For example, guests must not be permitted to:
- browse any filesystems
- control any settings, including trivial user preferences
- use task manager
- use the “run” command
- access the command prompt
- execute any applications other than those which have specifically been whitelisted for guest use
- gain access to any internal resources other than those explicitly made available to them
Group policy or a functional equivalent must be used to enforce these settings. In general, it is recommend that everyone follow the principle that the guest must not be allowed to do anything at all beyond acting as a typical end-user in the explicitly defined list of applications. Also consider employing mandatory profiles; this causes the user profile to be “reset” to the intended configuration during every logon, as a defense-in-depth measure in case a malicious actor has managed to change any settings pertaining to the user’s operating environment.
You must also ensure that all devices provided for guest use are appropriately updated - firmware, operating system, software, etc - in a manner substantially similar to all other end-user devices. Being dedicated for guest use does not absolve equipment from the ongoing necessity of being kept up-to-date.
Direct access may be provided between guest devices and specific types of equipment – output devices such as projectors, other presentation equipment, and printers that have been hardened to prevent data loss in case of breach. The same equipment may also be accessed by internal resources. In this configuration, the equipment must reside on a VLAN and subnet that is not associated with either internal resources nor guest resources; this is sometimes referred to as a demilitarized zone (DMZ). Firewalls must be placed at all points of interconnect between the DMZ and any other network.
While not security-related, consider the licensing (and therefore legal and financial) obligation posed by allowing individuals to access any resource (including desk phones).
IT security is of utmost importance in a production environment; indeed this document is primarily designed to spur security-centered thought and application in production. There’s a significant reason that I’m calling out production here: there also needs to be a non-production environment, generally referred to as “lab” or “test”. The lab environment is where IT staff experiment with technology, learn new things, evaluate new technology, test it for production readiness, etc. This environment must be completely separate from production – hardware and all. An extremely tightly restricted (by firewall) interconnect could be made between production and lab solely for the benefit of automatically building lab environments based on production attributes, but traffic must be generally prohibited between them (including DNS, DHCP, and internet-bound traffic).
It’s simply not possible to “set it and forget it” when it comes to IT security. An organization’s operational success or demise is strongly associated with the organization’s emphasis on IT security. To that end, compliance and vulnerability assessments (including penetration testing) must be performed by contracted companies (having no other/internal relationship to the organization), both from external and internal angles of attack, in addition to the IT/Operations team’s own vulnerability assessments.
The most infuriating part of the Licking County crisis, to me, is that any competent systems administrator should be aware of the nature of the modern “beast” – that all systems everywhere should be considered “under attack” at all times – and should be employing aggressive countermeasures to prevent disaster. If immediate management fails to provide adequate resources such that the IT/Operations team is unable to suitably protect the environment, upper management should be involved. Yes, even if that means stepping on toes, infuriating those around you, or ignoring boundaries in organizational structure and reporting hierarchy that stand in your way. This is government; there are quite literally lives on the line. Failure to protect your environment (as a systems administrator) or failure to provide adequate resources (as upper management) should be prosecuted as criminally negligent.
Last fall, the FBI sent out a bulletin to branches of US state and local government that pertain to elections, warning them to carefully watch for evidence of hacking. They provided additional information regarding what, specifically, local governments should be watching for. In my opinion, they did a disservice to local government by restricting the scope of their warning in the manner they did; if a malicious actor specifically wanted to interfere in any state or local government operation, it would have been trivially difficult for them to obtain a copy of that bulletin, and changed their methods ever so slightly to avoid detection. All forms of government, at all levels, must be prepared to defend themselves against Advanced Persistent Threats (APTs) proactively. Discovery after the fact means that the damage has already been done, and it would definitely be difficult and may very well be impossible for the government to determine what has been compromised, and/or to what extent.
Licking County’s incident is, unfortunately, not the only local government in Ohio that has recently experience ransomware attacks. Some agencies have eventually paid the ransom in order to recover their data, presumably because their backup solutions were inadequate. There is, however, no guarantee that data access would be restored (or, if so, that it wouldn’t have been tampered with) even if the ransom is paid. Fairfield County has had ransomware problems in the past, but the scope of the attack each time was limited to a specific workstation and list of file shares, making backup restoration fast and (relatively) easy.
More disturbing than ransomware? Think of the viruses that could already be present, quietly reading files here and there, leaking valuable information outside of the network – or worse yet, quietly changing data. Would you know? In my experience: most organizations, government or otherwise, would not.