AMTSO最新公布误报测试指南
发布者:cheesejust 发布时间:2010-05-31 16-28-00
AMTSO(Anti-Malware Testing Standards Organization)反恶意软件测试标准组织于2008年5月成立,是一家国际性的非盈利组织,旨在关注全球对恶意测试方法在客观性,测试水平等其他相关的改进的需求。成员包括学着,评审员,媒体,测评机构以及安全厂商,如AV-Comparatives,AV-TEST.org,ICSA Labs,Virus Bulletin,West Coast Labs都是其中的成员。
由于误报在现实生活中对用户的影响很大,也影响到用户对杀毒软件的购买信心,尤其是主动防御强的杀软产品, 更容易产生误报。因此误报测试也越来越受到业界重视,AMTSO在五月份的会议中也将这一议题摆到了重要位置。该测试指南主要从误报的定义,区分误报的重要级别,误报的紧急程度,流行度,误报出现的环境,被误报的文件是否能恢复,响应时间,如何在动态,静态和全面测试中执行误报测试的等方面进行讨论。
Guidelines to False Positive testing
Though still often overlooked false positive testing is no less important than true positive testing. Any false positive may hurt the user’s confidence in the product. On occasion the consequences of a false positive detection can far outweigh the consequences of a false negative. In many cases the better proactive protection rate a product it has the more likely it becomes it will cause additional false positives.
What is a false positive?
A false positive is a detection (or notification?) on a file or resource which has no malicious payload.
Defining what a malicious payload is isn’t always clear cut.
There are some gray areas such as Potentially Unwanted Applications, also known as Riskware. There are fully legitimate reasons to detect certain software as potentially unwanted as they can be used with malicious intent as well. Also, some vendors for instance opt to detect key generators or cracks to bypass software piracy checks. Ideally the detection name should give away that such program has been detected on purpose.
How to determine the importance of an FP?
There are a number of different criteria that need to be considered:
1.1 Criticality
It’s very important to determine the criticality of the false positive. Not all false positives have the same impact on the user experience.
Consideration should be given to the following levels when doing dynamic testing:
The tester may decide the rating of each of these cases
- System critical: the core functionality of the OS is not normally usable or AV becomes unusable
- //Core OS, non-critical: for files such as notepad
- Application critical: installed application is not functional (can’t be uninstalled)
- Data file/Non-PE critical: Documents such as Word, Excel and PDF, but also SWF
- Application non-critical: TODO
- Browsing critical: sites on same IP can no longer be visited; TODO
Testers may group application non-critical and application critical FPs together for resource purposes as it can be very time consuming to differentiate.
Having a false positive on a system critical file is much worse than on a regular file or resource.
Ideally, there should be a logarithmic scale of sorts to rate FPs based on critically.
To give an analogy: Falsing on highly critical system files should be viewed in a similar way as missing files from the WildCore.
Additional consideration should be given to the following levels when doing this testing:
- Check if the detection itself may actually be valid. This specifically applies to RiskWare/PUAs such as mIRC.
- When dealing with AdWare/RiskWare detections make sure that detected files are not misclassified.
- Vendor should be contacted to make sure that detection was not added intentionally
- TODO
Riskware – misclassification, context,
1.2 Prevalence of an object
Next to criticality the prevalence of an object is an important measure as to how important a false positive is.
The following should be taken into consideration:
- Asking vendors for telemetry data on how prevalent a certain file or URL is
- Resources such as popular download portals
- Origin of distribution; if it comes with the OS then it must be prevalent
Telemetry data can contain a large number of things.
Ideally the telemetry shared with a tester includes the following:
- Freshness of FP
- Prevalence of the file
- Breakdown of prevalence per region
- Response time
Potentially the tester can group FPs into countries of origin. For instance some products may be falsing mostly on programs created in China. This doesn’t really affect users in Europe.
1.3 Environment
Testers should take into consideration the intended purpose of the products they are testing. For instance perimeter defense solutions may have much more loose heuristics than desktop solutions.
The impact is generally also much less severe.
- Policy detections vs. signature detections
- FPs should be rated differently on a perimeter defense solution. Detecting svchost in email is not nearly as bad as doing it on the desktop.
- TODO
1.4 ‘Correlation of the product (data)’ TODO decent title
- Determine settings on the product. FPs caused by the highest detection settings are likely to be much less prevalent. Therefore they should be rated less severe.
- FPs often come as a cost of higher detection rates. Correlating TP and FP ratios gives a more accurate reflection of efficacy.
- Also take into consideration potential overlap between products; multiple products may be flagging the same file
- Tester should take into consideration the version of the program. If an anti-malware product falses on v1.7 of a program yet v1.9 is the latest one then this should be reported.
1.5 Recoverability: does the user need to take action to recover
- Off-line recovery
- Recovery from product quarantine/backup
o Including centralized admin recovery
- Web site/download
- Permanent destruction
1.6 Response time
A tester may want to consider also factoring in the amount of time it took a vendor to fix a particular false positive.
Special consideration for web FP testing
When a particular host, domain or URL is detected which is not malicious at the time of checking it does not mean that there’s necessarily a false positive. Some web threats perform a Geo IP check and may only deliver malware to certain parts of the world. In some cases the malware may only be pushed to the visitor at certain times of the day.
In other cases vendors may decide to block the entire host or domain when they’ve found a particular malicious URL. Such detection is then still valid. In these cases it’s best to contact the vendor.
How to perform FP testing?
Ideally FP testing is performed in a similar fashion to dynamic testing. A stream of clean files should be used to more accurately test FP efficacy.
The following objects and resources should be considered for testing:
- Files on the system
- Documents
- Domains, URLs, scripts
There are several ways to conduct FP testing, similar to testing TP performance.
.1 Static testing
Also see the AMTSO static testing paper.
.2 Dynamic testing
FP testing could be done in combination, however explicit note should be made of this. Keep in mind that performing FP testing in combination may lead to different results compared to performing individual FP testing.
For notifications vs. detections the same rules should be maintained between TP and FP testing.
Also see the AMTSO dynamic testing paper.
.3 Whole product testing
For notifications vs. detections the same rules should be maintained between TP and FP testing.
Also see the AMTSO whole product testing paper.
Artificial test scenarios
Similar to creating new malicious programs for testing, creating new programs for false positive testing has been considered. However, such artificial scenarios should not be employed.
The test should reflect real life scenarios. For further explanation see the AMTSO malware creation document.
Renamed files – for example \prog files\personal av\pav.exe
Other considerations for false positive testing
In general testers should avoid scanning competing anti-malware products to see if FPs occur.
One of the main reasons for this being that a scanner may detect the databases of the competitor. Such detections are not really false positives.
Corrupted, (incorrectly) disinfected or otherwise modified files
Testers should refrain from having such files in their FP test set. Under dynamic or whole product testing may encounter false positives on incomplete files. For instance when the browser is downloading a file. In such cases the tester should treat the detection as an FP.
PUPs/Riskware
ServU, mIRC, psexec
Classification – not-a-virus:Riskware.ServU.501 vs. Trojan.agent.blabla. Second is false, first is not.
Special cases which may be completely clean, but may be installed from non-clean soft.
(v1 = malicious, updated to v2=non-malicious)
Additionally, testers can approach vendors with a collection of clean objects and have the vendors independently judge the importance of the objects. The importance should be judged by the vendors using the parameters defined in the paper. The amount of time needed for classification of these FPs should be logged to have a good estimation of time that’s actually needed.
People worked on this paper: Roel, Mark, Dmitry, Gabor, Denis, Jimmy, Maksym, Mika, Luis, Phillip, Andy, Andreas Clementi, David Perry, Kurt Natvig, Kurt Baumgartner.
Please mail to Roel if you’ve worked on this document but are not listed.