请问:比如我把一个 docx 文档的后缀名删掉,只剩下文件名,有什么工具或方法能算出这个文件原来是什么格式的吗
1
msg7086 2016-04-22 13:58:45 +08:00
Linux 有 file 工具。 http://linux.die.net/man/1/file
|
2
gdtv 2016-04-22 13:59:02 +08:00 via Android 1
好像文件头几个字节记录了文件类型,我猜的
|
3
msg7086 2016-04-22 14:01:23 +08:00 1
@gdtv 是的,魔术头。不过也有一些是根据实际语义去探测的。
比如 docx 是一个 zip 文件,但是判断成 zip 意义不大。所以工具可能会继续探究其中包含的 xml 文件的结构,来判断具体的文件类型。 |
4
imn1 2016-04-22 14:08:35 +08:00
50 4B 03 04 14 00 06 00
DOCX, PPTX, XLSX Microsoft Office Open XML Format (OOXML) Document NOTE: There is no subheader for MS OOXML files as there is with DOC, PPT, and XLS files. To better understand the format of these files, rename any OOXML file to have a .ZIP extension and then unZIP the file; look at the resultant file named [Content_Types].xml to see the content types. In particular, look for the <Override PartName= tag, where you will find word, ppt, or xl, respectively. Trailer: Look for 50 4B 05 06 (PK..) followed by 18 additional bytes at the end of the file. |
5
slixurd 2016-04-22 14:10:05 +08:00
但是并不一定能正确解析,例如 osx 下的 file :
两个文件: **.doc: CDF V2 Document, Little Endian, Os: Windows, Version 5.1, Code page: 936, Title: ??????????ѧ?ڶ???ڶ??????????ϻ?, Author: ????ľ??, Template: Normal.dot, Last Saved By: ???û?, Revision Number: 15, Name of Creating Application: Microsoft Office Word, Total Editing Time: 01:43:00, Create Time/Date: Sat Aug 27 16:06:00 2011, Last Saved Time/Date: Thu Sep 1 02:24:00 2011, Number of Pages: 29, Number of Words: 1789, Number of Characters: 10203, Security: 0 ➜ file **.docx **.docx: Zip archive data, at least v2.0 to extract |
6
neutrino 2016-04-22 14:10:35 +08:00
apache POI
https://poi.apache.org/ |
7
shoaly 2016-04-22 14:43:47 +08:00
搜一下 filetypeid 这个软件 windows 版本 拖进去就可以算出 来是什么格式
|
8
clino 2016-04-22 14:48:20 +08:00
@slixurd 我在 linux 下可以啊
$ file test.docx test.docx: Microsoft Word 2007+ $ file test.xlsx test.xlsx: Microsoft Excel 2007+ |
9
shiny 2016-04-22 14:49:05 +08:00
iHex 打开看头部的一些标记就能推测出来
|
10
Frown 2016-04-22 15:05:18 +08:00
TrIDNet 或者把文件拖到文本编辑器里
|