当前位置: 首页 > news >正文

pdf-extract-kit paddle paddleocr pdf2markdown.py(效果不佳)

GitHub - opendatalab/PDF-Extract-Kit: A Comprehensive Toolkit for High-Quality PDF Content Extraction

https://github.com/opendatalab/PDF-Extract-Kit

 

pdf2markdown.py 运行遇到的问题:

错误:

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle_infer::Predictor::Predictor(paddle::AnalysisConfig const&)
1   std::unique_ptr<paddle::PaddlePredictor, std::default_delete<paddle::PaddlePredictor> > paddle::CreatePaddlePredictor<paddle::AnalysisConfig, (paddle::PaddleEngineKind)2>(paddle::AnalysisConfig const&)
2   paddle::AnalysisPredictor::Init(std::shared_ptr<paddle::framework::Scope> const&, std::shared_ptr<paddle::framework::ProgramDesc> const&)
3   paddle::AnalysisPredictor::PrepareProgram(std::shared_ptr<paddle::framework::ProgramDesc> const&)
4   paddle::AnalysisPredictor::OptimizeInferenceProgram()
5   paddle::inference::analysis::Analyzer::RunAnalysis(paddle::inference::analysis::Argument*)
6   paddle::inference::analysis::IrAnalysisPass::RunImpl(paddle::inference::analysis::Argument*)
7   paddle::inference::analysis::IRPassManager::Apply(std::unique_ptr<paddle::framework::ir::Graph, std::default_delete<paddle::framework::ir::Graph> >)
8   paddle::framework::ir::Pass::Apply(paddle::framework::ir::Graph*) const
9   paddle::framework::ir::SelfAttentionFusePass::ApplyImpl(paddle::framework::ir::Graph*) const
10  paddle::framework::ir::GraphPatternDetector::operator()(paddle::framework::ir::Graph*, std::function<void (std::map<paddle::framework::ir::PDNode*, paddle::framework::ir::Node*, paddle::framework::ir::GraphPatternDetector::PDNodeCompare, std::allocator<std::pair<paddle::framework::ir::PDNode* const, paddle::framework::ir::Node*> > > const&, paddle::framework::ir::Graph*)>)----------------------
Error Message Summary:
----------------------
FatalError: `Illegal instruction` is detected by the operating system.[TimeInfo: *** Aborted at 1739780413 (unix time) try "date -d @1739780413" if you are using GNU date ***][SignalInfo: *** SIGILL (@0x7f024e84e31a) received by PID 667042 (TID 0x7f0354c40740) from PID 1317331738 ***]

解决: 安装  paddlepaddle==2.5.2

错误:

  File "/usr/local/py310/lib/python3.10/site-packages/paddleocr/tools/infer/predict_rec.py", line 628, in __call__rec_result = self.postprocess_op(preds)File "/usr/local/py310/lib/python3.10/site-packages/paddleocr/ppocr/postprocess/rec_postprocess.py", line 121, in __call__text = self.decode(preds_idx, preds_prob, is_remove_duplicate=True)File "/usr/local/py310/lib/python3.10/site-packages/paddleocr/ppocr/postprocess/rec_postprocess.py", line 83, in decodechar_list = [File "/usr/local/py310/lib/python3.10/site-packages/paddleocr/ppocr/postprocess/rec_postprocess.py", line 84, in <listcomp>self.character[text_id]
IndexError: list index out of range

解决: 配置 pdf2markdown.yaml   ocr:   model_config:  lang: 设置成 ch, 而不是 en

终于跑出结果了:

[2025/02/17 16:56:20] ppocr WARNING: Since the angle classifier is not initialized, it will not be used during the forward process
[2025/02/17 16:56:21] ppocr DEBUG: dt_boxes num : 3, elapsed : 0.10364508628845215
[2025/02/17 16:56:21] ppocr DEBUG: split text box by formula, new dt_boxes num : 7, elapsed : 0.000263214111328125
[2025/02/17 16:56:22] ppocr DEBUG: rec_res num  : 7, elapsed : 1.4980812072753906
[2025/02/17 16:56:22] ppocr WARNING: Since the angle classifier is not initialized, it will not be used during the forward process
[2025/02/17 16:56:22] ppocr DEBUG: dt_boxes num : 3, elapsed : 0.10365056991577148
...........
ocr cost: 7.42
Task done, results can be found at outputs/pdf2markdown

初步结果表明,文本识别可以,但是组合成 markdown时,存在问题:(没有按照原内容一行一行呈现),还有重复混乱)

 4.(3分)下列各句中,没有语病的一句是(4.(3分)下列各句中,没有语病的一句是(一 AC.一所学校能否形成独特、健康的校园文化,学生能否真正接受并融入其中,这对德育C.一所学校能否形成独特、健康的校园文化,学生能否真正接受并融入其中,这对德育活动的有效开展起着至关重要的作用。活动的有效开展起看至关重要的作用。$\textcircled{2}$我国5岁至19岁青少年尝试吸烟率$20\%$,吸烟率近$7\%$。

http://www.lryc.cn/news/540427.html

相关文章:

  • Android 10.0 移除wifi功能及相关菜单
  • 什么是Dubbo?Dubbo框架知识点,面试题总结
  • Django+Vue3全栈开发实战:从零搭建博客系统
  • 双重差分学习笔记
  • python组备赛笔记(基础篇)
  • 从零开始构建一个小型字符级语言模型的完整详细教程(基于Transformer架构)
  • XUnity.AutoTranslator-Gemini——调用Google的Gemini API, 实现Unity游戏中日文文本的自动翻译
  • 中文Build a Large Language Model (From Scratch) 免费获取全文
  • DeepSeek 助力 Vue 开发:打造丝滑的瀑布流布局(Masonry Layout)
  • C++:从拷贝构造函数到深浅拷贝
  • Openssl之SM2加解密命令
  • Java集合框架之List接口详解
  • oracle apex post接口
  • 【数据挖掘】--算法
  • halcon机器视觉深度学习对象检测,物体检测
  • 英文字体:极简现代浓缩未来派科技海报标题排版无衬线字体 PODIUM Sharp Font
  • Java中JDK、JRE,JVM之间的关系
  • elasticsearch在windows上的配置
  • vscode 配置 Copilot 提示GHE.com连接失败
  • Pycharm+CodeGPT+Ollama+Deepseek
  • Unreal5从入门到精通之在编辑器中更新 UserWidgets
  • C语言修炼手册
  • Linux部署ollama
  • 跨语言语义理解与生成:多语言预训练方法及一致性优化策略
  • 最新华为 HCIP-Datacom(H12-821)2025.2.20
  • TS语言自定义脚手架
  • 深度学习-122-大语言模型LLM之基于langchian自定义国内联网查询工具并创建智能代理
  • Docker Desktop 入门教学
  • PyCharm 中的 %reset -f 功能:一键重置控制台变量
  • ollama-chat-ui-vue,一个可以用vue对接ollama的开源项目,可接入deepSeek