Python-Malspider v2.0: Malspider is a web spidering framework that detects characteristics of web compromises.

icon
Latest Release: v2.0

Updated Malspider to support the following:

Yara Integration

  • Scan URLs
  • Scan HTML Body
  • Scan inline JS (external JS will be supported in a future code push)

On-demand analysis

  • Rather than post-process crawl data, Malspider now analyses pages as they are crawled.
  • Added an ad-hoc "Scan" button in the GUI so you can scan individual domains on-demand
Source code(tar.gz)
Source code(zip)

Malspider

Malspider is a web spidering framework that inspects websites for characteristics of compromise. Malspider has three purposes:

  • Website Integrity Monitoring: monitor your organization's website (or your personal website) for potentially malicious changes.
  • Generate Threat Intelligence: keep an eye on previously compromised sites, currently compromised sites, or sites that may be targeted by various threat actors.
  • Validate Web Compromises: Is this website still compromised?

What can Malspider detect?

Malspider has built-in detection for characteristics of compromise like hidden iframes, reconnaisance frameworks, vbscript injection, email address disclosure, etc. As we find stuff we will continue to add classifications to this tool and we hope you will do the same. Malspider will be a much better tool if CIRT teams and security practioners around the world contribute to the project.

What's next? How can I help?

As mentioned above, it is very important to get help from other security practioners. Outside of adding classifications/signatures to the tool, here is a list of enhancements that would benefit the project and the broader infosec community. Don't feel contrained to this list, though.

  • Monitor website for historical changes (ie. a script tag was added today)
  • Develop a better mechanism for adding signatures/classifications
  • Attempt to download and store malware hosted on compromised sites

Join the community

Join the official mailing list: https://groups.google.com/forum/#!forum/malspider for support and to talk with other users.

Installation Prerequisites


Please make sure these technologies are installed before continuing:

  • Python 2.7.6
  • Updated version of pip
  • mysql

Note: If your server already has specific versions of these components installed, you can use a virtualenv to create an isolated python environment.

Tested and working on minimal installations of:

  • Ubuntu 14
  • CentOS 6
  • CentOS 7

(Quick) Installation


Start the installation process by running "./quick_install" from the command line. Please read the prompts carefully!!

Malspider comes with a quick_install script found in the root directory. This scripts attempts to makes the installation process as painless as possible by completing the following steps:

  1. Install Database: creates a database titled 'malspider', creates a new mysql user, and applies db schema.
  2. Install Dependencies: installs ALL dependencies and modules required by Malspider.
  3. Django Migrations: applies django migrations to the database (necessary for the web app).
  4. Create Web Admin User: creates an administrative user for the web application.
  5. Add Access Control: creates iptables rules to block port 6802 (used by the daemon) and open port 8080 (web app).
  6. Add Cronjobs: creates crontab entries to schedule jobs, analyze data, and purge the database after a period of time.

Note: The quick_install script uses scripts found under the install/ directory. If any of the above steps fail you can attempt to complete them manually using those scripts.

Note: If you need a permanent or production installation of Malspider, please consider using Apache as your webserver. Production instlalation instructions will be released soon.

(Quick) Start


Start Malspider by running "./quick_start" from the command line.

Malspider comes with a quick_start script found in the root directory. This script attempts to start the daemon and the web application.

Malspider can be accessed from your browser on port 8080 @ http://0.0.0.0:8080

Note: If the daemon and/or the web app fails to start you can attempt to start them separately using the scripts found under the start/ directory.

Using Malspider

Interaction with Malspider happens via an easy-to-use dashboard accessible through your web browser. The dashboard enables you to view alerts, inspect injected code, add websites to monitor, and tune false positives.

Monitoring Websites


Add websites to crawl by navigating to the administrative panel @ http://0.0.0.0:8080/admin (or by clicking on the admin link from the dashboard). Click on "Organizations" and a new Organization. You'll be prompted for the:

  • website name (ie. "Cisco Systems")
  • domain (ie. cisco.com)
  • industry/org category (ie. Energy, Political, Education, etc)

If you want to bulk import domains, create a csv file with a the following header and leave the first column blank (the id field):

id,org_name,category,domain
,Cisco Systems,Technology,cisco.com

Click "IMPORT" instead of "ADD ORGANIZATION".

NOTE: Websites are scheduled to be crawled once every 24 hours (at midnight) by a cronjob. If you want to crawl your list of websites more often than that you can edit the crontab entry that looks like this: "0 * * * * python your_path/manage.py manage_spiders"

Pages Per Domain

By default, Malspider crawls 20 pages per domain. This can be changed. You can crawl as many pages as you like (per domain) or you can crawl only the homepage of each site.

In the malspider/settings.py file you'll find a "PAGES_PER_DOMAIN" variable. Change this to your desired depth.

### Limit pages crawled per domain ###
# 0 = crawl only the home page (start urls)
# X = crawl X pages beyond the home page
PAGES_PER_DOMAIN = 20
}

Tuning False Positives


Login to the web app administrative panel @ http://0.0.0.0:8080/admin or click on the admin link from the dashboard.

Click on "Custom Whitelist" and add your entry there. This can be a full URL or a partial string. The analyzer won't generate any (new) alerts for elements that match patterns in the whitelist.

NOTE: You will need to delete the old alerts that are false positives. You can do this again through the admin interface by selecting "Alerts" and deleting the entries you don't want.

Anonymous Traffic


We recommend you tunnel all Malspider traffic through a proxy to hide the origin of your requests. Malspider supports a single proxy:

In malspider/settings.py change:

WEBDRIVER_OPTIONS = {
                'service_args': ['--debug=true', '--load-images=true', '--webdriver-loglevel=debug']
#                'service_args': ['--debug=true', '--load-images=true', '--webdriver-loglevel=debug', '--proxy=<address>','--proxy-type=<http,socks5,etc>']
    
}

to

WEBDRIVER_OPTIONS = {
#                'service_args': ['--debug=true', '--load-images=true', '--webdriver-loglevel=debug']
                 'service_args': ['--debug=true', '--load-images=true', '--webdriver-loglevel=debug', '--proxy=<address>','--proxy-type=<http,socks5,etc>']
    
}

and replace address with your proxy address and proxy_type with the type of proxy (http, socks5).

TIP: A more advanced (and preferred) setup is to load balance multiple proxies. Some services will do this for you.

Random User Agent Strings


Malspider randomly selects a user agent string from a list found at malspider/resources/useragents.txt. If you would like to add more user agents to the list then simply edit that text file.

NOTE: After editing the text file you'll need to re-deploy the project to the daemon. You can do that by navigating to the root project directory and typing "scrapyd-deploy". This should successfully deploy the changes.

Enable Screenshots

Malspider has built-in capabilities for taking screenshots of every page it crawls. Screenshots can be useful in a variety of situations, but this can cause a drastic increase in server space utilization. For that reason, screenshots are turned off by default. If you want to take screenshots then open malspider/settings.py and locate the following lines of code:

#screenshots
TAKE_SCREENSHOT = False
SCREENSHOT_LOCATION = '<full_file_path>'

Set TAKE_SCREENSHOT to True and change full_file_path to where you want the screenshots to be stored.

Email Summary of Alerts


Turn on email summaries by opening the malspider_django/malspider_django/settings.py file, locating the email options (near the bottom of the file), uncommenting them (removing the preceding #) and supplying the appropriate values:

EMAIL_HOST=""
EMAIL_PORT=

To create an email summary, Admin Panel -> Email Alerts -> Add Email Alert. Supply a subject line "ie. Malspider Email Summary", a list of recipients (separated by newline), and the email frequency (hourly, daily, weekly).

LDAP Authentication (disabled by default)


If enabled, Malspider will present the user with a login screen requiring auth credentials to view content. To configure LDAP, open /malspider_django/malspider_django/settings.py and uncomment (remove the '#') the two lines in AUTHENTICATION_BACKENDS.

AUTHENTICATION_BACKENDS = (
#    'django_auth_ldap.backend.LDAPBackend',
#    'django.contrib.auth.backends.ModelBackend',
)

Then edit the LDAP variables according to your environment. Here are some you should consider editing:

AUTH_LDAP_SERVER_URI = "ldap://example.com"

AUTH_LDAP_BIND_DN = "cn=cnexample,OU=groups,OU=groups,DC=example,DC=com"
AUTH_LDAP_BIND_PASSWORD = "<password>"
AUTH_LDAP_USER_SEARCH = LDAPSearch("ou=group,dc=example,dc=com",
    ldap.SCOPE_SUBTREE, "(cn=%(user)s)")

# Set up the basic group parameters.
AUTH_LDAP_GROUP_SEARCH = LDAPSearch("ou=groups,dc=example,dc=com",
    ldap.SCOPE_SUBTREE, "(objectClass=groupOfNames)"
)

AUTH_LDAP_GROUP_TYPE = GroupOfNamesType(name_attr="cn")

# Simple group restrictions
#AUTH_LDAP_REQUIRE_GROUP = "cn=example2,ou=django,ou=groups,dc=example,dc=com"
#AUTH_LDAP_DENY_GROUP = "cn=disabled,ou=django,ou=groups,dc=example,dc=com"

NOTE: For a more professional, production grade install, we recommend you setup Malspider with apache or nginx and use SSL.

Database Purging


The database can grow rather large very quickly. It is recommended that, for performance reasons, you delete data from the 'pages' table and the 'elements' table once per month... unless you have the storage space, of course.

If you want to perform a monthly purge of the database then uncomment the following line in your crontab file: 0 0 1 * * python your_path/manage.py purge_db

Comments

  • Bump cryptography from 1.3.2 to 3.2 in /install
    Bump cryptography from 1.3.2 to 3.2 in /install

    Oct 27, 2020

    Bumps cryptography from 1.3.2 to 3.2.

    Changelog

    Sourced from cryptography's changelog.

    3.2 - 2020-10-25

    
    * **SECURITY ISSUE:** Attempted to make RSA PKCS#1v1.5 decryption more constant
      time, to protect against Bleichenbacher vulnerabilities. Due to limitations
      imposed by our API, we cannot completely mitigate this vulnerability and a
      future release will contain a new API which is designed to be resilient to
      these for contexts where it is required. Credit to **Hubert Kario** for
      reporting the issue. *CVE-2020-25659*
    * Support for OpenSSL 1.0.2 has been removed. Users on older version of OpenSSL
      will need to upgrade.
    * Added basic support for PKCS7 signing (including SMIME) via
      :class:`~cryptography.hazmat.primitives.serialization.pkcs7.PKCS7SignatureBuilder`.
    

    .. _v3-1-1:

    3.1.1 - 2020-09-22

    • Updated Windows, macOS, and manylinux wheels to be compiled with OpenSSL 1.1.1h.

    .. _v3-1:

    3.1 - 2020-08-26

    
    * **BACKWARDS INCOMPATIBLE:** Removed support for ``idna`` based
      :term:`U-label` parsing in various X.509 classes. This support was originally
      deprecated in version 2.1 and moved to an extra in 2.5.
    * Deprecated OpenSSL 1.0.2 support. OpenSSL 1.0.2 is no longer supported by
      the OpenSSL project. The next version of ``cryptography`` will drop support
      for it.
    * Deprecated support for Python 3.5. This version sees very little use and will
      be removed in the next release.
    * ``backend`` arguments to functions are no longer required and the
      default backend will automatically be selected if no ``backend`` is provided.
    * Added initial support for parsing certificates from PKCS7 files with
      :func:`~cryptography.hazmat.primitives.serialization.pkcs7.load_pem_pkcs7_certificates`
      and
      :func:`~cryptography.hazmat.primitives.serialization.pkcs7.load_der_pkcs7_certificates`
      .
    * Calling ``update`` or ``update_into`` on
      :class:`~cryptography.hazmat.primitives.ciphers.CipherContext` with ``data``
      longer than 2\ :sup:`31` bytes no longer raises an ``OverflowError``. This
      also resolves the same issue in :doc:`/fernet`.
    

    .. _v3-0:

    3.0 - 2020-07-20 </tr></table>

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    Reply
  • Bump twisted from 13.2.0 to 20.3.0 in /install
    Bump twisted from 13.2.0 to 20.3.0 in /install

    Apr 30, 2021

    Bumps twisted from 13.2.0 to 20.3.0.

    Changelog

    Sourced from twisted's changelog.

    Twisted 20.3.0 (2020-03-13)

    Bugfixes

    • twisted.protocols.amp.BoxDispatcher.callRemote and callRemoteString will no longer return failing Deferreds for requiresAnswer=False commands when the transport they're operating on has been disconnected. (#9756)

    Improved Documentation

    • Added a missing hyphen to a reference to the --debug option of pdb in the Trial how-to. (#9690)
    • The documentation of the twisted.cred.checkers module has been extended and corrected. (#9724)

    Deprecations and Removals

    • twisted.news is deprecated. (#9405)

    Misc

    Conch

    Features

    
    - twisted.conch.ssh now supports the curve25519-sha256 key exchange algorithm (requires OpenSSL >= 1.1.0). ([#6814](https://github.com/twisted/twisted/issues/6814))
    - twisted.conch.ssh.keys can now write private keys in the new "openssh-key-v1" format, introduced in OpenSSH 6.5 and made the default in OpenSSH 7.8.  ckeygen has a corresponding new --private-key-subtype=v1 option. ([#9683](https://github.com/twisted/twisted/issues/9683))
    

    Bugfixes

    • twisted.conch.keys.Key.privateBlob now returns the correct blob format for ECDSA (i.e. the same as that implemented by OpenSSH). (#9682)

    Misc

    
    - [#9760](https://github.com/twisted/twisted/issues/9760)
    

    </tr></table>

    ... (truncated)

    Commits
    • 121c98e Merge branch 'release-20.3-9772' of github.com:twisted/twisted into release-2...
    • b9f8dad Fix a lint error in copyright.py and a release process bug that doesn't consi...
    • 384de59 towncrier for 20.3.0
    • 35db7f1 incremental 20.3.0
    • 0ebf7c5 Revert "20.3rc1 towncrier"
    • 50412c9 20.3rc1 towncrier
    • f80bdfa Fix a newsfile
    • 420f17a 20.3rc1
    • 5bab6b3 it's a brand new year, the sun is high, the birds are singin that 2019 went a...
    • 20c787a Merge pull request from GHSA-8r99-h8j2-rw64
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    Reply
  • Bump urllib3 from 1.9 to 1.26.5 in /install
    Bump urllib3 from 1.9 to 1.26.5 in /install

    Jun 1, 2021

    Bumps urllib3 from 1.9 to 1.26.5.

    Release notes

    Sourced from urllib3's releases.

    1.26.5

    :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

    • Fixed deprecation warnings emitted in Python 3.10.
    • Updated vendored six library to 1.16.0.
    • Improved performance of URL parser when splitting the authority component.

    If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

    1.26.4

    :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

    • Changed behavior of the default SSLContext when connecting to HTTPS proxy during HTTPS requests. The default SSLContext now sets check_hostname=True.

    If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

    1.26.3

    :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

    • Fixed bytes and string comparison issue with headers (Pull #2141)

    • Changed ProxySchemeUnknown error message to be more actionable if the user supplies a proxy URL without a scheme (Pull #2107)

    If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

    1.26.2

    :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

    • Fixed an issue where wrap_socket and CERT_REQUIRED wouldn't be imported properly on Python 2.7.8 and earlier (Pull #2052)

    1.26.1

    :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

    • Fixed an issue where two User-Agent headers would be sent if a User-Agent header key is passed as bytes (Pull #2047)

    1.26.0

    :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

    • Added support for HTTPS proxies contacting HTTPS servers (Pull #1923, Pull #1806)

    • Deprecated negotiating TLSv1 and TLSv1.1 by default. Users that still wish to use TLS earlier than 1.2 without a deprecation warning should opt-in explicitly by setting ssl_version=ssl.PROTOCOL_TLSv1_1 (Pull #2002) Starting in urllib3 v2.0: Connections that receive a DeprecationWarning will fail

    • Deprecated Retry options Retry.DEFAULT_METHOD_WHITELIST, Retry.DEFAULT_REDIRECT_HEADERS_BLACKLIST and Retry(method_whitelist=...) in favor of Retry.DEFAULT_ALLOWED_METHODS, Retry.DEFAULT_REMOVE_HEADERS_ON_REDIRECT, and Retry(allowed_methods=...) (Pull #2000) Starting in urllib3 v2.0: Deprecated options will be removed

    ... (truncated)

    Changelog

    Sourced from urllib3's changelog.

    1.26.5 (2021-05-26)

    • Fixed deprecation warnings emitted in Python 3.10.
    • Updated vendored six library to 1.16.0.
    • Improved performance of URL parser when splitting the authority component.

    1.26.4 (2021-03-15)

    • Changed behavior of the default SSLContext when connecting to HTTPS proxy during HTTPS requests. The default SSLContext now sets check_hostname=True.

    1.26.3 (2021-01-26)

    • Fixed bytes and string comparison issue with headers (Pull #2141)

    • Changed ProxySchemeUnknown error message to be more actionable if the user supplies a proxy URL without a scheme. (Pull #2107)

    1.26.2 (2020-11-12)

    • Fixed an issue where wrap_socket and CERT_REQUIRED wouldn't be imported properly on Python 2.7.8 and earlier (Pull #2052)

    1.26.1 (2020-11-11)

    • Fixed an issue where two User-Agent headers would be sent if a User-Agent header key is passed as bytes (Pull #2047)

    1.26.0 (2020-11-10)

    • NOTE: urllib3 v2.0 will drop support for Python 2. Read more in the v2.0 Roadmap <https://urllib3.readthedocs.io/en/latest/v2-roadmap.html>_.

    • Added support for HTTPS proxies contacting HTTPS servers (Pull #1923, Pull #1806)

    • Deprecated negotiating TLSv1 and TLSv1.1 by default. Users that still wish to use TLS earlier than 1.2 without a deprecation warning

    ... (truncated)

    Commits
    • d161647 Release 1.26.5
    • 2d4a3fe Improve performance of sub-authority splitting in URL
    • 2698537 Update vendored six to 1.16.0
    • 07bed79 Fix deprecation warnings for Python 3.10 ssl module
    • d725a9b Add Python 3.10 to GitHub Actions
    • 339ad34 Use pytest==6.2.4 on Python 3.10+
    • f271c9c Apply latest Black formatting
    • 1884878 [1.26] Properly proxy EOF on the SSLTransport test suite
    • a891304 Release 1.26.4
    • 8d65ea1 Merge pull request from GHSA-5phf-pp7p-vc2r
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    Reply
  • Bump django from 1.9.1 to 2.2.24 in /install
    Bump django from 1.9.1 to 2.2.24 in /install

    Jun 10, 2021

    Bumps django from 1.9.1 to 2.2.24.

    Commits
    • 2da029d [2.2.x] Bumped version for 2.2.24 release.
    • f27c38a [2.2.x] Fixed CVE-2021-33571 -- Prevented leading zeros in IPv4 addresses.
    • 053cc95 [2.2.x] Fixed CVE-2021-33203 -- Fixed potential path-traversal via admindocs'...
    • 6229d87 [2.2.x] Confirmed release date for Django 2.2.24.
    • f163ad5 [2.2.x] Added stub release notes and date for Django 2.2.24.
    • bed1755 [2.2.x] Changed IRC references to Libera.Chat.
    • 63f0d7a [2.2.x] Refs #32718 -- Fixed file_storage.test_generate_filename and model_fi...
    • 5fe4970 [2.2.x] Post-release version bump.
    • 61f814f [2.2.x] Bumped version for 2.2.23 release.
    • b8ecb06 [2.2.x] Fixed #32718 -- Relaxed file name validation in FileField.
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    Reply
  • Bump scrapy from 0.24.4 to 1.8.1 in /install
    Bump scrapy from 0.24.4 to 1.8.1 in /install

    Oct 6, 2021

    Bumps scrapy from 0.24.4 to 1.8.1.

    Release notes

    Sourced from scrapy's releases.

    1.8.1

    Security bug fix:

    If you use HttpAuthMiddleware (i.e. the http_user and http_pass spider attributes) for HTTP authentication, any request exposes your credentials to the request target.

    To prevent unintended exposure of authentication credentials to unintended domains, you must now additionally set a new, additional spider attribute, http_auth_domain, and point it to the specific domain to which the authentication credentials must be sent.

    If the http_auth_domain spider attribute is not set, the domain of the first request will be considered the HTTP authentication target, and authentication credentials will only be sent in requests targeting that domain.

    If you need to send the same HTTP authentication credentials to multiple domains, you can use w3lib.http.basic_auth_header instead to set the value of the Authorization header of your requests.

    If you really want your spider to send the same HTTP authentication credentials to any domain, set the http_auth_domain spider attribute to None.

    Finally, if you are a user of scrapy-splash, know that this version of Scrapy breaks compatibility with scrapy-splash 0.7.2 and earlier. You will need to upgrade scrapy-splash to a greater version for it to continue to work.

    1.7.4

    Revert the fix for #3804 (#3819), which has a few undesired side effects (#3897, #3976).

    1.7.3

    Enforce lxml 4.3.5 or lower for Python 3.4 (#3912, #3918)

    1.7.2

    Fix Python 2 support (#3889, #3893, #3896)

    1.7.0

    Highlights:

    • Improvements for crawls targeting multiple domains
    • A cleaner way to pass arguments to callbacks
    • A new class for JSON requests
    • Improvements for rule-based spiders
    • New features for feed exports

    See the full change log

    1.6.0

    Highlights:

    • Better Windows support
    • Python 3.7 compatibility
    • Big documentation improvements, including a switch from .extract_first() + .extract() API to .get() + .getall() API
    • Feed exports, FilePipeline and MediaPipeline improvements
    • Better extensibility: item_error and request_reached_downloader signals; from_crawler support for feed exporters, feed storages and dupefilters.
    • scrapy.contracts fixes and new features
    • Telnet console security improvements, first released as a backport in Scrapy 1.5.2 (2019-01-22)
    • Clean-up of the deprecated code
    • Various bug fixes, small new features and usability improvements across the codebase.

    Full changelog is in the docs.

    ... (truncated)

    Changelog

    Sourced from scrapy's changelog.

    Scrapy 1.8.1 (2021-10-05)

    • Security bug fix:

      If you use :class:~scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware (i.e. the http_user and http_pass spider attributes) for HTTP authentication, any request exposes your credentials to the request target.

      To prevent unintended exposure of authentication credentials to unintended domains, you must now additionally set a new, additional spider attribute, http_auth_domain, and point it to the specific domain to which the authentication credentials must be sent.

      If the http_auth_domain spider attribute is not set, the domain of the first request will be considered the HTTP authentication target, and authentication credentials will only be sent in requests targeting that domain.

      If you need to send the same HTTP authentication credentials to multiple domains, you can use :func:w3lib.http.basic_auth_header instead to set the value of the Authorization header of your requests.

      If you really want your spider to send the same HTTP authentication credentials to any domain, set the http_auth_domain spider attribute to None.

      Finally, if you are a user of scrapy-splash_, know that this version of Scrapy breaks compatibility with scrapy-splash 0.7.2 and earlier. You will need to upgrade scrapy-splash to a greater version for it to continue to work.

    .. _scrapy-splash: https://github.com/scrapy-plugins/scrapy-splash

    .. _release-1.8.0:

    Scrapy 1.8.0 (2019-10-28)

    Highlights:

    • Dropped Python 3.4 support and updated minimum requirements; made Python 3.8 support official
    • New :meth:Request.from_curl <scrapy.http.Request.from_curl> class method
    • New :setting:ROBOTSTXT_PARSER and :setting:ROBOTSTXT_USER_AGENT settings
    • New :setting:DOWNLOADER_CLIENT_TLS_CIPHERS and :setting:DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING settings

    ... (truncated)

    Commits
    • 283e90e Bump version: 1.8.0 → 1.8.1
    • 99ac4db Cover 1.8.1 in the release notes
    • 1635134 Small documentation fixes.
    • b01d69a Add http_auth_domain to HttpAuthMiddleware.
    • 4183925 Travis CI → GitHub Actions
    • be2e910 Bump version: 1.7.0 → 1.8.0
    • 94f060f Cover Scrapy 1.8.0 in the release notes (#3952)
    • 18b808b Merge pull request #4092 from further-reading/master
    • 93e3dc1 [test_downloadermiddleware_httpcache.py] Cleaning text
    • b73d217 [test_downloadermiddleware_httpcache.py] Fixing pytest mark behaviour
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    Reply
  • Bump lxml from 3.6.0 to 4.6.5 in /install
    Bump lxml from 3.6.0 to 4.6.5 in /install

    Dec 13, 2021

    Bumps lxml from 3.6.0 to 4.6.5.

    Changelog

    Sourced from lxml's changelog.

    4.6.5 (2021-12-12)

    Bugs fixed

    • A vulnerability (GHSL-2021-1038) in the HTML cleaner allowed sneaking script content through SVG images.

    • A vulnerability (GHSL-2021-1037) in the HTML cleaner allowed sneaking script content through CSS imports and other crafted constructs.

    4.6.4 (2021-11-01)

    Features added

    • GH#317: A new property system_url was added to DTD entities. Patch by Thirdegree.

    • GH#314: The STATIC_* variables in setup.py can now be passed via env vars. Patch by Isaac Jurado.

    4.6.3 (2021-03-21)

    Bugs fixed

    • A vulnerability (CVE-2021-28957) was discovered in the HTML Cleaner by Kevin Chung, which allowed JavaScript to pass through. The cleaner now removes the HTML5 formaction attribute.

    4.6.2 (2020-11-26)

    Bugs fixed

    • A vulnerability (CVE-2020-27783) was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

    4.6.1 (2020-10-18)

    ... (truncated)

    Commits
    • a9611ba Fix a test in Py2.
    • a3eacbc Prepare release of 4.6.5.
    • b7ea687 Update changelog.
    • 69a7473 Cleaner: cover some more cases where scripts could sneak through in specially...
    • 54d2985 Fix condition in test decorator.
    • 4b220b5 Use the non-depcrecated TextTestResult instead of _TextTestResult (GH-333)
    • d85c6de Exclude a test when using the macOS system libraries because it fails with li...
    • cd4bec9 Add macOS-M1 as wheel build platform.
    • fd0d471 Install automake and libtool in macOS build to be able to install the latest ...
    • f233023 Cleaner: Remove SVG image data URLs since they can embed script content.
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    Reply
  • Tasks running indefinitely
    Tasks running indefinitely

    Oct 18, 2016

    2016-10-13 00:00:19+0400 [scrapy] INFO: Scrapy 0.24.4 started (bot: full_domain) 2016-10-13 00:00:19+0400 [scrapy] INFO: Optional features available: ssl, http11, django 2016-10-13 00:00:19+0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'malspider.spiders', 'SPIDER_MODULES': ['malspider.spiders'], 'LOG_FILE': 'logs/malspider/full_domain/76f0156890b611e6959d005056ae7ab0.log', 'USER_AGENT': 'Mozilla/5.0 (Android; Tablet; rv:30.0) Gecko/30.0 Firefox/30.0', 'BOT_NAME': 'full_domain'} 2016-10-13 00:00:19+0400 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2016-10-13 00:00:19+0400 [scrapy] INFO: Enabled downloader middlewares: RandomUserAgentMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2016-10-13 00:00:19+0400 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, WebdriverSpiderMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2016-10-13 00:00:19+0400 [scrapy] INFO: Enabled item pipelines: DuplicateFilterPipeline, WhitelistFilterPipeline, MySQLPipeline 2016-10-13 00:00:19+0400 [full_domain] INFO: Spider opened 2016-10-13 00:00:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:00:19+0400 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6026 2016-10-13 00:00:19+0400 [scrapy] DEBUG: Web service listening on 127.0.0.1:6083 2016-10-13 00:00:19+0400 [scrapy] DEBUG: Downloading https://www.des.gov.ge with webdriver 2016-10-13 00:01:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:02:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:03:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:04:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:05:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:06:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:07:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:08:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:09:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:10:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:11:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:12:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:13:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:14:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:15:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:16:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

    This goes on for 24 hours, at which point the task is sigkilled. Seems to happen to all the domains I try to monitor, but not every time. Any idea why this keeps happening?

    Reply
  • Can not connect to ghostdriver
    Can not connect to ghostdriver

    Nov 30, 2016

    Is it failing because its trying HTTPS? Also notice when an IP is entered it adds www. to it

    2016-11-30 22:00:40+0000 [scrapy] INFO: Scrapy 0.24.4 started (bot: full_domain)
    2016-11-30 22:00:40+0000 [scrapy] INFO: Optional features available: ssl, http11, django
    2016-11-30 22:00:40+0000 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'malspider.spiders', 'SPIDER_MODULES': ['malspider.spiders'], 'LOG_FILE': 'logs/malspider/full_domain/6bc999bcb74811e6b3e7129119453e14.log', 'USER_AGENT': 'Mozilla/5.0 (Android; Tablet; rv:30.0) Gecko/30.0 Firefox/30.0', 'BOT_NAME': 'full_domain'}
    2016-11-30 22:00:40+0000 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
    2016-11-30 22:00:40+0000 [scrapy] INFO: Enabled downloader middlewares: RandomUserAgentMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2016-11-30 22:00:40+0000 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, WebdriverSpiderMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2016-11-30 22:00:40+0000 [scrapy] INFO: Enabled item pipelines: DuplicateFilterPipeline, WhitelistFilterPipeline, MySQLPipeline
    2016-11-30 22:00:40+0000 [full_domain] INFO: Spider opened
    2016-11-30 22:00:40+0000 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2016-11-30 22:00:40+0000 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
    2016-11-30 22:00:40+0000 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
    2016-11-30 22:00:40+0000 [scrapy] DEBUG: Downloading https://test.com with webdriver
    2016-11-30 22:01:09+0000 [full_domain] ERROR: Error downloading <GET https://test.com>
    	Traceback (most recent call last):
    	  File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
    	    self.__bootstrap_inner()
    	  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    	    self.run()
    	  File "/usr/lib/python2.7/threading.py", line 754, in run
    	    self.__target(*self.__args, **self.__kwargs)
    	--- <exception caught here> ---
    	  File "/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
    	    result = context.call(ctx, function, *args, **kwargs)
    	  File "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
    	    return self.currentContext().callWithContext(ctx, func, *args, **kw)
    	  File "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
    	    return func(*args,**kw)
    	  File "build/bdist.linux-x86_64/egg/malspider/scrapy_webdriver/download.py", line 66, in _download_request
    	    
    	  File "build/bdist.linux-x86_64/egg/malspider/scrapy_webdriver/manager.py", line 75, in webdriver
    	    
    	  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/phantomjs/webdriver.py", line 50, in __init__
    	    self.service.start()
    	  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/phantomjs/service.py", line 81, in start
    	    raise WebDriverException("Can not connect to GhostDriver")
    	selenium.common.exceptions.WebDriverException: Message: Can not connect to GhostDriver
    
    Reply
  • Limit Scrapy Depth
    Limit Scrapy Depth

    Jun 15, 2016

    Is there a way to limit the depth and recursion that happens per organization? If I just want to scrap initial homepages for quick checkups.

    Reply
  • Regex Raw field
    Regex Raw field

    Jun 14, 2016

    Is it possible to regex on the raw data? If so, whats the proper method?

    clicky_regex = re.compile("clicky", re.IGNORECASE | re.MULTILINE)

    clicky_elements = Element.objects.filter(Q(event_time__gte=search_start_time), Q(tag_name='script') | Q(tag_name='iframe'), Q(raw__regex=clicky_regex))

    Reply
  • Sql Error
    Sql Error

    Nov 30, 2016

    Fresh install and 1st attempt to hit Dashboard

    Exception Value:	
    (1055, "Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'malspider.alert.id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by")
    
    Exception Location:	/usr/local/lib/python2.7/dist-packages/MySQLdb/connections.py in defaulterrorhandler, line 36
    
    Reply
  • Are jobs really cancelled?
    Are jobs really cancelled?

    Jun 13, 2016

    python malspider_django/manage.py manage_spiders Canceling all outstanding jobs canceled job 261c094631af11e6872f79b4a0de6dd8 for project ' malspider ' canceled job 8976429e31b011e6872f79b4a0de6dd8 for project ' malspider '

    ps -aux /usr/bin/python -m scrapyd.runner crawl full_domain -a _job=261c094631af11e6872f79b4a0de6dd8 -a org=1708 ..... /usr/bin/python -m scrapyd.runner crawl full_domain -a _job=8976429e31b011e6872f79b4a0de6dd8 -a org=1707 ....

    Reply